Running AI Inference on Kubernetes: GPU Scheduling, Ollama, and Resource Sharing

Getting a cluster to hand a model the GPU it needs and no more

Smarc Included in

09-01-2025 1856 words 9 min read

Running AI Inference on Kubernetes: GPU Scheduling, Ollama, and Resource Sharing

Contents

Kubernetes was designed for a world of stateless web services you could scale by adding more identical replicas. GPUs are the opposite of that: scarce, expensive, and absolutely not interchangeable with CPU. So the moment you decide to run model inference on your cluster, you discover that Kubernetes treats your graphics card as a curious unknown — it doesn’t schedule on it, it can’t see it, and your pods come up GPU-less and confused.

Getting from there to “Ollama is serving a model on a cluster GPU, and three workloads are politely sharing it” took me an evening of swearing and a clearer mental model than I started with. Here’s the map I wish I’d had.

Why Kubernetes can’t see your GPU

By default the kubelet only knows about CPU and memory. A GPU is, as far as it’s concerned, none of its business. The bridge is the device plugin framework: a small agent runs on each GPU node, discovers the hardware, and advertises it to the kubelet as a schedulable resource with a name like nvidia.com/gpu.

For NVIDIA cards, the practical route is the GPU Operator, which installs the device plugin, the right drivers, and the container runtime hooks in one go. It’s a Helm chart; the honest reason to use it rather than assembling the pieces by hand is that the drivers, the container toolkit, and the device plugin all have to agree on versions, and the operator manages that dance for you. Once it’s running, kubectl describe node shows your GPU as allocatable capacity:

1
2
3
4
5
$ kubectl describe node gpu-node-01 | grep -A3 Allocatable
Allocatable:
  cpu:                16
  memory:             64270000Ki
  nvidia.com/gpu:     1

If nvidia.com/gpu doesn’t appear in that output, nothing else in this guide will work — the scheduler literally has no GPU to place pods on. Skip this step and every “why won’t my pod find CUDA” rabbit hole leads back here.

Asking for a GPU

With the plugin in place, requesting a GPU looks like any other resource request — you put it under limits. The wrinkle is that GPUs are non-divisible by default: you request a whole one, you get a whole one, and a second pod asking for the same card waits.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels: { app: ollama }
  template:
    metadata:
      labels: { app: ollama }
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models

Note the persistent volume. Ollama caches pulled models on disk, and without somewhere durable to put them, every pod restart re-downloads several gigabytes. Mount a PVC at /root/.ollama and that pain goes away. What backs that PVC matters more than you’d think for model serving — a 20 GB model loading off slow network storage adds seconds to every cold start, so I put the model cache on fast block storage, the kind of decision I worked through in Longhorn vs OpenEBS. Read-heavy, occasionally-written, latency-sensitive: exactly the profile a model cache has.

If you’ve already got Ollama running comfortably in a plain container — the setup from running LLMs locally with Ollama — this Deployment is that same workload, just handed to the cluster scheduler instead of docker run. Nothing about how Ollama itself behaves changes; you’re only changing who decides where it runs and how the GPU gets allocated.

Here’s the catch that bites everyone: that nvidia.com/gpu: 1 claims the entire card. If you have one GPU and two workloads that each ask for one, the second sits Pending forever. For a homelab with a single decent card, that’s a disaster — you wanted to run a chat model and an embedding model and the occasional batch job, not pick one.

There are three honest answers. Time-slicing lets the device plugin advertise one physical GPU as several virtual ones; the workloads take turns on the hardware. It’s the simplest to set up and right for bursty, low-contention use — like a homelab where things rarely run flat out at once. MPS (Multi-Process Service) lets several processes genuinely run concurrently on one GPU with better isolation. MIG, on data-centre cards that support it, carves the GPU into hardware-partitioned slices with real memory boundaries.

For most self-hosters, time-slicing via the device plugin config is the answer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
data:
  config.yaml: |
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4

Now one physical GPU advertises as four, and four pods can each request nvidia.com/gpu: 1 and all land on the same card. They share the hardware by taking turns. The honest caveat: there’s no memory isolation. If two models together want more VRAM than the card has, one of them gets an out-of-memory error and dies. Time-slicing shares compute time, not memory — you still have to do the VRAM arithmetic in your head.

Verifying the model actually serves

A pod that’s Running isn’t the same as a model that’s answering. Ollama needs to pull the model before it can serve it, and on a fresh volume that’s a multi-gigabyte download that happens after the container starts. Expose the deployment with a Service and hit its API to confirm end to end:

1
2
3
$ kubectl exec -it deploy/ollama -- ollama pull gemma3:12b
$ kubectl exec -it deploy/ollama -- ollama run gemma3:12b "say hi"
Hi! How can I help you today?

If that first pull is where you’re stuck, the pod is fine and the model just isn’t cached yet — which is precisely the argument for the persistent volume above. Once one pod has pulled the model onto the PVC, restarts are instant because the weights are already on disk.

Choosing between the three sharing modes comes down to one question: do your workloads ever genuinely contend for the card at the same instant? In a homelab, the honest answer is usually no. The chat model sits idle between prompts, the embedding service fires in short bursts, the nightly batch job runs when you’re asleep. That idle-heavy, rarely-simultaneous profile is exactly what time-slicing is good at, and it’s why I run it rather than reaching for MPS or MIG.

MPS earns its keep when you have several processes that really do want to compute concurrently and you’re willing to trade a little setup complexity for better throughput under contention. MIG is a data-centre feature — it needs an A100, H100, or similar, and it gives you hardware-partitioned slices with real, enforced memory boundaries. For a card that cost less than a laptop, MIG isn’t on the menu, and that’s fine; time-slicing covers the homelab case completely as long as you keep the VRAM arithmetic honest.

One number to hold in your head: the replicas you set in the time-slicing config is a promise about scheduling, not about memory. Advertising the card as four virtual GPUs lets four pods schedule, but all four still draw from the same physical VRAM pool. So if your card has 16 GB and you let four models onto it, their combined resident footprint has to stay under 16 GB or the fourth arrival kills one of the first three. I set the replica count to match how many small models I actually intend to co-locate, not the theoretical maximum, precisely so the scheduler doesn’t cheerfully pack the card past its memory limit and leave me chasing intermittent OOM kills at 2am.

Keeping the GPU workloads where they belong

Once a couple of nodes have GPUs and the rest don’t, you want inference pods to land only on the capable ones, and you want non-GPU workloads to stay off them. Taints and tolerations do the fencing: taint the GPU nodes so nothing schedules there unless it explicitly tolerates the taint, then add that toleration to your inference deployments. A nodeSelector on a GPU-node label gets you the same pull in the other direction. Between them, your cluster stops trying to run nginx on the expensive silicon.

1
2
3
4
5
6
7
# on the deployment's pod spec
tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
nodeSelector:
  gpu: "true"

Taint the node once with kubectl taint nodes gpu-node-01 nvidia.com/gpu=present:NoSchedule and label it with kubectl label node gpu-node-01 gpu=true, and the fencing is done. The GPU Operator can also apply this labelling automatically, which is worth turning on if nodes come and go.

When it goes wrong

GPU scheduling fails in a small number of recognisable ways, and knowing the shape of each saves hours.

Pod stuck Pending, event says “0/N nodes available: insufficient nvidia.com/gpu”. Either no node advertises a GPU (the device plugin isn’t running — check kubectl get pods -n gpu-operator) or every GPU is already claimed. Remember that without time-slicing, one card equals exactly one schedulable unit, so a second pod requesting a GPU waits for the first to release it. This is the single most common surprise.

Pod runs but the process falls back to CPU. The container can’t reach the driver. Confirm the image is a CUDA-capable build and that nvidia-smi works inside the pod: kubectl exec -it <pod> -- nvidia-smi. If that command fails, the container runtime hooks aren’t wired up — reinstall or reconcile the GPU Operator rather than fighting it pod by pod.

Out-of-memory kills under time-slicing. This is the trap from the sharing section, made concrete. Time-slicing shares compute, not VRAM, so two models whose combined weights exceed the card’s memory will collide and one dies with a CUDA OOM. There’s no scheduler that protects you here; you have to size the models to fit the card yourself. nvidia-smi on the node shows real memory use across all the pods sharing the GPU — watch it, don’t guess.

Driver/toolkit version mismatch after an upgrade. A kernel update on the node can break the NVIDIA driver until the module rebuilds. The GPU Operator normally handles this, but if a node reboots into a mismatched state, drain it, let the operator reconcile, then uncordon. Fighting individual pods while the node’s driver is broken is wasted effort.

Is it worth it?

If you have exactly one GPU box and one model to serve, Kubernetes is overkill — run Ollama in a plain container and go and have your evening back. The operator, the device plugin, the time-slicing config and the taints are a lot of yak-shaving for a single workload.

But the moment you have several models or batch jobs competing for one or more GPUs, the calculus flips. Kubernetes gives you a real scheduler, declarative resource requests, and a sane way to share scarce hardware between workloads that would otherwise fight. That’s exactly the problem it was built for — it just needed teaching what a GPU is first. For a homelab that’s outgrown a single container, it’s worth the evening.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#AI #kubernetes #GPU #self-hosting

Contents

Running AI Inference on Kubernetes: GPU Scheduling, Ollama, and Resource Sharing

Getting a cluster to hand a model the GPU it needs and no more

Why Kubernetes can’t see your GPU

Asking for a GPU

The resource-sharing problem

Verifying the model actually serves

The sharing decision in practice

Keeping the GPU workloads where they belong

When it goes wrong

Is it worth it?

Related Content

Stable Diffusion on Consumer Hardware: The Home Image Lab

Semantic Search on Your Own Documents: Embeddings, Vector DBs, and Practical Limits

Running Gemma 3 Locally: Google's Small Model on Consumer Hardware

Self-Hosted AI Search: Replacing Google with Perplexica and a Local Model