Running AI Inference on Kubernetes: GPU Scheduling, Ollama, and Resource Sharing
Getting a cluster to hand a model the GPU it needs and no more

Kubernetes was designed for a world of stateless web services you could scale by adding more identical replicas. GPUs are the opposite of that: scarce, expensive, and absolutely not interchangeable with CPU. So the moment you decide to run model inference on your cluster, you discover that Kubernetes treats your graphics card as a curious unknown — it doesn’t schedule on it, it can’t see it, and your pods come up GPU-less and confused.
Getting from there to “Ollama is serving a model on a cluster GPU, and three workloads are politely sharing it” took me an evening of swearing and a clearer mental model than I started with. Here’s the map I wish I’d had.
1 Why Kubernetes can’t see your GPU
By default the kubelet only knows about CPU and memory. A GPU is, as far as it’s concerned, none of its business. The bridge is the device plugin framework: a small agent runs on each GPU node, discovers the hardware, and advertises it to the kubelet as a schedulable resource with a name like nvidia.com/gpu.
For NVIDIA cards, the practical route is the GPU Operator, which installs the device plugin, the right drivers, and the container runtime hooks in one go. Once it’s running, kubectl describe node shows your GPU as allocatable capacity, and the scheduler can finally place pods on it. Skip this step and every “why won’t my pod find CUDA” rabbit hole leads back here.
2 Asking for a GPU
With the plugin in place, requesting a GPU looks like any other resource request — you put it under limits. The wrinkle is that GPUs are non-divisible by default: you request a whole one, you get a whole one, and a second pod asking for the same card waits.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
replicas: 1
selector:
matchLabels: { app: ollama }
template:
metadata:
labels: { app: ollama }
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: models
mountPath: /root/.ollama
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-modelsNote the persistent volume. Ollama caches pulled models on disk, and without somewhere durable to put them, every pod restart re-downloads several gigabytes. Mount a PVC at /root/.ollama and that pain goes away.
3 The resource-sharing problem
Here’s the catch that bites everyone: that nvidia.com/gpu: 1 claims the entire card. If you have one GPU and two workloads that each ask for one, the second sits Pending forever. For a homelab with a single decent card, that’s a disaster — you wanted to run a chat model and an embedding model and the occasional batch job, not pick one.
There are three honest answers. Time-slicing lets the device plugin advertise one physical GPU as several virtual ones; the workloads take turns on the hardware. It’s the simplest to set up and right for bursty, low-contention use — like a homelab where things rarely run flat out at once. MPS (Multi-Process Service) lets several processes genuinely run concurrently on one GPU with better isolation. MIG, on data-centre cards that support it, carves the GPU into hardware-partitioned slices with real memory boundaries.
For most self-hosters, time-slicing via the device plugin config is the answer:
apiVersion: v1
kind: ConfigMap
metadata:
name: device-plugin-config
data:
config.yaml: |
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4Now one physical GPU advertises as four, and four pods can each request nvidia.com/gpu: 1 and all land on the same card. They share the hardware by taking turns. The honest caveat: there’s no memory isolation. If two models together want more VRAM than the card has, one of them gets an out-of-memory error and dies. Time-slicing shares compute time, not memory — you still have to do the VRAM arithmetic in your head.
4 Keeping the GPU workloads where they belong
Once a couple of nodes have GPUs and the rest don’t, you want inference pods to land only on the capable ones, and you want non-GPU workloads to stay off them. Taints and tolerations do the fencing: taint the GPU nodes so nothing schedules there unless it explicitly tolerates the taint, then add that toleration to your inference deployments. A nodeSelector on a GPU-node label gets you the same pull in the other direction. Between them, your cluster stops trying to run nginx on the expensive silicon.
5 Is it worth it?
If you have exactly one GPU box and one model to serve, Kubernetes is overkill — run Ollama in a plain container and go and have your evening back. The operator, the device plugin, the time-slicing config and the taints are a lot of yak-shaving for a single workload.
But the moment you have several models or batch jobs competing for one or more GPUs, the calculus flips. Kubernetes gives you a real scheduler, declarative resource requests, and a sane way to share scarce hardware between workloads that would otherwise fight. That’s exactly the problem it was built for — it just needed teaching what a GPU is first. For a homelab that’s outgrown a single container, it’s worth the evening.




