Local AI on Your Own Metal: Running LLMs Offline with Ollama

Private, free, and surprisingly capable

24-02-2026

Local AI on Your Own Metal: Running LLMs Offline with Ollama

Contents

Not so long ago the idea of a capable language model running on the computer under your desk, with no internet connection and no monthly bill, sounded faintly absurd. The assumption baked into the whole industry was that the clever part lived in someone else’s datacentre, reachable only through an API and a credit card. That assumption no longer holds. A tool called Ollama has made running open-weight language models on your own hardware about as difficult as installing a music player. This guide shows you how to do it, what to expect from the machine you already own, and where the honest limits lie.

I run local models for the same reason I run most things myself: I like knowing where my data goes and I dislike renting capabilities I could own. If you’ve read my piece on what running a blog on your own infrastructure teaches you, you’ll recognise the instinct. Local AI is the same bet applied to the newest, shiniest, most cloud-shaped corner of computing — and it pays off more than you’d expect.

Why Run a Model Locally at All

There are three solid reasons, and most people care about at least one. The first is privacy. When you type into a cloud chatbot, your words travel to a remote server and may be logged, retained, or used to train future models. A locally run model never sees the network; your prompts about your finances, your health, or your half-finished novel stay on the machine in front of you.

The second is cost. Cloud APIs charge per token, and a heavy habit adds up. A model running locally costs nothing per query beyond the electricity it draws. The third is independence. A local model works on a plane, in a cabin with no signal, or during the inevitable afternoon when the cloud provider has an outage. Once the weights are on your disk, nobody can rate-limit you, deprecate the model out from under you, or change the terms of service.

What Ollama Actually Is

Ollama is a small, friendly piece of software that handles the awkward parts of running open-weight language models so you do not have to. Underneath it sits the well-known inference engine that runs these models efficiently on ordinary hardware, but Ollama wraps that in a clean experience: a single command to pull a model, a single command to chat with it, and a background service that manages loading the weights into memory and unloading them when idle.

Think of it as a package manager and runtime for models. You ask for a model by name, it downloads a ready-to-run, quantised version, and you start talking to it. It also exposes a local HTTP server so your own scripts and applications can call the model programmatically. All of this runs on Linux, macOS, and Windows, with no account and no telemetry required to function.

Installing on Linux

Installation on Linux is a single command, which fetches and sets up the service:

1
curl -fsSL https://ollama.com/install.sh | sh

The script installs the Ollama binary and registers it as a systemd service that starts automatically. You can confirm it is running:

1
2
systemctl status ollama
ollama --version

If you prefer to inspect what you are running before piping a script into your shell, the install script is plain text and worth a read first. The service listens locally on port 11434, which becomes important when we reach the API.

Pulling and Running a Model

The headline command is gloriously simple. To download a model and drop straight into an interactive chat:

1
ollama run llama3.2

The first invocation downloads the weights, which may be a few gigabytes, and subsequent runs start instantly from the local copy. Once the prompt appears you simply type, and the model responds in your terminal. Type /bye to exit. To see what you have downloaded and reclaim space later:

1
2
ollama list
ollama rm llama3.2

There is a whole library of models to choose from — several thousand at this point — ranging from tiny ones that run on a laptop to large ones that demand serious hardware, and each is pulled by name in exactly the same way. The names you’ll meet most often are Meta’s Llama family (llama3.2 in 1B and 3B sizes for modest machines, larger variants for serious ones), Google’s gemma3 (which comes as small as 270M and goes up to 27B, and handles images as well as text), Mistral’s models, Alibaba’s Qwen, and DeepSeek’s reasoning-focused releases. Which one is right for your hardware and your actual work is a whole question in itself, and I’ve put the popular small models head to head in a practical comparison of Llama, Mistral and Gemma — start there if you want a recommendation rather than a shrug.

Model Sizes, Memory, and Quantisation

Here is the single most important thing to understand before you go shopping for a model: the number in a model’s name, such as 7B or 70B, counts its parameters in billions, and that number is the dominant predictor of how much memory you need. As a rough rule of thumb for the quantised formats Ollama uses by default, you should expect to need somewhere around the model’s parameter count in gigabytes of memory, give or take, so a 7B model wants roughly eight gigabytes free and a 70B model is well beyond a typical desktop.

Quantisation is the trick that makes this feasible. Full-precision weights are bulky, so models are compressed by storing each weight with fewer bits, dramatically shrinking the memory footprint with only a modest cost to quality. A 4-bit quantised model is the common sweet spot, fitting comfortably where the original would not while still producing genuinely useful output. Start with a smaller model than you think you need, confirm it runs smoothly, then size up only if your hardware has room to spare.

GPU Versus CPU

Ollama will happily run on a processor alone, and for smaller models on a modern machine this is perfectly usable, if unhurried. But a capable graphics card transforms the experience. Language model inference is exactly the kind of massively parallel arithmetic that GPUs excel at, so offloading the work to one can speed up responses by a large margin.

Ollama detects a supported GPU automatically and uses it without configuration, falling back to the processor when no suitable card is present or when a model is too large to fit in video memory. The constraint that matters most for the GPU path is VRAM: the model’s weights must fit in the card’s memory to run fully accelerated. If they do not, Ollama splits the work between card and processor, which still helps but slows things down. When choosing a model, treat your available VRAM as the ceiling, and remember that a smaller, fully accelerated model often feels far nicer to use than a larger one crawling along on the CPU.

Talking to It Over HTTP

The terminal chat is convenient, but the real power comes from the local API, which lets your own programs use the model. Ollama serves requests at http://localhost:11434, and a single curl call demonstrates the generate endpoint:

1
2
3
4
5
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain what a reverse proxy does in two sentences.",
  "stream": false
}'

Setting stream to false returns the whole response at once as JSON, which is the simplest mode to script against; leaving it on streams tokens as they are produced for a live, typewriter feel. There is also a chat endpoint that accepts a structured conversation history, which is what you reach for when building anything resembling an assistant. Because the interface is plain HTTP and JSON, wiring a local model into a note-taking app, a code editor, or a home automation script is straightforward in any language.

When It Goes Wrong: Troubleshooting

Ollama is unusually forgiving, but a handful of problems come up often enough to be worth pre-empting.

The model runs, but painfully slowly. This is almost always the CPU fallback in disguise: the model didn’t fit in VRAM, so Ollama split it between card and processor, and the processor half is dragging. Run ollama ps while a model is loaded — it shows how much of the model is on GPU versus CPU. If you see a chunk on CPU, you’ve overshot your VRAM; drop to a smaller model or a more aggressive quantisation and the speed usually returns in full.

The GPU isn’t being used at all. On Linux with an Nvidia card this is nearly always the driver and container-toolkit layer rather than Ollama itself. Confirm the host sees the card with nvidia-smi first; if that works but Ollama still runs on CPU, the service may have started before the driver was ready — restart it with systemctl restart ollama and check its logs with journalctl -u ollama. AMD cards need a supported ROCm setup, which is more finicky; check Ollama’s hardware notes for your specific GPU.

Something else is already on port 11434. If the API won’t answer, another Ollama instance (or a container publishing the same port) may be holding it. ss -tlnp | grep 11434 shows you the culprit. This matters especially if you also run inference in containers — the port collision is a classic self-inflicted wound.

You’ve filled the disk with models. Weights are large and they accumulate. ollama list shows what you’ve got and their sizes; ollama rm <model> reclaims the space. A monthly glance at this saves an unpleasant surprise later.

If you find yourself wanting to run this at more than single-machine scale — sharing a GPU across several workloads, scheduling inference jobs — that’s a different and larger problem, and I’ve written it up separately in running AI inference on Kubernetes. For a single box under your desk, though, none of that complexity is needed.

What It Is Good For, and Where It Stops

Local models shine at a broad sweep of everyday tasks: drafting and rewriting text, summarising documents, answering general questions, brainstorming, explaining code, and powering small automations that you would never want to send to a cloud service. For a great deal of practical work, a well-chosen local model is entirely sufficient and the privacy and cost benefits are pure upside.

It would be dishonest, though, to pretend the experience matches the very largest cloud models. The frontier systems run on hardware no home can match, and that shows in the hardest reasoning tasks, the longest documents, and the most obscure knowledge. A model small enough to run on your desk will sometimes be less accurate, less nuanced, and more prone to confidently stating something wrong. The sensible posture is to treat local models as a capable, private workhorse for the bulk of what you do, while recognising the cloud’s frontier still leads on the very hardest problems.

The Datacentre on Your Desk

The arc from the clumsy text generators of a few years ago to a genuinely useful assistant running offline on ordinary hardware is one of the more quietly remarkable shifts in computing. Ollama distils all of it into two commands and a local port. You give up a little capability at the extreme top end and gain privacy, zero marginal cost, and the simple satisfaction of an intelligence that lives entirely on your own metal. Download a small model this afternoon, ask it something, and watch the future arrive on the machine you already own.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#linux #ai #ollama #llm #selfhosted