Local AI on Your Own Metal: Running LLMs Offline with Ollama
Private, free, and surprisingly capable

Not so long ago the idea of a capable language model running on the computer under your desk, with no internet connection and no monthly bill, sounded faintly absurd. We have written before about the leap from the stumbling early days of GPT-2 to the polished conversations of modern chatbots, and the assumption baked into all of it was that the clever part lived in someone else’s datacentre. That assumption no longer holds. A tool called Ollama has made running open-weight language models on your own hardware about as difficult as installing a music player. This guide shows you how to do it, what to expect from the machine you already own, and where the honest limits lie.
1 Why Run a Model Locally at All
There are three solid reasons, and most people care about at least one. The first is privacy. When you type into a cloud chatbot, your words travel to a remote server and may be logged, retained, or used to train future models. A locally run model never sees the network; your prompts about your finances, your health, or your half-finished novel stay on the machine in front of you.
The second is cost. Cloud APIs charge per token, and a heavy habit adds up. A model running locally costs nothing per query beyond the electricity it draws. The third is independence. A local model works on a plane, in a cabin with no signal, or during the inevitable afternoon when the cloud provider has an outage. Once the weights are on your disk, nobody can rate-limit you, deprecate the model out from under you, or change the terms of service.
2 What Ollama Actually Is
Ollama is a small, friendly piece of software that handles the awkward parts of running open-weight language models so you do not have to. Underneath it sits the well-known inference engine that runs these models efficiently on ordinary hardware, but Ollama wraps that in a clean experience: a single command to pull a model, a single command to chat with it, and a background service that manages loading the weights into memory and unloading them when idle.
Think of it as a package manager and runtime for models. You ask for a model by name, it downloads a ready-to-run, quantised version, and you start talking to it. It also exposes a local HTTP server so your own scripts and applications can call the model programmatically. All of this runs on Linux, macOS, and Windows, with no account and no telemetry required to function.
3 Installing on Linux
Installation on Linux is a single command, which fetches and sets up the service:
curl -fsSL https://ollama.com/install.sh | shThe script installs the Ollama binary and registers it as a systemd service that starts automatically. You can confirm it is running:
systemctl status ollama
ollama --versionIf you prefer to inspect what you are running before piping a script into your shell, the install script is plain text and worth a read first. The service listens locally on port 11434, which becomes important when we reach the API.
4 Pulling and Running a Model
The headline command is gloriously simple. To download a model and drop straight into an interactive chat:
ollama run llama3.2The first invocation downloads the weights, which may be a few gigabytes, and subsequent runs start instantly from the local copy. Once the prompt appears you simply type, and the model responds in your terminal. Type /bye to exit. To see what you have downloaded and reclaim space later:
ollama list
ollama rm llama3.2There is a whole library of models to choose from, ranging from tiny ones that run on a laptop to large ones that demand serious hardware, and each is pulled by name in exactly the same way.
5 Model Sizes, Memory, and Quantisation
Here is the single most important thing to understand before you go shopping for a model: the number in a model’s name, such as 7B or 70B, counts its parameters in billions, and that number is the dominant predictor of how much memory you need. As a rough rule of thumb for the quantised formats Ollama uses by default, you should expect to need somewhere around the model’s parameter count in gigabytes of memory, give or take, so a 7B model wants roughly eight gigabytes free and a 70B model is well beyond a typical desktop.
Quantisation is the trick that makes this feasible. Full-precision weights are bulky, so models are compressed by storing each weight with fewer bits, dramatically shrinking the memory footprint with only a modest cost to quality. A 4-bit quantised model is the common sweet spot, fitting comfortably where the original would not while still producing genuinely useful output. Start with a smaller model than you think you need, confirm it runs smoothly, then size up only if your hardware has room to spare.
6 GPU Versus CPU
Ollama will happily run on a processor alone, and for smaller models on a modern machine this is perfectly usable, if unhurried. But a capable graphics card transforms the experience. Language model inference is exactly the kind of massively parallel arithmetic that GPUs excel at, so offloading the work to one can speed up responses by a large margin.
Ollama detects a supported GPU automatically and uses it without configuration, falling back to the processor when no suitable card is present or when a model is too large to fit in video memory. The constraint that matters most for the GPU path is VRAM: the model’s weights must fit in the card’s memory to run fully accelerated. If they do not, Ollama splits the work between card and processor, which still helps but slows things down. When choosing a model, treat your available VRAM as the ceiling, and remember that a smaller, fully accelerated model often feels far nicer to use than a larger one crawling along on the CPU.
7 Talking to It Over HTTP
The terminal chat is convenient, but the real power comes from the local API, which lets your own programs use the model. Ollama serves requests at http://localhost:11434, and a single curl call demonstrates the generate endpoint:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain what a reverse proxy does in two sentences.",
"stream": false
}'Setting stream to false returns the whole response at once as JSON, which is the simplest mode to script against; leaving it on streams tokens as they are produced for a live, typewriter feel. There is also a chat endpoint that accepts a structured conversation history, which is what you reach for when building anything resembling an assistant. Because the interface is plain HTTP and JSON, wiring a local model into a note-taking app, a code editor, or a home automation script is straightforward in any language.
8 What It Is Good For, and Where It Stops
Local models shine at a broad sweep of everyday tasks: drafting and rewriting text, summarising documents, answering general questions, brainstorming, explaining code, and powering small automations that you would never want to send to a cloud service. For a great deal of practical work, a well-chosen local model is entirely sufficient and the privacy and cost benefits are pure upside.
It would be dishonest, though, to pretend the experience matches the very largest cloud models. The frontier systems run on hardware no home can match, and that shows in the hardest reasoning tasks, the longest documents, and the most obscure knowledge. A model small enough to run on your desk will sometimes be less accurate, less nuanced, and more prone to confidently stating something wrong. The sensible posture is to treat local models as a capable, private workhorse for the bulk of what you do, while recognising the cloud’s frontier still leads on the very hardest problems.
9 The Datacentre on Your Desk
The arc from the clumsy text generators of a few years ago to a genuinely useful assistant running offline on ordinary hardware is one of the more quietly remarkable shifts in computing. Ollama distils all of it into two commands and a local port. You give up a little capability at the extreme top end and gain privacy, zero marginal cost, and the simple satisfaction of an intelligence that lives entirely on your own metal. Download a small model this afternoon, ask it something, and watch the future arrive on the machine you already own.