Your First Local AI Coding Assistant: Wiring Ollama into Your Editor

Autocomplete that never leaves your laptop

Cloud coding assistants are wonderful right up until you remember where your code is going. Every keystroke, every half-finished function, every comment grumbling about a colleague’s API design is shipped off to someone else’s server. For a side project that scarcely matters; for proprietary code under a strict NDA it can be a genuine problem. The good news is that you can run a capable coding assistant entirely on your own machine, with no network round-trips and no data leaving the building. If you have already met Ollama in our introductory piece, this guide takes the next step: wiring a local model directly into your editor so it suggests code as you type.

The appeal is threefold. The first reason is privacy: your code never travels anywhere. For regulated industries, client work, or simply a healthy distrust of where your data ends up, that is reason enough. The second is offline capability: on a train, a plane, or a flaky conference Wi-Fi, a local assistant keeps working while cloud tools stare blankly at a connection error. The third is cost and control: no per-token billing, no monthly subscription, no surprise rate limits — just your hardware doing the work whenever you want it.

Be honest with yourself about the trade-off, though. A model running on your laptop will not match the very largest cloud models for raw capability. What it offers instead is a private, always-available assistant that is genuinely good at the bread-and-butter tasks: completing lines, drafting functions, explaining unfamiliar code, and writing boilerplate. For a great deal of everyday programming, that is exactly what you need.

Before touching your editor, get the foundations in place:

  1. Ollama installed and running. If you followed the earlier introduction you already have it. If not, install it from the official site and confirm it works with ollama --version.
  2. A code-capable model pulled. General chat models can write code, but models trained specifically on code do it better and often support a special completion mode. Pull one with a command like the following.
# A solid general-purpose code model for chat and explanation
ollama pull qwen2.5-coder:7b

# A small, fast model well suited to inline autocomplete
ollama pull qwen2.5-coder:1.5b

Confirm Ollama is serving its local API, which is what your editor will talk to:

curl http://localhost:11434/api/tags

A JSON list of your installed models means the endpoint is live and listening on the default port 11434.

You need a bridge between your editor and Ollama. Continue is a popular open-source extension that supports both VS Code and the JetBrains family, and it speaks to local models happily.

For VS Code:

  1. Open the Extensions panel with Ctrl+Shift+X (or Cmd+Shift+X on macOS).
  2. Search for Continue and install it.
  3. A new Continue icon appears in the sidebar; click it to open the assistant panel.

For JetBrains IDEs such as IntelliJ or PyCharm, open Settings → Plugins, search the marketplace for Continue, install it, and restart the IDE. The configuration that follows is shared across both platforms.

Continue is configured through a YAML (or JSON, in older versions) file in a .continue folder in your home directory. The key move is to declare Ollama as the provider so nothing reaches out to the cloud. A minimal configuration looks like this:

name: Local Assistant
version: 1.0.0
models:
  - name: Coder Chat
    provider: ollama
    model: qwen2.5-coder:7b
    roles:
      - chat
      - edit
  - name: Coder Autocomplete
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete

If your Ollama instance runs somewhere other than the default, add an explicit endpoint:

    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434

Save the file and Continue picks up the changes. From now on, both the chat panel and the inline suggestions are served by models running on your own machine.

Notice that the configuration above defines two separate models for two distinct jobs, and the distinction matters.

The chat model is the one you converse with: you select a block of code, ask it to refactor a function, request an explanation, or have it draft a test. This is a deliberate, on-demand interaction, so you can afford a larger and slower model that produces higher-quality answers. The 7-billion-parameter coder above is a sensible choice.

The autocomplete model is the one that suggests the rest of a line as you type, in the same ghostly grey text you may know from cloud assistants. This needs to be fast — suggestions that arrive a second after you have already typed the next line are useless. A smaller, snappier model in the 1-to-3-billion range is the right tool, even if it is a touch less clever. Splitting the roles lets each job use the model best suited to it rather than forcing one compromise on both.

Local models live or die by your RAM, and on Apple Silicon and modern systems by unified or GPU memory. A rough guide for quantised models: a 1.5B model is comfortable in around 2 GB, a 7B model wants roughly 6 to 8 GB, and anything in the 13B-and-up range starts to demand 16 GB or more before it runs smoothly.

If you have a machine with 8 GB of RAM, lean on smaller models and keep expectations modest. With 16 GB you can run a 7B chat model and a small autocomplete model side by side comfortably. With 32 GB or a dedicated GPU you have real headroom for larger, more capable models. The honest truth is that the experience scales directly with your hardware, so start small, confirm the workflow feels good, and size up only if your machine has room to spare. A fast small model you actually use beats a sluggish large one you keep disabling.

The standout benefit is that you can now write code under the strictest confidentiality with an assistant at your side. No proprietary logic, no customer data, no embarrassing variable names ever leave your laptop. Auditors relax, NDAs stay intact, and you stop wondering where your keystrokes went.

The limits are worth stating plainly. Local models lag the frontier cloud models on the hardest reasoning, the most obscure libraries, and very large multi-file refactors. They have smaller context windows, so they hold less of your codebase in mind at once. And they ask something of your hardware that cloud tools never do — your fan will spin up, your battery will drain faster, and a heavy model can make an older laptop sweat. The sensible posture is pragmatic: use the local assistant as your private default for everyday work, and reach for a cloud tool only when a task genuinely exceeds what your machine can handle.

A private coding assistant is no longer exotic. With Ollama serving a code model locally and Continue wiring it into your editor, you get inline autocomplete and a capable chat companion that work offline, cost nothing per use, and never leak a line of your code. Split the chat and autocomplete roles across an appropriately sized pair of models, match your ambitions to your RAM, and you have a setup that respects both your productivity and your privacy. Start small, see how it feels, and enjoy the rare luxury of an AI helper that lives entirely on your own machine.