Your First Local AI Coding Assistant: Wiring Ollama into Your Editor

Autocomplete that never leaves your laptop

29-04-2026

Your First Local AI Coding Assistant: Wiring Ollama into Your Editor

Contents

Cloud coding assistants are wonderful right up until you remember where your code is going. Every keystroke, every half-finished function, every comment grumbling about a colleague’s API design is shipped off to someone else’s server. For a side project that scarcely matters; for proprietary code under a strict NDA it can be a genuine problem. The good news is that you can run a capable coding assistant entirely on your own machine, with no network round-trips and no data leaving the building. If you have already met Ollama in our introductory piece, this guide takes the next step: wiring a local model directly into your editor so it suggests code as you type.

Why Bother Going Local

The appeal is threefold. The first reason is privacy: your code never travels anywhere. For regulated industries, client work, or simply a healthy distrust of where your data ends up, that is reason enough. The second is offline capability: on a train, a plane, or a flaky conference Wi-Fi, a local assistant keeps working while cloud tools stare blankly at a connection error. The third is cost and control: no per-token billing, no monthly subscription, no surprise rate limits — just your hardware doing the work whenever you want it.

Be honest with yourself about the trade-off, though. A model running on your laptop will not match the very largest cloud models for raw capability. What it offers instead is a private, always-available assistant that is genuinely good at the bread-and-butter tasks: completing lines, drafting functions, explaining unfamiliar code, and writing boilerplate. For a great deal of everyday programming, that is exactly what you need.

This guide assumes you already have Ollama serving models; if not, the companion piece on running LLMs on your own metal with Ollama covers installation and pulling models from scratch. If your ambitions later outgrow a single laptop — say you want a shared inference endpoint the whole household or team can hit — the same models scale onto a cluster, as covered in running AI inference on Kubernetes.

What You Need First

Before touching your editor, get the foundations in place:

Ollama installed and running. If you followed the earlier introduction you already have it. If not, install it from the official site and confirm it works with ollama --version.
A code-capable model pulled. General chat models can write code, but models trained specifically on code do it better and often support a special completion mode. Pull one with a command like the following.

1
2
3
4
5
# A solid general-purpose code model for chat and explanation
ollama pull qwen2.5-coder:7b

# A small, fast model well suited to inline autocomplete
ollama pull qwen2.5-coder:1.5b

Confirm Ollama is serving its local API, which is what your editor will talk to:

1
curl http://localhost:11434/api/tags

A JSON list of your installed models means the endpoint is live and listening on the default port 11434.

Installing the Editor Extension

You need a bridge between your editor and Ollama. Continue is a popular open-source extension that supports both VS Code and the JetBrains family, and it speaks to local models happily.

For VS Code:

Open the Extensions panel with Ctrl+Shift+X (or Cmd+Shift+X on macOS).
Search for Continue and install it.
A new Continue icon appears in the sidebar; click it to open the assistant panel.

For JetBrains IDEs such as IntelliJ or PyCharm, open Settings → Plugins, search the marketplace for Continue, install it, and restart the IDE. The configuration that follows is shared across both platforms.

Pointing It at Local Ollama

Continue is configured through a YAML (or JSON, in older versions) file in a .continue folder in your home directory. The key move is to declare Ollama as the provider so nothing reaches out to the cloud. A minimal configuration looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
name: Local Assistant
version: 1.0.0
models:
  - name: Coder Chat
    provider: ollama
    model: qwen2.5-coder:7b
    roles:
      - chat
      - edit
  - name: Coder Autocomplete
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete

If your Ollama instance runs somewhere other than the default, add an explicit endpoint:

1
2
3
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434

One detail that trips people up: for the autocomplete role, the model must support fill-in-the-middle (FIM) completion, because inline suggestions need to predict code between what you have typed and what comes after the cursor, not just continue from the end. The qwen2.5-coder, starcoder2, and deepseek-coder families ship FIM-capable builds on Ollama; a general chat model in the autocomplete slot will either produce nonsense or nothing at all. You can also tune the behaviour with autocompleteOptions:

1
2
3
4
5
6
7
8
  - name: Coder Autocomplete
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete
    autocompleteOptions:
      debounceDelay: 350
      maxPromptTokens: 1024

The debounceDelay (in milliseconds) is how long Continue waits after you stop typing before it asks the model for a suggestion — nudge it up if a slower machine is firing requests it never finishes, down if suggestions feel laggy. Save the file and Continue picks up the changes. From now on, both the chat panel and the inline suggestions are served by models running on your own machine.

Chat Versus Autocomplete

Notice that the configuration above defines two separate models for two distinct jobs, and the distinction matters.

The chat model is the one you converse with: you select a block of code, ask it to refactor a function, request an explanation, or have it draft a test. This is a deliberate, on-demand interaction, so you can afford a larger and slower model that produces higher-quality answers. The 7-billion-parameter coder above is a sensible choice.

The autocomplete model is the one that suggests the rest of a line as you type, in the same ghostly grey text you may know from cloud assistants. This needs to be fast — suggestions that arrive a second after you have already typed the next line are useless. A smaller, snappier model in the 1-to-3-billion range is the right tool, even if it is a touch less clever. Splitting the roles lets each job use the model best suited to it rather than forcing one compromise on both.

There is a third role worth knowing about: embeddings. If you want the assistant to answer questions about your whole codebase rather than just the file in front of you, Continue can index your project and retrieve relevant chunks, which needs a small embedding model such as nomic-embed-text. Pull it with Ollama and add an embed role entry, and the chat model gains context it would otherwise never see. It is optional, but on a large project it is the difference between an assistant that guesses and one that actually knows where your PaymentService lives.

Hardware and Model Size Realities

Local models live or die by your RAM, and on Apple Silicon and modern systems by unified or GPU memory. A rough guide for quantised models: a 1.5B model is comfortable in around 2 GB, a 7B model wants roughly 6 to 8 GB, and anything in the 13B-and-up range starts to demand 16 GB or more before it runs smoothly.

If you have a machine with 8 GB of RAM, lean on smaller models and keep expectations modest. With 16 GB you can run a 7B chat model and a small autocomplete model side by side comfortably. With 32 GB or a dedicated GPU you have real headroom for larger, more capable models. The honest truth is that the experience scales directly with your hardware, so start small, confirm the workflow feels good, and size up only if your machine has room to spare. A fast small model you actually use beats a sluggish large one you keep disabling.

The Privacy Dividend, and the Limits

The standout benefit is that you can now write code under the strictest confidentiality with an assistant at your side. No proprietary logic, no customer data, no embarrassing variable names ever leave your laptop. Auditors relax, NDAs stay intact, and you stop wondering where your keystrokes went.

The limits are worth stating plainly. Local models lag the frontier cloud models on the hardest reasoning, the most obscure libraries, and very large multi-file refactors. They have smaller context windows, so they hold less of your codebase in mind at once. And they ask something of your hardware that cloud tools never do — your fan will spin up, your battery will drain faster, and a heavy model can make an older laptop sweat. The sensible posture is pragmatic: use the local assistant as your private default for everyday work, and reach for a cloud tool only when a task genuinely exceeds what your machine can handle.

When It Doesn’t Work: Troubleshooting

A local setup has more moving parts than a cloud plugin, so a few things reliably go wrong on first run. Here is how to diagnose them.

No suggestions appear at all. Nine times out of ten the autocomplete model is not FIM-capable, or the role was misspelt in the config. Confirm the model is pulled with ollama list, confirm the role reads exactly autocomplete, and check Continue’s own log output (in VS Code, the Continue panel has a console) for a model-not-found error.

Continue can’t reach Ollama. If chat returns a connection error, the extension and the server disagree about where Ollama lives. Verify the endpoint by hand:

1
curl http://localhost:11434/api/tags

If that returns JSON but Continue still fails, you are likely running Ollama in a container or on another host, and localhost inside the editor’s context is not the same localhost. Set an explicit apiBase to the reachable address. If Ollama runs on a different machine, remember it only binds to loopback by default; you must set OLLAMA_HOST=0.0.0.0 on the server for it to accept connections from your workstation.

Suggestions are painfully slow or the machine grinds. The model is too large for your RAM and is spilling to swap or falling back to CPU. Drop to a smaller autocomplete model, raise debounceDelay so you are not queuing requests, and watch memory with ollama ps to see what is actually loaded. A model that keeps unloading and reloading between chat and autocomplete will feel dreadful — giving each role its own resident model, if you have the RAM, avoids the thrash.

The first request after a pause is slow, then it speeds up. That is Ollama loading the model into memory on demand. Set OLLAMA_KEEP_ALIVE to a longer duration so the model stays resident between edits rather than being evicted after a few idle minutes.

Conclusion

A private coding assistant is no longer exotic. With Ollama serving a code model locally and Continue wiring it into your editor, you get inline autocomplete and a capable chat companion that work offline, cost nothing per use, and never leak a line of your code. Split the chat and autocomplete roles across an appropriately sized pair of models, match your ambitions to your RAM, and you have a setup that respects both your productivity and your privacy. Start small, see how it feels, and enjoy the rare luxury of an AI helper that lives entirely on your own machine.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#ai #ollama #coding #howto

Your First Local AI Coding Assistant: Wiring Ollama into Your Editor

Autocomplete that never leaves your laptop

Why Bother Going Local

What You Need First

Installing the Editor Extension

Pointing It at Local Ollama

Chat Versus Autocomplete

Hardware and Model Size Realities

The Privacy Dividend, and the Limits

When It Doesn’t Work: Troubleshooting

Conclusion

Related Content

Fine-Tuning vs Prompting vs RAG: Picking the Right Tool Without Wasting GPU Hours

Talking to Your Documents: A Practical RAG Pipeline with Open-Source Tools

Prompt Injection: The SQL Injection of the AI Era

What Is Agentic AI, and Why Is Everyone Suddenly Talking About It?