LangChain vs LlamaIndex: Orchestrating LLMs Without Going Mad

Two frameworks for wiring models into something useful, and when each earns its keep

Smarc Included in

30-04-2024 1830 words 9 min read

LangChain vs LlamaIndex: Orchestrating LLMs Without Going Mad

Contents

The moment you try to build anything real with a language model, you discover the hard part isn’t the model. It’s everything around it: loading documents, splitting them sensibly, embedding them, stuffing the right context into a prompt, calling a tool, parsing the reply, and doing it all again. You can write this yourself — I did, twice, badly — or you can reach for a framework. The two that dominate are LangChain and LlamaIndex, and the internet will cheerfully tell you to use both, neither, or that one is bloated and the other is a toy. Here’s what I actually think after building with each.

They started solving different problems

This is the key to the whole comparison, and most arguments online miss it. The two frameworks come from different starting assumptions.

LangChain is a general orchestration toolkit. Its worldview is that an LLM application is a chain of steps — and increasingly an agent that decides which steps to take. Prompts, model calls, tools, memory, output parsing: LangChain wants to be the glue for all of it. It’s broad, it has an integration for seemingly everything, and that breadth is both its strength and the source of every complaint about it.

LlamaIndex started life as GPT Index, and its obsession is retrieval. If your problem is “I have a pile of documents and I want the model to answer questions using them” — the thing everyone now calls RAG — LlamaIndex was built from the ground up for exactly that. Its abstractions are about indexing, querying, and getting the right chunks in front of the model.

The RAG case, side by side

Both can do retrieval-augmented generation, so let’s see them do it. Here’s the LlamaIndex version, which is almost rude in how little it asks of you:

1
2
3
4
5
6
7
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

print(query_engine.query("What's our refund policy?"))

Four meaningful lines and you have a working document Q&A system. LlamaIndex made sensible default choices about chunking, embedding, and retrieval so you didn’t have to. That is the entire pitch, and it’s a good one.

LangChain can do the same, but it shows you more of the wiring — you assemble the loader, the splitter, the vector store, the retriever, and the chain yourself. That’s more code and more decisions, which is annoying for a simple RAG app and liberating the moment you need to do something the defaults didn’t anticipate.

Here’s the same job in LangChain, so you can feel the difference in your hands rather than take my word for it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA

docs = DirectoryLoader("./docs").load()
chunks = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100
).split_documents(docs)
store = FAISS.from_documents(chunks, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(temperature=0),
    retriever=store.as_retriever(search_kwargs={"k": 4}),
)
print(qa.invoke("What's our refund policy?"))

Six or seven decisions are now yours: which loader, which splitter and its chunk size, which embedding model, which vector store, how many chunks to retrieve. For a refund-policy bot that’s needless ceremony. For a system where retrieval quality is the whole ballgame — where you’ve measured that 1,000-character chunks lose the answer but 1,500 keep it, or that you need to filter by metadata before the vector search runs — that explicitness stops being overhead and starts being the reason you can fix things. LlamaIndex lets you reach in and tune all of the same knobs, incidentally; the difference is which behaviour is the default and which is the opt-in. If you’re still deciding whether to reach for retrieval at all versus adjusting the model itself, I’ve laid out that trade-off in fine-tuning vs prompting vs RAG — get that choice wrong and no framework will save you.

Where LangChain pulls ahead

The instant your application stops being “answer questions about documents” and starts being “do a multi-step task, calling tools and making decisions along the way”, LangChain’s broader scope becomes the point. Agents, tool use, conversational memory, branching logic — this is the territory it was built for.

1
2
3
4
5
6
7
8
from langchain.agents import initialize_agent, load_tools
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)
tools = load_tools(["llm-math", "wikipedia"], llm=llm)
agent = initialize_agent(tools, llm, agent="zero-shot-react-description")

agent.run("What's the population of France divided by 4?")

That agent will reason about which tool to use, look up the population, do the arithmetic, and return an answer — and you can hand it your own tools just as easily. LlamaIndex has grown agent and tool features too, but this remains LangChain’s home turf.

A word of caution before you get excited about agents, because the demo above hides a real risk. The instant you let a model choose which tool to call and with what arguments, you’ve handed it a lever on the world. A tool that runs shell commands, hits an internal API, or writes to a database is only as safe as the guardrails around it, and a model that’s been prompt-injected by a poisoned document will happily pull that lever in a direction you did not intend. initialize_agent and its modern successor create_react_agent are convenience, not safety. Scope every tool tightly, treat every model-chosen argument as untrusted input, and read when your AI agent goes rogue before you wire one of these to anything that can spend money or delete files. The framework will not stop you shooting yourself in the foot; it just makes the trigger easier to reach.

There’s also a version-churn tax specific to LangChain that’s worth naming. The API you learn today may be deprecated in six months — initialize_agent itself is on the way out in favour of LangGraph’s explicit state machines, langchain split into langchain-core, langchain-community, and per-provider packages like langchain-openai, and import paths that worked in tutorials from a year ago now throw. This isn’t fatal, but it means you should pin your versions, read the migration notes when you upgrade, and be sceptical of any Stack Overflow answer older than the last release. LlamaIndex went through its own great renaming (everything moved under llama_index.core and friends), so neither is innocent here — but LangChain’s larger surface area means more of it moves at once.

The honest gripes

LangChain has a reputation, and it’s partly earned. The abstractions move fast, the documentation has historically struggled to keep pace, and there’s a genuine “do I need this layer at all?” question lurking under simple use cases. I’ve spent real time debugging a LangChain chain only to conclude that three direct API calls would have been clearer. The framework adds the most value precisely where your application is complex, and the least where it’s simple.

LlamaIndex is more focused and therefore less likely to leave you bewildered — but that focus is a ceiling as well as a floor. Push it far past retrieval into elaborate agentic workflows and you’ll feel it straining against its own grain.

Troubleshooting: the failures you’ll actually hit

Both frameworks fail in ways the happy-path tutorials never mention, and the failures cluster into a handful of recurring shapes. Here’s the field guide I wish I’d had.

“It retrieves the wrong chunks.” This is the single most common complaint about RAG, and it’s almost never the framework’s fault — it’s your chunking. If your documents are split so that the answer straddles a chunk boundary, no retriever will find it whole. Symptoms: the model answers confidently but wrong, or says “I don’t have that information” about something plainly in your docs. Fix by increasing chunk_overlap (100–200 characters buys you a lot of forgiveness), trying larger chunks, or switching to a semantic splitter that respects paragraph and section breaks instead of blindly cutting every N characters. Always inspect what’s actually being retrieved — in LlamaIndex, response.source_nodes; in LangChain, call retriever.invoke(query) directly and read what comes back. Nine times out of ten the problem is visible the moment you look.

Dependency hell. Both frameworks pull in a sprawling tree of optional integrations, and installing “langchain” or “llama-index” as a single package drags along more than you need and occasionally conflicts with your own pins. Install only the sub-packages you actually use (langchain-core plus the one or two integration packages), work inside a virtual environment or container, and lock your versions. If you’re the sort who’d rather run each app in its own isolated userspace, the Podman approach to rootless containers pairs nicely with this — one container per LLM app, its dependency graph sealed off from everything else.

Silent cost blowups. An agent that loops — reasoning, calling a tool, reasoning again — can make dozens of model calls per user request, and if you’re on a paid API each of those is money. A misbehaving agent stuck in a reasoning loop can burn through a budget alarmingly fast. Set max_iterations on your agents, add a hard timeout, and log token usage from day one so a bug shows up as a graph spike rather than a bill.

Opaque errors deep in the abstractions. When a chain fails five layers down, the stack trace can be genuinely useless. LangChain’s answer is set_debug(True) and, better, LangSmith tracing, which shows you every prompt, every model call, and every tool invocation in order. LlamaIndex has callback handlers that do the same. Turn observability on before you need it; debugging blind is where the “I could have written three API calls myself” resentment comes from.

How to actually choose

My rule of thumb is embarrassingly simple. If the heart of your project is retrieval over your own documents, start with LlamaIndex; it’ll have you running in an afternoon and the defaults are good. If the heart of your project is orchestration — agents, tools, multi-step reasoning, lots of moving parts — start with LangChain and accept the learning curve as the cost of its reach.

And yes, the “use both” advice is real and not a cop-out: a common pattern is LlamaIndex handling the retrieval layer as a tool that a LangChain agent calls. They interoperate fine, and using each for what it’s best at is more sensible than forcing one to do everything.

Is it worth a framework at all?

For a genuinely simple application — one prompt, one model call, parse the result — skip both and call the API directly. A framework you don’t need is just indirection between you and a bug. But the moment you’re juggling documents, tools, memory, or multiple steps, hand-rolling that plumbing stops being a learning exercise and starts being a maintenance burden. That’s where these frameworks earn their place: not because the LLM call is hard, but because everything wrapped around it is, and someone has already solved it more carefully than you will at 1am. Pick the one whose centre of gravity matches your problem, and don’t be too proud to use both.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#machine-learning #llm

Contents

LangChain vs LlamaIndex: Orchestrating LLMs Without Going Mad

Two frameworks for wiring models into something useful, and when each earns its keep

They started solving different problems

The RAG case, side by side

Where LangChain pulls ahead

The honest gripes

Troubleshooting: the failures you’ll actually hit

How to actually choose

Is it worth a framework at all?

Related Content

Local LLMs: A Practical Comparison of Llama, Mistral, and Gemma for Real Work

Prompt Injection: The SQL Injection of the AI Era

RAG Explained: How AI Stops Making Things Up

What Is a Token, Really? How LLMs Read, Reason, and Bill You