What Is a Token, Really? How LLMs Read, Reason, and Bill You

The hidden unit behind every AI conversation

13-03-2026

What Is a Token, Really? How LLMs Read, Reason, and Bill You

Contents

Every conversation you have with a language model is quietly measured, chopped, and counted in a unit you almost never see. It is not the word, nor quite the letter. It is the token: the atom of AI text, the thing the model actually reads, the thing your bill is calculated from, and the reason your carefully crafted prompt sometimes behaves in ways that feel slightly arbitrary. Understand tokens and a great deal about how these systems read, reason, and charge suddenly clicks into place.

Not Quite Words, Not Quite Letters

When you read a sentence, you process it as words. A language model does not. Before any text reaches the model, it passes through a step called tokenisation, which splits the text into chunks called tokens. A token is usually a piece of a word — a common run of characters that appears frequently in text. Sometimes a whole word is a single token; sometimes a word is two or three; sometimes a single token includes a leading space.

The rough rule of thumb in English is that one token corresponds to about four characters, or roughly three-quarters of a word. So a hundred words is in the neighbourhood of 130 tokens. This is only an average, and the precise count depends on the specific words and the specific model, but it is a useful figure to keep in your head when you are estimating size or cost.

Why Pieces Instead of Whole Words

Splitting text into sub-word pieces sounds fiddly, so why do it? Because it strikes a clever balance. If a model used whole words as its units, its vocabulary would need to contain every word it might ever encounter — including names, typos, slang, technical jargon, and words in dozens of languages. That vocabulary would be impossibly large, and it would still stumble over any word it had never seen.

If instead the model used single characters, the vocabulary would be tiny, but every sentence would become a very long sequence of units, making the model slow and forcing it to reassemble meaning from scratch each time. Sub-word tokens sit in the sweet spot. Common words get their own token for efficiency, while rare or novel words are built from smaller familiar pieces. The made-up word “unfriendliness” might break into “un”, “friend”, “li”, and “ness” — fragments the model has seen countless times, even if the whole word is new to it. Nothing is ever truly out of vocabulary.

How the Split Is Actually Decided

The vocabulary is not hand-written by a person deciding “cat” deserves its own token. It is learned from data by an algorithm, most commonly byte-pair encoding, or BPE. The idea is elegantly simple. You start with the raw characters, then repeatedly find the most frequent adjacent pair and merge it into a new single unit. Do this thousands of times over a huge corpus and the common building blocks of the language surface on their own: frequent words become single tokens, common prefixes and suffixes become their own tokens, and rare sequences remain in smaller pieces. The vocabulary is frozen once training is done, and every piece of text the model ever sees is chopped up using that fixed set of merges.

This is why tokenisation feels arbitrary from the outside: it reflects the statistics of the training data, not the rules of grammar you learned at school. A word that happens to be common in English text gets efficient treatment; a word common in your particular domain but rare on the open web may fragment badly. Programmers notice this constantly — a variable name like getUserById can shatter into half a dozen tokens because the tokeniser never saw that exact run of characters often enough to merge it. Emoji, unusual Unicode, and languages underrepresented in the training data all pay a similar tax, often needing several tokens where English needs one. If you work in a language other than English, you may quietly be paying more per sentence for the same meaning.

Watching a Word Split

A few concrete examples make the behaviour tangible. The word “cat” is so common it is a single token. “Strawberry” might surprise you by splitting into something like “straw” and “berry”, two tokens. Numbers often fragment in unintuitive ways — “12345” can become several tokens rather than one. Whitespace usually attaches to the front of the following word, so " token" with its leading space is a different token from “token” at the start of a line.

This explains a famous party trick: ask a model to count the letters in a word and it sometimes gets it wrong. From the model’s point of view it never saw the individual letters at all. It saw a token or two, opaque little bundles, and counting characters inside a bundle it cannot see clearly is genuinely hard. The model reasons over tokens, not letters, and the seams between those two worlds are where odd behaviour creeps in.

From Token to Number

Splitting text into tokens is only the first half of the story. The model does no arithmetic on the word “berry” itself; it works on numbers. Each token in the vocabulary has an integer ID, and that ID is looked up in a large table to fetch a long list of numbers called an embedding — a vector that encodes, in a way learned during training, something about the token’s meaning and how it tends to combine with others. So the real pipeline is: text goes in, tokeniser splits it into tokens, each token becomes an ID, each ID becomes a vector, and only then does the model start its work. When people talk about a model “reading” your prompt, this is what is actually happening under the bonnet.

Knowing this demystifies a lot. The model has no direct access to letters because letters were dissolved into token vectors several steps back. It cannot easily reverse a token into its component characters any more than you can recover the individual ingredients from a baked cake by looking at it. That is not a bug to be fixed with a better prompt; it is a structural consequence of how the input is represented.

Estimating Tokens Without Guessing

If you want a real count rather than the four-characters rule of thumb, do not reach for a generic word counter and do not use a tokeniser built for a different model — the vocabularies differ, so counts from the wrong tool can be off by a fifth or more, and worse on code. Most providers expose a token-counting endpoint or a small library that runs their exact tokeniser, and that is the only way to get an authoritative figure. The pattern is the same everywhere: hand it your text, get back the token count, multiply by the per-token price to estimate cost before you ever send the real request.

1
2
3
4
5
6
7
8
# Count tokens with the provider's own tokeniser, then estimate input cost
resp = client.messages.count_tokens(
    model="the-model-you-will-actually-use",
    messages=[{"role": "user", "content": open("prompt.txt").read()}],
)
tokens = resp.input_tokens
cost = tokens * PRICE_PER_INPUT_TOKEN   # e.g. $5 / 1,000,000 tokens
print(f"{tokens} tokens, roughly ${cost:.4f} to send once")

The habit worth building is to count before you commit, especially for anything you will send at scale. A prompt that looks harmless in a text editor can be considerably larger once tokenised, and a batch job that runs it ten thousand times turns a rounding error into a real invoice.

The Context Window

Every model can only consider a fixed amount of text at once, and that ceiling is measured in tokens. This is the context window. It covers everything in play during a single exchange: your prompt, any documents or chat history you have included, the system instructions, and the model’s own reply. When the total exceeds the window, something has to give — usually the oldest material drops off the front.

This is why a long chat can seem to develop amnesia, forgetting details you mentioned at the start. They simply scrolled out of the window. It is also why feeding a model an entire book in one go is not always possible: the book may not fit. Context windows have grown enormously, from a few thousand tokens in early models to hundreds of thousands or more, but they remain finite, and the limit is always counted in tokens rather than pages or words.

Why You Pay Per Token

Tokens are not just how models read; they are how providers bill. Pricing is quoted per token, and almost always split into two rates: one for input tokens (everything you send) and one for output tokens (everything the model generates). Output is typically the pricier of the two, because generating each token requires a fresh pass through the model, whereas input tokens are digested more cheaply in bulk.

This split has real consequences. A long prompt with a short answer is cheap on output but can be expensive on input if you keep sending the same lengthy context over and over. A short prompt that produces pages of generated text tilts the cost the other way. Knowing which side of the ledger dominates your use case tells you where to trim — and it is the same per-unit thinking that makes per-seat SaaS pricing so grating once you notice you are paying for capacity you never touch.

How Tokens Shape Prompt Design

Once you think in tokens, prompt design changes character. Every word you include in your prompt is a word you pay for and a word that consumes context-window space. Bloated, repetitive instructions cost money and crowd out room for the content that actually matters. Tight, well-structured prompts are cheaper and frequently better, because the model is not wading through padding to find your real request.

There is a subtler effect too, sometimes called lost in the middle. Models tend to pay closest attention to the beginning and end of their context and can gloss over material buried in the centre of a very long prompt. So it is not merely that there is a limit — even within the limit, where you place the important information matters. Put your key instructions and most relevant facts near the start or the end, not adrift in the middle of a wall of text.

Spending Fewer Tokens

A handful of habits keep your token bill and your context usage lean:

Trim the prompt. Remove pleasantries, redundant instructions, and repeated context. State what you need once, clearly.
Cap the output. If you only want a short answer, ask for one explicitly, or set a maximum length. Unbounded generation is where surprise costs hide.
Summarise long histories. In an ongoing conversation, periodically condense earlier turns into a short summary rather than resending the entire transcript each time.
Send only relevant context. Do not paste a whole document when a couple of paragraphs will do. This is exactly the discipline that retrieval-based systems automate.
Reuse stable context. If you send the same large instructions repeatedly, look into caching features that let providers charge less for context they have already seen. This matters most once you start wiring models into automated workflows — an AI agent that manages your inbox resends its instructions on every single turn, so a bloated system prompt quietly multiplies your bill across hundreds of calls.

None of these require deep technical skill. They are mostly a matter of remembering that text is not free and that brevity is, more often than not, also clarity.

Conclusion

The token is the unsung unit underneath every AI interaction — smaller than a word, the true thing a model reads, the measure of how much it can hold in mind, and the basis of every charge on your bill. Once you stop thinking in words and start thinking in tokens, the quirks fall into place: why models miscount letters, why long chats forget, why a verbose prompt costs more than a crisp one. Counting in tokens is the quiet skill that separates someone who uses these tools from someone who uses them well.

Frequently asked questions

What is a token in an LLM?

A token is the unit of text a language model actually reads, smaller than a word but usually larger than a single letter. Before text reaches the model it is split into these chunks during a step called tokenisation.

How many tokens are in a word?

As a rough rule of thumb in English, one token is about four characters or roughly three-quarters of a word, so a hundred words is in the neighbourhood of 130 tokens. The exact count depends on the specific words and model.

Why do LLMs charge per token and which costs more, input or output?

Providers bill per token, split into input tokens (everything you send) and output tokens (everything the model generates). Output is typically pricier because each generated token requires a fresh pass through the model, whereas input is digested more cheaply in bulk.

Why do language models miscount the letters in a word?

Because the model reasons over tokens, not individual letters. It never sees the separate characters; it sees opaque token bundles, so counting the letters hidden inside them is genuinely hard.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#ai #llm #tokens #explainer