What Is a Token, Really? How LLMs Read, Reason, and Bill You
The hidden unit behind every AI conversation

Every conversation you have with a language model is quietly measured, chopped, and counted in a unit you almost never see. It is not the word, nor quite the letter. It is the token: the atom of AI text, the thing the model actually reads, the thing your bill is calculated from, and the reason your carefully crafted prompt sometimes behaves in ways that feel slightly arbitrary. Understand tokens and a great deal about how these systems read, reason, and charge suddenly clicks into place.
1 Not Quite Words, Not Quite Letters
When you read a sentence, you process it as words. A language model does not. Before any text reaches the model, it passes through a step called tokenisation, which splits the text into chunks called tokens. A token is usually a piece of a word — a common run of characters that appears frequently in text. Sometimes a whole word is a single token; sometimes a word is two or three; sometimes a single token includes a leading space.
The rough rule of thumb in English is that one token corresponds to about four characters, or roughly three-quarters of a word. So a hundred words is in the neighbourhood of 130 tokens. This is only an average, and the precise count depends on the specific words and the specific model, but it is a useful figure to keep in your head when you are estimating size or cost.
2 Why Pieces Instead of Whole Words
Splitting text into sub-word pieces sounds fiddly, so why do it? Because it strikes a clever balance. If a model used whole words as its units, its vocabulary would need to contain every word it might ever encounter — including names, typos, slang, technical jargon, and words in dozens of languages. That vocabulary would be impossibly large, and it would still stumble over any word it had never seen.
If instead the model used single characters, the vocabulary would be tiny, but every sentence would become a very long sequence of units, making the model slow and forcing it to reassemble meaning from scratch each time. Sub-word tokens sit in the sweet spot. Common words get their own token for efficiency, while rare or novel words are built from smaller familiar pieces. The made-up word “unfriendliness” might break into “un”, “friend”, “li”, and “ness” — fragments the model has seen countless times, even if the whole word is new to it. Nothing is ever truly out of vocabulary.
3 Watching a Word Split
A few concrete examples make the behaviour tangible. The word “cat” is so common it is a single token. “Strawberry” might surprise you by splitting into something like “straw” and “berry”, two tokens. Numbers often fragment in unintuitive ways — “12345” can become several tokens rather than one. Whitespace usually attaches to the front of the following word, so " token" with its leading space is a different token from “token” at the start of a line.
This explains a famous party trick: ask a model to count the letters in a word and it sometimes gets it wrong. From the model’s point of view it never saw the individual letters at all. It saw a token or two, opaque little bundles, and counting characters inside a bundle it cannot see clearly is genuinely hard. The model reasons over tokens, not letters, and the seams between those two worlds are where odd behaviour creeps in.
4 The Context Window
Every model can only consider a fixed amount of text at once, and that ceiling is measured in tokens. This is the context window. It covers everything in play during a single exchange: your prompt, any documents or chat history you have included, the system instructions, and the model’s own reply. When the total exceeds the window, something has to give — usually the oldest material drops off the front.
This is why a long chat can seem to develop amnesia, forgetting details you mentioned at the start. They simply scrolled out of the window. It is also why feeding a model an entire book in one go is not always possible: the book may not fit. Context windows have grown enormously, from a few thousand tokens in early models to hundreds of thousands or more, but they remain finite, and the limit is always counted in tokens rather than pages or words.
5 Why You Pay Per Token
Tokens are not just how models read; they are how providers bill. Pricing is quoted per token, and almost always split into two rates: one for input tokens (everything you send) and one for output tokens (everything the model generates). Output is typically the pricier of the two, because generating each token requires a fresh pass through the model, whereas input tokens are digested more cheaply in bulk.
This split has real consequences. A long prompt with a short answer is cheap on output but can be expensive on input if you keep sending the same lengthy context over and over. A short prompt that produces pages of generated text tilts the cost the other way. Knowing which side of the ledger dominates your use case tells you where to trim.
6 How Tokens Shape Prompt Design
Once you think in tokens, prompt design changes character. Every word you include in your prompt is a word you pay for and a word that consumes context-window space. Bloated, repetitive instructions cost money and crowd out room for the content that actually matters. Tight, well-structured prompts are cheaper and frequently better, because the model is not wading through padding to find your real request.
There is a subtler effect too, sometimes called lost in the middle. Models tend to pay closest attention to the beginning and end of their context and can gloss over material buried in the centre of a very long prompt. So it is not merely that there is a limit — even within the limit, where you place the important information matters. Put your key instructions and most relevant facts near the start or the end, not adrift in the middle of a wall of text.
7 Spending Fewer Tokens
A handful of habits keep your token bill and your context usage lean:
- Trim the prompt. Remove pleasantries, redundant instructions, and repeated context. State what you need once, clearly.
- Cap the output. If you only want a short answer, ask for one explicitly, or set a maximum length. Unbounded generation is where surprise costs hide.
- Summarise long histories. In an ongoing conversation, periodically condense earlier turns into a short summary rather than resending the entire transcript each time.
- Send only relevant context. Do not paste a whole document when a couple of paragraphs will do. This is exactly the discipline that retrieval-based systems automate.
- Reuse stable context. If you send the same large instructions repeatedly, look into caching features that let providers charge less for context they have already seen.
None of these require deep technical skill. They are mostly a matter of remembering that text is not free and that brevity is, more often than not, also clarity.
8 Conclusion
The token is the unsung unit underneath every AI interaction — smaller than a word, the true thing a model reads, the measure of how much it can hold in mind, and the basis of every charge on your bill. Once you stop thinking in words and start thinking in tokens, the quirks fall into place: why models miscount letters, why long chats forget, why a verbose prompt costs more than a crisp one. Counting in tokens is the quiet skill that separates someone who uses these tools from someone who uses them well.