Every time you type a message into an AI chatbot, something happens before the model reads your words. The text gets broken into smaller pieces called tokens. These tokens are the actual units that an LLM processes, generates, and charges you for.
Understanding tokens matters because they control two things most users care about: cost and capability. Among the core concepts in LLM basics, tokens are one of the most practical to learn early. The number of tokens in your conversation determines how much you pay and how much text the model can handle.
Whether you are writing a single prompt or processing a 50-page document, tokens shape the experience. Getting familiar with how they work helps you manage costs, write better prompts, and avoid running into limits mid-conversation. The concept connects directly to how large language models process every piece of text you send them.
Key Takeaways
What Tokens Are and How Tokenization Works
A token is the smallest unit of text that a language model processes. It is not a word, a letter, or a sentence. Instead, a token is a text fragment the tokenizer has learned to recognize as a meaningful chunk.
Token: The smallest unit of text an LLM processes. A token can be a whole word, part of a word, a number, or a punctuation mark. Most English words equal 1 to 2 tokens.
When you send text to a model, a program called a tokenizer splits your input into pieces. The tokenizer assigns each piece a numeric ID from a fixed vocabulary. The model then works entirely with these numbers, not with readable text.
Think of it like a translator converting your message into a code the model understands. The model never sees the word “hello.” It sees the number 9906, or whatever ID its vocabulary assigns to that token.
How Tokenizers Split Text
Most modern tokenizers use a method called Byte Pair Encoding (BPE) or a close variant. BPE works by analyzing a large body of text during training. It starts with individual characters and repeatedly merges the most common pairs into single tokens.
The result is a vocabulary of typically 30,000 to 100,000 tokens that balances efficiency and coverage. Common English words like “the” or “hello” become single tokens. Less common words get split into smaller pieces.
The word “tokenization,” for example, might become three tokens: “token,” “ization,” and the space before it. The word “cat” stays whole because it appears often enough to earn its own token.
Numbers, punctuation, and whitespace also become tokens. A period is one token. A comma is one token.
Even spaces sometimes merge with the following word into a combined token. That is why you might see tokens like ” world” with a leading space.
Different model providers build their own tokenizer vocabularies. OpenAI’s current tokenizer (cl200k) has about 200,000 tokens in its vocabulary.
Google’s SentencePiece approach produces a different-sized vocabulary with different merge patterns. This is why the same text creates a different number of tokens across models.
Why Not Just Use Words?
Languages are messy. English alone has hundreds of thousands of words, plus slang, abbreviations, technical terms, and typos.
A word-level system would need an enormous vocabulary. It would also fail on any word it had not seen before.
Token-based systems handle this gracefully. An unfamiliar word simply gets split into smaller known pieces.
The word “cryptocurrency” might split into “crypt” and “ocurrency.” A common word like “cat” stays whole. This flexibility is part of how LLMs work under the hood.
The subword approach also helps with new terminology. When a brand-new product name appears for the first time, the model does not crash. It tokenizes the name into known fragments and processes them normally.
The 0.75 Rule for Estimating Token Counts
A useful rule of thumb: one English word equals roughly 1.3 tokens. Flipped around, one token covers about 0.75 words. A 1,000-word document becomes approximately 1,300 tokens.
This ratio shifts depending on what you are writing. Conversational English stays close to the 1.3 average. Technical text with long compound words pushes higher, sometimes reaching 1.5 or 1.6 tokens per word.
Code is especially token-heavy. Programming syntax, variable names, and special characters all produce additional tokens. A 100-line Python script might use twice as many tokens as 100 lines of plain English.
Non-English languages often require more tokens as well. Chinese, Korean, and Arabic text typically needs 2 to 4 times more tokens per word than English. This happens because most tokenizers were trained primarily on English text, giving other languages less efficient tokenization.
How Tokens Affect Your Day-to-Day LLM Use
Tokens and Pricing
API providers charge per token for both input (your prompt) and output (the model’s response). When you send a 500-word prompt and receive a 1,000-word response, you pay for roughly 650 input tokens and 1,300 output tokens.
These costs vary widely between models. As of February 2026, ChatGPT‘s GPT-5 charges $1.25 per million input tokens and $10.00 per million output tokens according to OpenAI’s API pricing. Claude‘s Sonnet 4.5 costs $3.00 input and $15.00 output per million tokens per Anthropic’s pricing page.
Output tokens almost always cost more than input tokens. The difference is often 3 to 8 times higher for output. This means the model’s response is the bigger cost driver, not your prompt.
Shorter, focused prompts that produce concise answers save money at scale. A chatbot handling 10,000 conversations daily will notice significant cost differences between average reply lengths. Reducing average output from 500 tokens to 200 tokens per response cuts output costs by 60%.
To put these numbers in perspective, consider processing a 10,000-word report. That report is roughly 13,000 input tokens. Sending it to GPT-5 costs about $0.016, while sending it to Claude Opus 4.6 costs about $0.065.
The output cost for a 1,000-word summary adds more on top. These per-document differences seem small, but they multiply quickly for businesses processing hundreds of documents daily.
Tokens and Context Windows
Every model has a maximum number of tokens it can handle in a single conversation. This limit is called the context window. It includes your input and the model’s output combined.
Context windows in 2026 range from 128,000 tokens for legacy models up to 1,000,000 tokens for frontier releases. Google’s Gemini 3.1 Pro offers a 1 million token window according to Google AI pricing. Claude Opus 4.6 matches this with 1 million tokens in beta.
A 200,000-token window holds approximately 150,000 words. That is roughly the length of two full novels. Larger windows accommodate multi-document analysis and lengthy conversations without losing earlier context.
When your conversation exceeds the context window, the model loses access to earlier messages. Long documents, extended conversations, and detailed instructions all consume tokens from this shared pool. Users often hit this limit without realizing it during long chat sessions.
Tokens and Response Quality
Token limits shape output quality in ways that are less obvious. When the model has room to process more tokens, it can maintain coherence over longer responses. Cramming too much into a single prompt leaves fewer tokens for the reply.
This creates a practical trade-off. A detailed 2,000-token prompt gives the model more context but less room to reply. A minimal 200-token prompt provides plenty of output space but less guidance.
Experienced users balance this through prompt design that provides enough context without wasting tokens on unnecessary detail. Including only the most relevant information often produces better results than pasting entire documents.
Tokens and Conversation Memory
In a multi-turn conversation, every previous message stays in the context window. Your first message, the model’s first reply, your second message, and so on. All of these accumulate tokens.
After enough back-and-forth exchanges, the total exceeds the limit. When this happens in chat interfaces, the platform silently drops the oldest messages. The model loses access to instructions or context from early in the conversation.
This explains why long chats sometimes feel like the model “forgot” what you discussed. It did not forget in a human sense. Those earlier messages were simply removed from its processing window to make room for newer ones.
For users who rely on long conversations, this creates a practical concern. Custom instructions, persona definitions, and style preferences set at the start of a conversation can disappear once the token total grows too large. The model then reverts to its default behavior, which often confuses users who assumed those instructions would persist.
Token Characteristics Across Major Models
Each provider uses its own tokenizer, which means the same text produces different token counts depending on the model. The table below compares major models available in February 2026.
| Model | Provider | Context Window | Input Cost (per 1M) | Output Cost (per 1M) | Tokenizer Type |
|---|---|---|---|---|---|
| GPT-5 | OpenAI | 400,000 | $1.25 | $10.00 | BPE (cl200k) |
| GPT-5 nano | OpenAI | 400,000 | $0.05 | $0.40 | BPE (cl200k) |
| Claude Opus 4.6 | Anthropic | 1,000,000 | $5.00 | $25.00 | Custom BPE |
| Claude Haiku 4.5 | Anthropic | 200,000 | $1.00 | $5.00 | Custom BPE |
| Gemini 3.1 Pro | 1,000,000 | $2.00 | $12.00 | SentencePiece | |
| Gemini 2.5 Flash | 1,000,000 | $0.15 | $0.60 | SentencePiece |
The price spread is dramatic. Sending one million input tokens through GPT-5 nano costs $0.05. The same volume through Claude Opus 4.6 costs $5.00.
That is a 100x cost difference for the same amount of text, though the models differ greatly in capability.
Different tokenizers also produce different token counts for identical text. A paragraph tokenized by OpenAI’s cl200k system might produce 85 tokens.
Google’s SentencePiece tokenizer might return 91 for the same paragraph. These small differences compound over thousands of API calls.
The practical takeaway is that comparing “price per token” across providers requires caution. A model with cheaper tokens that produces more tokens for the same text may not actually cost less. Calculating total cost for a sample workload gives a more accurate picture than raw per-token rates.
When Tokens Work For and Against You
Strengths
- Flexible handling of any text. The subword approach means models process misspelled words, new slang, and mixed-language text without breaking. No input is truly “unknown” to a tokenizer, because unfamiliar words always fall back to smaller known pieces.
- Predictable pricing structure. Token-based billing makes costs transparent and calculable ahead of time. You can estimate expenses before running a task by counting tokens in your planned input and expected output length.
- Efficient compression of common patterns. Frequent words compress into single tokens. Common phrases like “the” or “and” take up minimal space, leaving more of the context window available for meaningful, unique content.
Limitations
- Non-English text costs more. Languages using non-Latin scripts often require 2 to 4 times more tokens per word than English. This means non-English users pay more and get shorter effective context windows for the same amount of readable text.
- Counting tokens is unintuitive. Users cannot reliably predict token counts by looking at their text. A 10-word sentence might produce 12 tokens or 18, depending on which words appear and how the tokenizer splits them.
- Hidden overhead eats into your limit. System prompts, conversation history, and formatting instructions all consume tokens invisibly. A chatbot with a 500-token system prompt starts every conversation with less available context than the stated limit suggests.
- Multimodal inputs add up quickly. Images, PDFs, and audio files get converted into token equivalents before processing. A single high-resolution image can cost over 1,000 tokens, reducing the space left for text in the same request.
System prompts and conversation history consume tokens from your context window before you type a single word. A chatbot with a detailed system prompt might use 2,000 or more tokens before the conversation starts, reducing the space available for your actual content.
Common Misunderstandings About Tokens
“Tokens are the same as words.” This is the most widespread misconception. Tokens and words overlap often, but they are not the same unit.
Short, common words map to one token. Longer or rarer words split into multiple tokens.
Punctuation and spaces count as separate tokens too. This distinction trips up most beginners when they start exploring how language models actually process text.
“All models count tokens the same way.” Providers use different tokenizers with different vocabularies. The same 500-word email produces a different token count in ChatGPT, Claude, and Gemini. This matters when comparing what LLMs actually cost, because a “cheaper per token” model might tokenize your text into more pieces.
“Only my prompt uses tokens.” Both sides of the conversation count toward your total. Your input tokens and the model’s output tokens draw from the same context window and both appear on your bill. People often underestimate usage because they only consider what they typed.
“A bigger context window automatically means better results.” More space does not guarantee better output. Models can struggle with accuracy over very long inputs, even within their stated limits. Information buried in the middle of a long context often gets recalled less reliably than content near the start or end.
OpenAI offers a free Tokenizer tool that shows exactly how your text splits into tokens. Paste any text and watch it highlight each token boundary in real time.
“Tokens only matter if you use the API.” Users on subscription plans like ChatGPT Plus, Claude Pro, or Gemini Advanced never see per-token charges. But tokens still limit their context windows and can cause the model to drop earlier messages. The practical limits of language models apply regardless of how you pay for access.
Why Token Awareness Matters Before You Start
Tokens are the currency of every LLM interaction. They determine what you pay, how much the model remembers, and whether your document fits in a single conversation. Ignoring them leads to unexpected costs, truncated responses, and conversations where the model loses track of your instructions.
The good news is that token awareness does not require deep technical knowledge. Knowing the 0.75 rule gives you a quick estimator. Checking your model’s context window tells you how much content fits.
Understanding that output costs more than input helps you budget effectively. These three basics cover most practical situations.
Once you have a feel for how tokens work, decisions about AI tools become clearer. Picking a model with a larger context window makes sense for document analysis.
A cheaper per-token model works better for high-volume, simple tasks. Choosing the right model becomes a more informed decision when token economics are part of the equation.