Vancouver, BC alek@aleksystem.com

Artificial Intelligence

What is a token in AI?

In the world of AI, tokens are the fundamental building blocks of language processing. Think of them as the “atoms” of text.

Large Language Models (LLMs) don’t actually read words the way humans do; they process numbers. Tokenization is the bridge that turns raw text into a format the machine can understand.


How Tokenization Works

Instead of looking at a sentence as one long string of characters, an AI breaks it down into smaller chunks. Depending on the model’s “vocabulary,” a token can be:

  • A whole word (e.g., “apple”)
  • A part of a word (e.g., “ing” or “pre”)
  • A single character or punctuation mark (e.g., “!” or “z”)

The “4-Character” Rule of Thumb

While the exact math varies by model, a good rule of thumb for English text is that 1,000 tokens is roughly equal to 750 words.

Common words like “the” or “and” usually count as a single token, while complex or rare words like “idiosyncratic” might be broken into three or four pieces.


Why Tokens Matter

  1. Context Window: Every AI has a limit on how many tokens it can “remember” at once. If your prompt plus the AI’s memory exceeds this limit, the model starts to “forget” the beginning of the conversation.
  2. Cost: Most AI providers (like OpenAI or Anthropic) charge based on the number of tokens processed, not the number of words.
  3. Computing Power: More tokens require more mathematical operations. Efficient tokenization allows models to handle more information faster.

A Concrete Example

If you give an AI the sentence:

“Running is fun.”

The tokenizer might break it down like this:

  1. Run
  2. ning
  3. is
  4. fun
  5. .

Each of these is assigned a specific ID number. The AI then performs complex math on these numbers to predict which token should come next.