2.4 Adding Special Context Tokens

Adding Special Tokens and Handling Unknowns

Real-world tokenizers must deal with two practical issues:

  1. End-of-sequence markers

  2. Tokens that do not exist in the vocabulary

To address these, we extend the vocabulary with:

  • <|endoftext|> — marks the end of a sample

  • <|unk|> — stands for “unknown token”

Adding Special Tokens to the Dictionary for Handling Special Cases

Now let's run following code:

Result:

Improved Tokenizer with Unknown Token Support (SimpleTokenizerV2)

SimpleTokenizerV2 enhances encoding by replacing any unrecognized tokens with <|unk|>:

This makes the tokenizer robust even when encountering unseen text.

SimpleTokenizerV2

Example: Encoding Across Sentences

Result:

Notice that:

  • Any word not in our training text becomes <|unk|>

  • <|endoftext|> safely separates two samples

This behavior is essential when training models on fragmented or varied datasets.

Understanding the Token → ID → Token Cycle

A complete tokenizer must be invertible, meaning:

  1. Encoding maps text → IDs

  2. Decoding maps IDs → text

  3. Encoding decoded text produces the same IDs (except for unknown tokens)

Our implementations satisfy this requirement:

This “round-trip” property is vital when training or debugging a GPT model. It ensures the model sees exactly the same textual units during training that you expect during inference.

Below you may find the full code example:

Listing 2.4

Converting tokens to token IDs

Result:

Summary

By this point you've built a fully functional token-to-token-ID system:

  • You created a vocabulary from real text.

  • You mapped each token to a unique integer.

  • You constructed two tokenizer classes:

    • SimpleTokenizerV1 – minimal, drops unknown tokens

    • SimpleTokenizerV2 – robust, supports unknown and special tokens

  • You implemented encoding and decoding cycles.

  • You observed how <|unk|> and <|endoftext|> are used in practical modeling scenarios.

This completes the foundation needed for feeding data into a neural network.

In the next chapters, you'll begin turning these token sequences into training batches suitable for building your own GPT-style model in PHP.

Last updated