2.4 Adding Special Context Tokens
Adding Special Tokens and Handling Unknowns
Real-world tokenizers must deal with two practical issues:
End-of-sequence markers
Tokens that do not exist in the vocabulary
To address these, we extend the vocabulary with:
<|endoftext|>— marks the end of a sample<|unk|>— stands for “unknown token”

Now let's run following code:
Result:
Improved Tokenizer with Unknown Token Support (SimpleTokenizerV2)
SimpleTokenizerV2 enhances encoding by replacing any unrecognized tokens with <|unk|>:
This makes the tokenizer robust even when encountering unseen text.
Example: Encoding Across Sentences
Result:
Notice that:
Any word not in our training text becomes
<|unk|><|endoftext|>safely separates two samples
This behavior is essential when training models on fragmented or varied datasets.
Understanding the Token → ID → Token Cycle
A complete tokenizer must be invertible, meaning:
Encoding maps text → IDs
Decoding maps IDs → text
Encoding decoded text produces the same IDs (except for unknown tokens)
Our implementations satisfy this requirement:
This “round-trip” property is vital when training or debugging a GPT model. It ensures the model sees exactly the same textual units during training that you expect during inference.
Below you may find the full code example:
Listing 2.4
Result:
Summary
By this point you've built a fully functional token-to-token-ID system:
You created a vocabulary from real text.
You mapped each token to a unique integer.
You constructed two tokenizer classes:
SimpleTokenizerV1– minimal, drops unknown tokensSimpleTokenizerV2– robust, supports unknown and special tokens
You implemented encoding and decoding cycles.
You observed how
<|unk|>and<|endoftext|>are used in practical modeling scenarios.
This completes the foundation needed for feeding data into a neural network.
In the next chapters, you'll begin turning these token sequences into training batches suitable for building your own GPT-style model in PHP.
Last updated