Real-world tokenizers must deal with two practical issues:
End-of-sequence markers
Tokens that do not exist in the vocabulary
To address these, we extend the vocabulary with:
<|endoftext|> — marks the end of a sample
<|unk|> — stands for “unknown token”
Adding Special Tokens to the Dictionary for Handling Special Cases
Now let's run following code:
Result:
Improved Tokenizer with Unknown Token Support (SimpleTokenizerV2)
SimpleTokenizerV2 enhances encoding by replacing any unrecognized tokens with <|unk|>:
This makes the tokenizer robust even when encountering unseen text.
SimpleTokenizerV2
Example: Encoding Across Sentences
Result:
Notice that:
Any word not in our training text becomes <|unk|>
<|endoftext|> safely separates two samples
This behavior is essential when training models on fragmented or varied datasets.
Understanding the Token → ID → Token Cycle
A complete tokenizer must be invertible, meaning:
Encoding maps text → IDs
Decoding maps IDs → text
Encoding decoded text produces the same IDs
(except for unknown tokens)
Our implementations satisfy this requirement:
This “round-trip” property is vital when training or debugging a GPT model. It ensures the model sees exactly the same textual units during training that you expect during inference.
Below you may find the full code example:
Listing 2.4
Converting tokens to token IDs
Result:
Summary
By this point you've built a fully functional token-to-token-ID system:
You created a vocabulary from real text.
You mapped each token to a unique integer.
You constructed two tokenizer classes:
SimpleTokenizerV1 – minimal, drops unknown tokens
SimpleTokenizerV2 – robust, supports unknown and special tokens
You implemented encoding and decoding cycles.
You observed how <|unk|> and <|endoftext|> are used in practical modeling scenarios.
This completes the foundation needed for feeding data into a neural network.
In the next chapters, you'll begin turning these token sequences into training batches suitable for building your own GPT-style model in PHP.
foreach ($cleaned as $token) {
if (isset($this->strToInt[$token])) {
$processed[] = $token;
} else {
$processed[] = "<|unk|>";
}
}
class SimpleTokenizerV2 {
private array $strToInt;
private array $intToStr;
public function __construct(array $vocab) {
$this->strToInt = $vocab;
// Create reverse mapping: int -> str
$this->intToStr = array_flip($vocab);
}
public function encode(string $text): array {
// Split on punctuation and whitespace, capturing delimiters
$pattern = '/([,.:;?_!"()\'"]|--|\s)/u';
$preprocessed = preg_split($pattern, $text, -1, PREG_SPLIT_DELIM_CAPTURE);
// Clean up: trim and remove empty entries
$cleaned = [];
foreach ($preprocessed as $item) {
$trimmed = trim($item);
if ($trimmed !== '') {
$cleaned[] = $trimmed;
}
}
// Replace unknown tokens with <|unk|>
$processed = [];
foreach ($cleaned as $token) {
if (isset($this->strToInt[$token])) {
$processed[] = $token;
} else {
$processed[] = "<|unk|>";
}
}
// Convert tokens to IDs
$ids = [];
foreach ($processed as $token) {
$ids[] = $this->strToInt[$token];
}
return $ids;
}
public function decode(array $ids): string {
// Convert IDs back to tokens
$tokens = [];
foreach ($ids as $id) {
if (isset($this->intToStr[$id])) {
$tokens[] = $this->intToStr[$id];
}
}
// Join with spaces
$text = implode(' ', $tokens);
// Remove spaces before specified punctuation
$text = preg_replace('/\s+([,.?!"()\'])/u', '$1', $text);
return $text;
}
}
$tokenizer = new SimpleTokenizerV2($vocab);
$text1 = "Hello, do you like tea?";
$text2 = "In the sunlit terraces of the palace.";
$text = $text1 . " <|endoftext|> " . $text2;
pprint($text);
pprint($tokenizer->encode($text));
pprint($tokenizer->decode($tokenizer->encode($text)));
Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[2193, 4, 713, 2186, 1214, 1938, 10, 2192, 104, 1958, 2193, 2193, 1383, 1958, 2193, 6]
<|unk|>, do you like tea? <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>.
$tokenizer->decode(
$tokenizer->encode($text)
);
declare(strict_types=1);
require_once __DIR__ . '/../../vendor/autoload.php';
require_once __DIR__ . '/../utils-php/utils.php';
use function Apphp\PrettyPrint\pprint;
$startTime = microtime(true);
if (!is_dir('data')) {
mkdir('data', 0777, true);
}
$filePath = 'data/the-queen-of-spades.txt';
$url = 'https://aiwithphp.org/data/the-queen-of-spades.txt';
// Download the file if it doesn't exist
if (!file_exists($filePath)) {
$context = stream_context_create(['http' => ['timeout' => 30]]);
$response = @file_get_contents($url, false, $context);
if ($response === false) {
throw new RuntimeException("Failed to download file from $url");
}
file_put_contents($filePath, $response);
}
// Read the file content
$rawText = file_get_contents($filePath);
if ($rawText === false) {
throw new RuntimeException("Unable to read file: $filePath");
}
// Print total number of characters and first 100 characters
pprint("Total number of characters: " . strlen($rawText));
pprint(substr($rawText, 0, 100));
// Regular expression split (similar to Python’s re.split)
$pattern = '/([,.:;?_!"()\'"*]|--|\s)/u';
$preprocessed = preg_split($pattern, $rawText, -1, PREG_SPLIT_DELIM_CAPTURE);
// Clean up: trim and remove empty entries
foreach ($preprocessed as $item) {
if (($item = trim($item)) !== '') {
$cleaned[] = $item;
}
}
$preprocessed = $cleaned ?? [];
// Print first 30 tokens
pprint(array_slice($preprocessed, 0, 30));
pprint('Total tokens: ', count($preprocessed));
// Create a vocabulary (unique sorted list)
// Build sorted, unique vocabulary
$allWords = array_values(array_unique($preprocessed));
sort($allWords);
pprint("Vocabulary size: " . count($allWords));
// Print first 15 tokens with their index
pprint('First 15 tokens:');
foreach (array_slice($allWords, 0, 16) as $i => $token) {
pprint("('$token', $i)");
}
pprint('');
// Create vocabulary mapping (token -> ID)
$vocab = [];
foreach ($allWords as $i => $token) {
$vocab[$token] = $i;
}
class SimpleTokenizerV1 {
private array $strToInt;
private array $intToStr;
public function __construct(array $vocab) {
$this->strToInt = $vocab;
// Create reverse mapping: int -> str
$this->intToStr = array_flip($vocab);
}
public function encode(string $text): array {
// Split on punctuation and whitespace, capturing delimiters
$pattern = '/([,.:;?_!"()\'"]|--|\s)/u';
$preprocessed = preg_split($pattern, $text, -1, PREG_SPLIT_DELIM_CAPTURE);
// Clean up: trim and remove empty entries
$cleaned = [];
foreach ($preprocessed as $item) {
$trimmed = trim($item);
if ($trimmed !== '') {
$cleaned[] = $trimmed;
}
}
// Convert tokens to IDs
$ids = [];
foreach ($cleaned as $token) {
if (isset($this->strToInt[$token])) {
$ids[] = $this->strToInt[$token];
}
}
return $ids;
}
public function decode(array $ids): string {
// Convert IDs back to tokens
$tokens = [];
foreach ($ids as $id) {
if (isset($this->intToStr[$id])) {
$tokens[] = $this->intToStr[$id];
}
}
// Join with spaces
$text = implode(' ', $tokens);
// Remove spaces before specified punctuation
$text = preg_replace('/\s+([,.?!"()\'])/u', '$1', $text);
return $text;
}
}
// Test the tokenizer
# ----------------------------------------
$tokenizer = new SimpleTokenizerV1($vocab);
$text = "It's the last he painted, you know, Mrs. Gisburn said with pardonable pride.";
$ids = $tokenizer->encode($text);
pprint($ids);
pprint($tokenizer->decode($ids));
pprint($tokenizer->decode($tokenizer->encode($text)));
# Extend vocab
# ----------------------------------------
$allTokens = $preprocessed;
sort($allTokens);
$vocab = array_merge($allWords, ["<|endoftext|>", "<|unk|>"]);
$countAllWords = count($vocab);
pprint($countAllWords);
for ($i = $countAllWords - 5; $i < $countAllWords; $i++) {
pprint($vocab[$i]. ', ' . $i);
}
$vocab = array_flip($vocab);
class SimpleTokenizerV2 {
private array $strToInt;
private array $intToStr;
public function __construct(array $vocab) {
$this->strToInt = $vocab;
// Create reverse mapping: int -> str
$this->intToStr = array_flip($vocab);
}
public function encode(string $text): array {
// Split on punctuation and whitespace, capturing delimiters
$pattern = '/([,.:;?_!"()\'"]|--|\s)/u';
$preprocessed = preg_split($pattern, $text, -1, PREG_SPLIT_DELIM_CAPTURE);
// Clean up: trim and remove empty entries
$cleaned = [];
foreach ($preprocessed as $item) {
$trimmed = trim($item);
if ($trimmed !== '') {
$cleaned[] = $trimmed;
}
}
// Replace unknown tokens with <|unk|>
$processed = [];
foreach ($cleaned as $token) {
if (isset($this->strToInt[$token])) {
$processed[] = $token;
} else {
$processed[] = "<|unk|>";
}
}
// Convert tokens to IDs
$ids = [];
foreach ($processed as $token) {
$ids[] = $this->strToInt[$token];
}
return $ids;
}
public function decode(array $ids): string {
// Convert IDs back to tokens
$tokens = [];
foreach ($ids as $id) {
if (isset($this->intToStr[$id])) {
$tokens[] = $this->intToStr[$id];
}
}
// Join with spaces
$text = implode(' ', $tokens);
// Remove spaces before specified punctuation
$text = preg_replace('/\s+([,.?!"()\'])/u', '$1', $text);
return $text;
}
}
$tokenizer = new SimpleTokenizerV2($vocab);
$text1 = "Hello, do you like tea?";
$text2 = "In the sunlit terraces of the palace.";
$text = $text1 . " <|endoftext|> " . $text2;
pprint($text);
pprint($tokenizer->encode($text));
pprint($tokenizer->decode($tokenizer->encode($text)), end: "\n\n");
$endTime = microtime(true);
$elapsedTime = $endTime - $startTime;
pprint("Execution time: " . number_format($elapsedTime, 4) . " seconds");
Total number of characters: 56042
I
There was a card party at the rooms of Narumov of the Horse Guards. The long winter night passed
[I, There, was, a, card, party, at, the, rooms, of, Narumov, of, the, Horse, Guards, ., The, long, winter, night, passed, away, imperceptibly, ,, and, it, was, five, o, ']
Total tokens: 11884
Vocabulary size: 2192
First 50 tokens:
('!', 0)
('"', 1)
(''', 2)
('*', 3)
(',', 4)
('--', 5)
('.', 6)
('17', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('About', 12)
('Ace', 13)
('After', 14)
('Aleksandrovich', 15)
[107, 2, 1699, 1958, 1190, 1036, 1424, 4, 2186, 1176, 4, 6, 1703, 2155, 6]
It' s the last he painted, you know,. said with.
It' s the last he painted, you know,. said with.
2194
yourself, 2189
youth, 2190
à, 2191
<|endoftext|>, 2192
<|unk|>, 2193
Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[2193, 4, 713, 2186, 1214, 1938, 10, 2192, 104, 1958, 2193, 2193, 1383, 1958, 2193, 6]
<|unk|>, do you like tea? <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>.