2.4 Adding Special Context Tokens

Adding Special Tokens and Handling Unknowns

Real-world tokenizers must deal with two practical issues:

End-of-sequence markers
Tokens that do not exist in the vocabulary

To address these, we extend the vocabulary with:

<|endoftext|> — marks the end of a sample
<|unk|> — stands for “unknown token”

Now let's run following code:

$allTokens = $preprocessed;
sort($allTokens);
$vocab = array_merge($allWords, ["<|endoftext|>", "<|unk|>"]);
$countAllWords = count($vocab);
pprint($countAllWords);
for ($i = $countAllWords - 5; $i < $countAllWords; $i++) {
    pprint($vocab[$i]. ', ' . $i);
}
$vocab = array_flip($vocab);

Result:

2194
yourself, 2189
youth, 2190
à, 2191
<|endoftext|>, 2192
<|unk|>, 2193

Improved Tokenizer with Unknown Token Support (SimpleTokenizerV2)

SimpleTokenizerV2 enhances encoding by replacing any unrecognized tokens with <|unk|>:

foreach ($cleaned as $token) {
    if (isset($this->strToInt[$token])) {
        $processed[] = $token;
    } else {
        $processed[] = "<|unk|>";
    }
}

This makes the tokenizer robust even when encountering unseen text.

SimpleTokenizerV2

class SimpleTokenizerV2 {
    private array $strToInt;
    private array $intToStr;

    public function __construct(array $vocab) {
        $this->strToInt = $vocab;
        // Create reverse mapping: int -> str
        $this->intToStr = array_flip($vocab);
    }

    public function encode(string $text): array {
        // Split on punctuation and whitespace, capturing delimiters
        $pattern = '/([,.:;?_!"()\'"]|--|\s)/u';
        $preprocessed = preg_split($pattern, $text, -1, PREG_SPLIT_DELIM_CAPTURE);

        // Clean up: trim and remove empty entries
        $cleaned = [];
        foreach ($preprocessed as $item) {
            $trimmed = trim($item);
            if ($trimmed !== '') {
                $cleaned[] = $trimmed;
            }
        }

        // Replace unknown tokens with <|unk|>
        $processed = [];
        foreach ($cleaned as $token) {
            if (isset($this->strToInt[$token])) {
                $processed[] = $token;
            } else {
                $processed[] = "<|unk|>";
            }
        }

        // Convert tokens to IDs
        $ids = [];
        foreach ($processed as $token) {
            $ids[] = $this->strToInt[$token];
        }
        return $ids;
    }

    public function decode(array $ids): string {
        // Convert IDs back to tokens
        $tokens = [];
        foreach ($ids as $id) {
            if (isset($this->intToStr[$id])) {
                $tokens[] = $this->intToStr[$id];
            }
        }

        // Join with spaces
        $text = implode(' ', $tokens);

        // Remove spaces before specified punctuation
        $text = preg_replace('/\s+([,.?!"()\'])/u', '$1', $text);

        return $text;
    }
}

Example: Encoding Across Sentences

$tokenizer = new SimpleTokenizerV2($vocab);
$text1 = "Hello, do you like tea?";
$text2 = "In the sunlit terraces of the palace.";
$text = $text1 . " <|endoftext|> " . $text2;
pprint($text);
pprint($tokenizer->encode($text));
pprint($tokenizer->decode($tokenizer->encode($text)));

Result:

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[2193, 4, 713, 2186, 1214, 1938, 10, 2192, 104, 1958, 2193, 2193, 1383, 1958, 2193, 6]
<|unk|>, do you like tea? <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>.

Notice that:

Any word not in our training text becomes <|unk|>
<|endoftext|> safely separates two samples

This behavior is essential when training models on fragmented or varied datasets.

Understanding the Token → ID → Token Cycle

A complete tokenizer must be invertible, meaning:

Encoding maps text → IDs
Decoding maps IDs → text
Encoding decoded text produces the same IDs (except for unknown tokens)

Our implementations satisfy this requirement:

$tokenizer->decode(
    $tokenizer->encode($text)
);

This “round-trip” property is vital when training or debugging a GPT model. It ensures the model sees exactly the same textual units during training that you expect during inference.

Below you may find the full code example:

Listing 2.4

Converting tokens to token IDs

declare(strict_types=1);

require_once __DIR__ . '/../../vendor/autoload.php';
require_once __DIR__ . '/../utils-php/utils.php';
use function Apphp\PrettyPrint\pprint;

$startTime = microtime(true);

if (!is_dir('data')) {
    mkdir('data', 0777, true);
}

$filePath = 'data/the-queen-of-spades.txt';
$url = 'https://aiwithphp.org/data/the-queen-of-spades.txt';

// Download the file if it doesn't exist
if (!file_exists($filePath)) {
    $context = stream_context_create(['http' => ['timeout' => 30]]);
    $response = @file_get_contents($url, false, $context);
    if ($response === false) {
        throw new RuntimeException("Failed to download file from $url");
    }

    file_put_contents($filePath, $response);
}

// Read the file content
$rawText = file_get_contents($filePath);
if ($rawText === false) {
    throw new RuntimeException("Unable to read file: $filePath");
}

// Print total number of characters and first 100 characters
pprint("Total number of characters: " . strlen($rawText));
pprint(substr($rawText, 0, 100));

// Regular expression split (similar to Python’s re.split)
$pattern = '/([,.:;?_!"()\'"*]|--|\s)/u';
$preprocessed = preg_split($pattern, $rawText, -1, PREG_SPLIT_DELIM_CAPTURE);

// Clean up: trim and remove empty entries
foreach ($preprocessed as $item) {
    if (($item = trim($item)) !== '') {
        $cleaned[] = $item;
    }
}
$preprocessed = $cleaned ?? [];

// Print first 30 tokens
pprint(array_slice($preprocessed, 0, 30));
pprint('Total tokens: ', count($preprocessed));

// Create a vocabulary (unique sorted list)
// Build sorted, unique vocabulary
$allWords = array_values(array_unique($preprocessed));
sort($allWords);
pprint("Vocabulary size: " . count($allWords));

// Print first 15 tokens with their index
pprint('First 15 tokens:');
foreach (array_slice($allWords, 0, 16) as $i => $token) {
    pprint("('$token', $i)");
}
pprint('');

// Create vocabulary mapping (token -> ID)
$vocab = [];
foreach ($allWords as $i => $token) {
    $vocab[$token] = $i;
}

class SimpleTokenizerV1 {
    private array $strToInt;
    private array $intToStr;

    public function __construct(array $vocab) {
        $this->strToInt = $vocab;
        // Create reverse mapping: int -> str
        $this->intToStr = array_flip($vocab);
    }

    public function encode(string $text): array {
        // Split on punctuation and whitespace, capturing delimiters
        $pattern = '/([,.:;?_!"()\'"]|--|\s)/u';
        $preprocessed = preg_split($pattern, $text, -1, PREG_SPLIT_DELIM_CAPTURE);

        // Clean up: trim and remove empty entries
        $cleaned = [];
        foreach ($preprocessed as $item) {
            $trimmed = trim($item);
            if ($trimmed !== '') {
                $cleaned[] = $trimmed;
            }
        }

        // Convert tokens to IDs
        $ids = [];
        foreach ($cleaned as $token) {
            if (isset($this->strToInt[$token])) {
                $ids[] = $this->strToInt[$token];
            }
        }
        return $ids;
    }

    public function decode(array $ids): string {
        // Convert IDs back to tokens
        $tokens = [];
        foreach ($ids as $id) {
            if (isset($this->intToStr[$id])) {
                $tokens[] = $this->intToStr[$id];
            }
        }

        // Join with spaces
        $text = implode(' ', $tokens);

        // Remove spaces before specified punctuation
        $text = preg_replace('/\s+([,.?!"()\'])/u', '$1', $text);

        return $text;
    }
}

// Test the tokenizer
# ----------------------------------------
$tokenizer = new SimpleTokenizerV1($vocab);
$text = "It's the last he painted, you know, Mrs. Gisburn said with pardonable pride.";

$ids = $tokenizer->encode($text);
pprint($ids);
pprint($tokenizer->decode($ids));
pprint($tokenizer->decode($tokenizer->encode($text)));

# Extend vocab
# ----------------------------------------
$allTokens = $preprocessed;
sort($allTokens);
$vocab = array_merge($allWords, ["<|endoftext|>", "<|unk|>"]);
$countAllWords = count($vocab);
pprint($countAllWords);
for ($i = $countAllWords - 5; $i < $countAllWords; $i++) {
    pprint($vocab[$i]. ', ' . $i);
}
$vocab = array_flip($vocab);


class SimpleTokenizerV2 {
    private array $strToInt;
    private array $intToStr;

    public function __construct(array $vocab) {
        $this->strToInt = $vocab;
        // Create reverse mapping: int -> str
        $this->intToStr = array_flip($vocab);
    }

    public function encode(string $text): array {
        // Split on punctuation and whitespace, capturing delimiters
        $pattern = '/([,.:;?_!"()\'"]|--|\s)/u';
        $preprocessed = preg_split($pattern, $text, -1, PREG_SPLIT_DELIM_CAPTURE);

        // Clean up: trim and remove empty entries
        $cleaned = [];
        foreach ($preprocessed as $item) {
            $trimmed = trim($item);
            if ($trimmed !== '') {
                $cleaned[] = $trimmed;
            }
        }

        // Replace unknown tokens with <|unk|>
        $processed = [];
        foreach ($cleaned as $token) {
            if (isset($this->strToInt[$token])) {
                $processed[] = $token;
            } else {
                $processed[] = "<|unk|>";
            }
        }

        // Convert tokens to IDs
        $ids = [];
        foreach ($processed as $token) {
            $ids[] = $this->strToInt[$token];
        }
        return $ids;
    }

    public function decode(array $ids): string {
        // Convert IDs back to tokens
        $tokens = [];
        foreach ($ids as $id) {
            if (isset($this->intToStr[$id])) {
                $tokens[] = $this->intToStr[$id];
            }
        }

        // Join with spaces
        $text = implode(' ', $tokens);

        // Remove spaces before specified punctuation
        $text = preg_replace('/\s+([,.?!"()\'])/u', '$1', $text);

        return $text;
    }
}

$tokenizer = new SimpleTokenizerV2($vocab);
$text1 = "Hello, do you like tea?";
$text2 = "In the sunlit terraces of the palace.";
$text = $text1 . " <|endoftext|> " . $text2;
pprint($text);
pprint($tokenizer->encode($text));
pprint($tokenizer->decode($tokenizer->encode($text)), end: "\n\n");

$endTime = microtime(true);
$elapsedTime = $endTime - $startTime;
pprint("Execution time: " . number_format($elapsedTime, 4) . " seconds");

Result:

Total number of characters: 56042
I

There was a card party at the rooms of Narumov of the Horse Guards. The long winter night passed 
[I, There, was, a, card, party, at, the, rooms, of, Narumov, of, the, Horse, Guards, ., The, long, winter, night, passed, away, imperceptibly, ,, and, it, was, five, o, ']
Total tokens:  11884
Vocabulary size: 2192
First 50 tokens:
('!', 0)
('"', 1)
(''', 2)
('*', 3)
(',', 4)
('--', 5)
('.', 6)
('17', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('About', 12)
('Ace', 13)
('After', 14)
('Aleksandrovich', 15)

[107, 2, 1699, 1958, 1190, 1036, 1424, 4, 2186, 1176, 4, 6, 1703, 2155, 6]
It' s the last he painted, you know,. said with.
It' s the last he painted, you know,. said with.
2194
yourself, 2189
youth, 2190
à, 2191
<|endoftext|>, 2192
<|unk|>, 2193
Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[2193, 4, 713, 2186, 1214, 1938, 10, 2192, 104, 1958, 2193, 2193, 1383, 1958, 2193, 6]
<|unk|>, do you like tea? <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>.

Summary

By this point you've built a fully functional token-to-token-ID system:

You created a vocabulary from real text.
You mapped each token to a unique integer.
You constructed two tokenizer classes:
- SimpleTokenizerV1 – minimal, drops unknown tokens
- SimpleTokenizerV2 – robust, supports unknown and special tokens
You implemented encoding and decoding cycles.
You observed how <|unk|> and <|endoftext|> are used in practical modeling scenarios.

This completes the foundation needed for feeding data into a neural network.

In the next chapters, you'll begin turning these token sequences into training batches suitable for building your own GPT-style model in PHP.

Previous2.3 Converting Tokens to Token IDs Next2.5 Byte Pair Encoding

Last updated 2 months ago

hashtagAdding Special Tokens and Handling Unknowns

hashtagImproved Tokenizer with Unknown Token Support (SimpleTokenizerV2)

hashtagExample: Encoding Across Sentences

hashtagUnderstanding the Token → ID → Token Cycle

hashtagListing 2.4

hashtagSummary

Adding Special Tokens and Handling Unknowns

Improved Tokenizer with Unknown Token Support (SimpleTokenizerV2)

Example: Encoding Across Sentences

Understanding the Token → ID → Token Cycle

Listing 2.4

Summary