2.3 Converting Tokens to Token IDs
Converting Tokens
Once we have a deterministic way to split raw text into tokens, the next essential step is converting those tokens into token IDs. Token IDs are simply integers that uniquely represent each token in our vocabulary. Neural networks cannot operate directly on text - they operate on numbers - so this conversion is foundational for building any GPT-style model.
In this chapter, you’ll learn:
Why token IDs matter
How to build a vocabulary from real text
How to map tokens to IDs (and back)
How to handle unknown tokens
How our
SimpleTokenizerV1andSimpleTokenizerV2implement token-to-ID conversion in pure PHP

What Is a Token ID?
A token is an atomic unit of text:
a word, a punctuation mark, or a meaningful sequence like "--".
A token ID is an integer assigned to a token. Every unique token in the vocabulary receives exactly one integer index:
The
0
queen
1
of
2
spades
3
…
…
This mapping allows us to turn a sequence of words into a sequence of numbers that a model can learn from.
Building a Vocabulary from the Text
Before converting anything into IDs, we must decide which tokens exist in our mini-GPT’s world. In this chapter, we use a real public-domain text - The Queen of Spades - as our training corpus.
The code below reads or downloads the text file, tokenizes it, and builds a vocabulary:
Result:
At this point:
$preprocessed(later cleaned) contains the full tokenized text.$allWordscontains an alphabetized list of unique tokens.Each token can be assigned an integer index based on its position in the sorted list.
This simple approach mirrors early NLP systems and is exactly how many introductory GPT tutorials begin.
Creating the Token → ID Mapping
To convert tokens into IDs, we construct a PHP associative array:
Now $vocab['Chance'] returns something like 40, depending on sort order.
This mapping is passed into our tokenizer classes.
Implementing a Basic Tokenizer (SimpleTokenizerV1)
SimpleTokenizerV1 is a minimal, direct tokenizer:
It splits text using the same pattern used during vocabulary creation.
It cleans whitespace.
If a token appears in the vocabulary, it is added to the ID list.
If a token does not appear in the vocabulary, it is discarded.
Encoding Text into IDs
Here’s the essential part of encode():
If the token exists, it is converted into its integer ID.
Decoding IDs Back to Text
decode() performs the inverse operation:
After decoding, the tokens are joined and punctuation spacing is normalized.
Example: Encoding and Decoding

Result:
Even this simple pipeline is enough to convert natural language into numerical form. However, you may notice that some tokens weren't found, such as Mrs., pardonable, or pride. In next chapter we'll learn what to do with that.
Last updated