2.7 Creating Token Embeddings
In the previous sections, we prepared text data for a Large Language Model (LLM) by applying tokenization and converting text tokens into token IDs. Now we will look at the final step of preparing the input: transforming token IDs into vector representations called token embeddings. This step is essential because neural networks such as GPT cannot operate directly on discrete symbols. They need continuous numerical vectors that can be processed by matrix multiplications inside deep neural layers.

Why We Need Embeddings
Token embeddings convert each token ID into a vector of real numbers. These vectors capture semantic meaning and allow the model to learn relationships between words and subwords. At the beginning of training, these vectors do not contain any meaningful information. The model must learn proper values during the training process using backpropagation.
Before training begins, we must initialize the embedding weights. Typically, the embedding matrix is filled with random numbers drawn from a normal distribution. This behaves as a starting point for optimization. During training, the embedding vectors gradually adapt to represent the statistical structure of the language.
It's important to remember:
“The last step in preparing input text for training an LLM is to convert token IDs into vector representations. Before this, we initialize the embedding weights with random values. This initialization serves as the starting point for the learning process.
The text preparation process includes tokenizing the text, converting text tokens into token IDs, and converting these IDs into embedding vectors. Here we use previously created token IDs to obtain embedded token vectors.
Continuous vector representation is required because LLMs such as GPT are deep neural networks trained using the backpropagation algorithm.”
Embedding Layer and One-Hot Encoding
If you are familiar with one-hot encoding, you might know that each token can be represented by a vector the size of the vocabulary, with exactly one position equal to 1 and all others equal to 0. However, this representation is extremely inefficient: the vectors are very large, and they do not contain any learned relationships.
An embedding layer can be seen as a more efficient and trainable version of this approach. Instead of using one-hot vectors and multiplying them with a weight matrix, the embedding layer directly looks up the corresponding row in the embedding matrix. Because the embedding layer is simply a more efficient equivalent, it can be optimized using backpropagation like any other neural layer.
In practice, this means the embedding layer stores a matrix:
Each row corresponds to one token.
Demo Example
To make the concepts concrete, let us examine the demonstration code. We have the class Embedding that implements the following:
deterministic random initialization using a seed,
creation of an embedding matrix with values drawn from a normal distribution,
lookup of embedding vectors for one token or several tokens,
safe handling of out-of-range token IDs,
rounding helpers for readable output.
Below is the complete demo code:
Listing 2.7
Now, let's run this code:
Result:
Understanding the Output
When you run this script, it prints:
The full embedding matrix (rounded values).
The embedding vector for a single token.
A batch of embedding vectors for several tokens.
Each embedding is simply a row from the initialized matrix. During training, these vectors will be updated many times. Over thousands of gradient steps, they will start capturing meaningful structure such as similarity, analogy relationships, and contextual patterns.
Now that we've created embedding vectors based on token IDs, in next chapters we'll learn how to modify these vectors slightly to encode information about the token's position in the text.
Last updated