2.7 Creating Token Embeddings

In the previous sections, we prepared text data for a Large Language Model (LLM) by applying tokenization and converting text tokens into token IDs. Now we will look at the final step of preparing the input: transforming token IDs into vector representations called token embeddings. This step is essential because neural networks such as GPT cannot operate directly on discrete symbols. They need continuous numerical vectors that can be processed by matrix multiplications inside deep neural layers.

Why We Need Embeddings

Token embeddings convert each token ID into a vector of real numbers. These vectors capture semantic meaning and allow the model to learn relationships between words and subwords. At the beginning of training, these vectors do not contain any meaningful information. The model must learn proper values during the training process using backpropagation.

Before training begins, we must initialize the embedding weights. Typically, the embedding matrix is filled with random numbers drawn from a normal distribution. This behaves as a starting point for optimization. During training, the embedding vectors gradually adapt to represent the statistical structure of the language.

It's important to remember:

“The last step in preparing input text for training an LLM is to convert token IDs into vector representations. Before this, we initialize the embedding weights with random values. This initialization serves as the starting point for the learning process.
The text preparation process includes tokenizing the text, converting text tokens into token IDs, and converting these IDs into embedding vectors. Here we use previously created token IDs to obtain embedded token vectors.
Continuous vector representation is required because LLMs such as GPT are deep neural networks trained using the backpropagation algorithm.”

Embedding Layer and One-Hot Encoding

If you are familiar with one-hot encoding, you might know that each token can be represented by a vector the size of the vocabulary, with exactly one position equal to 1 and all others equal to 0. However, this representation is extremely inefficient: the vectors are very large, and they do not contain any learned relationships.

An embedding layer can be seen as a more efficient and trainable version of this approach. Instead of using one-hot vectors and multiplying them with a weight matrix, the embedding layer directly looks up the corresponding row in the embedding matrix. Because the embedding layer is simply a more efficient equivalent, it can be optimized using backpropagation like any other neural layer.

In practice, this means the embedding layer stores a matrix:

Embedding matrix shape = [vocab_size × embedding_dim]

Each row corresponds to one token.

Demo Example

To make the concepts concrete, let us examine the demonstration code. We have the class Embedding that implements the following:

deterministic random initialization using a seed,
creation of an embedding matrix with values drawn from a normal distribution,
lookup of embedding vectors for one token or several tokens,
safe handling of out-of-range token IDs,
rounding helpers for readable output.

Below is the complete demo code:

Listing 2.7

Creating token embeddings (refactored into a class)

declare(strict_types=1);

require_once __DIR__ . '/../../vendor/autoload.php';
require_once __DIR__ . '/../utils-php/utils.php';
use function Apphp\PrettyPrint\{pprint, pp, ppd, pdiff};

final class Embedding
{
    private int $vocabSize;
    private int $outputDim;
    private int $seed;
    private int $precision; // number of decimal places for display
    /** @var list<list<float|int>> */
    private array $W; // embedding matrix [vocabSize x outputDim]

    /**
     * Encapsulates the demo logic for creating and using token embeddings.
     * - Loads weights exported by Python for parity when available
     * - Falls back to deterministic Normal(0,1) init otherwise
     * - Provides single- and multi-token lookup helpers
     */
    public function __construct(
        int $vocabSize = 6,
        int $outputDim = 3,
        int $seed = 123,
        int $precision = 4,
    ) {
        $this->vocabSize = $vocabSize;
        $this->outputDim = $outputDim;
        $this->seed = $seed;
        $this->precision = $precision;

        // Always generate weights deterministically using the provided seed.
        // This bypasses loading from JSON to ensure `seed` affects the result.
        $this->W = $this->createEmbeddingMatrixNormal($this->vocabSize, $this->outputDim);
    }

    /**
     * Return the embedding vector for a single token id.
     * @return list<float|int>
     */
    public function embedToken(int $tokenId): array
    {
        $vocabSize = count($this->W);
        if ($tokenId < 0 || $tokenId >= $vocabSize) {
            throw new InvalidArgumentException("Token id {$tokenId} out of range 0.." . ($vocabSize - 1));
        }
        return $this->W[$tokenId];
    }

    /**
     * Return embeddings for a list of token ids as 2D array [N x D].
     * @param list<int> $tokenIds
     * @return list<list<float|int>>
     */
    public function embedTokens(array $tokenIds): array
    {
        $out = [];
        foreach ($tokenIds as $tid) {
            $out[] = $this->embedToken((int)$tid);
        }
        return $out;
    }

    /**
     * Expose the embedding matrix [vocabSize x outputDim].
     * @return list<list<float|int>>
     */
    public function getEmbeddingWeights(): array
    {
        return $this->W;
    }

    /** Get display precision (decimal places). */
    public function getPrecision(): int
    {
        return $this->precision;
    }

    /**
     * Round floats in scalars/arrays recursively to the given precision.
     * If $precision is null, use the instance's configured precision.
     * @param mixed $value
     * @return mixed
     */
    private function roundNumericDeep($value) {
        if (is_array($value)) {
            $out = [];
            foreach ($value as $k => $v) {
                $out[$k] = $this->roundNumericDeep($v);
            }
            return $out;
        }
        if (is_float($value)) {
            return round($value, $this->precision);
        }
        return $value; // ints and others unchanged
    }

    // ---------- Public rounded helpers (do not expose rounding internals) ----------
    /** @return list<list<float|int>> */
    public function getWeights(): array
    {
        /** @var list<list<float|int>> $rounded */
        $rounded = $this->roundNumericDeep($this->W);
        return $rounded;
    }

    /** @return list<float|int> */
    public function embedTokenRounded(int $tokenId): array
    {
        /** @var list<float|int> $rounded */
        $rounded = $this->roundNumericDeep($this->embedToken($tokenId));
        return $rounded;
    }

    /**
     * @param list<int> $tokenIds
     * @return list<list<float|int>>
     */
    public function embedTokensRounded(array $tokenIds): array
    {
        /** @var list<list<float|int>> $rounded */
        $rounded = $this->roundNumericDeep($this->embedTokens($tokenIds));
        return $rounded;
    }

    // ---------- Internals ----------

    /**
     * Try to load weights exported by Python (app/02/2_7_weights.json)
     * @return list<list<float|int>>|null
     */
    private function loadWeightsFromJson(string $path): ?array
    {
        if (!file_exists($path)) {
            return null;
        }
        $json = file_get_contents($path);
        if ($json === false) {
            return null;
        }
        $data = json_decode($json, true);
        if (!is_array($data)) {
            return null;
        }
        // Basic validation: ensure it's 2D numeric
        foreach ($data as $row) {
            if (!is_array($row)) return null;
            foreach ($row as $val) {
                if (!is_int($val) && !is_float($val)) return null;
            }
        }
        return $data;
    }

    /**
     * Create embedding matrix [vocabSize x outputDim] with Normal(0,1) init (deterministic for a seed).
     * @return list<list<float>>
     */
    private function createEmbeddingMatrixNormal(int $vocabSize, int $outputDim): array
    {
        mt_srand($this->seed);
        $W = [];
        for ($i = 0; $i < $vocabSize; $i++) {
            $row = [];
            for ($j = 0; $j < $outputDim; $j++) {
                $row[] = $this->randn(); // match PyTorch scale (no rounding)
            }
            $W[] = $row;
        }
        return $W;
    }

    // Box–Muller transform: standard normal N(0,1)
    private function randn(): float
    {
        // mt_rand returns int in [0, mt_getrandmax()]. Map to (0,1]
        $u1 = (mt_rand() + 1) / (mt_getrandmax() + 2);
        $u2 = (mt_rand() + 1) / (mt_getrandmax() + 2);
        return sqrt(-2.0 * log($u1)) * cos(2.0 * M_PI * $u2);
    }
}

Now, let's run this code:

$inputIds = [2, 3, 5, 1];

$demo = new Embedding(vocabSize: 6, outputDim: 3, seed: 123);

// Print full embedding weight matrix (rounded inside the class)
pprint('Full embedding weight matrix:', $demo->getWeights(), sep: "\n", end: "\n\n");

// Embedding for token 3
pprint('Embedding for token 3:', $demo->embedTokenRounded(3), sep: "\n", end: "\n\n");

// Embedding for tokens 2, 3, 5, 1
pprint('Embedding for tokens 2, 3, 5, 1:', $demo->embedTokensRounded($inputIds), sep: "\n");

Result:

Full embedding weight matrix:
tensor([
   [-0.1962, -1.4249, -0.6252],
   [-0.2102, -0.8102,  0.2460],
   [-0.1670, -0.7633,  0.7716],
   [-1.1121, -1.0190, -0.3571],
   [ 0.0423, -0.8077, -1.1279],
   [-0.5272, -1.7160,  1.6134]
])

Embedding for token 3:
[-1.1121, -1.0190, -0.3571]

Embedding for tokens 2, 3, 5, 1:
tensor([
   [-0.1670, -0.7633,  0.7716],
   [-1.1121, -1.0190, -0.3571],
   [-0.5272, -1.7160,  1.6134],
   [-0.2102, -0.8102,  0.2460]
])

Understanding the Output

When you run this script, it prints:

The full embedding matrix (rounded values).
The embedding vector for a single token.
A batch of embedding vectors for several tokens.

Each embedding is simply a row from the initialized matrix. During training, these vectors will be updated many times. Over thousands of gradient steps, they will start capturing meaningful structure such as similarity, analogy relationships, and contextual patterns.

Now that we've created embedding vectors based on token IDs, in next chapters we'll learn how to modify these vectors slightly to encode information about the token's position in the text.

Previous2.6 Data Sampling With a Sliding Window Next2.8. Word Position Coding

Last updated 9 days ago