2.1 Vector Representations of Words

When we read a sentence, we naturally understand how words relate to each other. We know that dog and cat are more similar than dog and car. But for a computer, words are just symbols – sequences of characters. To make language understandable for a machine, we need a mathematical way to represent meaning. That’s where vector representations of words come in.

In traditional programming, we might store words as strings. But strings have no mathematical meaning – we cannot add them, compare their closeness, or measure distance. In machine learning, every input must be numerical, so we need to transform words into numbers in a meaningful way.

From One-Hot Encoding to Dense Vectors

The simplest idea is one-hot encoding. Imagine we have a vocabulary of 10,000 words. Each word becomes a vector of length 10,000, where all elements are zero except one – the position corresponding to that word. For example:

dog = [0, 0, 0, 1, 0, 0, ...]
cat = [0, 1, 0, 0, 0, 0, ...]

This method is easy to implement, but it has serious problems. Each vector is very long and sparse (mostly zeros), and all words are equally distant from each other. There’s no notion that dog and cat might be close in meaning.

To capture relationships between words, we move to dense vector representations, or embeddings. Instead of assigning each word a unique index, we assign it a vector of real numbers – for example, a 100-dimensional vector like [0.12, -0.48, 0.09, …]. These vectors are trained so that words appearing in similar contexts have similar vectors.

Meaning in the Geometry

One of the most beautiful discoveries in natural language processing is that these dense vectors form a geometric space where meaning becomes measurable. In this space, similar words are close to each other. For example:

king and queen are near each other.
The vector king – man + woman is close to queen.

This is not magic; it’s mathematics. When we train word embeddings on large amounts of text, the model learns statistical patterns of word usage. Words used in similar sentences tend to have similar meanings.

Word2Vec – Learning by Context

A popular algorithm to build such representations is Word2Vec. It was introduced by a team at Google in 2013. The idea is simple but powerful: learn word vectors by predicting surrounding words.

Word2Vec has two main versions:

CBOW (Continuous Bag of Words) – predicts the current word based on its context.
Skip-Gram – predicts the context based on the current word.

For example, in the sentence "The cat sits on the mat", the Skip-Gram model learns that the word cat often appears near sits and mat. During training, it adjusts vector values so that words appearing together become closer in vector space.

Why It Works

The secret of Word2Vec lies in its objective function. It doesn't try to assign meanings directly – it simply optimizes predictions for surrounding words. But through millions of such predictions, it learns to encode syntactic and semantic information. Words sharing similar patterns of context naturally align close together.

Using Word Vectors in PHP

In this book, we will learn how to implement such mechanisms with PHP alone. You'll see how to create simple vector operations, build small datasets, and train models that produce embeddings. Even though PHP is not a typical machine learning language, its mathematical operations and array handling make it fully capable of demonstrating these concepts.

We'll begin by building our own lightweight structures for vectors and matrices. Then we'll implement functions for normalization, similarity, and gradient updates – the same principles used in modern neural architectures.

By the end of this chapter, you should understand:

Why we need numeric representations for words.
How one-hot encoding works and why it's limited.
What dense embeddings are and why they're powerful.

In the next chapters, we'll start coding – we'll build our first PHP code to represent vectors and measure similarity between words. This will be the foundation of your own Large Language Model built entirely in PHP.

PreviousCode Utils Next2.2 Text Tokenization

Last updated 2 days ago