2.6 Data Sampling With a Sliding Window

When we prepare training data for a Large Language Model (LLM), we cannot simply give the model the whole text at once. Instead, we must create many small examples that the model can learn from. Each example contains two parts:

  1. Input – what the model receives

  2. Target – what the model must predict

LLMs learn by predicting the next token (a token can be a word or a sub-word) based on the previous ones. During training, the model repeatedly sees pairs of sequences:

  • The input sequence is a block of text.

  • The target sequence is the same block, but shifted one token to the right.

How the sliding window works

Imagine that our text is very long — thousands of tokens. We need to break it into many smaller pieces of length max_length, because models can only process limited context.

To do this, we use a sliding window:

  1. Take a block of tokens from position i to i + max_length.

  2. Then move the window forward by stride tokens.

  3. Take the next block.

  4. Repeat until we reach the end of the text.

This method produces many overlapping samples. It helps the model learn because each example contains slightly different contexts.

For every window:

  • The input contains tokens from positions [i ... i + max_length - 1]

  • The target contains tokens from positions [i+1 ... i + max_length]

So the model learns to predict each next token inside the window.

In this case, the input to the LLM consists of blocks of text that lead to input to the models, and the task of the LLM during training is to predict the whole word that follows the input block.

Step-By-Step Walkthrough

Let’s walk through what this code is doing and why. We’ll build up the idea of a sliding-window dataset the same way you would inside a real GPT training loop.

1. First, we tokenize the entire text

Before we can create any training samples, we must turn the raw text into numbers. So we load the tokenizer and simply ask it to encode the complete book:

Now we know exactly how many tokens our dataset will contain. The script prints:

So that’s our raw material: 13k tokens we will soon slice into training samples.

2. Let's take a small look at how next-token prediction works

To get an intuition, we take a small piece of the token stream starting at position 50. From this chunk, we pick the first few tokens as input (x) and shift them by one position to form the target (y):

Result:

Here, every token in x wants to predict the token that follows it in y.

So the relationships are:

  • 550 → 1839

  • 1839 → 11

  • 11 → 15063

  • 15063 → 351

This is the basic training objective of GPT: for every input token, predict the next one.

3. Now let's gradually grow the context

Let’s take a bigger and bigger slice of the context and see what the next token should be each time.

Result:

Then we decode those tokens back into text so we can see what we’re predicting:

Result:

Now the idea becomes concrete: we build up a context, and the model should learn what word naturally comes next.

4. Time to build a sliding-window dataset

Now that we’ve seen the concept, we turn the entire text into many such training samples.

To implement efficient data loaders, we collect the input data into a tensor x, where each row represents one input. The second tensor, y, contains the corresponding target values ​​(the following words)

We do this by sliding a window across the token list:

Here’s what is happening:

  • We take maxLength tokens as our input.

  • Then we take the same window but shifted one step to the right as our target.

  • After that, we slide forward by stride tokens and repeat.

If stride is smaller than maxLength, windows overlap. If stride equals maxLength, the windows don’t overlap — they “tile” the sequence.

This simple mechanism turns one giant token list into thousands of tiny training examples.

5. Let's test it with very small samples

We create a dataset with maxLength = 1, meaning each training sample is just a single token predicting the next one.

Output:

So the first relationship is:

then:

This is a good sanity check: the dataset produces exactly what we expect — each input token predicts the next token in the text.

When creating multiple batches from the input dataset, we move the window through the text. If the stride is 1, we shift the window one position to the right to create the next batch. By setting the stride equal to the window size, we can avoid overlaps between batches.

6. Now let’s create bigger windows: length 4, stride 4

Here we tell the dataset:

  • “Give me windows of size 4…”

  • “…and after each window, jump forward exactly 4 tokens.”

This means no overlap — each block is its own chunk of text.

The inputs look like this:

And the targets are shifted by one token:

Visually:

Every row of input predicts the next row of target tokens.

This is precisely the structure GPT needs for autoregressive learning.

Below you may find the full code example:

Listing 2.6

Data sampling with a sliding window

Final View

So what have we built?

  • We took a long text.

  • We tokenized it.

  • We sampled small contexts to understand next-token prediction.

  • Then we generated a full training dataset by sliding a window across all tokens.

  • Each window became an input sequence, and the next-shifted window became its target.

  • Changing stride changed how many overlapping samples we produce.

This sliding-window mechanism is the backbone of almost all GPT dataset creation.

Last updated