2.6 Data Sampling With a Sliding Window
When we prepare training data for a Large Language Model (LLM), we cannot simply give the model the whole text at once. Instead, we must create many small examples that the model can learn from. Each example contains two parts:
Input – what the model receives
Target – what the model must predict
LLMs learn by predicting the next token (a token can be a word or a sub-word) based on the previous ones. During training, the model repeatedly sees pairs of sequences:
The input sequence is a block of text.
The target sequence is the same block, but shifted one token to the right.
How the sliding window works
Imagine that our text is very long — thousands of tokens. We need to break it into many smaller pieces of length max_length, because models can only process limited context.
To do this, we use a sliding window:
Take a block of tokens from position i to i + max_length.
Then move the window forward by stride tokens.
Take the next block.
Repeat until we reach the end of the text.
This method produces many overlapping samples. It helps the model learn because each example contains slightly different contexts.
For every window:
The input contains tokens from positions
[i ... i + max_length - 1]The target contains tokens from positions
[i+1 ... i + max_length]
So the model learns to predict each next token inside the window.

Step-By-Step Walkthrough
Let’s walk through what this code is doing and why. We’ll build up the idea of a sliding-window dataset the same way you would inside a real GPT training loop.
1. First, we tokenize the entire text
Before we can create any training samples, we must turn the raw text into numbers. So we load the tokenizer and simply ask it to encode the complete book:
Now we know exactly how many tokens our dataset will contain. The script prints:
So that’s our raw material: 13k tokens we will soon slice into training samples.
2. Let's take a small look at how next-token prediction works
To get an intuition, we take a small piece of the token stream starting at position 50.
From this chunk, we pick the first few tokens as input (x) and shift them by one position to form the target (y):
Result:
Here, every token in x wants to predict the token that follows it in y.
So the relationships are:
550 → 1839
1839 → 11
11 → 15063
15063 → 351
This is the basic training objective of GPT: for every input token, predict the next one.
3. Now let's gradually grow the context
Let’s take a bigger and bigger slice of the context and see what the next token should be each time.
Result:
Then we decode those tokens back into text so we can see what we’re predicting:
Result:
Now the idea becomes concrete: we build up a context, and the model should learn what word naturally comes next.
4. Time to build a sliding-window dataset
Now that we’ve seen the concept, we turn the entire text into many such training samples.

We do this by sliding a window across the token list:
Here’s what is happening:
We take
maxLengthtokens as our input.Then we take the same window but shifted one step to the right as our target.
After that, we slide forward by
stridetokens and repeat.
If stride is smaller than maxLength, windows overlap.
If stride equals maxLength, the windows don’t overlap — they “tile” the sequence.
This simple mechanism turns one giant token list into thousands of tiny training examples.
5. Let's test it with very small samples
We create a dataset with maxLength = 1, meaning each training sample is just a single token predicting the next one.
Output:
So the first relationship is:
then:
This is a good sanity check: the dataset produces exactly what we expect — each input token predicts the next token in the text.

6. Now let’s create bigger windows: length 4, stride 4
Here we tell the dataset:
“Give me windows of size 4…”
“…and after each window, jump forward exactly 4 tokens.”
This means no overlap — each block is its own chunk of text.
The inputs look like this:
And the targets are shifted by one token:
Visually:
Every row of input predicts the next row of target tokens.
This is precisely the structure GPT needs for autoregressive learning.
Below you may find the full code example:
Listing 2.6
Final View
So what have we built?
We took a long text.
We tokenized it.
We sampled small contexts to understand next-token prediction.
Then we generated a full training dataset by sliding a window across all tokens.
Each window became an input sequence, and the next-shifted window became its target.
Changing
stridechanged how many overlapping samples we produce.
This sliding-window mechanism is the backbone of almost all GPT dataset creation.
Last updated