GPT-2 implementation

29 Mar, 2025

Notes from Andrej Karpathy's Video on Training a GPT from Scratch

These notes summarize key insights from Andrej Karpathy's video on training a GPT model from scratch. They were written after we had already implemented transformers, so many comparisons are made between the approaches in the video and our previous implementation.

For reference, you can find my blog post on implementing transformers here:
🔗 Training a Transformer

Key Takeaways:

GPT-2 is a decoder-only architecture, so we remove the encoder and the cross-attention components.
In the Attention Is All You Need paper, positional encoding was fixed. However, GPT-2 trains the embeddings for positional encoding.
Andrej Karpathy believes that the trained embeddings in GPT-2 exhibit some sinusoidal properties, similar to those in the Transformer paper.
The position of LayerNorm was changed from the end of the attention block to the beginning. The reason for this change is unclear—will update when known.
A final LayerNorm was added after all attention blocks.
The sequence length in GPT-2 is fixed at 1024, which is also used in positional encodings. In our Transformer implementation, we set the sequence length dynamically.
In the feedforward layer, Andrej uses GELU instead of RELU. GELU with tanh activation to be precise. Since GELU doesn't have a hard threshold as
We keep using the the dmodel*4 for the dff in this implementation too.
In Transformers, the attention mask typically depends on the input sequence length —padding tokens (used to pad sequences to a fixed length) are masked to prevent the model from attending to them. However, in GPT-2, the attention mask is usually of a fixed size (determined by the context length -1024)

torch.tril(torch.ones(self.context_len, self.context_len)).view(
            1, 1, self.context_len, self.context_len)

After building the architecture, I noticed that it had 160M parameters—significantly more than the expected 120M. This was due to a clever optimization technique used by OpenAI.
In a transformer model, the embedding layer maps token IDs to vectors, while the projection layer (used for output predictions) maps embeddings back to token IDs. OpenAI ties the weights of these two layers, meaning they share the same tensor in memory.
By doing this, we effectively eliminate the need for separate weight matrices in both layers, reducing the number of trainable parameters by approximately 40M. This was also the reason why we set bias = False in nn.Linear of the projection layer.
Andrej explained how greedy decoding always selects the highest-probability token, while top-k sampling introduces randomness by selecting from the top 50 tokens using torch.multinomial, leading to more diverse outputs.
When weights are tied but the model is untrained, it predictably repeats the same token, whereas without weight tying, the outputs appear as meaningless gibberish due to random initialization.
Karpathy highlighted that an untrained model's initial loss should approximate -ln(1/vocab_size), since each token in the vocabulary has an equal chance of being selected before training begins. This serves as a useful validation of correct initialization.

So a key error when we were implementing the code for training was that we were overlapping tokens, which turned out to be a big mistake. The way we figured out our mistake was by training the model for some mini batches and we quickly noticed our error going to 0.2 without even a epoch. This was funny but also was a major lapse in judgement.

So once we were ready with a basic training loop, our next step was to include mixed precision training. So for that karpathy sensei did the following- First he explained how everything goes in FP32.

The following were the details of training the model on FP32 precision.

Epoch 1/10 [Train]:   0%|   0/94242 [00:00<?, ?it/s]step: 0, loss: 11.025991439819336, dt: 1174.2680072784424, tokens/s : 1744.0652281301386
Epoch 1/10 [Train]:   0%|   | 100/94242 [00:40<10:45:38,  2.43it/s]step: 100, loss: 7.5851287841796875, dt: 395.0457572937012, tokens/s : 5184.209581264764
Epoch 1/10 [Train]:   0%|▎ | 200/94242 [01:20<10:31:35,  2.48it/s]step: 200, loss: 6.906430244445801, dt: 418.8878536224365, tokens/s : 4889.136751732982
Epoch 1/10 [Train]:   0%|▍ | 300/94242 [02:01<10:41:26,  2.44it/s]step: 300, loss: 6.423778533935547, dt: 407.1216583251953, tokens/s : 5030.4373597440135
Epoch 1/10 [Train]:   0%|▋ | 400/94242 [02:42<10:37:16,  2.45it/s]step: 400, loss: 6.498997211456299, dt: 412.72664070129395, tokens/s : 4962.122136143414

The following were the details of training the model on TF32 precision.

Epoch 1/10 [Train]:   0%|   | 0/94242 [00:00<?, ?it/s]step: 0, loss: 11.02596378326416, dt: 1072.5276470184326, tokens/s : 1909.5078860608642
Epoch 1/10 [Train]:   0%|▏ | 100/94242 [00:31<8:07:40,  3.22it/s]step: 100, loss: 7.585122585296631, dt: 310.6203079223633, tokens/s : 6593.258546739574
Epoch 1/10 [Train]:   0%|▎ | 200/94242 [01:03<8:12:15,  3.18it/s]step: 200, loss: 6.910003662109375, dt: 302.4632930755615, tokens/s : 6771.0695707077675
Epoch 1/10 [Train]:   0%|▍ | 300/94242 [01:34<8:07:02,  3.21it/s]step: 300, loss: 6.423272132873535, dt: 303.3599853515625, tokens/s : 6751.05517831095

Although in the video Karpathy sees a 3x increase in number of tokens processed per sec _ we didn't notice the similar. The only reason for this could be that our GPU is not supported for TF32. But why do we still see a increase in the tokens/sec ?

The following were the details of training the model on mixed precision with BFloat16.

Epoch 1/10 [Train]:   0%|    | 0/94242 [00:00<?, ?it/s]step: 0, loss: 11.025543212890625, dt: 1149.9764919281006, tokens/s : 1780.9059701439933
Epoch 1/10 [Train]:   0%|▏  | 100/94242 [00:25<6:30:14,  4.02it/s]step: 100, loss: 7.585273742675781, dt: 249.59468841552734, tokens/s : 8205.302817143578
Epoch 1/10 [Train]:   0%|▎  | 200/94242 [00:50<6:33:51,  3.98it/s]step: 200, loss: 6.9183244705200195, dt: 249.06373023986816, tokens/s : 8222.795017273744
Epoch 1/10 [Train]:   0%|▍  | 300/94242 [01:16<6:32:43,  3.99it/s]step: 300, loss: 6.416199684143066, dt: 249.7537136077881, tokens/s : 8200.078270772656
Epoch 1/10 [Train]:   0%|▋  | 400/94242 [01:41<6:32:50,  3.98it/s]step: 400, loss: 6.4893035888671875, dt: 253.40914726257324, tokens/s : 8081.791924732447

tell how the gpu usage is lowered by 2gb.

Flash attention implementation and changing the vocab size - increasing it - mind = blown.

Epoch 1/10 [Train]:   0%|   | 0/94242 [00:00<?, ?it/s]step: 0, loss: 10.933685302734375, dt: 891.303300857544, tokens/s : 2297.7587966179085
Epoch 1/10 [Train]:   0%|▏  | 100/94242 [00:14<3:35:17,  7.29it/s]step: 100, loss: 7.229774475097656, dt: 134.56296920776367, tokens/s : 15219.64038143296
Epoch 1/10 [Train]:   0%|▎  | 200/94242 [00:28<3:36:35,  7.24it/s]step: 200, loss: 6.723935604095459, dt: 127.14266777038574, tokens/s : 16107.889160376917
Epoch 1/10 [Train]:   0%|▍  | 300/94242 [00:41<3:36:07,  7.24it/s]step: 300, loss: 6.507452964782715, dt: 135.31255722045898, tokens/s : 15135.328472606432
Epoch 1/10 [Train]:   0%|▋  | 400/94242 [00:55<3:33:34,  7.32it/s]step: 400, loss: 6.184662818908691, dt: 144.84047889709473, tokens/s : 14139.6936519041

One of the first major challenges I encountered while training GPT-2 was when the dataset size grew significantly. Initially, I was working with a smaller dataset — about 900 million tokens — and everything worked smoothly. But as I scaled to larger models and increased the dataset size to 40 GB, my original data loading approach simply didn’t hold up. Let me explain this with a piece of code I originally used.

class GPT2Dataset(Dataset):
    def __init__(self, data, seq_len, tokenizer):
        super().__init__()

        if not data:
            raise ValueError("Input data cannot be empty")
        if seq_len <= 0:
            raise ValueError(f"Sequence length must be positive, got {seq_len}")
        if not hasattr(tokenizer, 'encode'):
            raise ValueError("Tokenizer must have an 'encode' method")

        self.seq_len = seq_len
        self.data = data
        self.tokenizer = tokenizer

        logger.info(f"Tokenizing dataset with sequence length {seq_len}")
        self.tokens = self.tokenizer.encode(self.data, allowed_special={'<|endoftext|>'})
        logger.info(f"Total tokens: {len(self.tokens)}")

        num_samples = len(self.tokens) // (self.seq_len + 1)
        self.tokens = self.tokens[: num_samples * (self.seq_len + 1)]
        self.tokens = torch.tensor(self.tokens, dtype=torch.long).reshape(num_samples, self.seq_len + 1)
        logger.info(f"Created {num_samples} training samples")

    def __len__(self):
        return len(self.tokens)

    def __getitem__(self, idx):
        x = self.tokens[idx, :-1]  # Input: all but last token
        y = self.tokens[idx, 1:]   # Target: all but first token
        return x, y

This approach worked fine for smaller datasets. I could tokenize the entire text corpus in memory and create training samples on the fly. However, as I scaled up, several issues emerged.