Sai's AI Lab

GPT-2 implementation

Notes from Andrej Karpathy's Video on Training a GPT from Scratch

These notes summarize key insights from Andrej Karpathy's video on training a GPT model from scratch. They were written after we had already implemented transformers, so many comparisons are made between the approaches in the video and our previous implementation.

For reference, you can find my blog post on implementing transformers here:
🔗 Training a Transformer

Key Takeaways:

torch.tril(torch.ones(self.context_len, self.context_len)).view(
            1, 1, self.context_len, self.context_len)

So a key error when we were implementing the code for training was that we were overlapping tokens, which turned out to be a big mistake. The way we figured out our mistake was by training the model for some mini batches and we quickly noticed our error going to 0.2 without even a epoch. This was funny but also was a major lapse in judgement.

So once we were ready with a basic training loop, our next step was to include mixed precision training. So for that karpathy sensei did the following- First he explained how everything goes in FP32.

The following were the details of training the model on FP32 precision.

Epoch 1/10 [Train]:   0%|   0/94242 [00:00<?, ?it/s]step: 0, loss: 11.025991439819336, dt: 1174.2680072784424, tokens/s : 1744.0652281301386
Epoch 1/10 [Train]:   0%|   | 100/94242 [00:40<10:45:38,  2.43it/s]step: 100, loss: 7.5851287841796875, dt: 395.0457572937012, tokens/s : 5184.209581264764
Epoch 1/10 [Train]:   0%| | 200/94242 [01:20<10:31:35,  2.48it/s]step: 200, loss: 6.906430244445801, dt: 418.8878536224365, tokens/s : 4889.136751732982
Epoch 1/10 [Train]:   0%| | 300/94242 [02:01<10:41:26,  2.44it/s]step: 300, loss: 6.423778533935547, dt: 407.1216583251953, tokens/s : 5030.4373597440135
Epoch 1/10 [Train]:   0%| | 400/94242 [02:42<10:37:16,  2.45it/s]step: 400, loss: 6.498997211456299, dt: 412.72664070129395, tokens/s : 4962.122136143414

The following were the details of training the model on TF32 precision.

Epoch 1/10 [Train]:   0%|   | 0/94242 [00:00<?, ?it/s]step: 0, loss: 11.02596378326416, dt: 1072.5276470184326, tokens/s : 1909.5078860608642
Epoch 1/10 [Train]:   0%| | 100/94242 [00:31<8:07:40,  3.22it/s]step: 100, loss: 7.585122585296631, dt: 310.6203079223633, tokens/s : 6593.258546739574
Epoch 1/10 [Train]:   0%| | 200/94242 [01:03<8:12:15,  3.18it/s]step: 200, loss: 6.910003662109375, dt: 302.4632930755615, tokens/s : 6771.0695707077675
Epoch 1/10 [Train]:   0%| | 300/94242 [01:34<8:07:02,  3.21it/s]step: 300, loss: 6.423272132873535, dt: 303.3599853515625, tokens/s : 6751.05517831095

Although in the video Karpathy sees a 3x increase in number of tokens processed per sec _ we didn't notice the similar. The only reason for this could be that our GPU is not supported for TF32. But why do we still see a increase in the tokens/sec ?

The following were the details of training the model on mixed precision with BFloat16.

Epoch 1/10 [Train]:   0%|    | 0/94242 [00:00<?, ?it/s]step: 0, loss: 11.025543212890625, dt: 1149.9764919281006, tokens/s : 1780.9059701439933
Epoch 1/10 [Train]:   0%|  | 100/94242 [00:25<6:30:14,  4.02it/s]step: 100, loss: 7.585273742675781, dt: 249.59468841552734, tokens/s : 8205.302817143578
Epoch 1/10 [Train]:   0%|  | 200/94242 [00:50<6:33:51,  3.98it/s]step: 200, loss: 6.9183244705200195, dt: 249.06373023986816, tokens/s : 8222.795017273744
Epoch 1/10 [Train]:   0%|  | 300/94242 [01:16<6:32:43,  3.99it/s]step: 300, loss: 6.416199684143066, dt: 249.7537136077881, tokens/s : 8200.078270772656
Epoch 1/10 [Train]:   0%|  | 400/94242 [01:41<6:32:50,  3.98it/s]step: 400, loss: 6.4893035888671875, dt: 253.40914726257324, tokens/s : 8081.791924732447

tell how the gpu usage is lowered by 2gb.

Flash attention implementation and changing the vocab size - increasing it - mind = blown.

Epoch 1/10 [Train]:   0%|   | 0/94242 [00:00<?, ?it/s]step: 0, loss: 10.933685302734375, dt: 891.303300857544, tokens/s : 2297.7587966179085
Epoch 1/10 [Train]:   0%|  | 100/94242 [00:14<3:35:17,  7.29it/s]step: 100, loss: 7.229774475097656, dt: 134.56296920776367, tokens/s : 15219.64038143296
Epoch 1/10 [Train]:   0%|  | 200/94242 [00:28<3:36:35,  7.24it/s]step: 200, loss: 6.723935604095459, dt: 127.14266777038574, tokens/s : 16107.889160376917
Epoch 1/10 [Train]:   0%|  | 300/94242 [00:41<3:36:07,  7.24it/s]step: 300, loss: 6.507452964782715, dt: 135.31255722045898, tokens/s : 15135.328472606432
Epoch 1/10 [Train]:   0%|  | 400/94242 [00:55<3:33:34,  7.32it/s]step: 400, loss: 6.184662818908691, dt: 144.84047889709473, tokens/s : 14139.6936519041

One of the first major challenges I encountered while training GPT-2 was when the dataset size grew significantly. Initially, I was working with a smaller dataset — about 900 million tokens — and everything worked smoothly. But as I scaled to larger models and increased the dataset size to 40 GB, my original data loading approach simply didn’t hold up. Let me explain this with a piece of code I originally used.

class GPT2Dataset(Dataset):
    def __init__(self, data, seq_len, tokenizer):
        super().__init__()

        if not data:
            raise ValueError("Input data cannot be empty")
        if seq_len <= 0:
            raise ValueError(f"Sequence length must be positive, got {seq_len}")
        if not hasattr(tokenizer, 'encode'):
            raise ValueError("Tokenizer must have an 'encode' method")

        self.seq_len = seq_len
        self.data = data
        self.tokenizer = tokenizer

        logger.info(f"Tokenizing dataset with sequence length {seq_len}")
        self.tokens = self.tokenizer.encode(self.data, allowed_special={'<|endoftext|>'})
        logger.info(f"Total tokens: {len(self.tokens)}")

        num_samples = len(self.tokens) // (self.seq_len + 1)
        self.tokens = self.tokens[: num_samples * (self.seq_len + 1)]
        self.tokens = torch.tensor(self.tokens, dtype=torch.long).reshape(num_samples, self.seq_len + 1)
        logger.info(f"Created {num_samples} training samples")

    def __len__(self):
        return len(self.tokens)

    def __getitem__(self, idx):
        x = self.tokens[idx, :-1]  # Input: all but last token
        y = self.tokens[idx, 1:]   # Target: all but first token
        return x, y

This approach worked fine for smaller datasets. I could tokenize the entire text corpus in memory and create training samples on the fly. However, as I scaled up, several issues emerged.