Understanding Makemore

Build character-level language models from scratch — from bigram counting through MLPs, batch normalization, and WaveNet — generating new names with PyTorch.

download courseware

Prefer to inspect a complete implementation? Download pre-completed courseware for this course.

Pre-completed project
Complete reference implementations — bigram, MLP, batch-normalized MLP, and WaveNet models, ready to train on names.
sign in to download

Bigrams

Count character pairs, turn them into probabilities, and generate your first names — the simplest language model.

[ ] What Is Makemore? [text] free
- Explain what makemore does — generating new names by learning character-level patterns
- Describe the names.txt dataset and what character-level modeling means
- Position makemore in the Karpathy neural network learning progression
[ ] The Dataset and Encoding [text] free
- Build a character-level vocabulary with a special start/end token
- Explain why a special boundary token is needed for generation
- Implement encode and decode functions for character sequences
[ ] Counting Bigrams [text]
- Construct a bigram frequency matrix from character pairs in the dataset
- Convert raw counts to a probability distribution using row normalization
- Evaluate a language model using average negative log likelihood
[ ] The Neural Bigram [text]
- Implement a neural network that produces the same bigram probabilities as counting
- Explain how one-hot encoding plus a linear layer plus softmax is equivalent to a lookup table
- Connect cross-entropy loss to negative log likelihood

The MLP Language Model

Embed characters into vectors, feed context through a multi-layer network, and predict the next character.

[ ] Embeddings and Context [text]
- Explain why dense embeddings outperform one-hot encoding for representing characters
- Describe how a context window of block_size characters provides more information than a bigram
- Implement an embedding lookup table that maps character indices to dense vectors
[ ] The MLP Architecture [text]
- Trace the forward pass from concatenated embeddings through a hidden layer to output probabilities
- Explain the role of the tanh activation function in the hidden layer
- Calculate the parameter count of the MLP and understand model capacity
[ ] Training and Hyperparameters [text]
- Implement minibatch training and explain why it is preferred over full-batch training
- Find an appropriate learning rate using an exponential sweep
- Diagnose overfitting and underfitting by comparing train and dev loss

Activations and Batch Normalization

Diagnose sick neurons, fix vanishing gradients, and keep deep networks healthy with batch normalization.

[ ] Diagnosing Activations [text]
- Diagnose saturated activations by examining activation histograms across layers
- Explain why poor weight initialization causes tanh activations to saturate near ±1
- Apply Kaiming initialization to fix activation scaling in deep networks
[ ] Gradient Flow [text]
- Visualize gradient distributions across layers to diagnose training problems
- Explain the vanishing gradient problem and its connection to activation saturation
- Use the gradient-to-data ratio as a diagnostic for healthy training
[ ] Batch Normalization [text]
- Implement batch normalization from scratch with mean, variance, gamma, and beta
- Explain why normalizing activations stabilizes training in deep networks
- Distinguish between training mode (batch statistics) and inference mode (running statistics)

Becoming a Backprop Ninja

Compute every gradient by hand, verify against autograd, and truly understand what loss.backward() does.

[ ] Manual Backpropagation [text]
- Explain why understanding manual backpropagation matters beyond autograd
- Compute the gradient of cross-entropy loss by hand at the tensor level
- Verify manual gradients against PyTorch autograd using torch.allclose
[ ] Backprop Through Layers [text]
- Compute gradients through a linear layer using matrix multiplication and transposes
- Derive the backward pass through tanh using its local derivative
- Implement the backward pass through batch normalization step by step
[ ] The Full Backward Pass [text]
- Complete a full manual backward pass through an entire neural network
- Explain how gradients flow through an embedding table lookup
- Articulate how manual backpropagation skill transfers to real-world model development

Building a WaveNet

Move from flat MLPs to hierarchical processing — the WaveNet architecture that processes context at multiple scales.

[ ] From MLP to Hierarchy [text]
- Explain why hierarchical merging improves on flat concatenation of context characters
- Implement FlattenConsecutive to progressively merge pairs of character representations
- Navigate multi-dimensional tensor shapes through the hierarchical forward pass
[ ] The Complete Picture [text]
- Describe how dilated causal convolutions implement hierarchical merging efficiently
- Trace the complete progression from counting bigrams to WaveNet and its loss improvements
- Connect makemore's character-level models to the transformer architectures covered in Understanding GPT