Understanding Makemore
Build character-level language models from scratch — from bigram counting through MLPs, batch normalization, and WaveNet — generating new names with PyTorch.
download courseware
Prefer to inspect a complete implementation? Download pre-completed courseware for this course.
- Pre-completed projectsign in to downloadComplete reference implementations — bigram, MLP, batch-normalized MLP, and WaveNet models, ready to train on names.
Bigrams
The simplest language model — predicting the next character from just the current one, using both counting and neural network approaches.
-
- Explain what makemore does — generating new names by learning character-level patterns
- Describe the names.txt dataset and what character-level modeling means
- Position makemore in the Karpathy neural network learning progression
-
- Build a character-level vocabulary with a special start/end token
- Explain why a special boundary token is needed for generation
- Implement encode and decode functions for character sequences
- [ ] Counting Bigrams
- Construct a bigram frequency matrix from character pairs in the dataset
- Convert raw counts to a probability distribution using row normalization
- Evaluate a language model using average negative log likelihood
-
- Implement a neural network that produces the same bigram probabilities as counting
- Explain how one-hot encoding plus a linear layer plus softmax is equivalent to a lookup table
- Connect cross-entropy loss to negative log likelihood
The MLP Language Model
Using embeddings and a multi-layer perceptron to predict the next character from multiple previous characters.
-
- Explain why dense embeddings outperform one-hot encoding for representing characters
- Describe how a context window of block_size characters provides more information than a bigram
- Implement an embedding lookup table that maps character indices to dense vectors
-
- Trace the forward pass from concatenated embeddings through a hidden layer to output probabilities
- Explain the role of the tanh activation function in the hidden layer
- Calculate the parameter count of the MLP and understand model capacity
-
- Implement minibatch training and explain why it is preferred over full-batch training
- Find an appropriate learning rate using an exponential sweep
- Diagnose overfitting and underfitting by comparing train and dev loss
Activations and Batch Normalization
Diagnosing deep network pathologies — saturated activations, vanishing gradients — and fixing them with proper initialization and batch normalization.
-
- Diagnose saturated activations by examining activation histograms across layers
- Explain why poor weight initialization causes tanh activations to saturate near ±1
- Apply Kaiming initialization to fix activation scaling in deep networks
- [ ] Gradient Flow
- Visualize gradient distributions across layers to diagnose training problems
- Explain the vanishing gradient problem and its connection to activation saturation
- Use the gradient-to-data ratio as a diagnostic for healthy training
-
- Implement batch normalization from scratch with mean, variance, gamma, and beta
- Explain why normalizing activations stabilizes training in deep networks
- Distinguish between training mode (batch statistics) and inference mode (running statistics)
Becoming a Backprop Ninja
Manually computing gradients through every layer — cross-entropy, linear layers, tanh, batch normalization, and embeddings.
-
- Explain why understanding manual backpropagation matters beyond autograd
- Compute the gradient of cross-entropy loss by hand at the tensor level
- Verify manual gradients against PyTorch autograd using torch.allclose
-
- Compute gradients through a linear layer using matrix multiplication and transposes
- Derive the backward pass through tanh using its local derivative
- Implement the backward pass through batch normalization step by step
-
- Complete a full manual backward pass through an entire neural network
- Explain how gradients flow through an embedding table lookup
- Articulate how manual backpropagation skill transfers to real-world model development
Building a WaveNet
From flat MLPs to hierarchical character fusion — tree-structured models and the connection to dilated causal convolutions.
-
- Explain why hierarchical merging improves on flat concatenation of context characters
- Implement FlattenConsecutive to progressively merge pairs of character representations
- Navigate multi-dimensional tensor shapes through the hierarchical forward pass
-
- Describe how dilated causal convolutions implement hierarchical merging efficiently
- Trace the complete progression from counting bigrams to WaveNet and its loss improvements
- Connect makemore's character-level models to the transformer architectures covered in Understanding GPT