Everyone's talking about AI adoption. Almost nobody has the real numbers. Help us change that — and get the full report 👉 Engineers | Leaders

Understanding Makemore

Build character-level language models from scratch — from bigram counting through MLPs, batch normalization, and WaveNet — generating new names with PyTorch.


download courseware

Prefer to inspect a complete implementation? Download pre-completed courseware for this course.

  • Pre-completed project
    Complete reference implementations — bigram, MLP, batch-normalized MLP, and WaveNet models, ready to train on names.
    sign in to download

Bigrams

The simplest language model — predicting the next character from just the current one, using both counting and neural network approaches.

  • [ ] What Is Makemore? [text] free
    • Explain what makemore does — generating new names by learning character-level patterns
    • Describe the names.txt dataset and what character-level modeling means
    • Position makemore in the Karpathy neural network learning progression
  • [ ] The Dataset and Encoding [text] free
    • Build a character-level vocabulary with a special start/end token
    • Explain why a special boundary token is needed for generation
    • Implement encode and decode functions for character sequences
  • [ ] Counting Bigrams [text]
    • Construct a bigram frequency matrix from character pairs in the dataset
    • Convert raw counts to a probability distribution using row normalization
    • Evaluate a language model using average negative log likelihood
  • [ ] The Neural Bigram [text]
    • Implement a neural network that produces the same bigram probabilities as counting
    • Explain how one-hot encoding plus a linear layer plus softmax is equivalent to a lookup table
    • Connect cross-entropy loss to negative log likelihood

The MLP Language Model

Using embeddings and a multi-layer perceptron to predict the next character from multiple previous characters.

    • Explain why dense embeddings outperform one-hot encoding for representing characters
    • Describe how a context window of block_size characters provides more information than a bigram
    • Implement an embedding lookup table that maps character indices to dense vectors
    • Trace the forward pass from concatenated embeddings through a hidden layer to output probabilities
    • Explain the role of the tanh activation function in the hidden layer
    • Calculate the parameter count of the MLP and understand model capacity
    • Implement minibatch training and explain why it is preferred over full-batch training
    • Find an appropriate learning rate using an exponential sweep
    • Diagnose overfitting and underfitting by comparing train and dev loss

Activations and Batch Normalization

Diagnosing deep network pathologies — saturated activations, vanishing gradients — and fixing them with proper initialization and batch normalization.

    • Diagnose saturated activations by examining activation histograms across layers
    • Explain why poor weight initialization causes tanh activations to saturate near ±1
    • Apply Kaiming initialization to fix activation scaling in deep networks
  • [ ] Gradient Flow [text]
    • Visualize gradient distributions across layers to diagnose training problems
    • Explain the vanishing gradient problem and its connection to activation saturation
    • Use the gradient-to-data ratio as a diagnostic for healthy training
    • Implement batch normalization from scratch with mean, variance, gamma, and beta
    • Explain why normalizing activations stabilizes training in deep networks
    • Distinguish between training mode (batch statistics) and inference mode (running statistics)

Becoming a Backprop Ninja

Manually computing gradients through every layer — cross-entropy, linear layers, tanh, batch normalization, and embeddings.

    • Explain why understanding manual backpropagation matters beyond autograd
    • Compute the gradient of cross-entropy loss by hand at the tensor level
    • Verify manual gradients against PyTorch autograd using torch.allclose
    • Compute gradients through a linear layer using matrix multiplication and transposes
    • Derive the backward pass through tanh using its local derivative
    • Implement the backward pass through batch normalization step by step
    • Complete a full manual backward pass through an entire neural network
    • Explain how gradients flow through an embedding table lookup
    • Articulate how manual backpropagation skill transfers to real-world model development

Building a WaveNet

From flat MLPs to hierarchical character fusion — tree-structured models and the connection to dilated causal convolutions.

    • Explain why hierarchical merging improves on flat concatenation of context characters
    • Implement FlattenConsecutive to progressively merge pairs of character representations
    • Navigate multi-dimensional tensor shapes through the hierarchical forward pass
    • Describe how dilated causal convolutions implement hierarchical merging efficiently
    • Trace the complete progression from counting bigrams to WaveNet and its loss improvements
    • Connect makemore's character-level models to the transformer architectures covered in Understanding GPT