Per-position computation after attention, and skip connections that make deep networks trainable.
This lesson requires an active subscription.