Click any block in the diagram to explore it in detail.
The Transformer processes sequences entirely through attention — no recurrence, no convolution. Each block on the left is a distinct computation stage. Start with Input Embedding to follow the data flow from top to bottom.
The raw input sentence is first tokenized into subword units using byte-pair encoding (BPE), producing a sequence of integer IDs. The embedding layer converts each integer ID into a dense, learnable vector of dimension dmodel = 512.
This is implemented as a simple lookup table — a matrix of shape [vocab_size × 512] where each row corresponds to one token in the vocabulary. Initially random, these vectors are learned during training to place semantically similar tokens nearby in the 512-dimensional space.
The paper also shares the embedding weight matrix with the pre-softmax linear transformation, and scales the embedding values by √dmodel to keep them in a suitable range relative to the positional encodings added next.
token_ids = [464, 3797, 3332] # "The cat sat"
# Embedding lookup
E = nn.Embedding(vocab_size=37000, dim=512)
x = E[token_ids] * sqrt(512) # → [3 × 512]
The decoder also begins with an embedding layer, identical in structure to the encoder's. During training, the decoder receives the ground-truth target tokens shifted one position to the right — this is called teacher forcing. If the target is "The cat sat", the decoder input is "<START> The cat sat" and it tries to predict "The cat sat <END>".
During inference, the decoder runs autoregressively: it starts with just <START> and generates one token at a time, feeding each prediction back as the next input. The masking mechanism (see Masked Attention) prevents it from peeking ahead at future positions.
The same weight matrix is shared between the input embedding, output embedding, and the pre-softmax linear layer — a technique from "Using the Output Embedding to Improve Language Models" that reduces parameters and improves generalization.
Attention is a set operation — it's order-agnostic. Without additional information, "cat sat on the mat" and "mat the on sat cat" would produce identical attention outputs. Positional encodings solve this by adding a unique position-dependent vector to each token embedding.
The paper uses fixed sinusoidal encodings rather than learned ones. Each position gets a 512-dimensional vector where even dimensions use sine and odd dimensions use cosine, at geometrically spaced frequencies:
PE(pos, 2i+1) = cos(pos / 10000^(2i/512))
The wavelengths range from 2π (high frequency, captures local position) to 10000·2π (low frequency, captures global position). This was chosen because for any fixed offset k, PE(pos+k) can be expressed as a linear function of PE(pos) — allowing the model to easily attend by relative offset.
The encoding is simply added to the embedding (not concatenated), keeping the dimension at 512. Learned embeddings were tested and produced nearly identical results, but sinusoids were chosen as they can generalize to longer sequences than seen during training.
This is the core operation of the Transformer. Every token in the input simultaneously computes how much it should attend to every other token — including itself. The result for each token is a weighted blend of all token representations, weighted by relevance.
Three projections are learned: Query (what am I looking for?), Key (what do I offer?), Value (what information do I carry?). Q·Kᵀ measures compatibility between every pair of tokens; softmax normalizes these into probabilities; the result weights the Values.
This runs 8 times in parallel (multi-head), each head using projections down to dk = 64 dimensions. Each head can specialize: one might track syntactic dependencies, another co-reference, another proximity. Their outputs are concatenated and projected back to 512 dimensions.
The division by √dk (= √64 = 8) prevents the dot products from growing large and pushing softmax into saturation with near-zero gradients. In the encoder, all positions can attend to all other positions — no masking.
K = x @ W_K # [seq × 64]
V = x @ W_V # [seq × 64]
scores = (Q @ K.T) / sqrt(64) # [seq × seq]
weights = softmax(scores) # attention map
out = weights @ V # [seq × 64]
The decoder generates output one token at a time. During training with teacher forcing, all target tokens are fed in simultaneously — but we must prevent position 3 from "seeing" position 5 (that would be cheating). This is enforced via causal masking.
Before the softmax, the attention scores for all future positions are set to −∞. After softmax, e^(−∞) = 0, so those positions receive zero weight. The result is a triangular attention pattern: each position can only attend to itself and all positions before it.
where mask[i,j] = 0 if j ≤ i, else −∞
weights = softmax(scores_masked)
This masking is what makes the decoder autoregressive — the prediction at each position depends only on known preceding outputs. During inference this is naturally satisfied since future tokens haven't been generated yet, but the mask is still applied for consistency.
Like the encoder's self-attention, this also uses 8 heads with dk = 64. The Q, K, V all come from the decoder's own partial output sequence.
This is where the decoder "reads" the encoded source sentence. It's structurally identical to self-attention, but the Q, K, V come from different places — which is the key distinction:
Q (Query) comes from the decoder's previous sub-layer — representing "what is the decoder currently looking for?"
K (Key) and V (Value) come from the encoder's final output — representing the full, contextualized representation of the source sentence.
This allows every decoder position to attend over every position in the source input simultaneously. It's the mechanism that connects source and target — the decoder can focus on the relevant part of the input when generating each output word.
K = encoder_output @ W_K ← from encoder
V = encoder_output @ W_V ← from encoder
CrossAttention(Q, K, V) = softmax(QKᵀ/√d_k) · V
This replaces the classic "attention mechanism" from earlier seq2seq models (Bahdanau et al. 2015) — the idea of aligning source and target — but does so in parallel across all positions rather than sequentially.
This small block does two things that are critical for training deep networks:
Residual Connection (Add): The input to any sub-layer is added directly to its output — x + sublayer(x). This creates a "highway" for gradients to flow backwards through many layers without vanishing. Borrowed from ResNets (He et al. 2016), this is why you can stack 6 (or 96, or 175) layers and still train effectively.
Layer Normalization (Norm): After adding, the result is normalized across the feature dimension — for each token independently, subtract the mean and divide by the standard deviation, then apply learned scale (γ) and shift (β) parameters. This stabilizes the distribution of activations across layers, allowing higher learning rates and faster convergence.
LayerNorm(z) = γ · (z − μ) / (σ + ε) + β
where μ, σ computed per token across 512 dims
Note: this is Layer Norm (normalizes across features for each sample), not Batch Norm (normalizes across samples for each feature). LayerNorm works better for sequences of variable length.
After attention has mixed information across positions, the feed-forward network processes each position independently and identically. Think of attention as the "communication" step (tokens talk to each other) and the FFN as the "computation" step (each token thinks on its own).
It's a simple two-layer MLP: expand from 512 → 2048 with ReLU, then project back 2048 → 512. The intermediate 4× expansion gives the model more representational capacity to transform features.
W₁ : [512 → 2048]
W₂ : [2048 → 512]
Activation: ReLU = max(0, x)
The weights W₁ and W₂ are shared across all positions within a layer but different between layers. This is equivalent to two 1×1 convolutions. While attention handles dependencies between positions, the FFN provides the depth that allows complex feature transformations within each position.
In modern LLMs (GPT-3, etc.) this FFN accounts for the majority of parameters — often 2/3 of total model parameters — and has been shown to act as a key-value memory store.
x = relu(x @ W1 + b1) # → [seq × 2048]
x = x @ W2 + b2 # → [seq × 512]
return x
After the final decoder layer, each position has a 512-dimensional representation. To produce a probability distribution over the vocabulary, we first project from 512 dimensions up to the full vocabulary size (37,000 for English-German).
This is a simple matrix multiplication: [seq × 512] → [seq × 37000]. The result (called logits) is a raw score for each token in the vocabulary at each position. Higher logit = more likely next token.
This weight matrix is shared with the input and output embedding matrices — the same transformation that maps token IDs to vectors is (transposed) used to map decoder representations back to token probabilities. This weight tying significantly reduces parameters and tends to improve performance.
shape: [seq × 512] → [seq × 37000]
The final softmax converts the 37,000-dimensional logit vector into a valid probability distribution — all values between 0 and 1, summing to exactly 1. Each value is the model's estimated probability that this token is the correct next word.
Output: probability over all 37,000 tokens
During training: cross-entropy loss is computed against the ground-truth token. With label smoothing (ε=0.1), the target distribution is 0.9 for the correct token and 0.1/vocab_size spread over all others — preventing overconfidence.
During inference: rather than always picking the highest-probability token (greedy), the paper uses beam search with beam size 4. This keeps the 4 most probable partial sequences at each step, producing better overall sequences at the cost of some computation. A length penalty (α=0.6) prevents the model from favoring shorter outputs.