Transformer Architecture

Encoder

↓ inputs

Input Embedding

+ Positional Encoding

Nx = 6

Multi-Head Self-Attention

Add & Norm

Feed-Forward Network

Add & Norm

encoder output →

Decoder

↓ outputs (shifted right)

Output Embedding

+ Positional Encoding

Nx = 6

Masked Multi-Head Attention

Add & Norm

Cross-Attention (Enc→Dec)

Add & Norm

Feed-Forward Network

Add & Norm

Linear Projection

Softmax

output probabilities ↑

Click any block in the diagram to explore it in detail.

The Transformer processes sequences entirely through attention — no recurrence, no convolution. Each block on the left is a distinct computation stage. Start with Input Embedding to follow the data flow from top to bottom.

Quick Reference — Model Dimensions

d_model = 512

d_ff = 2048

heads (h) = 8

d_k = d_v = 64

layers (N) = 6

vocab = 37,000

Input Embedding

Encoder · First layer · Converts tokens to vectors

The raw input sentence is first tokenized into subword units using byte-pair encoding (BPE), producing a sequence of integer IDs. The embedding layer converts each integer ID into a dense, learnable vector of dimension d_model = 512.

This is implemented as a simple lookup table — a matrix of shape [vocab_size × 512] where each row corresponds to one token in the vocabulary. Initially random, these vectors are learned during training to place semantically similar tokens nearby in the 512-dimensional space.

The paper also shares the embedding weight matrix with the pre-softmax linear transformation, and scales the embedding values by √d_model to keep them in a suitable range relative to the positional encodings added next.

# Input: sequence of token IDs
token_ids = [464, 3797, 3332] # "The cat sat"

# Embedding lookup
E = nn.Embedding(vocab_size=37000, dim=512)
x = E[token_ids] * sqrt(512) # → [3 × 512]

learned

vocab: 37k tokens

output: [seq × 512]

shared weights with softmax

Output Embedding

Decoder · First layer · Teacher forcing during training

The decoder also begins with an embedding layer, identical in structure to the encoder's. During training, the decoder receives the ground-truth target tokens shifted one position to the right — this is called teacher forcing. If the target is "The cat sat", the decoder input is "<START> The cat sat" and it tries to predict "The cat sat <END>".

During inference, the decoder runs autoregressively: it starts with just <START> and generates one token at a time, feeding each prediction back as the next input. The masking mechanism (see Masked Attention) prevents it from peeking ahead at future positions.

The same weight matrix is shared between the input embedding, output embedding, and the pre-softmax linear layer — a technique from "Using the Output Embedding to Improve Language Models" that reduces parameters and improves generalization.

shared weights

teacher forcing (train)

autoregressive (infer)

output: [seq × 512]

Positional Encoding

Encoder & Decoder · Injects sequence order information

Attention is a set operation — it's order-agnostic. Without additional information, "cat sat on the mat" and "mat the on sat cat" would produce identical attention outputs. Positional encodings solve this by adding a unique position-dependent vector to each token embedding.

The paper uses fixed sinusoidal encodings rather than learned ones. Each position gets a 512-dimensional vector where even dimensions use sine and odd dimensions use cosine, at geometrically spaced frequencies:

PE(pos, 2i) = sin(pos / 10000^(2i/512))
PE(pos, 2i+1) = cos(pos / 10000^(2i/512))

The wavelengths range from 2π (high frequency, captures local position) to 10000·2π (low frequency, captures global position). This was chosen because for any fixed offset k, PE(pos+k) can be expressed as a linear function of PE(pos) — allowing the model to easily attend by relative offset.

The encoding is simply added to the embedding (not concatenated), keeping the dimension at 512. Learned embeddings were tested and produced nearly identical results, but sinusoids were chosen as they can generalize to longer sequences than seen during training.

fixed (not learned)

added to embeddings

enables relative positions

same dim: 512

Multi-Head Self-Attention

Encoder · All positions attend to all positions

This is the core operation of the Transformer. Every token in the input simultaneously computes how much it should attend to every other token — including itself. The result for each token is a weighted blend of all token representations, weighted by relevance.

Three projections are learned: Query (what am I looking for?), Key (what do I offer?), Value (what information do I carry?). Q·Kᵀ measures compatibility between every pair of tokens; softmax normalizes these into probabilities; the result weights the Values.

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

This runs 8 times in parallel (multi-head), each head using projections down to d_k = 64 dimensions. Each head can specialize: one might track syntactic dependencies, another co-reference, another proximity. Their outputs are concatenated and projected back to 512 dimensions.

The division by √d_k (= √64 = 8) prevents the dot products from growing large and pushing softmax into saturation with near-zero gradients. In the encoder, all positions can attend to all other positions — no masking.

Q = x @ W_Q # [seq × 64], per head
K = x @ W_K # [seq × 64]
V = x @ W_V # [seq × 64]

scores = (Q @ K.T) / sqrt(64) # [seq × seq]
weights = softmax(scores) # attention map
out = weights @ V # [seq × 64]

h = 8 heads

d_k = d_v = 64

no mask

output: [seq × 512]

Masked Multi-Head Attention

Decoder · Prevents attending to future positions

The decoder generates output one token at a time. During training with teacher forcing, all target tokens are fed in simultaneously — but we must prevent position 3 from "seeing" position 5 (that would be cheating). This is enforced via causal masking.

Before the softmax, the attention scores for all future positions are set to −∞. After softmax, e^(−∞) = 0, so those positions receive zero weight. The result is a triangular attention pattern: each position can only attend to itself and all positions before it.

scores_masked = scores + mask
where mask[i,j] = 0 if j ≤ i, else −∞

weights = softmax(scores_masked)

This masking is what makes the decoder autoregressive — the prediction at each position depends only on known preceding outputs. During inference this is naturally satisfied since future tokens haven't been generated yet, but the mask is still applied for consistency.

Like the encoder's self-attention, this also uses 8 heads with d_k = 64. The Q, K, V all come from the decoder's own partial output sequence.

causal mask

autoregressive

future positions → −∞

h = 8 heads

Cross-Attention (Encoder→Decoder)

Decoder · Reads from the encoder's output

This is where the decoder "reads" the encoded source sentence. It's structurally identical to self-attention, but the Q, K, V come from different places — which is the key distinction:

Q (Query) comes from the decoder's previous sub-layer — representing "what is the decoder currently looking for?"

K (Key) and V (Value) come from the encoder's final output — representing the full, contextualized representation of the source sentence.

This allows every decoder position to attend over every position in the source input simultaneously. It's the mechanism that connects source and target — the decoder can focus on the relevant part of the input when generating each output word.

Q = decoder_prev_layer @ W_Q ← from decoder
K = encoder_output @ W_K ← from encoder
V = encoder_output @ W_V ← from encoder

CrossAttention(Q, K, V) = softmax(QKᵀ/√d_k) · V

This replaces the classic "attention mechanism" from earlier seq2seq models (Bahdanau et al. 2015) — the idea of aligning source and target — but does so in parallel across all positions rather than sequentially.

Q from decoder

K,V from encoder

no mask

bridge between stacks

Add & Norm

Residual Connection + Layer Normalization · Appears 6× in encoder, 9× in decoder

This small block does two things that are critical for training deep networks:

Residual Connection (Add): The input to any sub-layer is added directly to its output — x + sublayer(x). This creates a "highway" for gradients to flow backwards through many layers without vanishing. Borrowed from ResNets (He et al. 2016), this is why you can stack 6 (or 96, or 175) layers and still train effectively.

Layer Normalization (Norm): After adding, the result is normalized across the feature dimension — for each token independently, subtract the mean and divide by the standard deviation, then apply learned scale (γ) and shift (β) parameters. This stabilizes the distribution of activations across layers, allowing higher learning rates and faster convergence.

output = LayerNorm(x + Sublayer(x))

LayerNorm(z) = γ · (z − μ) / (σ + ε) + β
where μ, σ computed per token across 512 dims

Note: this is Layer Norm (normalizes across features for each sample), not Batch Norm (normalizes across samples for each feature). LayerNorm works better for sequences of variable length.

residual connection

layer normalization

prevents vanishing gradients

learned γ, β

Feed-Forward Network

Encoder & Decoder · Position-wise transformation

After attention has mixed information across positions, the feed-forward network processes each position independently and identically. Think of attention as the "communication" step (tokens talk to each other) and the FFN as the "computation" step (each token thinks on its own).

It's a simple two-layer MLP: expand from 512 → 2048 with ReLU, then project back 2048 → 512. The intermediate 4× expansion gives the model more representational capacity to transform features.

FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂

W₁ : [512 → 2048]
W₂ : [2048 → 512]
Activation: ReLU = max(0, x)

The weights W₁ and W₂ are shared across all positions within a layer but different between layers. This is equivalent to two 1×1 convolutions. While attention handles dependencies between positions, the FFN provides the depth that allows complex feature transformations within each position.

In modern LLMs (GPT-3, etc.) this FFN accounts for the majority of parameters — often 2/3 of total model parameters — and has been shown to act as a key-value memory store.

def ffn(x):
  x = relu(x @ W1 + b1) # → [seq × 2048]
  x = x @ W2 + b2 # → [seq × 512]
  return x

position-wise

512 → 2048 → 512

ReLU activation

shared across positions

Linear Projection

Decoder output · Projects to vocabulary size

After the final decoder layer, each position has a 512-dimensional representation. To produce a probability distribution over the vocabulary, we first project from 512 dimensions up to the full vocabulary size (37,000 for English-German).

This is a simple matrix multiplication: [seq × 512] → [seq × 37000]. The result (called logits) is a raw score for each token in the vocabulary at each position. Higher logit = more likely next token.

This weight matrix is shared with the input and output embedding matrices — the same transformation that maps token IDs to vectors is (transposed) used to map decoder representations back to token probabilities. This weight tying significantly reduces parameters and tends to improve performance.

logits = decoder_output @ W_embed.T
shape: [seq × 512] → [seq × 37000]

shared with embeddings

output: [seq × vocab]

raw logits → softmax next

Softmax Output

Final layer · Converts logits to probability distribution

The final softmax converts the 37,000-dimensional logit vector into a valid probability distribution — all values between 0 and 1, summing to exactly 1. Each value is the model's estimated probability that this token is the correct next word.

P(token_i) = exp(logit_i) / Σⱼ exp(logit_j)

Output: probability over all 37,000 tokens

During training: cross-entropy loss is computed against the ground-truth token. With label smoothing (ε=0.1), the target distribution is 0.9 for the correct token and 0.1/vocab_size spread over all others — preventing overconfidence.

During inference: rather than always picking the highest-probability token (greedy), the paper uses beam search with beam size 4. This keeps the 4 most probable partial sequences at each step, producing better overall sequences at the cost of some computation. A length penalty (α=0.6) prevents the model from favoring shorter outputs.

cross-entropy loss (train)

beam search k=4 (infer)

label smoothing ε=0.1

length penalty α=0.6