Transformer Architecture, Attention Heads, and the LLM Creation Pipeline
Transformer Architecture & the LLM Pipeline
Introduction
The 2017 paper Attention Is All You Need (Vaswani et al.) introduced the Transformer — a sequence transduction model built entirely on attention mechanisms, dispensing with recurrence and convolution entirely. It is the foundation of every modern large language model: GPT, BERT, Claude, Gemini, LLaMA, and their successors all descend directly from this architecture.
This post works through the Transformer from first principles, assuming a software engineering background but no prior deep learning experience. It proceeds in four parts: the architecture itself, the data pipeline from raw text to trained model, the complete LLM creation and alignment pipeline, and a detailed treatment of attention heads — what they are, what they learn, and crucially, what makes them differ from each other.
1. The Transformer Architecture
1.1 The Problem It Solves
Before Transformers, sequence modeling used RNNs — effectively a for-loop over tokens. By token 50 the model had “forgotten” much about token 1, and the sequential dependency prevented parallelization. The Transformer’s key insight is attention: let every token look at every other token simultaneously and decide what is relevant — O(1) sequential operations, fully parallelizable, with explicit long-range dependencies.
1.2 Encoder-Decoder Structure
The Transformer maps directly to an encode-then-decode pattern. The Encoder (N=6 identical layers) reads the full input and builds a rich contextual representation via multi-head self-attention followed by a position-wise feed-forward network. The Decoder (also N=6 layers) generates output autoregressively, adding masked self-attention (preventing future peeking) and a cross-attention sub-layer that reads from the encoder’s output. Residual connections — output = LayerNorm(x + sublayer(x)) — wrap every sub-layer, providing a clean gradient highway through deep stacks.
Figure 1. Encoder-Decoder Architecture.
Key elements of the encoder-decoder architecture:
- Encoder — builds contextual representations of the full source sequence, producing K and V matrices.
- Cross-attention — injects encoder K and V into every decoder layer, enabling full-input attention at each depth.
- Decoder — generates output autoregressively, consuming its own previous tokens one at a time.
- Masked self-attention — applies −∞ to future positions (j > i), enforcing left-to-right causal generation.
1.3 Scaled Dot-Product Attention
The core operation:
Q = x @ W_Q # "what am I looking for?" [seq × d_k]
K = x @ W_K # "what do I offer?" [seq × d_k]
V = x @ W_V # "what do I carry?" [seq × d_v]
scores = Q @ K.T / sqrt(d_k) # pairwise relevance [seq × seq]
weights = softmax(scores) # normalize to probabilities
output = weights @ V # weighted blend of values
Formally: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
Dividing by √d_k prevents dot products from growing large in high dimensions, which would push softmax into saturation and produce near-zero gradients. The scores matrix is the attention pattern — a [seq × seq] distribution where each row describes how much one token attends to every other.
Figure 2. Scaled Dot-Product Attention.
Steps in the computation:
- Score — dot-product of Q and K, scaled by 1/√d_k to stabilise gradients in high dimensions.
- Mask (decoder only) — sets future-position scores to −∞ before normalisation.
- Softmax — normalises scores into attention weights summing to 1.0 per query position.
- Output — weighted sum of V vectors; high-attention tokens contribute proportionally more.
1.4 Multi-Head Attention
A single head produces one weighted average per position. Multi-head attention runs H=8 independent heads in lower-dimensional subspaces (d_k = d_model/H = 64), concatenates their outputs, and projects back:
heads = [Attention(x @ W_Q_i, x @ W_K_i, x @ W_V_i) for i in range(8)]
output = Concat(heads) @ W_O # [seq × d_model]
Each head learns a different relationship type. Total compute ≈ one full-dimension head, but the model gains 8 independent perspectives per token.
Figure 3. Multi-Head Attention.
How multi-head attention works:
- Parallel projection — each of H=8 heads projects to d_k=64 with its own W_Q, W_K, W_V matrices.
- Specialisation — each head learns a distinct relational pattern (syntactic, semantic, positional, etc.).
- Concatenation — outputs are concatenated to [seq × 512] and projected back to d_model by W_O.
- Efficiency — total compute matches one full-dimension head, with 8× representational diversity.
1.5 Feed-Forward Networks
After attention mixes information across positions, each position is processed independently by a two-layer MLP:
FFN(x) = ReLU(x @ W1 + b1) @ W2 + b2
# W1: [512 → 2048], W2: [2048 → 512]
The 4× inner expansion gives representational capacity to transform features. FFN weights account for roughly two thirds of all model parameters and act as key-value memory stores for factual associations.
1.6 Positional Encoding
Attention is a set operation with no inherent notion of order. Fixed sinusoidal encodings are added to embeddings:
PE(pos, 2i) = sin(pos / 10000^(2i/512))
PE(pos, 2i+1) = cos(pos / 10000^(2i/512))
Low-index dimensions oscillate rapidly (encoding local position); high-index dimensions oscillate slowly (encoding global position). Because PE(pos+k) is a linear function of PE(pos), the model can easily learn to attend by relative offset.
Figure 4. Positional Encoding Heatmap.
Reading the heatmap:
- Axes — columns = positions 0–11; rows = dimension bands. Blue = positive, white = zero, orange = negative.
- Top rows (low-index) — fast sine cycles make neighbouring positions clearly distinct.
- Bottom rows (high-index) — slow variation provides a coarser global position signal.
- Result — the 512-dimension vector produces a unique fingerprint for every sequence position.
1.7 Model Dimensions (Base Transformer)
| Parameter | Value |
|---|---|
| d_model | 512 |
| N (layers) | 6 |
| H (heads) | 8 |
| d_k = d_v | 64 |
| d_ff | 2048 |
| Vocab | ~37,000 |
2. From Raw Text to Q, K, V
The path from a string to the Q, K, V matrices involves four discrete transforms.
Figure 5. Input Pipeline: Raw Text to Q, K, V.
Six steps transform a raw string into the attention inputs Q, K, V:
- ① Raw Text — the verbatim input string (
Python str); no processing has occurred yet. - ② Token IDs — the tokenizer (BPE) splits the string into subword units and maps each to an integer from a frozen vocabulary; the sequence length
seqis fixed here. - ③ Embeddings — each integer ID is looked up in a learned matrix E ∈ ℝ^(vocab × d_model) and the result is scaled by √d_model to keep variance stable across layers.
- ④ x [seq × d_model] — fixed sinusoidal positional vectors are added element-wise to the embeddings, encoding each token’s absolute position;
xis the shared input to every transformer layer. - ⑤ Q, K, V —
xis linearly projected three times via learned weight matrices W_Q, W_K, W_V (each ℝ^(d_model × d_k)); the three projections can attend to different aspects of the same input. - ⑥ Attention Output — scaled dot-product attention combines the projections: softmax(QKᵀ / √d_k) · V, yielding a context-weighted summary of shape [seq × d_v].
2.1 Tokenization
BPE iteratively merges the most frequent adjacent byte pair until the vocabulary reaches the target size (32k–100k tokens). The tokenizer is trained once then permanently frozen — vocab.json is a permanent artifact that must ship with the model at inference time.
2.2 Embedding
Each integer ID is converted to a dense [d_model]-dimensional learnable vector. Similar tokens converge to nearby positions in embedding space during training. The embedding matrix is weight-tied with the pre-softmax output projection — the same matrix used to decode output logits — reducing parameters and improving gradient flow.
2.3 Producing Q, K, V
x = embedding + positional_encoding is the input to every transformer layer. Three learned linear projections produce Q, K, V for each head. The weight matrices W_Q, W_K, W_V are the entire learned “personality” of a head — the attention pattern is the result of applying those weights to the current input, not something stored in the model.
3. The LLM Creation Pipeline
Modern LLMs follow a well-defined multi-stage pipeline. Each stage produces artifacts consumed by the next.
3.1 Training Data Collection
LLMs train on trillions of tokens from diverse sources. Data mix matters as much as volume — GPT-3’s approximate composition:
| Source | Weight | Sampling rate |
|---|---|---|
| Common Crawl | 60% | 0.44× per epoch |
| Books | 22% | 2.0× per epoch |
| Wikipedia | 3% | 3.4× per epoch |
| Other | 15% | ~2.0× per epoch |
Code data improves general reasoning beyond coding — code’s explicit logic chains transfer to multi-step problem solving. The Phi model results showed a 1.3B parameter model on textbook-quality data outperforming much larger models on raw web data, shifting the field toward quality over quantity.
3.2 Data Processing
Raw data cannot be used directly. Deduplication (MinHash + LSH on n-gram shingles) prevents the model from memorizing repeated text. Quality filtering (KenLM perplexity, heuristic rules, classifiers) removes spam, boilerplate, and gibberish. PII removal (regex + NER) redacts emails, phone numbers, and identifiers. Toxic content filtering down-weights harmful categories.
3.3 Tokenizer Training
A BPE or SentencePiece tokenizer is trained on a corpus sample and frozen before any model training begins. Special tokens — [BOS], [EOS], [PAD], chat turn tokens — are assigned reserved IDs with semantic meaning for the model’s operation.
3.4 Architecture Definition
The config file is specified before weight initialization and is essentially permanent — changing it requires retraining from scratch.
GPT-3 (175B): d_model=12288, n_layers=96, n_heads=96, d_ff=49152, ctx=2048
LLaMA-7B: d_model=4096, n_layers=32, n_heads=32, d_ff=11008, ctx=4096
Mistral-7B: d_model=4096, n_layers=32, n_heads=32, GQA-8, ctx=8192
Key modern choices: GQA (Grouped-Query Attention) for reduced KV cache memory; RoPE positional encoding for length extrapolation; RMSNorm (drops mean subtraction, ~10% faster); SwiGLU activation in FFN layers.
3.5 Pre-Training
Objective: next-token prediction — minimize cross-entropy loss averaged over all positions.
L = -1/T × Σᵢ log P(xᵢ | x₁...xᵢ₋₁ ; θ)
Perplexity = exp(L) → random (50k vocab): 50,000 | GPT-3 PTB: ~20
Optimizer: AdamW (β₁=0.9, β₂=0.95) with cosine LR decay, linear warmup, gradient clipping (norm=1.0). Infrastructure: FSDP data parallelism, tensor parallelism splitting attention heads across GPUs, pipeline parallelism across layers.
3.6 Alignment
The base model predicts text unconditionally. Three sequential stages transform it into a safe, helpful assistant.
Figure 6. Alignment Pipeline: SFT → Reward Model → RLHF/DPO.
Three stages of the alignment pipeline:
- SFT — fine-tunes on curated (instruction, response) pairs; loss computed on response tokens only.
- Reward Model — trained on pairwise comparisons to produce a scalar human-preference score.
- RLHF / PPO — maximises reward with a KL penalty against the SFT policy to prevent reward hacking.
- DPO — reformulates the RLHF objective as direct classification on preference pairs, bypassing the reward model.
- SFT: fine-tune on human-written (instruction, response) pairs; loss only on response tokens. Small data (10k–1M examples), high impact.
- Reward Model: trained on (prompt, chosen, rejected) triples; loss =
-log σ(r(chosen) − r(rejected)). Scores any response in milliseconds as a proxy for human judgment. - RLHF / DPO: optimize policy against RM. PPO adds a KL penalty to prevent reward hacking. DPO reformulates RLHF as direct supervised classification on preference pairs — simpler, more stable, now dominant in open models.
3.7 Model Artifacts
The trained model is a collection of weight tensors on disk:
| Artifact | Shape | Notes |
|---|---|---|
| Token embeddings | [vocab × d_model] | Shared with output projection |
| W_Q, W_K, W_V | [d_model × d_k] each | L × H sets; often fused as W_QKV |
| W_O | [H·d_v × d_model] | One per layer |
| W1, W2 (FFN) | [d_model × d_ff], [d_ff × d_model] | ~66% of total params |
| γ, β (norm) | [d_model] each | < 0.5% of total params |
Attention patterns ([seq × seq]) are not stored — computed fresh each forward pass. The KV cache (K, V per token per layer) is stored at inference time: ~524 KB/token for LLaMA-7B.
3.8 Deployment
- Quantization: FP16 → INT4 (GPTQ, AWQ) shrinks 70B from 140 GB to 35 GB, 1–5% quality loss.
- PagedAttention (vLLM): manages KV cache like OS virtual memory — near-100% GPU utilization vs ~30% naive.
- Continuous batching: finished sequences leave and new requests join the batch at each generation step.
4. Attention Heads in Depth
4.1 What an Attention Head Is
A single head is one complete execution of scaled dot-product attention with its own W_Q, W_K, W_V. Those weight matrices are the head’s entire personality — they determine what it looks for, responds to, and passes along. The attention pattern is a runtime result, not a stored parameter.
4.2 What Heads Learn to Do
Interpretability research (Elhage et al. 2021, A Mathematical Framework for Transformer Circuits) has found that heads reliably develop distinct behaviors:
- Syntactic heads: connect verbs to subjects, adjectives to nouns — across arbitrary distances, with no grammar supervision.
- Coreference heads: resolve pronouns. Processing “it” in “the cat sat because it was tired” → near-exclusive attention on “cat”.
- Positional heads: attend at a nearly fixed offset (±1 or ±2 tokens), capturing local n-gram patterns.
- Copying heads: W_Q·W_K structured so tokens attend to similar tokens; W_V·W_O structured to copy the embedding forward. Core component of induction circuits.
- Induction heads: two-layer circuit. Layer N’s previous-token head passes a “what came before me” signal; layer N+1’s induction head uses it to find tokens that follow the same predecessor — the basis of in-context learning.
4.3 Layer Depth and Head Behavior
Figure 7. Attention Head Roles by Layer Depth.
Head behaviour varies systematically with depth:
- Early layers — local syntactic scaffolding: positional, POS-tag, and copying heads within short windows.
- Middle layers — long-range semantic and coreference functions requiring full-sequence attention.
- Late layers — hardest to interpret; primarily route information toward next-token prediction.
- Cross-architecture — this depth progression is consistent across model families and scales.
4.4 What Makes Heads Differ: The Mechanism
Step 1 — Random initialization breaks symmetry. Identical initial weights → identical outputs → identical gradients → no divergence. Random initialization creates immediate differences that compound over hundreds of thousands of steps into completely distinct learned functions.
Step 2 — W_O feedback loop enforces specialization. W_O learns to expect different information from different heads. Head 3 drifts toward subject-tracking → W_O routes head 3’s output wherever subject information is needed → head 3 receives stronger gradient for that behavior → it specializes further. Heads negotiate a division of labor purely through gradient descent.
Step 3 — Two independent degrees of freedom per head.
Figure 8. QK and OV Circuits in Each Attention Head.
Two independent circuits per head:
- QK circuit (W_Q · W_Kᵀ) — controls where attention flows: which token pairs produce high scores.
- OV circuit (W_V · W_O) — controls what moves: features read from attended tokens into the residual stream.
- Independence — identical QK patterns can carry completely different information via different OV circuits.
- Example — Heads A, B, C share the same attention weights but extract syntax, semantics, and position respectively.
Step 4 — Not all heads matter equally. Michel et al. (2019) showed that most heads can be pruned with minimal degradation. The load-bearing heads are few; the rest provide redundancy that regularizes training.
4.5 A Concrete Mental Model
Think of each head as a specialist reading the same document in parallel:
- Head A (syntactician): “who is the subject of each verb?”
- Head B (coreference): “what does each pronoun refer to?”
- Head C (proximity): “what came immediately before this word?”
- Head D (semantic): “what other tokens are topically related?”
W_O combines all their outputs into one enriched representation. No roles were assigned — they emerged as the most efficient division of labor for minimizing next-token prediction loss.
Conclusion
The Transformer’s elegance is that a single objective — predict the next token — applied at scale, produces grammar, factual recall, reasoning, and world knowledge as emergent byproducts. The key practitioner takeaways:
- Head specialization emerges from random initialization + gradient feedback through W_O — not architectural design.
- Every head has two independent degrees of freedom: the QK circuit (where attention flows) and the OV circuit (what information moves).
- FFN layers (~66% of parameters) act as key-value memory stores; most factual recall happens there, not in attention.
- The KV cache dominates inference memory at production scale — not the model weights.
- Data quality beats data quantity: Chinchilla scaling laws and the Phi results both show a smaller model on better data often outperforms a larger model on noisier data.
References
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
- Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.
- Michel, P. et al. (2019). Are Sixteen Heads Really Better than One? NeurIPS 2019.
- Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020.
- Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). DeepMind.
- Touvron, H. et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.
- Rafailov, R. et al. (2023). Direct Preference Optimization. NeurIPS 2023.
- He, K. et al. (2016). Deep Residual Learning for Image Recognition. CVPR 2016.