17 minute read

Transformer Architecture & the LLM Pipeline Logo

Transformer Architecture & the LLM Pipeline


Introduction

The 2017 paper Attention Is All You Need (Vaswani et al.) introduced the Transformer — a sequence transduction model built entirely on attention mechanisms, dispensing with recurrence and convolution entirely. It is the foundation of every modern large language model: GPT, BERT, Claude, Gemini, LLaMA, and their successors all descend directly from this architecture.

This post works through the Transformer from first principles, assuming a software engineering background but no prior deep learning experience. It proceeds in four parts: the architecture itself, the data pipeline from raw text to trained model, the complete LLM creation and alignment pipeline, and a detailed treatment of attention heads — what they are, what they learn, and crucially, what makes them differ from each other.


1. The Transformer Architecture

1.1 The Problem It Solves

Before Transformers, sequence modeling used RNNs — effectively a for-loop over tokens. By token 50 the model had “forgotten” much about token 1, and the sequential dependency prevented parallelization. The Transformer’s key insight is attention: let every token look at every other token simultaneously and decide what is relevant — O(1) sequential operations, fully parallelizable, with explicit long-range dependencies.

1.2 Encoder-Decoder Structure

The Transformer maps directly to an encode-then-decode pattern. The Encoder (N=6 identical layers) reads the full input and builds a rich contextual representation via multi-head self-attention followed by a position-wise feed-forward network. The Decoder (also N=6 layers) generates output autoregressively, adding masked self-attention (preventing future peeking) and a cross-attention sub-layer that reads from the encoder’s output. Residual connectionsoutput = LayerNorm(x + sublayer(x)) — wrap every sub-layer, providing a clean gradient highway through deep stacks.

Encoder-Decoder Architecture ENCODER ×6 Input Embedding + Positional Encoding Multi-Head Self-Attention Add & Norm Feed-Forward Network Add & Norm ↑ Encoder Output K, V DECODER ×6 Output Embedding (shifted) + Positional Encoding Masked Self-Attention Add & Norm Cross-Attention (Encoder → Decoder) Q ↑ Add & Norm Feed-Forward Network Add & Norm Linear + Softmax Output Probabilities

Figure 1. Encoder-Decoder Architecture.

Key elements of the encoder-decoder architecture:

  • Encoder — builds contextual representations of the full source sequence, producing K and V matrices.
  • Cross-attention — injects encoder K and V into every decoder layer, enabling full-input attention at each depth.
  • Decoder — generates output autoregressively, consuming its own previous tokens one at a time.
  • Masked self-attention — applies −∞ to future positions (j > i), enforcing left-to-right causal generation.

1.3 Scaled Dot-Product Attention

The core operation:

Q = x @ W_Q    # "what am I looking for?"   [seq × d_k]
K = x @ W_K    # "what do I offer?"          [seq × d_k]
V = x @ W_V    # "what do I carry?"          [seq × d_v]

scores  = Q @ K.T / sqrt(d_k)   # pairwise relevance  [seq × seq]
weights = softmax(scores)         # normalize to probabilities
output  = weights @ V             # weighted blend of values

Formally: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

Dividing by √d_k prevents dot products from growing large in high dimensions, which would push softmax into saturation and produce near-zero gradients. The scores matrix is the attention pattern — a [seq × seq] distribution where each row describes how much one token attends to every other.

Scaled Dot-Product Attention Q x · W_Q K x · W_K V x · W_V MatMul Q × Kᵀ → [seq×seq] ← raw similarity scores Scale ÷ √d_k prevents saturation ← avoids gradient saturation Mask (opt.) decoder: −∞ future Softmax rows → attn weights ← each row sums to 1.0 MatMul weights × V → output Attention Output [seq × d_v]

Figure 2. Scaled Dot-Product Attention.

Steps in the computation:

  • Score — dot-product of Q and K, scaled by 1/√d_k to stabilise gradients in high dimensions.
  • Mask (decoder only) — sets future-position scores to −∞ before normalisation.
  • Softmax — normalises scores into attention weights summing to 1.0 per query position.
  • Output — weighted sum of V vectors; high-attention tokens contribute proportionally more.

1.4 Multi-Head Attention

A single head produces one weighted average per position. Multi-head attention runs H=8 independent heads in lower-dimensional subspaces (d_k = d_model/H = 64), concatenates their outputs, and projects back:

heads  = [Attention(x @ W_Q_i, x @ W_K_i, x @ W_V_i) for i in range(8)]
output = Concat(heads) @ W_O     # [seq × d_model]

Each head learns a different relationship type. Total compute ≈ one full-dimension head, but the model gains 8 independent perspectives per token.

Multi-Head Attention (H = 8 heads, d_k = 64 each) MultiHead Output [seq × 512] Linear W_O [512×512] → [seq × d_model] Concat(head₁, …, head₈) [seq × 512] (= 8 × 64) head₁ head₂ head₈ Scaled Dot-Product Attention [seq×64] Scaled Dot-Product Attention [seq×64] Scaled Dot-Product Attention [seq×64] heads 3–7 W_Q₁ W_K₁ W_V₁ W_Q₂ W_K₂ W_V₂ W_Q₈ W_K₈ W_V₈ Head 1 Head 2 Head 8 Input x [seq × 512]

Figure 3. Multi-Head Attention.

How multi-head attention works:

  • Parallel projection — each of H=8 heads projects to d_k=64 with its own W_Q, W_K, W_V matrices.
  • Specialisation — each head learns a distinct relational pattern (syntactic, semantic, positional, etc.).
  • Concatenation — outputs are concatenated to [seq × 512] and projected back to d_model by W_O.
  • Efficiency — total compute matches one full-dimension head, with 8× representational diversity.

1.5 Feed-Forward Networks

After attention mixes information across positions, each position is processed independently by a two-layer MLP:

FFN(x) = ReLU(x @ W1 + b1) @ W2 + b2
# W1: [512 → 2048],  W2: [2048 → 512]

The 4× inner expansion gives representational capacity to transform features. FFN weights account for roughly two thirds of all model parameters and act as key-value memory stores for factual associations.

1.6 Positional Encoding

Attention is a set operation with no inherent notion of order. Fixed sinusoidal encodings are added to embeddings:

PE(pos, 2i)   = sin(pos / 10000^(2i/512))
PE(pos, 2i+1) = cos(pos / 10000^(2i/512))

Low-index dimensions oscillate rapidly (encoding local position); high-index dimensions oscillate slowly (encoding global position). Because PE(pos+k) is a linear function of PE(pos), the model can easily learn to attend by relative offset.

Positional Encoding — Sinusoidal Values by Dimension Low-index dims oscillate fast (local); high-index dims change slowly (global). Every position → unique 512-dim vector. dim 0 dim 2 dim 4 dim 8 dim 16 dim 32 dim 64 dim 128 0 1 2 3 4 5 6 7 8 9 10 11 positive (+1) zero (0) negative (−1) ← fast oscillation (local position) ← slow oscillation (global position)

Figure 4. Positional Encoding Heatmap.

Reading the heatmap:

  • Axes — columns = positions 0–11; rows = dimension bands. Blue = positive, white = zero, orange = negative.
  • Top rows (low-index) — fast sine cycles make neighbouring positions clearly distinct.
  • Bottom rows (high-index) — slow variation provides a coarser global position signal.
  • Result — the 512-dimension vector produces a unique fingerprint for every sequence position.

1.7 Model Dimensions (Base Transformer)

Parameter Value
d_model 512
N (layers) 6
H (heads) 8
d_k = d_v 64
d_ff 2048
Vocab ~37,000

2. From Raw Text to Q, K, V

The path from a string to the Q, K, V matrices involves four discrete transforms.

Input Pipeline: Raw Text → Q, K, V ① Raw Text "The cat sat on the mat" Python str BPE tokenize ② Token IDs [464, 3797, 3332, 319, 262, 2603] int[seq] • vocab.json (frozen artifact) embedding lookup ③ Embeddings E[ids] × √d_model [seq × 512] • weight-tied with output projection add sinusoidal PE ④ x [seq × 512] Embedding + Positional Encoding input to every transformer layer × W_Q, W_K, W_V Q x · W_Q [seq×64] K x · W_K [seq×64] V x · W_V [seq×64] ⑥ Attention Output softmax(QKᵀ/√d_k) · V [seq × d_v]

Figure 5. Input Pipeline: Raw Text to Q, K, V.

Six steps transform a raw string into the attention inputs Q, K, V:

  • ① Raw Text — the verbatim input string (Python str); no processing has occurred yet.
  • ② Token IDs — the tokenizer (BPE) splits the string into subword units and maps each to an integer from a frozen vocabulary; the sequence length seq is fixed here.
  • ③ Embeddings — each integer ID is looked up in a learned matrix E ∈ ℝ^(vocab × d_model) and the result is scaled by √d_model to keep variance stable across layers.
  • ④ x [seq × d_model] — fixed sinusoidal positional vectors are added element-wise to the embeddings, encoding each token’s absolute position; x is the shared input to every transformer layer.
  • ⑤ Q, K, Vx is linearly projected three times via learned weight matrices W_Q, W_K, W_V (each ℝ^(d_model × d_k)); the three projections can attend to different aspects of the same input.
  • ⑥ Attention Output — scaled dot-product attention combines the projections: softmax(QKᵀ / √d_k) · V, yielding a context-weighted summary of shape [seq × d_v].

2.1 Tokenization

BPE iteratively merges the most frequent adjacent byte pair until the vocabulary reaches the target size (32k–100k tokens). The tokenizer is trained once then permanently frozen — vocab.json is a permanent artifact that must ship with the model at inference time.

2.2 Embedding

Each integer ID is converted to a dense [d_model]-dimensional learnable vector. Similar tokens converge to nearby positions in embedding space during training. The embedding matrix is weight-tied with the pre-softmax output projection — the same matrix used to decode output logits — reducing parameters and improving gradient flow.

2.3 Producing Q, K, V

x = embedding + positional_encoding is the input to every transformer layer. Three learned linear projections produce Q, K, V for each head. The weight matrices W_Q, W_K, W_V are the entire learned “personality” of a head — the attention pattern is the result of applying those weights to the current input, not something stored in the model.


3. The LLM Creation Pipeline

Modern LLMs follow a well-defined multi-stage pipeline. Each stage produces artifacts consumed by the next.

3.1 Training Data Collection

LLMs train on trillions of tokens from diverse sources. Data mix matters as much as volume — GPT-3’s approximate composition:

Source Weight Sampling rate
Common Crawl 60% 0.44× per epoch
Books 22% 2.0× per epoch
Wikipedia 3% 3.4× per epoch
Other 15% ~2.0× per epoch

Code data improves general reasoning beyond coding — code’s explicit logic chains transfer to multi-step problem solving. The Phi model results showed a 1.3B parameter model on textbook-quality data outperforming much larger models on raw web data, shifting the field toward quality over quantity.

3.2 Data Processing

Raw data cannot be used directly. Deduplication (MinHash + LSH on n-gram shingles) prevents the model from memorizing repeated text. Quality filtering (KenLM perplexity, heuristic rules, classifiers) removes spam, boilerplate, and gibberish. PII removal (regex + NER) redacts emails, phone numbers, and identifiers. Toxic content filtering down-weights harmful categories.

3.3 Tokenizer Training

A BPE or SentencePiece tokenizer is trained on a corpus sample and frozen before any model training begins. Special tokens — [BOS], [EOS], [PAD], chat turn tokens — are assigned reserved IDs with semantic meaning for the model’s operation.

3.4 Architecture Definition

The config file is specified before weight initialization and is essentially permanent — changing it requires retraining from scratch.

GPT-3 (175B):  d_model=12288, n_layers=96, n_heads=96, d_ff=49152, ctx=2048
LLaMA-7B:      d_model=4096,  n_layers=32, n_heads=32, d_ff=11008,  ctx=4096
Mistral-7B:    d_model=4096,  n_layers=32, n_heads=32, GQA-8,       ctx=8192

Key modern choices: GQA (Grouped-Query Attention) for reduced KV cache memory; RoPE positional encoding for length extrapolation; RMSNorm (drops mean subtraction, ~10% faster); SwiGLU activation in FFN layers.

3.5 Pre-Training

Objective: next-token prediction — minimize cross-entropy loss averaged over all positions.

L = -1/T × Σᵢ log P(xᵢ | x₁...xᵢ₋₁ ; θ)
Perplexity = exp(L)   →  random (50k vocab): 50,000  |  GPT-3 PTB: ~20

Optimizer: AdamW (β₁=0.9, β₂=0.95) with cosine LR decay, linear warmup, gradient clipping (norm=1.0). Infrastructure: FSDP data parallelism, tensor parallelism splitting attention heads across GPUs, pipeline parallelism across layers.

3.6 Alignment

The base model predicts text unconditionally. Three sequential stages transform it into a safe, helpful assistant.

Alignment Pipeline Pre-trained Base LM Predicts any text — including harmful content raw next-token prediction, no instruction following instruction data ① SFT — Supervised Fine-Tuning Human-written (instruction, response) pairs Loss computed on response tokens only 10k–1M examples • high impact despite small dataset High impact despite small data preference pairs ② Reward Model (prompt, chosen, rejected) triples loss = −log σ(r_chosen − r_rejected) Scalar output — proxy for human judgment Flawed RM → reward hacking RL signal ③ RLHF / DPO PPO: maximize reward − β · KL(π ‖ π_SFT) DPO: skips reward model entirely — classifies (chosen, rejected) as supervised loss DPO: simpler, more stable, now dominant in open-source models DPO now dominant in open models Aligned Model Helpful · Harmless · Honest follows instructions, refuses harmful requests

Figure 6. Alignment Pipeline: SFT → Reward Model → RLHF/DPO.

Three stages of the alignment pipeline:

  • SFT — fine-tunes on curated (instruction, response) pairs; loss computed on response tokens only.
  • Reward Model — trained on pairwise comparisons to produce a scalar human-preference score.
  • RLHF / PPO — maximises reward with a KL penalty against the SFT policy to prevent reward hacking.
  • DPO — reformulates the RLHF objective as direct classification on preference pairs, bypassing the reward model.
  1. SFT: fine-tune on human-written (instruction, response) pairs; loss only on response tokens. Small data (10k–1M examples), high impact.
  2. Reward Model: trained on (prompt, chosen, rejected) triples; loss = -log σ(r(chosen) − r(rejected)). Scores any response in milliseconds as a proxy for human judgment.
  3. RLHF / DPO: optimize policy against RM. PPO adds a KL penalty to prevent reward hacking. DPO reformulates RLHF as direct supervised classification on preference pairs — simpler, more stable, now dominant in open models.

3.7 Model Artifacts

The trained model is a collection of weight tensors on disk:

Artifact Shape Notes
Token embeddings [vocab × d_model] Shared with output projection
W_Q, W_K, W_V [d_model × d_k] each L × H sets; often fused as W_QKV
W_O [H·d_v × d_model] One per layer
W1, W2 (FFN) [d_model × d_ff], [d_ff × d_model] ~66% of total params
γ, β (norm) [d_model] each < 0.5% of total params

Attention patterns ([seq × seq]) are not stored — computed fresh each forward pass. The KV cache (K, V per token per layer) is stored at inference time: ~524 KB/token for LLaMA-7B.

3.8 Deployment

  • Quantization: FP16 → INT4 (GPTQ, AWQ) shrinks 70B from 140 GB to 35 GB, 1–5% quality loss.
  • PagedAttention (vLLM): manages KV cache like OS virtual memory — near-100% GPU utilization vs ~30% naive.
  • Continuous batching: finished sequences leave and new requests join the batch at each generation step.

4. Attention Heads in Depth

4.1 What an Attention Head Is

A single head is one complete execution of scaled dot-product attention with its own W_Q, W_K, W_V. Those weight matrices are the head’s entire personality — they determine what it looks for, responds to, and passes along. The attention pattern is a runtime result, not a stored parameter.

4.2 What Heads Learn to Do

Interpretability research (Elhage et al. 2021, A Mathematical Framework for Transformer Circuits) has found that heads reliably develop distinct behaviors:

  • Syntactic heads: connect verbs to subjects, adjectives to nouns — across arbitrary distances, with no grammar supervision.
  • Coreference heads: resolve pronouns. Processing “it” in “the cat sat because it was tired” → near-exclusive attention on “cat”.
  • Positional heads: attend at a nearly fixed offset (±1 or ±2 tokens), capturing local n-gram patterns.
  • Copying heads: W_Q·W_K structured so tokens attend to similar tokens; W_V·W_O structured to copy the embedding forward. Core component of induction circuits.
  • Induction heads: two-layer circuit. Layer N’s previous-token head passes a “what came before me” signal; layer N+1’s induction head uses it to find tokens that follow the same predecessor — the basis of in-context learning.

4.3 Layer Depth and Head Behavior

Head Behaviour by Layer Depth (12-layer model) Layers 1–2 Syntactic • Positional offset • Part-of-speech • Adjacent tokens Short-range, diffuse attend pos ± 1 e.g. induction heads Layers 3–4 Structural • Subject-verb • Noun phrases • Clause structure Medium-range syntactic trees e.g. copying heads Layers 5–6 Semantic • Topic clustering • Lexical relations • Long-range deps Full-sequence span semantic clusters e.g. entity heads Layers 7–8 Coreference • Pronoun resolve • Entity tracking • Anaphora Sharp, sparse attn patterns "it" → "cat" Layers 9–10 Aggregation • Context pooling • Feature mixing • Dense patterns Distributed, harder to interpret Layers 11–12 Prediction • Next-token prep • Output routing • Vocab projection Hardest layer to interpret

Figure 7. Attention Head Roles by Layer Depth.

Head behaviour varies systematically with depth:

  • Early layers — local syntactic scaffolding: positional, POS-tag, and copying heads within short windows.
  • Middle layers — long-range semantic and coreference functions requiring full-sequence attention.
  • Late layers — hardest to interpret; primarily route information toward next-token prediction.
  • Cross-architecture — this depth progression is consistent across model families and scales.

4.4 What Makes Heads Differ: The Mechanism

Step 1 — Random initialization breaks symmetry. Identical initial weights → identical outputs → identical gradients → no divergence. Random initialization creates immediate differences that compound over hundreds of thousands of steps into completely distinct learned functions.

Step 2 — W_O feedback loop enforces specialization. W_O learns to expect different information from different heads. Head 3 drifts toward subject-tracking → W_O routes head 3’s output wherever subject information is needed → head 3 receives stronger gradient for that behavior → it specializes further. Heads negotiate a division of labor purely through gradient descent.

Step 3 — Two independent degrees of freedom per head.

Two Independent Circuits Per Attention Head QK Circuit = W_Q · W_Kᵀ determines WHERE attention flows token A token B q = A · W_Q k = B · W_K score = q · kᵀ / √d_k attn weight A→B Encodes: "does A attend to B?" Produces one scalar per (i, j) token pair OV Circuit = W_V · W_O determines WHAT information moves attended token x_src [d_model dims] v = x_src · W_V attn_weight · v · W_O → residual stream Head A OV: syntax role → copies grammatical features Head B OV: semantic emb. → copies meaning vector Head C OV: positional → copies position signal Same attn pattern, 3 different OV → 3 different outputs QK and OV are independently learned

Figure 8. QK and OV Circuits in Each Attention Head.

Two independent circuits per head:

  • QK circuit (W_Q · W_Kᵀ) — controls where attention flows: which token pairs produce high scores.
  • OV circuit (W_V · W_O) — controls what moves: features read from attended tokens into the residual stream.
  • Independence — identical QK patterns can carry completely different information via different OV circuits.
  • Example — Heads A, B, C share the same attention weights but extract syntax, semantics, and position respectively.

Step 4 — Not all heads matter equally. Michel et al. (2019) showed that most heads can be pruned with minimal degradation. The load-bearing heads are few; the rest provide redundancy that regularizes training.

4.5 A Concrete Mental Model

Think of each head as a specialist reading the same document in parallel:

  • Head A (syntactician): “who is the subject of each verb?”
  • Head B (coreference): “what does each pronoun refer to?”
  • Head C (proximity): “what came immediately before this word?”
  • Head D (semantic): “what other tokens are topically related?”

W_O combines all their outputs into one enriched representation. No roles were assigned — they emerged as the most efficient division of labor for minimizing next-token prediction loss.


Conclusion

The Transformer’s elegance is that a single objective — predict the next token — applied at scale, produces grammar, factual recall, reasoning, and world knowledge as emergent byproducts. The key practitioner takeaways:

  • Head specialization emerges from random initialization + gradient feedback through W_O — not architectural design.
  • Every head has two independent degrees of freedom: the QK circuit (where attention flows) and the OV circuit (what information moves).
  • FFN layers (~66% of parameters) act as key-value memory stores; most factual recall happens there, not in attention.
  • The KV cache dominates inference memory at production scale — not the model weights.
  • Data quality beats data quantity: Chinchilla scaling laws and the Phi results both show a smaller model on better data often outperforms a larger model on noisier data.

References

  • Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
  • Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.
  • Michel, P. et al. (2019). Are Sixteen Heads Really Better than One? NeurIPS 2019.
  • Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020.
  • Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). DeepMind.
  • Touvron, H. et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.
  • Rafailov, R. et al. (2023). Direct Preference Optimization. NeurIPS 2023.
  • He, K. et al. (2016). Deep Residual Learning for Image Recognition. CVPR 2016.