Transformer Architecture, Attention Heads, and the LLM Creation Pipeline

17 minute read

Transformer Architecture & the LLM Pipeline

Introduction

The 2017 paper Attention Is All You Need (Vaswani et al.) introduced the Transformer — a sequence transduction model built entirely on attention mechanisms, dispensing with recurrence and convolution entirely. It is the foundation of every modern large language model: GPT, BERT, Claude, Gemini, LLaMA, and their successors all descend directly from this architecture.

This post works through the Transformer from first principles, assuming a software engineering background but no prior deep learning experience. It proceeds in four parts: the architecture itself, the data pipeline from raw text to trained model, the complete LLM creation and alignment pipeline, and a detailed treatment of attention heads — what they are, what they learn, and crucially, what makes them differ from each other.

Two companion interactive diagrams accompany this post: the Transformer Architecture explorer (click any block for a detailed explanation) and the LLM Creation Pipeline explorer (click any stage to explore its sub-steps and formulas).

1. The Transformer Architecture

1.1 The Problem It Solves

Before Transformers, sequence modeling used RNNs — effectively a for-loop over tokens. By token 50 the model had “forgotten” much about token 1, and the sequential dependency prevented parallelization. The Transformer’s key insight is attention: let every token look at every other token simultaneously and decide what is relevant — O(1) sequential operations, fully parallelizable, with explicit long-range dependencies.

1.2 Encoder-Decoder Structure

The Transformer maps directly to an encode-then-decode pattern. The Encoder (N=6 identical layers) reads the full input and builds a rich contextual representation via multi-head self-attention followed by a position-wise feed-forward network. The Decoder (also N=6 layers) generates output autoregressively, adding masked self-attention (preventing future peeking) and a cross-attention sub-layer that reads from the encoder’s output. Residual connections — output = LayerNorm(x + sublayer(x)) — wrap every sub-layer, providing a clean gradient highway through deep stacks.

Figure 1. Encoder-Decoder Architecture.

Key elements of the encoder-decoder architecture:

Encoder — builds contextual representations of the full source sequence, producing K and V matrices.
Cross-attention — injects encoder K and V into every decoder layer, enabling full-input attention at each depth.
Decoder — generates output autoregressively, consuming its own previous tokens one at a time.
Masked self-attention — applies −∞ to future positions (j > i), enforcing left-to-right causal generation.

1.3 Scaled Dot-Product Attention

The core operation:

Q = x @ W_Q    # "what am I looking for?"   [seq × d_k]
K = x @ W_K    # "what do I offer?"          [seq × d_k]
V = x @ W_V    # "what do I carry?"          [seq × d_v]

scores  = Q @ K.T / sqrt(d_k)   # pairwise relevance  [seq × seq]
weights = softmax(scores)         # normalize to probabilities
output  = weights @ V             # weighted blend of values

Formally: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

Dividing by √d_k prevents dot products from growing large in high dimensions, which would push softmax into saturation and produce near-zero gradients. The scores matrix is the attention pattern — a [seq × seq] distribution where each row describes how much one token attends to every other.

Figure 2. Scaled Dot-Product Attention.

Steps in the computation:

Score — dot-product of Q and K, scaled by 1/√d_k to stabilise gradients in high dimensions.
Mask (decoder only) — sets future-position scores to −∞ before normalisation.
Softmax — normalises scores into attention weights summing to 1.0 per query position.
Output — weighted sum of V vectors; high-attention tokens contribute proportionally more.

1.4 Multi-Head Attention

A single head produces one weighted average per position. Multi-head attention runs H=8 independent heads in lower-dimensional subspaces (d_k = d_model/H = 64), concatenates their outputs, and projects back:

heads  = [Attention(x @ W_Q_i, x @ W_K_i, x @ W_V_i) for i in range(8)]
output = Concat(heads) @ W_O     # [seq × d_model]

Each head learns a different relationship type. Total compute ≈ one full-dimension head, but the model gains 8 independent perspectives per token.

Figure 3. Multi-Head Attention.

How multi-head attention works:

Parallel projection — each of H=8 heads projects to d_k=64 with its own W_Q, W_K, W_V matrices.
Specialisation — each head learns a distinct relational pattern (syntactic, semantic, positional, etc.).
Concatenation — outputs are concatenated to [seq × 512] and projected back to d_model by W_O.
Efficiency — total compute matches one full-dimension head, with 8× representational diversity.

1.5 Feed-Forward Networks

After attention mixes information across positions, each position is processed independently by a two-layer MLP:

FFN(x) = ReLU(x @ W1 + b1) @ W2 + b2
# W1: [512 → 2048],  W2: [2048 → 512]

The 4× inner expansion gives representational capacity to transform features. FFN weights account for roughly two thirds of all model parameters and act as key-value memory stores for factual associations.

1.6 Positional Encoding

Attention is a set operation with no inherent notion of order. Fixed sinusoidal encodings are added to embeddings:

PE(pos, 2i)   = sin(pos / 10000^(2i/512))
PE(pos, 2i+1) = cos(pos / 10000^(2i/512))

Low-index dimensions oscillate rapidly (encoding local position); high-index dimensions oscillate slowly (encoding global position). Because PE(pos+k) is a linear function of PE(pos), the model can easily learn to attend by relative offset.

Figure 4. Positional Encoding Heatmap.

Reading the heatmap:

Axes — columns = positions 0–11; rows = dimension bands. Blue = positive, white = zero, orange = negative.
Top rows (low-index) — fast sine cycles make neighbouring positions clearly distinct.
Bottom rows (high-index) — slow variation provides a coarser global position signal.
Result — the 512-dimension vector produces a unique fingerprint for every sequence position.

1.7 Model Dimensions (Base Transformer)

Parameter	Value
d_model	512
N (layers)	6
H (heads)	8
d_k = d_v	64
d_ff	2048
Vocab	~37,000

2. From Raw Text to Q, K, V

The path from a string to the Q, K, V matrices involves four discrete transforms.

Figure 5. Input Pipeline: Raw Text to Q, K, V.

Six steps transform a raw string into the attention inputs Q, K, V:

① Raw Text — the verbatim input string (Python str); no processing has occurred yet.
② Token IDs — the tokenizer (BPE) splits the string into subword units and maps each to an integer from a frozen vocabulary; the sequence length seq is fixed here.
③ Embeddings — each integer ID is looked up in a learned matrix E ∈ ℝ^(vocab × d_model) and the result is scaled by √d_model to keep variance stable across layers.
④ x [seq × d_model] — fixed sinusoidal positional vectors are added element-wise to the embeddings, encoding each token’s absolute position; x is the shared input to every transformer layer.
⑤ Q, K, V — x is linearly projected three times via learned weight matrices W_Q, W_K, W_V (each ℝ^(d_model × d_k)); the three projections can attend to different aspects of the same input.
⑥ Attention Output — scaled dot-product attention combines the projections: softmax(QKᵀ / √d_k) · V, yielding a context-weighted summary of shape [seq × d_v].

2.1 Tokenization

BPE iteratively merges the most frequent adjacent byte pair until the vocabulary reaches the target size (32k–100k tokens). The tokenizer is trained once then permanently frozen — vocab.json is a permanent artifact that must ship with the model at inference time.

2.2 Embedding

Each integer ID is converted to a dense [d_model]-dimensional learnable vector. Similar tokens converge to nearby positions in embedding space during training. The embedding matrix is weight-tied with the pre-softmax output projection — the same matrix used to decode output logits — reducing parameters and improving gradient flow.

2.3 Producing Q, K, V

x = embedding + positional_encoding is the input to every transformer layer. Three learned linear projections produce Q, K, V for each head. The weight matrices W_Q, W_K, W_V are the entire learned “personality” of a head — the attention pattern is the result of applying those weights to the current input, not something stored in the model.

3. The LLM Creation Pipeline

Modern LLMs follow a well-defined multi-stage pipeline. Each stage produces artifacts consumed by the next.

3.1 Training Data Collection

LLMs train on trillions of tokens from diverse sources. Data mix matters as much as volume — GPT-3’s approximate composition:

Source	Weight	Sampling rate
Common Crawl	60%	0.44× per epoch
Books	22%	2.0× per epoch
Wikipedia	3%	3.4× per epoch
Other	15%	~2.0× per epoch

Code data improves general reasoning beyond coding — code’s explicit logic chains transfer to multi-step problem solving. The Phi model results showed a 1.3B parameter model on textbook-quality data outperforming much larger models on raw web data, shifting the field toward quality over quantity.

3.2 Data Processing

Raw data cannot be used directly. Deduplication (MinHash + LSH on n-gram shingles) prevents the model from memorizing repeated text. Quality filtering (KenLM perplexity, heuristic rules, classifiers) removes spam, boilerplate, and gibberish. PII removal (regex + NER) redacts emails, phone numbers, and identifiers. Toxic content filtering down-weights harmful categories.

3.3 Tokenizer Training

A BPE or SentencePiece tokenizer is trained on a corpus sample and frozen before any model training begins. Special tokens — [BOS], [EOS], [PAD], chat turn tokens — are assigned reserved IDs with semantic meaning for the model’s operation.

3.4 Architecture Definition

The config file is specified before weight initialization and is essentially permanent — changing it requires retraining from scratch.

GPT-3 (175B):  d_model=12288, n_layers=96, n_heads=96, d_ff=49152, ctx=2048
LLaMA-7B:      d_model=4096,  n_layers=32, n_heads=32, d_ff=11008,  ctx=4096
Mistral-7B:    d_model=4096,  n_layers=32, n_heads=32, GQA-8,       ctx=8192

Key modern choices: GQA (Grouped-Query Attention) for reduced KV cache memory; RoPE positional encoding for length extrapolation; RMSNorm (drops mean subtraction, ~10% faster); SwiGLU activation in FFN layers.

3.5 Pre-Training

Objective: next-token prediction — minimize cross-entropy loss averaged over all positions.

L = -1/T × Σᵢ log P(xᵢ | x₁...xᵢ₋₁ ; θ)
Perplexity = exp(L)   →  random (50k vocab): 50,000  |  GPT-3 PTB: ~20

Optimizer: AdamW (β₁=0.9, β₂=0.95) with cosine LR decay, linear warmup, gradient clipping (norm=1.0). Infrastructure: FSDP data parallelism, tensor parallelism splitting attention heads across GPUs, pipeline parallelism across layers.

3.6 Alignment

The base model predicts text unconditionally. Three sequential stages transform it into a safe, helpful assistant.

Figure 6. Alignment Pipeline: SFT → Reward Model → RLHF/DPO.

Three stages of the alignment pipeline:

SFT — fine-tunes on curated (instruction, response) pairs; loss computed on response tokens only.
Reward Model — trained on pairwise comparisons to produce a scalar human-preference score.
RLHF / PPO — maximises reward with a KL penalty against the SFT policy to prevent reward hacking.
DPO — reformulates the RLHF objective as direct classification on preference pairs, bypassing the reward model.

SFT: fine-tune on human-written (instruction, response) pairs; loss only on response tokens. Small data (10k–1M examples), high impact.
Reward Model: trained on (prompt, chosen, rejected) triples; loss = -log σ(r(chosen) − r(rejected)). Scores any response in milliseconds as a proxy for human judgment.
RLHF / DPO: optimize policy against RM. PPO adds a KL penalty to prevent reward hacking. DPO reformulates RLHF as direct supervised classification on preference pairs — simpler, more stable, now dominant in open models.

3.7 Model Artifacts

The trained model is a collection of weight tensors on disk:

Artifact	Shape	Notes
Token embeddings	[vocab × d_model]	Shared with output projection
W_Q, W_K, W_V	[d_model × d_k] each	L × H sets; often fused as W_QKV
W_O	[H·d_v × d_model]	One per layer
W1, W2 (FFN)	[d_model × d_ff], [d_ff × d_model]	~66% of total params
γ, β (norm)	[d_model] each	< 0.5% of total params

Attention patterns ([seq × seq]) are not stored — computed fresh each forward pass. The KV cache (K, V per token per layer) is stored at inference time: ~524 KB/token for LLaMA-7B.

3.8 Deployment

Quantization: FP16 → INT4 (GPTQ, AWQ) shrinks 70B from 140 GB to 35 GB, 1–5% quality loss.
PagedAttention (vLLM): manages KV cache like OS virtual memory — near-100% GPU utilization vs ~30% naive.
Continuous batching: finished sequences leave and new requests join the batch at each generation step.

4. Attention Heads in Depth

4.1 What an Attention Head Is

A single head is one complete execution of scaled dot-product attention with its own W_Q, W_K, W_V. Those weight matrices are the head’s entire personality — they determine what it looks for, responds to, and passes along. The attention pattern is a runtime result, not a stored parameter.

4.2 What Heads Learn to Do

Interpretability research (Elhage et al. 2021, A Mathematical Framework for Transformer Circuits) has found that heads reliably develop distinct behaviors:

Syntactic heads: connect verbs to subjects, adjectives to nouns — across arbitrary distances, with no grammar supervision.
Coreference heads: resolve pronouns. Processing “it” in “the cat sat because it was tired” → near-exclusive attention on “cat”.
Positional heads: attend at a nearly fixed offset (±1 or ±2 tokens), capturing local n-gram patterns.
Copying heads: W_Q·W_K structured so tokens attend to similar tokens; W_V·W_O structured to copy the embedding forward. Core component of induction circuits.
Induction heads: two-layer circuit. Layer N’s previous-token head passes a “what came before me” signal; layer N+1’s induction head uses it to find tokens that follow the same predecessor — the basis of in-context learning.

4.3 Layer Depth and Head Behavior

Figure 7. Attention Head Roles by Layer Depth.

Head behaviour varies systematically with depth:

Early layers — local syntactic scaffolding: positional, POS-tag, and copying heads within short windows.
Middle layers — long-range semantic and coreference functions requiring full-sequence attention.
Late layers — hardest to interpret; primarily route information toward next-token prediction.
Cross-architecture — this depth progression is consistent across model families and scales.

4.4 What Makes Heads Differ: The Mechanism

Step 1 — Random initialization breaks symmetry. Identical initial weights → identical outputs → identical gradients → no divergence. Random initialization creates immediate differences that compound over hundreds of thousands of steps into completely distinct learned functions.

Step 2 — W_O feedback loop enforces specialization. W_O learns to expect different information from different heads. Head 3 drifts toward subject-tracking → W_O routes head 3’s output wherever subject information is needed → head 3 receives stronger gradient for that behavior → it specializes further. Heads negotiate a division of labor purely through gradient descent.

Step 3 — Two independent degrees of freedom per head.

Figure 8. QK and OV Circuits in Each Attention Head.

Two independent circuits per head:

QK circuit (W_Q · W_Kᵀ) — controls where attention flows: which token pairs produce high scores.
OV circuit (W_V · W_O) — controls what moves: features read from attended tokens into the residual stream.
Independence — identical QK patterns can carry completely different information via different OV circuits.
Example — Heads A, B, C share the same attention weights but extract syntax, semantics, and position respectively.

Step 4 — Not all heads matter equally. Michel et al. (2019) showed that most heads can be pruned with minimal degradation. The load-bearing heads are few; the rest provide redundancy that regularizes training.

4.5 A Concrete Mental Model

Think of each head as a specialist reading the same document in parallel:

Head A (syntactician): “who is the subject of each verb?”
Head B (coreference): “what does each pronoun refer to?”
Head C (proximity): “what came immediately before this word?”
Head D (semantic): “what other tokens are topically related?”

W_O combines all their outputs into one enriched representation. No roles were assigned — they emerged as the most efficient division of labor for minimizing next-token prediction loss.

Conclusion

The Transformer’s elegance is that a single objective — predict the next token — applied at scale, produces grammar, factual recall, reasoning, and world knowledge as emergent byproducts. The key practitioner takeaways:

Head specialization emerges from random initialization + gradient feedback through W_O — not architectural design.
Every head has two independent degrees of freedom: the QK circuit (where attention flows) and the OV circuit (what information moves).
FFN layers (~66% of parameters) act as key-value memory stores; most factual recall happens there, not in attention.
The KV cache dominates inference memory at production scale — not the model weights.
Data quality beats data quantity: Chinchilla scaling laws and the Phi results both show a smaller model on better data often outperforms a larger model on noisier data.

References

Interactive diagrams

Transformer Architecture explorer — click-through diagram of the encoder-decoder architecture with per-block formulas and code.
LLM Creation Pipeline explorer — full pipeline from raw data collection through deployment, with expandable stage details.

Papers

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.
Michel, P. et al. (2019). Are Sixteen Heads Really Better than One? NeurIPS 2019.
Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020.
Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). DeepMind.
Touvron, H. et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.
Rafailov, R. et al. (2023). Direct Preference Optimization. NeurIPS 2023.
He, K. et al. (2016). Deep Residual Learning for Image Recognition. CVPR 2016.

Applying This in Practice

If you are applying these ideas to a regulated product, certification target, or production system, I can help turn the analysis into a threat model, architecture review, migration roadmap, or remediation plan.

Discuss a security architecture challenge

Twitter Facebook LinkedIn