LLM Training Workflows: From Data Governance to Deployment

22 minute read

LLM Training Workflows

Executive Summary

Large language model training workflows are now best understood as integrated systems rather than single optimisation problems. The strongest public recipes combine carefully governed data pipelines, decoder-dominant transformer architectures, staged training and post-training, distributed systems co-design, and a much broader evaluation stack than perplexity alone. Open reports from OLMo, OLMo 2, Llama 3, BLOOM, Switch Transformer, and DeepSeek-V3 show that model quality is driven as much by data quality, curriculum, and systems engineering as by raw parameter count. Recent frontier-scale examples also show a split between dense models that prioritise stability and simplicity, and sparse mixture-of-experts models that trade operational complexity for better compute efficiency at very large scale.

Because target model size, domain, latency target, and budget vary widely, the most useful framing is by decision band. For general-purpose assistants, dense decoder-only transformers remain the default path from roughly 7B to 70B because they integrate cleanly with next-token pre-training, supervised fine-tuning, DPO, and modern serving stacks. Encoder-decoder models are still strong for sequence-to-sequence workloads, and MoE becomes more appealing when training budgets are large enough to justify expert routing, expert parallelism, and more complex failure handling. Public examples span OLMo 2 at 7B, 13B, and 32B, Llama 3 at 8B, 70B, and 405B, and DeepSeek-V3 at 671B total parameters with 37B active per token.

The strongest cross-cutting best practice is to design the workflow backwards from deployment and risk constraints. Teams that start with a benchmark target but neglect licensing provenance, deduplication, contamination control, checkpoint portability, post-training safety, or serving economics often discover too late that they have trained an expensive model that is difficult to ship or audit. Recent official guidance and public documentation from PyTorch, DeepSpeed, Megatron Core, AWS, Google Cloud, Azure, NIST, and the EU’s GPAI materials all emphasise reproducibility, scalable checkpointing, distributed fault tolerance, and lifecycle governance rather than only training throughput.

In practice, a robust LLM workflow begins with governed and deduplicated corpus assembly, moves through streaming preprocessing and tokenizer validation, then trains a dense or sparse transformer with AdamW-family optimisation and mixed precision. At scale, training uses some combination of data, tensor, pipeline, expert, and sharded parallelism. Checkpointing and restart design need to exist from day one. After the base run, teams usually add supervised fine-tuning plus preference optimisation, deployment-time adaptation with LoRA, QLoRA, quantisation, or pruning, and evaluation that covers capability, calibration, safety, and adversarial behaviour.

Pastel process diagram showing the full LLM training workflow from source acquisition through governance, curation, preprocessing, training, post-training, evaluation, deployment optimization, and MLOps. — **Fig 1.** A defensible LLM training programme is a lifecycle. Data governance, checkpointing, evaluation, and deployment optimisation are not cleanly separable from the main training run.

Workflow Overview and Project Roadmap

Lifecycle Stages

Recent public model reports converge on a multi-stage lifecycle: pre-training data construction, large-scale pre-training, continued pre-training or annealing, supervised instruction tuning, preference optimisation, safety work, and deployment optimisation. Llama 3 explicitly describes separate pre-training and post-training stages, plus a long-context continued pre-training stage that stretches the context window from 8K to 128K. OLMo and OLMo 2 similarly expose data, training code, checkpoints, and late-stage curriculum changes as first-class parts of the workflow rather than afterthoughts.

The flow in Figure 1 is a synthesis of patterns that appear repeatedly in open reports and official framework documentation. It is not a claim that every team follows the exact same order. It reflects the most defensible default design for a modern training programme, where data governance, scalable infrastructure, evaluation, and release readiness are planned together.

Project Roadmap

A typical project timeline depends heavily on scope. A 7B-13B research model can often compress phases; a 30B-70B production model usually needs separate data, systems, and post-training workstreams; frontier-scale MoE efforts require longer platform hardening and risk management. The roadmap is therefore illustrative, not prescriptive, but it maps well to the staged processes described in Llama 3, OLMo and OLMo 2, InstructGPT, and current cloud training tutorials.

Pastel Gantt chart showing an illustrative LLM training roadmap across data, platform, training, post-training, and release workstreams. — **Fig 2.** The original roadmap is represented as a Gantt-style project plan. It shows why data, platform, training, post-training, safety, and release work overlap rather than forming a simple serial queue.

The most common planning pitfall is to treat evaluation, checkpointing, or compliance as post-run work. Recent public training stacks show the opposite. Llama 3 chose a simpler dense architecture partly to maximise training stability and scale-management; PyTorch DCP and torchrun exist because restartability and topology changes are normal at scale; and EU GPAI guidance ties obligations to the model lifecycle rather than only to the release event.

Data Sourcing, Governance, Preprocessing, and Tokenization

Data Pipeline Leverage

The data pipeline is usually the highest-leverage part of the workflow. ROOTS, OLMo, Dolma, FineWeb, and Llama 3 all make data curation central. BigScience ROOTS is notable because it treated corpus construction as a values-driven, multilingual, governance-heavy activity rather than a raw crawl problem. OLMo trained on a 2T-token sample of Dolma and released enough artefacts to reconstruct training order and batch composition. Llama 3 reports roughly 15T multilingual pre-training tokens and highlights improved preprocessing, curation, and data-quality controls as core contributors to performance.

Source family	Why teams use it	Workflow requirements	Notable failure mode
Open web crawl	Scale, diversity, low marginal cost	Aggressive filtering, deduplication, licence review, contamination control	Low-quality or repetitive text, hidden PII, attribution gaps
Curated or licensed corpora	Better legal posture and domain quality	Contract tracking, provenance, renewal logic	Expensive coverage gaps
Code repositories	Strong coding capability gains	Permissive-licence validation, repository-level metadata, opt-outs	Licence incompatibility, copied secrets, benchmark leakage
Synthetic or self-generated data	Cheap instruction expansion, post-training coverage	Teacher-model validation, diversity controls, anti-collapse checks	Self-amplified errors, style homogenisation
Proprietary enterprise data	Highest domain relevance	Strict access controls, retention rules, lineage, auditability	Privacy, confidentiality, unclear reuse rights

Provenance and Licensing

Source-level provenance, licence metadata, and documented inclusion criteria should be maintained from the start. Dataset cards exist because responsible use demands that teams record contents, intended use, biases, and collection context. The Stack is a strong example of code-specific governance: it was built around permissively licensed source code and argued that creators should have meaningful control over inclusion. This matters more now because EU GPAI obligations include transparency and copyright-related rules, and the AI Office’s GPAI materials tie obligations to both full training and certain modifications or fine-tuning events.

Deduplication and Mixture Design

Deduplication should happen before training and before evaluation. Lee et al. found that near-duplicate examples and repetitive substrings were common in standard corpora, that more than 1% of unprompted language model output could copy directly from the training set, and that deduplication both reduced memorisation and improved training efficiency. Deduplication is therefore not only a legal or privacy safeguard; it is also a quality and optimisation step.

Teams often over-index on scalar quality scores and under-invest in mixture design. Public recipes increasingly use balancing or curriculum rather than a fixed mixture forever. OLMo 2 reports that a specialised late-stage data mix, Dolmino Mix 1124, improved downstream capabilities during annealing, while Llama 3 reports changing its data mix during training and increasing non-English proportions to improve multilingual performance. A static mixture can leave capability on the table even when the total token budget is large.

Synthetic Data

Synthetic data is now a standard part of post-training, but it works best as a controlled supplement rather than a blind multiplier. Self-Instruct improved instruction-following by generating its own training tasks and filtering invalid or overly similar examples. Constitutional AI shows a more controlled pattern: model-generated critiques and revisions, followed by preference-style optimisation, can improve harmlessness without relying entirely on human comparison labels. The pitfall is teacher collapse, where narrow or unverified synthetic data overfits style without improving competence.

Preprocessing and Boundaries

Strong preprocessing pipelines combine text normalisation, careful document-boundary handling, chunking, and high-throughput streaming. OLMo appends EOS to each document, concatenates documents, then groups consecutive 2,048-token chunks into training instances. Llama 3 adds a document-aware attention mask to prevent self-attention across document boundaries within the same sequence, reports that this matters especially for very long continued pre-training, and uses 8K-context pre-training followed by staged long-context adaptation to 128K. Chunk boundaries and document boundaries are separate design choices.

Streaming and Tokenization

Very large corpora should be streamable. Hugging Face’s dataset streaming documentation uses FineWeb as a concrete example and notes that the English split alone is 45 TB, which is large enough that “download then preprocess” stops being realistic for many teams. Streaming reduces local storage pressure and makes it easier to iterate over Parquet or JSONL at scale.

Tokenizer design is a model-quality decision, not a convenience decision. Hugging Face’s tokenization pipeline documentation emphasises normalisation and pre-tokenization as explicit stages, while OpenAI’s tiktoken documents fast BPE tokenisation and its desirable properties: reversibility, arbitrary-text coverage, and compression. Llama 3 makes the optimisation payoff concrete: it uses a 128K-token vocabulary combining 100K tiktoken tokens with 28K added non-English tokens, and reports improved compression on English plus better multilingual support. Better compression means more text seen for the same training compute.

Architectures, Objectives, and Optimisation

Dense Decoder-only Models

Decoder-only transformers still dominate broad LLM training because they align cleanly with next-token pre-training, long-context serving, incremental decoding, and current post-training methods. GPT-3 established the modern decoder-only pattern at 175B scale; PaLM scaled the same general family further; OLMo 2 stays within dense autoregressive models; and Llama 3 explicitly says its largest model is a dense transformer and that Meta deliberately favoured a standard dense architecture over MoE to maximise stability and manage complexity.

Encoder-decoder Models

Encoder-decoder architectures remain strategically useful when the deployment workload is fundamentally sequence-to-sequence rather than open-ended dialogue. T5 unified many NLP tasks into text-to-text transfer and is still the cleanest public reference for encoder-decoder pre-training as a general framework. In practice, this means encoder-decoder stacks remain strong for translation, summarisation, and tightly conditioned generation, even though they are no longer the default for general chat assistants.

Sparse MoE Models

MoE is attractive when training budgets are large enough to exploit conditional computation. GShard and Switch Transformer showed the early scaling pattern: keep per-token compute roughly bounded while increasing total parameter count through sparse expert routing. DeepSeek-V3 is the most striking recent public example, with 671B total parameters, 37B active per token, an auxiliary-loss-free load-balancing strategy, and multi-token prediction in pre-training. The trade-off is clear: MoE can improve compute efficiency and capacity, but it raises routing, communication, stability, and checkpoint-management complexity.

Pastel decision map comparing dense decoder-only, encoder-decoder, and sparse MoE architecture choices for LLM training. — **Fig 3.** Architecture should follow workload, risk, and infrastructure constraints. Dense decoder-only models remain the default, while encoder-decoder and MoE designs are stronger fits for narrower use cases or larger platform budgets.

Architecture	Best fit	Advantages	Main drawbacks
Dense decoder-only	General assistants, coding, open-ended generation	Simplest end-to-end stack; strong serving ecosystem; next-token objective is natural	Expensive at very large scales; KV-cache cost grows with context
Encoder-decoder	Translation, summarisation, structured sequence-to-sequence	Strong conditional generation; efficient for input-conditioned tasks	Less standard for chat-centric ecosystems; serving stack less unified
Sparse MoE	Frontier-scale training with large infrastructure budgets	Higher total capacity at lower active compute per token	Harder routing, load-balancing, expert parallelism, and fault handling

Training Objectives

Training objectives are now layered. Pre-training remains dominated by next-token prediction for generative LLMs. Encoder-centric pre-training still uses masked losses or span corruption, as seen in BERT and T5. Post-training increasingly combines instruction tuning with preference optimisation such as PPO-based RLHF or DPO. InstructGPT remains the key public archetype for the classic RLHF stack: supervised demonstrations, preference data, reward modelling, then PPO. DPO is the major simplification: it directly optimises preferences without an explicit reward model and RL loop. Llama 3 says it preferred SFT plus rejection sampling plus DPO over more complex RL methods because they were harder to scale and less stable. OLMo 2 also points toward RLVR as part of the newest post-training trend.

Pre-training and post-training metrics should be kept separate. Teams often report post-training wins that are actually data-formatting or preference-learning wins, not base-model knowledge wins. Llama 3’s report is comparatively clear on this separation: pre-training teaches language and world knowledge, post-training teaches instruction following, tool use, and safety behaviour. This separation is useful operationally because it localises regressions.

Optimisation Defaults

AdamW remains the public default in major open recipes. OLMo uses AdamW, warms up for 5,000 steps, decays learning rate linearly to one-tenth of peak, and clips gradient norm at 1.0. Its reported hyperparameters at 7B include betas of (0.9, 0.95), epsilon 1e-5, weight decay 0.1, and a 4M-token batch. Llama 3 405B also uses AdamW, a linear warm-up followed by a cosine schedule, and staged batch-size increases from 4M tokens to 8M and then 16M tokens as training progresses. These reports support a practical rule: choose a stable optimiser family first, then scale batch size and schedule deliberately rather than locking them early.

Mixed Precision and Scaling

Public stacks use mixed precision by default, but the exact format depends on hardware and scale. Switch Transformer showed that sparse trillion-parameter models could be trained in bfloat16. OLMo materialises parameters in bfloat16 during forward and backward passes while reducing gradients in full precision. DeepSeek-V3 reports an FP8 mixed-precision training framework for extremely large-scale training. Mixed precision is no longer optional, but moving from bf16 to fp8 is an infrastructure co-design decision rather than a simple training flag.

Under-specified schedules and poor batch scaling remain common reasons for failed runs. Chinchilla’s compute-optimal scaling results are the cautionary baseline here: many large dense models were effectively undertrained relative to their parameter count, and budget allocation between parameters and tokens matters. Llama 3 explicitly uses scaling-law experiments and then trains smaller models longer than compute-optimal for better inference efficiency. That is a stronger planning pattern than simply training the biggest model that fits.

Distributed Training and AI Infrastructure

Composable Parallelism

At modern scales, architecture and infrastructure are inseparable. Megatron-LM’s core result is that tensor, pipeline, and data parallelism can be composed to train trillion-parameter models efficiently across thousands of GPUs. Megatron Core’s current guide makes this operationally concrete: tensor parallelism splits individual layers and is recommended for large hidden sizes; pipeline parallelism splits layers by depth and is useful for very deep models; sequence and context parallel features help with long sequences; and these techniques are designed to be mixed, not chosen in isolation.

Fig 4. Large training runs are infrastructure programmes. Blue arrows show data and activation movement, rose dashed arrows show scheduler control, and orange dotted arrows show checkpoint traffic.

Technique	What it distributes	Primary benefit	Primary tax
Data parallel / DDP	Batches	Simple scale-out	Redundant optimiser and model state
FSDP / ZeRO	Model, gradients, optimiser state	Much lower memory overhead	More communication, trickier state handling
Tensor parallelism	Layer internals	Fits wider layers, good compute balance	High-bandwidth intra-layer communication
Pipeline parallelism	Layer depth	Fits deep models, lowers memory per device	Bubbles, schedule complexity
Expert parallelism	MoE experts	Sparse scale-out for MoE	Routing skew, all-to-all pressure
Context / sequence parallelism	Long sequence dimension	Long-context scaling	Additional communication and planning complexity

Sharding

PyTorch FSDP is now an industry-grade sharded training solution, and DeepSpeed ZeRO remains the most explicit public taxonomy of sharded states. DeepSpeed’s documentation states that ZeRO stage 1 partitions optimiser states, stage 2 also partitions gradients, and stage 3 partitions parameters as well. It also documents CPU and NVMe offload options under ZeRO-3 and ZeRO-Infinity. This is the clearest public memory strategy ladder for teams that want to move from “model barely fits” to “model scales safely”.

Checkpointing and Restartability

Distributed checkpointing is a core design requirement, not a convenience feature. PyTorch DCP loads and saves from multiple ranks in parallel, handles load-time resharding, and allows saving under one cluster topology and loading under another. torchrun explicitly supports both fault-tolerant fixed-size jobs and elastic jobs with worker-count ranges and restart limits. Teams should test checkpoint portability and restart logic before the main run, because the underlying toolchain is designed for failures and topology changes.

JAX and TPU Infrastructure

JAX remains important wherever TPUs are strategic. Flax’s pjit guide describes an SPMD model where users specify how inputs and outputs are partitioned and the compiler performs internal partitioning and communication. Google Cloud’s current tutorials show Llama 3 8B training on GKE using JAX, Ray Train, and TPUs, and TPU v5p documentation publishes chip-level specifications including 459 BF16 TFLOPs and 95 GiB HBM per chip.

Cloud Infrastructure Patterns

Official vendor documentation shows similar high-level patterns. AWS SageMaker supports distributed data and model parallel training and also supports standard PyTorch distributed tooling. AWS’s P5 page says UltraClusters can scale to 20,000 H100 or H200 GPUs. Azure’s distributed GPU guide stresses RDMA-capable InfiniBand SKUs for near-linear scaling, and its ND H100 v5 documentation notes eight H100s per VM and scale-up to thousands of GPUs with 3.2 Tbps interconnect bandwidth per VM. GCP’s TPU documentation shows pod-scale configurations, and GKE tutorials now explicitly position TPUs for LLM training and fine-tuning workflows.

Cost Planning

Exact costs depend on region, reserved capacity, discounts, storage, network, failed runs, and engineering overhead, so any single figure should be treated as illustrative. Public list prices are still useful for order-of-magnitude planning. AWS Capacity Blocks list p5.48xlarge at US$34.608 per instance-hour, or US$4.326 per H100 GPU-hour. GCP TPU pricing lists TPU v5p at US$4.20 per chip-hour in several US regions. At those rates, 100,000 H100 GPU-hours is roughly US$432,600 on AWS before ancillary costs, and 100,000 TPU v5p chip-hours is roughly US$420,000 on GCP. Using DeepSeek-V3’s reported 2.788M H800 GPU-hours as a purely illustrative proxy, the run would correspond to more than US$12 million at AWS’s public H100 list rate, though that comparison is hardware-mismatched and almost certainly different from the model’s real effective cost.

The most common expensive mistakes are topology-naive parallel choices, insufficient interconnect bandwidth, and checkpoint formats that cannot be resharded. Megatron’s documentation is explicit that parallelism order should be planned with network bandwidth and latency in mind, and PyTorch DCP exists because topology changes are normal. A model that trains on one cluster layout but cannot resume elsewhere is operationally fragile.

Evaluation, Fine-tuning, and Deployment

Evaluation Surfaces

A rigorous validation stack should cover at least four surfaces: core capability, calibration and truthfulness, safety and bias or toxicity, and adversarial robustness. HELM remains the best public reference for this broader framing because it explicitly evaluates multiple scenarios and multiple metrics, including calibration, robustness, fairness, bias, toxicity, and efficiency rather than only accuracy. MMLU, BIG-bench, and GSM8K remain useful for academic and reasoning-style evaluation, but they should be treated as slices rather than a complete picture.

Capability Benchmarks

MMLU’s 57-task breadth is still useful for knowledge and problem-solving breadth, BIG-bench probes more diverse emergent or difficult tasks, and GSM8K remains a durable check on multi-step arithmetic reasoning. The failure mode is leaderboard overfitting. Once a benchmark becomes ubiquitous, contamination and test-specific tuning become real risks, which is another reason to deduplicate, track provenance, and keep internal holdouts.

Safety and Truthfulness

TruthfulQA is still one of the most important public reminders that scale alone does not solve hallucination: the benchmark’s authors found that the largest models tested were generally the least truthful. RealToxicityPrompts showed that pretrained language models can produce toxic continuations even from seemingly innocuous prompts. BBQ remains useful for probing social bias in question-answering settings. These benchmarks are imperfect, but together they show why release readiness cannot be inferred from capability alone.

Calibration and Adversarial Testing

Anthropic’s “Language Models Mostly Know What They Know” is one of the most practical calibration references: models can provide useful self-evaluations under the right prompting or training setup, which helps selective answering and abstention. For adversarial testing, both DeepMind’s “Red Teaming Language Models with Language Models” and Anthropic’s large red-team study show that automated or hybrid red teaming can uncover harmful behaviours more systematically than ad hoc human probing.

Fine-tuning Options

Full fine-tuning remains the gold standard when quality is paramount and infrastructure is abundant, but parameter-efficient methods have become the production default. LoRA freezes base weights and inserts low-rank adapters, reducing trainable parameters and avoiding extra inference latency. QLoRA pushes this further by backpropagating through a frozen 4-bit quantised base model into LoRA adapters, reporting 65B fine-tuning on a single 48 GB GPU while preserving strong task performance. For most domain adaptation tasks, these methods deliver the best cost-quality-operability balance.

Deployment Optimisation

Post-training quantisation and pruning are standard parts of shipping. AWQ is a strong weight-only quantisation baseline that protects salient weights using activation statistics and reports strong generalisation across instruction-tuned models. SparseGPT shows that one-shot pruning can reach at least 50% sparsity in massive GPT-family models with small perplexity changes. The right choice depends on whether the bottleneck is memory, latency, throughput, or hardware compatibility. Quantisation is usually the first move; pruning is more hardware- and stack-dependent.

Serving Stacks

vLLM’s PagedAttention paper is the reference public explanation of why serving economics are dominated by KV-cache management: it reports 2-4x throughput gains at the same latency by reducing KV-cache fragmentation and enabling better batching. Hugging Face TGI and NVIDIA TensorRT-LLM show the main production alternatives. TGI emphasises continuous batching and broad open-model support, while TensorRT-LLM focuses on NVIDIA-optimised deployment and quantisation features. Aggressive batching and cache optimisation reduce cost per token, but interactive applications must still tune for p95 and p99 latency and admission control.

Adaptation or serving option	Best use case	Upside	Main caution
Full fine-tuning	Highest-quality domain adaptation	Maximum flexibility	Highest cost and storage
LoRA	Most supervised domain or style tuning	Cheap, fast, low-risk	Adapter management at scale
QLoRA	Low-memory fine-tuning	Very low hardware barrier	Quantised-training workflow complexity
AWQ / weight-only quantisation	Production inference	Memory and throughput gains	Hardware and kernel dependence
SparseGPT-style pruning	Specialised inference stacks	Lower compute and memory	Not universally speed-improving on all hardware
vLLM / TGI / TensorRT-LLM	Production serving	High-throughput open-model serving	Engine-specific tuning and observability work

Reproducibility, MLOps, and Emerging Risks

Reproducibility Scope

Reproducibility in LLM training is partly scientific and partly operational. PyTorch’s reproducibility notes stress that users can control randomness and sources of nondeterminism, but exact reproducibility is still a systems problem involving software versions, kernels, data order, launch arguments, and hardware behaviour. OLMo is the strongest open example of this idea in practice because it releases not just weights but training data, code, evaluation code, logs, and many intermediate artefacts. Reproducibility at LLM scale means full-run provenance, not only seeds.

Experiment Tracking and Lineage

MLflow Tracking logs parameters, code versions, metrics, output files, and run metadata. OpenLineage tracks metadata about datasets, jobs, and runs so teams can understand upstream and downstream impacts of changes. This pairing is close to the modern minimum: experiment tracking for model development and lineage for data and process provenance. It is particularly useful when datasets, filters, or tokenisers are revised during continued pre-training or post-training.

Documentation Artefacts

Dataset cards and model cards are not merely community niceties. They are becoming operational necessities because they connect internal provenance to external disclosure. Hugging Face’s documentation frames dataset cards as responsible-use artefacts and model cards as useful for discoverability, reproducibility, and sharing. Teams that cannot explain data sources, intended use, known limitations, or safety constraints usually cannot satisfy governance reviews either.

Compliance and Risk Management

NIST’s Generative AI Profile extends the AI RMF specifically for generative systems, while EU GPAI materials make clear that transparency, copyright-related rules, and risk-mitigation expectations can apply at the model layer. Some obligations reach early into the development phase once the training-compute threshold is met or expected. Governance work should therefore start before the main run, not after a checkpoint is first demoed.

Hallucination and Carbon Accounting

Truthfulness and hallucination remain unsolved foundation-model problems, which is why benchmarks like TruthfulQA and calibration methods like P(True) remain important. On environmental cost, Patterson et al. argue that model choice, data-centre choice, and processor choice can reduce carbon footprint by orders of magnitude, and the BLOOM carbon study estimated final training emissions at about 24.7 tonnes CO2eq for dynamic power alone and about 50.5 tonnes when broader lifecycle factors were included. CodeCarbon exists because teams increasingly need runtime carbon accounting rather than speculative estimates.

Operational Failure Modes

The main failure modes are silent dataset drift, irreproducible tokenisers, missing checkpoint lineage, and compliance work that starts too late. Another common mistake is reporting training cost without including experiment churn, failed jobs, extra post-training rounds, or long-serving costs. Public studies and vendor documentation support a fuller lifecycle view: training, deployment, evaluation, and governance all have material cost and risk surfaces.

Actionable Recommendations

Model and Data Decisions

For a team planning to train an LLM from scratch or through heavy continued pre-training, the most defensible path is usually to start with a dense decoder-only baseline unless there is a clear sequence-to-sequence or MoE-specific reason not to. The size band should be explicit, such as 7B-13B, 30B-70B, or frontier/MoE, and should be validated through scaling-law pilots rather than inherited from market hype.

The data programme should begin with a source registry and governance policy that captures provenance, licence, opt-out handling, PII policy, and retention before ingestion starts. Training and evaluation corpora should be deduplicated separately, with benchmark holdouts kept physically and procedurally separate from training data. Tokenizer compression and non-English or domain coverage should be validated on representative samples before committing the vocabulary.

Infrastructure and Post-training Decisions

The infrastructure programme should dry-run the full distributed stack before the main run, including checkpoint save and load under a different topology. AdamW plus warm-up, explicit clipping, and mixed precision remain the safest public defaults, and major training knobs should be changed one at a time. Post-training should stay modular: supervised fine-tuning first, then DPO or RLHF only where it materially improves target metrics.

Release Readiness

Release readiness should combine capability, safety, truthfulness, calibration, and adversarial tests. For most domain adaptation work, LoRA or QLoRA will be more economical than full fine-tuning unless the quality target justifies the extra infrastructure. Serving should be optimised for the product regime that matters, such as interactive latency or batch throughput, and the team should track runs, lineage, model cards, and carbon from the first experiment rather than only near release.

Public reports still under-specify some load-bearing details, especially exact frontier data mixtures, total failed-run overhead, and negotiated infrastructure costs. Cost estimates that use public cloud list pricing should therefore remain illustrative, and fine-grained hyperparameter values should always be re-tuned on the actual data, hardware, and context-length regime.

References and Further Reading

GPT-3: Language Models are Few-Shot Learners - dense decoder-only scaling baseline.
Training Compute-Optimal Large Language Models - Chinchilla scaling law reference for token and parameter budgeting.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer - T5 and encoder-decoder text-to-text transfer.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding - sparse conditional computation and sharding.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity - MoE training in bfloat16.
DeepSeek-V3 Technical Report - public MoE example with 671B total parameters and 37B active per token.
OLMo: Accelerating the Science of Language Models and OLMo 2 - transparent open training workflows.
The Llama 3 Herd of Models - modern dense recipe, long-context training, and post-training.
Training Language Models to Follow Instructions with Human Feedback and Direct Preference Optimization - post-training and preference optimisation.
Megatron-LM, Megatron Core documentation, DeepSpeed ZeRO, and PyTorch distributed documentation - distributed training systems.
PyTorch Distributed Checkpoint and torchrun elastic launcher - restartability and topology-aware checkpointing.
HELM, MMLU, GSM8K, TruthfulQA, RealToxicityPrompts, and BBQ - broader model evaluation.
LoRA, QLoRA, AWQ, and SparseGPT - adaptation, quantisation, and pruning.
vLLM and PagedAttention, Hugging Face Text Generation Inference, and NVIDIA TensorRT-LLM - production serving options.
Hugging Face dataset cards, Hugging Face model cards, MLflow Tracking, and OpenLineage - documentation, tracking, and lineage.
NIST AI RMF Generative AI Profile and European Commission GPAI materials - governance and regulatory context.

Twitter Facebook LinkedIn