24 minute read

LLM Fine-Tuning Logo

LLM Fine-Tuning


Executive Summary

Fine-tuning for large language models has moved from a single practice to a layered post-training stack. In current practice, most teams do not begin with full end-to-end weight updates. They usually start with supervised fine-tuning on curated instruction or task data, use parameter-efficient fine-tuning such as LoRA or QLoRA for cost control, and add preference optimisation such as DPO or, less often for many applied use cases, RLHF with PPO when they need stronger behavioural shaping.

This shift is driven by three realities. Full fine-tuning is expensive to train and store. PEFT methods can achieve competitive quality with far lower operational burden. Simpler preference methods often deliver a better complexity-to-benefit ratio than full RLHF pipelines.

For most practitioners, the best default is SFT plus LoRA, moving to QLoRA when GPU memory is the binding constraint, and adding DPO only after a strong SFT baseline and a trustworthy evaluation harness are in place. Full fine-tuning still matters, but mostly when the organisation has large proprietary corpora, significant compute, mature MLOps, and a strong reason to reshape the entire model rather than attach specialisation modules. Adapters, IA3, prompt tuning, prefix tuning, and BitFit remain useful, but they are usually niche choices relative to LoRA and QLoRA in mainstream LLM adaptation.

Governance and security are not side topics in fine-tuning. The model lifecycle introduces distinct risks around training-data provenance, PII exposure, value-chain integrity, model theft, extraction, poisoning, prompt injection during testing or synthetic-data generation, and unsafe behavioural drift after adaptation. NIST’s AI RMF and Generative AI Profile connect these risks to governance, content provenance, pre-deployment testing, incident disclosure, third-party risk, and continuous monitoring.

In practice, technical fine-tuning choices should be paired with documented lineage, dataset and model cards, run tracking, versioned rollback, PII minimisation and de-identification, access controls, encryption in transit and at rest, and, when warranted, protection of data in use through confidential-computing patterns. Fine-tuning is not complete when the loss curve looks good. It is complete when the organisation can explain, reproduce, deploy, govern, and safely reverse the model update.

Acronyms and Key Concepts

Term Meaning Why it matters
LLM Large language model The base model being adapted through fine-tuning or post-training.
SFT Supervised fine-tuning The usual first adaptation stage, where the model learns from curated demonstrations or task outputs.
PEFT Parameter-efficient fine-tuning A family of methods that freeze most base-model weights and train only small added or selected parameter sets.
LoRA Low-rank adaptation The most common PEFT method for LLMs; it adds small trainable low-rank matrices to selected linear layers.
QLoRA Quantised LoRA LoRA applied on top of a frozen low-bit base model, usually to reduce GPU memory requirements.
Adapter Bottleneck module inserted into the model A modular PEFT approach useful when many task-specific variants need to be stored, composed, or rolled back.
IA3 Infused adapter by inhibiting and amplifying inner activations A tiny PEFT method that learns activation-scaling vectors instead of larger adapter modules.
BitFit Bias-only fine-tuning A very small adaptation method that updates only bias parameters and is useful as a low-cost baseline.
Prompt tuning Learned virtual input tokens A prompt-side method that keeps the model frozen and learns soft prompt embeddings.
Prefix tuning Learned per-layer key/value prefixes A prompt-side method that conditions generation by injecting learned prefixes into attention layers.
DPO Direct preference optimisation A preference-tuning method that learns from chosen/rejected response pairs without a separate reward-model RL loop.
RLHF Reinforcement learning from human feedback A post-training approach that usually trains a reward model from preferences and optimises the model with reinforcement learning.
PPO Proximal policy optimisation The reinforcement-learning algorithm commonly associated with classic RLHF pipelines.
PII Personally identifiable information Sensitive data that must be minimised, protected, masked, or governed before fine-tuning.
FSDP Fully sharded data parallel A PyTorch distributed-training strategy that shards model parameters, gradients, and optimiser state.
AMP Automatic mixed precision A training technique that uses lower precision where safe to reduce memory and improve throughput.
CI/CD Continuous integration and continuous delivery The release discipline needed to evaluate, promote, monitor, and roll back fine-tuned models or adapters.
Model card Structured model documentation Captures intended use, limitations, training context, evaluation results, and lineage.
Dataset card Structured dataset documentation Captures provenance, collection context, intended use, quality concerns, and known bias.
Confidential computing Hardware-backed protection for data in use Relevant when fine-tuning on highly sensitive data that must be protected even during computation.

Scope and Assumptions

Budget and Provider Neutrality

This report assumes no specific budget and no specific cloud provider. Hardware and storage figures are therefore presented as practitioner-grade ranges and relative comparisons, not fixed totals. They vary materially with model family, sequence length, optimiser, batch size, precision, activation checkpointing, parameter sharding, and whether the model is decoder-only or encoder-decoder. Official PyTorch and Hugging Face documentation also make clear that mixed precision, sharding, and quantisation choices can substantially change the effective memory envelope.

The legal and privacy discussion is governance guidance, not legal advice. Where laws vary by jurisdiction, this report emphasises durable principles that recur across frameworks: data minimisation, documented lawful basis or consent where required, privacy by design, integrity and confidentiality, and auditable accountability. The GDPR is relevant because it states these principles directly and concretely; NIST is relevant because it translates them into operational risk-management actions for AI systems.

Method and Hyperparameter Caveats

Some seminal methods older than 2021 are included because important adapter methods pre-date the requested horizon. When a typical hyperparameter is given, it should be read as a starting band for tuning, not a universal prescription. Official TRL and PEFT documentation expose defaults such as SFT learning rates, LoRA rank and scaling parameters, and DPO beta, but these remain starting points rather than guarantees of optimality.

Fine-Tuning Technologies and Architectures

Architectural Map

Modern fine-tuning techniques differ mainly in where they inject trainable capacity: into all weights, into low-rank residual paths, into bottleneck modules, into activation scalers, into bias terms, or into learned prompt-like vectors. Figure 1 synthesises the main architectural families as a conceptual map derived from the original method papers and official PEFT documentation.

Pastel architecture diagram showing full fine-tuning, prompt-side methods, and PEFT weight-side methods around a transformer model.
Fig 1. Fine-tuning methods differ by the location and size of the trainable surface. Full fine-tuning changes the backbone, prompt-side methods add learned context, and PEFT methods add small trainable paths inside the model.

Technique Comparison

The table below compares full fine-tuning, adapters, LoRA, QLoRA, prompt tuning, prefix tuning, BitFit, and IA3. Parameter percentages are typical order-of-magnitude ranges, not constants. The exact number depends on model architecture and which modules are targeted.

Technique Description and architecture Trainable params Training memory profile Inference impact Storage overhead Typical starting hyperparameters Best use cases Example tools
Full fine-tuning Updates the full model end to end; no frozen backbone. Best when maximum representational freedom is required. 100% Highest; 7B+ models usually need multi-GPU training or sharding/offload because parameters, gradients, and optimiser states dominate memory. None beyond the new model itself Full new checkpoint, often many GB SFT starts often near 2e-5, bf16, gradient checkpointing on, then sweep. Large proprietary corpora; domain transfer where PEFT underfits; strategic base-model ownership Transformers Trainer, TRL SFTTrainer, PyTorch FSDP, DeepSpeed-style backends
Adapters Inserts small bottleneck modules after attention and feed-forward layers while the base model is frozen. ~0.5%-5% Much lower than full fine-tuning; often feasible on a single high-memory GPU for 7B-class models with checkpointing. Small added forward-pass cost unless merged or heavily optimised Small adapter checkpoint Adapter-specific learning rates are often higher, around 1e-4; longer training can be useful. Multi-task portfolios, modular composition, many task-specific variants AdapterHub / Adapters library, AdapterTrainer
LoRA Freezes the base model and injects low-rank matrices into selected linear layers, usually attention projections. ~0.01%-1% Low; one of the best quality-to-VRAM trade-offs for practical LLM tuning. None if merged into base weights; otherwise slight overhead Very small adapter file r=8, lora_alpha=8 or 16-32, lora_dropout=0-0.1, explicit target modules Default choice for instruction tuning, domain adaptation, multi-tenant customisation PEFT LoRA, Transformers, TRL
QLoRA Applies LoRA on top of a frozen 4-bit quantised base model using NF4, double quantisation, and paged optimisers. Same as LoRA Lowest among high-quality methods for large LLMs; the original paper showed 65B fine-tuning on a single 48GB GPU. Usually low, but depends on deployment kernels and whether inference stays quantised. Tiny adapter plus quantised base Same LoRA knobs plus 4-bit config; NF4/FP4-capable backend Memory-constrained fine-tuning, single-GPU experimentation, adapter farms PEFT + bitsandbytes + Transformers
Prompt tuning Learns soft prompt embeddings prepended to the input while keeping the backbone frozen. Usually <0.1% Very low Slight prompt-length overhead because extra virtual tokens are processed Extremely small num_virtual_tokens, prompt initialisation, token dimension metadata Very large shared base model, many small task variants, rapid prototyping PEFT PromptTuningConfig
Prefix tuning Learns continuous prefixes that act as per-layer key/value prompts. Around 0.1% in the original paper; exact count grows with layers and virtual tokens Very low to low Small overhead because prefixes expand the effective context / KV path Very small num_virtual_tokens, optional prefix projection, layer count, token dimension Low-data generation tasks, controllable style conditioning PEFT PrefixTuningConfig
BitFit Fine-tunes only bias terms; the base model is otherwise frozen. ~0.01%-0.1% Very low None or negligible Tiny Select which bias tensors to unfreeze; standard optimiser settings, sometimes higher LR than full FT Strong baseline, ablations, low-cost adaptation Standard PyTorch / Transformers selective unfreezing
IA3 Learns multiplicative vectors that rescale attention keys, values, and feed-forward activations. Tiny; often hundredths of a percent Very low Negligible to minimal Tiny Choose target_modules and feedforward_modules Resource-constrained tuning with tiny trainable surface PEFT IA3Config

Full Fine-Tuning

Full fine-tuning gives the model maximum freedom because every parameter can move. That is useful when the task distribution is far from the base model’s original instruction or domain distribution. The price is steep: large memory demand, heavy checkpoint storage, harder reproducibility, slower iteration, and a larger blast radius if the run goes wrong. It is the most powerful option technically, but rarely the most economical first move operationally.

LoRA and QLoRA

LoRA is the current practical workhorse because it attacks the real operational bottlenecks: trainable parameter count, per-task storage, and iteration speed. It usually improves the cost-quality-operability balance without forcing major changes in serving architecture, especially when adapters are merged before deployment. Its main weakness is that it still requires careful target-module selection and rank tuning, and it can underfit when the task truly needs broad representational rewiring.

QLoRA is the choice when VRAM, not model ambition, is the constraint. It is particularly attractive for single-node experimentation and for teams that want to fine-tune large models without moving immediately to expensive multi-GPU clusters. The trade-off is additional dependency on quantisation kernels and more sensitivity to implementation details. It is excellent for training efficiency, but production serving choices should still be validated against actual latency and quality targets.

Adapters, Prompt Methods, BitFit, and IA3

Adapters remain valuable where modularity matters as much as raw efficiency. They are easier to reason about for some organisations because the adapter itself is a visible module that can be composed, versioned, and shared. Their main downside is extra runtime path length unless aggressively optimised or merged. AdapterHub’s ecosystem remains the strongest open-source expression of the modular-adapter philosophy.

Prompt tuning and prefix tuning are best understood as prompt-space PEFT. They are compelling when the base model is already very capable and the goal is lightweight steering rather than substantial behavioural rewiring. Prefix tuning is usually more expressive than simple soft prompts because it touches every layer’s attention state, but that also means more dependence on sequence mechanics. Prompt tuning has become more persuasive as model size increases, which is one reason it remains relevant despite LoRA’s popularity.

BitFit and IA3 are often underrated because they are so small. BitFit is an excellent sanity-check baseline: if bias-only tuning gets most of the way, the task may not justify heavier adaptation. IA3 is the cleaner choice when the goal is a tiny trainable surface while still intervening in internal routing and activation scaling. Both methods are compelling in resource-constrained or high-multiplexing settings, but less often the first industrial choice for broad instruction tuning than LoRA.

Preference Optimisation and Post-Training Workflow

Layered Post-Training

Fine-tuning a useful assistant usually proceeds in layers. A common modern path is instruction or task data for SFT, then pairwise preference data for DPO or RLHF/PPO, then targeted safety tuning, followed by gated deployment and monitoring. OpenAI’s InstructGPT paper made the canonical SFT-then-RLHF pattern visible. DPO later showed that many preference-alignment objectives can be solved more simply, without training a separate reward model and running a full online RL loop. Anthropic’s Constitutional AI work showed that harmlessness tuning can itself use supervised and preference-style stages, including AI feedback.

Pastel flowchart showing source review, data cleaning, SFT, evaluation, preference optimization, safety tuning, model lineage, rollout, monitoring, rollback, and promotion.
Fig 2. Fine-tuning should be treated as a governed release workflow. Preference optimisation is only one stage between data controls, evaluation, safety gates, rollout, monitoring, and rollback.

SFT, DPO, and RLHF

Method What it optimises Extra components Compute and operability profile Typical starting knobs Strengths Main trade-offs Best use cases
SFT Negative log-likelihood on demonstrations or curated outputs. No reward model required Lowest complexity; easiest to debug and reproduce. LR, batch size, context length, packing, assistant-only loss, PEFT config Stable, cheap, easy to evaluate, strong for domain and instruction learning Copies demonstrator style and errors; does not directly optimise pairwise preference Always the baseline; often sufficient for task/domain adaptation
RLHF with PPO Uses a reward model trained from human preferences, then optimises the policy with PPO. Preference dataset, reward model, online sampling loop, RL trainer Highest complexity; more moving parts, more instability risk, more infrastructure overhead PPO clip / KL control and implementation-specific knobs Can optimise non-trivial behavioural objectives and online reward shaping Expensive and fragile; reward hacking, instability, harder reproducibility Frontier alignment work and cases where reward modelling is central
DPO Directly fits preferences with a classification-style loss, avoiding a reward-model-plus-RL loop. Preference pairs and a reference policy Lower complexity than PPO-based RLHF; easier to operationalise TRL defaults include learning_rate=1e-6, beta=0.1, loss_type="sigmoid", gradient_checkpointing=True Simpler training, lower implementation burden, often strong quality Only as good as the preference data; can overfit preference style Most production preference tuning when pairwise data is available

Practitioner Workflow

A robust workflow begins with data triage, not training. Confirm licence terms, usage rights, sensitive-data rules, provenance, and intended purpose before sampling examples for tuning. NIST specifically recommends diligence on training data usage, protection of third-party IP and training data, and re-evaluating risk when fine-tuned models are adapted to new domains.

Next, create a high-quality SFT baseline. Use PEFT first unless there is strong evidence that full fine-tuning is required. The SFT run should be tracked end to end: base model, exact dataset version, tokenizer revision, prompt template, precision, seed, optimiser, evaluation results, and artefacts. MLflow and model cards are useful here because they log parameters, code versions, metrics, datasets, evaluation results, and base-model relationships for fine-tuned, adapter, and quantised models.

Only then should preference optimisation be considered. If the gap is mostly style or ranking among already-good outputs, DPO is usually the better first choice because it is operationally lighter than PPO-based RLHF. If online reward shaping, dynamic exploration, or richer reward modelling is required, RLHF/PPO still has a place, but it should be adopted deliberately and with more evaluation scaffolding.

Safety tuning should be a distinct training phase. NIST recommends red-teaming for prompt injection, adversarial prompts, data poisoning, membership inference, model extraction, and related ML attacks, and it warns to verify that fine-tuning does not compromise safety and security controls. Constitutional AI is a useful research reference for structured harmlessness tuning beyond ordinary task optimisation.

AI Infrastructure for Fine-Tuning

Infrastructure Stack

Fine-tuning infrastructure is more than a training script. It includes dataset storage, preprocessing jobs, training clusters, model and adapter artifact storage, registries, serving infrastructure, observability, and governance controls. The architecture is smaller than full pre-training infrastructure, but it has a wider operational surface because adapters, datasets, experiments, and release aliases multiply quickly.

Pastel architecture diagram showing data lake, preprocessing jobs, training cluster, artifact store, model registry, serving layer, observability, and governance controls.
Fig 3. A fine-tuning platform needs data controls before training, artifact and registry controls after training, and observability across both training and serving. PEFT reduces model-change size, but it does not remove the need for lineage and rollback infrastructure.

Compute and Memory Planning

The dominant memory contributors depend on the method. Full fine-tuning has to carry model parameters, gradients, optimiser states, activations, and often sharded checkpoint state. LoRA reduces the trainable surface while keeping the base model frozen. QLoRA goes further by quantising the base model to 4-bit and training small LoRA parameters on top. Mixed precision, activation checkpointing, FSDP or ZeRO-style sharding, sequence length, and batch size can all shift the practical envelope.

For most teams, the engineering question is not “Can the run fit once?” but “Can the run be repeated, evaluated, checkpointed, secured, and rolled back?” A single successful notebook run is not production infrastructure. A production-grade setup needs reproducible configuration, versioned datasets, secrets isolation, role-based access to artifacts, scheduled evaluation, and promotion controls.

Storage, Registry, and Serving

PEFT methods materially change storage economics. Full fine-tuning produces a full new checkpoint for each adapted model. LoRA, QLoRA, adapters, IA3, BitFit, prompt tuning, and prefix tuning can produce tiny task artifacts relative to the base model. That enables adapter portfolios, tenant-specific variants, and faster rollback by switching adapter aliases rather than replacing the entire serving checkpoint.

Serving design still matters. Some deployments merge LoRA adapters into the base model before release to avoid runtime overhead. Others keep adapters dynamic so one base model can serve multiple task or tenant variants. Dynamic adapters increase operational flexibility, but they also require stricter registry discipline, compatibility checks, and runtime observability so that the active base-model and adapter combination is always known.

Evaluation, Reproducibility, and Operations

Evaluation Surfaces

Evaluation for fine-tuned LLMs should be multi-objective, not benchmark monoculture. HELM is the clearest formal statement of this: common practice over-indexed on accuracy, while robust evaluation should include calibration, robustness, fairness, bias, toxicity, and efficiency. For practical teams, this means combining task-specific business metrics, general capability benchmarks, and safety/security tests tied to the intended deployment context.

Evaluation layer What to measure Useful benchmarks / metrics Caveats
Core task quality Accuracy, exact match, F1, ROUGE, pass@k, domain-specific acceptance rate Task-native metrics first; GSM8K for reasoning; MMLU for broad knowledge General benchmarks do not substitute for the real task distribution
Conversational preference Human preference rate, pairwise win rate, judge model agreement MT-Bench and LLM-as-a-judge approaches Judge models have position, verbosity, and self-enhancement biases
Truthfulness / hallucination Truthfulness, groundedness, hallucination rate TruthfulQA, HaluEval One benchmark is never enough; use task-grounded checks too
Safety and resilience Refusal accuracy, unsafe-response rate, jailbreak success, prompt-injection success, extraction / inference / leakage attempts NIST-style red-team testing and security metrics Safety behaviour is domain- and policy-specific
Efficiency / operability Tokens/sec, GPU-hours, memory footprint, storage per task, rollback readiness Experiment tracking and deployment telemetry Efficiency regressions can offset modest quality gains

Chatbot-style leaderboards should be treated carefully. The QLoRA paper warned that then-current chatbot benchmarks were not trustworthy enough to measure chatbot performance accurately, while MT-Bench showed that automated judging can be useful but also has identifiable biases. Use them as signals, not promotion gates on their own.

Reproducibility and Model CI/CD

PyTorch documentation is blunt: fully reproducible results are not guaranteed across releases, commits, or platforms, and even CPU/GPU runs can diverge despite identical seeds. The practical playbook is still useful: seed relevant random number generators, disable nondeterministic paths where needed, and accept that determinism can reduce single-run performance while improving debugging and regression testing.

For model CI/CD, three capabilities matter most. Run tracking records parameters, code versions, metrics, and output files for each run. Registry and rollback provide lineage, versioning, aliases, metadata, and explicit return to previous model versions. Lineage beyond models tracks datasets, jobs, and run metadata across schedulers, ETL jobs, and validation steps.

A mature fine-tuning CI/CD path uses version control for code and prompts, dataset versioning and dataset cards, automated training runs with fixed configs, post-train evaluation gates, model cards with base-model and dataset metadata, registry promotion to staged aliases, canary or shadow deployment, runtime monitoring for task, safety, and security metrics, and one-command rollback to the last known-good model or adapter. Hugging Face model cards support base-model relations for fine-tunes, adapters, and quantised derivatives, which is exactly the provenance metadata practitioners need for model lineage.

Governance, Data Governance, and Security

Lifecycle Risk Management

The most useful governance frame for fine-tuning is to treat it as a lifecycle risk-management process, not just an experiment. NIST’s Generative AI Profile states that organisations should govern, map, measure, and manage risks across design, development, deployment, operation, and decommissioning. Key GenAI considerations include governance, content provenance, pre-deployment testing, incident disclosure, third-party risk, and continuous monitoring.

Data Governance

Data governance starts with purpose, minimisation, and lawful basis. The GDPR states that personal data should be adequate, relevant, and limited to what is necessary. It also requires privacy-by-design measures such as pseudonymisation and data-minimisation safeguards, demonstrable consent where consent is the legal basis, withdrawal mechanisms, and heightened constraints for special-category data. These are exactly the questions raised by fine-tuning on support logs, chat transcripts, tickets, CRM text, or health, finance, or HR documents.

Operationally, each data source should have a documented reason for inclusion. Unnecessary PII should be removed or masked. Evidence of consent or another lawful basis should be retained where required. Retention windows should be bounded. Dataset cards should describe provenance, creation context, and known biases. Microsoft Presidio is a practical open-source option for PII identification and anonymisation, but its own documentation warns that automated detection is not guaranteed to find all sensitive information, so defence-in-depth remains necessary.

NIST’s GenAI guidance also recommends detecting PII or sensitive data in generated outputs, allowing subjects to withdraw participation or revoke future use of their data, using anonymisation or differential-privacy-style privacy-enhancing techniques where suitable, and conducting diligence on training data usage to assess privacy and IP risk. These controls are especially important when fine-tuning on synthetic data or derivatives of third-party models and corpora, because provenance drift and rights drift are common failure modes.

Security Architecture and Controls

A fine-tuning stack should assume a threat model that includes data poisoning, prompt injection in evaluation or synthetic-data loops, membership inference, model extraction, model-weight theft, inference bypass, sponge-style resource exhaustion, and value-chain compromise through third-party libraries, APIs, adapters, or datasets. NIST names these attack families directly in its GenAI profile and recommends AI red-teaming against them.

The baseline control families are familiar but must be made concrete for model training: access control, incident response, system and communications protection, system and information integrity, risk assessment, and supply-chain risk management all appear explicitly in NIST SP 800-53 Rev. 5. In practice, that means role-based access to datasets, training jobs, model artefacts, and registries; short-lived credentials; secrets isolation; network segmentation; encrypted artifact stores; signed containers and model bundles; and policy gates for third-party dependencies and open-source checkpoints.

For highly sensitive workloads, confidential computing is no longer theoretical. AWS Nitro Enclaves provides isolated, hardened environments with no persistent storage, no interactive access, cryptographic attestation, and KMS integration, and positions the feature for sensitive data such as PII and private-key handling. Google Cloud Confidential Computing emphasises protection of data in use, inline memory encryption, confidentiality against privileged access, and trusted-execution-style support for AI/ML collaboration and training scenarios. Because the provider is unspecified, these should be understood as illustrative patterns, not vendor recommendations.

Monitoring, Rollback, and Incident Response

Monitoring should cover more than tokens per second. NIST recommends metrics for provenance effectiveness, unauthorised access attempts, inference and extraction incidents, and the rate at which lessons from security incidents are incorporated into the system. It also recommends continuous monitoring of third-party GenAI systems in deployment, explicit fallbacks, redundancy for model weights and system artefacts, and testing of rollover and fallback technologies.

For rollback, the best operational pattern is to treat models and adapters as promotable, versioned release units. MLflow’s registry model is useful because it combines lineage, versioning, aliases, metadata tags, and explicit rollback to earlier versions. For PEFT systems, rollback can be even faster if the base model remains frozen and only the active adapter alias changes. That is a strong practical argument for LoRA and related methods in production: they reduce not only training cost, but also the complexity of reversing a bad release.

Governance Control Questions

Area Minimum control questions
Purpose and scope Is the intended use documented, bounded, and matched to the chosen fine-tuning method? Are unsupported domains explicitly listed?
Data provenance Do you know where every training and preference example came from, under what licence or permission, and whether it can legally be used for this fine-tune?
PII and consent Have you minimised personal data, documented lawful basis or consent where required, and provided withdrawal or revocation processes where applicable?
Run and model lineage Can you trace the released model to exact code, dataset, prompt template, seed, hyperparameters, and evaluation results?
Security controls Are access control, encryption, secrets handling, auditability, and third-party review in place for datasets, training clusters, registries, and serving layers?
Safety testing Did red-teaming include prompt injection, jailbreaks, extraction, poisoning assumptions, and policy-violation probes? Did you verify fine-tuning did not weaken existing safety controls?
Deployment controls Is there a staged rollout, monitored champion/challenger or alias-based promotion path, and a tested rollback plan?
Incident response Are ownership, disclosure triggers, retrospective learning, and third-party escalation paths documented and rehearsed?

Practical Recommendations

Default Stack

For a pragmatic default stack, use SFT plus LoRA first. It gives the best balance of quality, cost, storage efficiency, rollback simplicity, and tool maturity for most non-frontier use cases. Move to QLoRA when compute-constrained or when single-node iteration on larger base models is important. Add DPO only after the SFT model is already good and there is trustworthy preference data plus a robust evaluation harness.

When to Use Full Fine-Tuning

Choose full fine-tuning only when three conditions are true at the same time: there is enough proprietary data to justify it, the organisation has infrastructure to train, evaluate, secure, and version large checkpoints, and there is evidence that PEFT underfits the target behaviour. Without those conditions, full fine-tuning is often an expensive expression of ambition rather than a disciplined engineering choice.

When to Use Preference Optimisation

Prefer DPO over PPO-based RLHF for most production alignment work unless online RL dynamics or reward-model-driven optimisation are specifically needed. DPO is simpler to implement, lighter to train, and easier to operationalise. PPO-based RLHF remains strategically relevant, but it should be treated as an advanced option with higher operational cost and more failure modes.

Modular and Tiny-Parameter Methods

Use adapters or IA3 when a large portfolio of modular task-specific variants is needed, or when model multiplexing and quick rollback matter more than squeezing out the last few benchmark points. Use prompt tuning or prefix tuning when the base model is already very capable and the desired task artefacts should be extremely small. Keep BitFit as a baseline and ablation. If it works almost as well as a heavier method, that is valuable information about the true complexity of the adaptation problem.

Release Gates

No matter which tuning method is chosen, the release process should include the same gates: a dataset card, run tracking, a model card with base-model and dataset metadata, task, safety, and efficiency evaluation, registry promotion through aliases or versions, continuous monitoring, and a rehearsed rollback path.

Open Questions and Limitations

This report gives relative rather than absolute cost figures because actual GPU-hour and storage totals depend strongly on model size, sequence length, optimiser states, sharding strategy, and serving design. Official tooling documentation makes clear that AMP, FSDP, quantisation, and packing can shift memory and throughput substantially.

Evaluation recommendations are inherently time-sensitive. Benchmark ecosystems evolve, popular benchmarks saturate or develop contamination issues, and automated judges can introduce bias. That is why the report prioritises evaluation structure over any single leaderboard.

The governance section is cross-jurisdictional by design. If deployment touches regulated data or high-impact decisions, legal and compliance requirements should be mapped to operating jurisdictions and contract structure before training begins. The principles cited here are stable; the exact obligations are not.

References and Further Reading