24 minute read

Prompt Injection Threat Model Logo

Prompt Injection Threat Model


Introduction

Prompt injection has quickly become one of the defining security challenges for anyone building with large language models. MITRE ATLAS treats it as a first-class adversary technique — LLM Prompt Injection (AML.T0051) — with three sub-techniques: Direct (AML.T0051.000), Indirect (AML.T0051.001), and Triggered (AML.T0051.002). These reflect the different ways attackers can slip malicious instructions into a system — through the user interface, through external content the system ingests, or through delayed payloads that fire later when conditions are right.

What makes prompt injection especially dangerous in modern deployments is that it’s rarely the end goal. Think of it as an enabling primitive — a stepping stone that attackers chain into tool invocation, credential harvesting, data collection from RAG and enterprise sources, system prompt extraction, and exfiltration. MITRE’s OpenClaw investigation illustrates this pattern clearly: the recurring chains involve prompt injection, tool invocation abuse, and configuration modification working together to achieve persistent, wide-ranging impact.

A rigorous threat model therefore needs to treat the LLM and its orchestration layer as an untrusted interpreter of mixed code-and-data. The literature on indirect prompt injection highlights that LLM-integrated applications blur the line between “instructions” and “retrieved data,” and defensive work confirms that LLMs struggle to reliably distinguish where instructions are coming from within a single token stream.

Defence-in-depth is required because there is no universally foolproof prevention mechanism today, and defences must be evaluated against adaptive attackers and measured for utility degradation. The highest-leverage controls, mapped to ATLAS mitigations and modern agent architectures, are:

  • Constrain agency and tool impact — least privilege, delegated permissions, human approval for high-consequence actions, and restricting tool use when untrusted content is in context.
  • Instrument and monitor — AI telemetry logging of prompts, outputs, and tool decisions, with alerting on anomalous tool calls and exfiltration patterns.
  • Harden prompt and context handling — guardrails, validation, provenance marking and segmentation of untrusted retrieved content, and memory hardening.
  • Eliminate secrets-as-prompts — assume system prompts can be extracted, treat them as non-secret configuration, and enforce security outside the model.

Scope and Definitions

What this report covers

This threat model applies to two broad categories of system:

LLM-based systems — applications that incorporate an LLM into a workflow (chat, summarisation, coding, retrieval, classification, triage), potentially with retrieval augmentation (RAG), browsing, plugins, and/or function or tool calling.

Autonomous or semi-autonomous agents — LLM-orchestrated systems that can select actions and invoke tools (execute code, call APIs, access enterprise data sources) and may operate with reduced continuous human oversight. ATLAS explicitly models agent tool invocation and agent configuration discovery and modification as techniques.

Prompt injection defined

ATLAS defines LLM Prompt Injection (AML.T0051) as adversary-crafted prompts that cause the LLM to ignore or override original instructions and act in unintended ways, potentially serving as an initial access vector to achieve a foothold that persists within an interactive session. OWASP’s definition is closely aligned: prompts that alter model behaviour in unintended ways, including non-human-visible injections, with a note that RAG and fine-tuning do not fully mitigate the vulnerability class.

The three sub-techniques

Direct injection (AML.T0051.000) — the attacker supplies malicious instructions as a direct user of the system.

Indirect injection (AML.T0051.001) — malicious instructions are embedded in external content the system ingests: web pages, documents, emails, database records, or tool outputs.

Triggered injection (AML.T0051.002) — malicious instructions that activate after a user action or system event, often paired with delayed-execution or activation-trigger discovery in agent workflows.

A few related ATLAS techniques show up repeatedly throughout this report:

  • Prompt infiltration and smuggling via public-facing apps (AML.T0093) — placing prompt payloads into channels the system ingests (shared docs, web search results, messages, third-party skills), bridging untrusted content into the LLM context.
  • RAG poisoning (AML.T0070) — injecting malicious content into indexed retrieval sources so it surfaces for future queries and contaminates downstream context.
  • Context and memory poisoning (AML.T0080) — persistence via poisoned agent memory or conversation thread state, potentially across sessions.
  • System prompt extraction (AML.T0056) — extracting system prompts via injection or configuration access. OWASP warns that system prompts should not be treated as secrets or security controls and should not contain credentials.
  • Tool invocation abuse (AML.T0053) — using access to an AI agent to invoke tools and thereby reach data or actions not directly accessible to the attacker.
  • Chain-of-thought leakage and hijacking — reasoning traces can leak sensitive information (PII) even when final answers are sanitised, and long reasoning sequences can be exploited to weaken safety and refusal and enable jailbreak-style outcomes.

Assumptions

No deployment-specific details are assumed beyond the fact that an LLM is used, the system has at least one input channel, and the system may incorporate retrieval, plugins or tools, and memory. This is a risk assessment of realistic enterprise patterns, not a product vulnerability disclosure.


Mapping ATLAS Tactics and Techniques to Prompt Injection

The ATLAS dataset (v5.4.0) encodes a matrix with 16 tactics spanning reconnaissance through impact, with technique definitions and mitigations used for mapping and control selection. MITRE’s SAFE-AI report characterises ATLAS as a framework based on real-world attack observations and AI red team demonstrations, organised similarly to ATT&CK but focused on AI-enabled systems.

Tactic-to-scenario mapping

The table below highlights ATLAS techniques that most often appear in prompt-injection-driven chains — especially in RAG plus tool-using agent stacks — and how they map to practical scenarios.

ATLAS Tactic Technique (ID) Why It Matters Representative Scenarios
Reconnaissance / Discovery Discover LLM System Information (AML.T0069) Finds system prompts, special tokens/delimiters, and instruction keywords to craft higher-success payloads Token and control-scheme discovery for agent jailbreak or prompt smuggling
Reconnaissance Gather RAG-Indexed Targets (AML.T0064) Identifies which sources are indexed, enabling targeted RAG poisoning Targeting an enterprise index via email, docs, or repo content
Resource Development LLM Prompt Crafting (AML.T0065) Develops working malicious prompts tailored to system behaviour Payload design for exfil, tool calls, or memory poisoning
Resource Development LLM Prompt Obfuscation (AML.T0068) Evades humans and filters via hidden text, encoding, or UI tricks Hidden Unicode, tiny font, base64 or rot13 patterns
Initial Access Prompt Infiltration via Public-Facing App (AML.T0093) Gets malicious instructions into the context via shared docs, web, or messages Shared Google Docs, web content ingestion, agent skills
Initial Access Drive-by Compromise (AML.T0078) Triggers ingestion of malicious content via browsing or previewing workflows Auto-fetch summarisation, plugin retrieval
Execution LLM Prompt Injection (AML.T0051 + subs) The core mechanism: override intended behaviour, bypass guardrails, invoke privileged actions Direct, indirect, and triggered injections across channels
Execution AI Agent Tool Invocation (AML.T0053) Turns an LLM compromise into a system compromise by invoking tools, APIs, or code execution “Living off AI”: email, file I/O, terminals, enterprise APIs
Persistence AI Agent Context Poisoning (AML.T0080) Persists malicious instructions in memory or threads Cross-session manipulation via poisoned memory
Persistence / Defence Evasion Modify AI Agent Configuration (AML.T0081) Alters agent policies, confirmations, or tool settings for durability Disabling confirmations, changing rules files or configs
Credential Access / Collection RAG Credential Harvesting (AML.T0082) Uses RAG-access to retrieve secrets (API keys, tokens, creds) from indexed content Prompting enterprise assistant to surface keys from private channels
Exfiltration Extract LLM System Prompt (AML.T0056) Reveals system prompt for targeted follow-on attacks Meta-prompt extraction via direct prompting or config access
Exfiltration Exfiltration via AI Agent Tool Invocation (AML.T0086) Encodes sensitive data into tool parameters or outputs and exfiltrates via legitimate channels Emailing results, posting to tickets, curl via shell tools
Impact Data Destruction via AI Agent Tool Invocation (AML.T0101) Converts agent tool abuse into destructive outcomes Deleting files or resources, sabotaging workflows

Why these mappings are stable. The OpenClaw investigation explicitly frames the agentic exploit class as abuses of trust, configuration, and autonomy — not purely low-level bugs — highlighting prompt injection plus tool invocation plus agent config modification as recurring, high-risk chains. Academic work on indirect prompt injection also emphasises that when retrieved prompts are processed as instructions, attackers can achieve outcomes analogous to code execution and can influence whether downstream APIs are called.


Threat Model Components

System model and trust boundaries

At a high level, prompt injection becomes feasible when the system constructs a single model context that mixes higher-priority instructions (system or developer prompts), user intent, and external content (retrieved documents, web pages, tool outputs, memory) — without robust provenance controls and deterministic enforcement outside the LLM.

The following diagram shows a conceptual entity-relationship view of a typical tool-using LLM application and where trust boundaries sit.

Tool-Using LLM Application — Entity Relationship & Trust Boundaries TRUSTED ZONE — Developer-controlled System Prompt Context Builder LLM Policy Engine Tool Router Secrets Vault Audit Log UNTRUSTED / MIXED ZONE — User & external content User Input Retrieved Content Memory Items Tool Outputs Adversary EXTERNAL — Services, data stores, vector DBs Vector DB Indexed Docs External APIs Enterprise Data Retrieval Pipeline Trusted data flow Untrusted data flow (crosses trust boundary) Adversary injection path Key insight: prompt injection exploits the mixing of trusted instructions and untrusted content in the Context Builder.

Figure 1 — Entity-relationship view of a tool-using LLM application, showing trust boundaries and adversary injection paths.

Adversary goals

In ATLAS terms, prompt injection is often a gateway to goals across Credential Access, Collection, Exfiltration, and Impact. Typical goals include:

  • Data theft — user conversations, proprietary docs, credentials, API keys, system prompts, or sensitive enterprise records (often via RAG or tool access).
  • Privilege amplification through tools — using the agent to call tools or APIs not normally accessible to the attacker, effectively “borrowing” the agent’s permissions.
  • Persistence and long-lived manipulation — poisoning memory or threads, or modifying agent configuration so malicious behaviour triggers later.
  • Integrity and availability damage — manipulating outputs, eroding model integrity, or destructive tool invocation.

Adversary capabilities and knowledge

Capabilities typically required, in increasing sophistication:

  • Prompt engineering and iterative testing — prompt crafting, obfuscation, and delayed execution.
  • Control-scheme discovery — special tokens, keywords, and system prompt structure to bypass guardrails or spoof control channels in agent frameworks.
  • Placement capability for indirect injections — ability to modify or introduce content where the system retrieves it (web pages, shared docs, indexed channels, repositories).
  • Supply chain manipulation — for agent plugins, skills, rules, or config packages.

Access vectors and prerequisites

Common access vectors include direct user input (chat UI or API), indirect channels (web pages, documents, emails, Slack or Teams channels, ticketing systems, knowledge bases), tool outputs and plugin responses treated as trusted context, and agent marketplace or shared configuration artifacts (skills, rules files).

For the chain to matter, three prerequisites typically hold: the application passes untrusted content into model context with insufficient provenance or segmentation; the system has excessive agency relative to safety controls and approval gates; and secrets or privileged details are stored in accessible channels (system prompts, RAG index, config files, memory).

Assets at risk

Key asset classes, aligned to ATLAS technique families:

  • System prompts and internal configuration — often prerequisites for targeted injections.
  • Credentials and tokens — in RAG sources, agent configuration files, or tool configs.
  • Enterprise data accessible via tools or RAG — docs, CRM records, tickets.
  • User privacy and conversations — conversation exfiltration and PII extraction.
  • Availability and integrity — of systems affected by tools (file deletion, sabotage, fraudulent actions).

Concrete Attack Patterns and Kill Chains

Scenario overview

The following scenarios are selected to cover web apps, APIs, plugins, tool-using agents, chain-of-thought leakage, and data exfiltration, each with explicit mapping to ATLAS IDs.

Scenario Environment Primary Goal Key ATLAS Techniques
Web app injection to code execution LLM web app that executes LLM-generated code Execute commands / steal API keys AML.T0051.000, AML.T0053, AML.T0093
API-driven system prompt extraction LLM inference API or internal service Extract system prompt for follow-on attacks AML.T0056, AML.T0069, AML.T0051
Plugin/browsing indirect injection Chat assistant with web retrieval or plugin execution Exfiltrate conversation or history AML.T0051.001, AML.T0078, AML.T0077, AML.T0053
RAG poisoning + credential harvesting Enterprise assistant indexing messages Steal API key from private content AML.T0070, AML.T0051.001, AML.T0082, AML.T0077
Memory poisoning persistence Assistant with cross-session memory Persistently bias or steer future sessions AML.T0080.000, AML.T0051.001, AML.T0068
Supply chain injection via skills/rules Agent marketplace / shared config Execute tools or insert backdoors AML.T0104, AML.T0051.000, AML.T0081
Agent becomes command-and-control Autonomous agent with browsing + shell tool Persistent remote control AML.T0051.001, AML.T0053, AML.T0081, AML.T0080.001
Chain-of-thought leakage and hijacking Reasoning model producing explicit traces Leak PII or weaken refusal Injection + exfil/impact outcomes

Kill chain: Indirect injection through RAG leading to credential exfiltration

This pattern is explicitly demonstrated in ATLAS case studies where a malicious message is indexed into a RAG database and later retrieved and executed — leading to API key harvesting and exfiltration through rendered output.

Kill Chain — RAG Poisoning to Credential Exfiltration 1. Craft retrieval-targeted payload AML.T0066 / AML.T0065 2. Place payload into indexed channel RAG poisoning — AML.T0070 3. Victim queries enterprise assistant RAG retrieves malicious content into context 4. Indirect injection executes AML.T0051.001 5. Retrieve secret from private sources RAG credential harvesting — AML.T0082 6. Render output as lure or link LLM response rendering — AML.T0077 7. Secret transmitted to adversary endpoint User tricked into click-through — exfiltration complete SETUP TRIGGER EXPLOIT

Figure 2 — Kill chain for indirect prompt injection through RAG leading to credential exfiltration.

Kill chain: Agentic injection leading to tool invocation, persistence, and remote control

MITRE’s OpenClaw investigation and the corresponding ATLAS case study describe this as a key emerging pattern: indirect prompt injection plus tool invocation plus configuration modification produces a persistent “agent implant” capable of ongoing malicious activity.

Kill Chain — Agentic Injection to Persistent Remote Control 1. Host malicious web content Masquerade as trusted resource — Acquire infrastructure 2. User asks agent to open/summarise link Drive-by compromise — AML.T0078 3. Agent fetches content; injection fires Indirect injection — AML.T0051.001 4. Agent silently invokes execution tool Tool invocation — AML.T0053 5. Script modifies agent config/prompt Modify AI Agent Configuration — AML.T0081 6. Poisoned thread/memory persists Context poisoning — AML.T0080 7. Agent acts as C2 channel Remote commands and follow-on actions — persistent agent implant SETUP TRIGGER EXPLOIT & PERSIST

Figure 3 — Kill chain for agentic prompt injection leading to persistent remote control.

Step-by-step scenario breakdowns

Web app prompt injection enabling code execution and secret theft. A documented exercise against an LLM-backed web app (MathGPT) shows how prompt injection can induce an app that executes LLM-generated code to run attacker-influenced code paths, read environment variables, and obtain API keys (with follow-on denial-of-service potential). The kill chain: the attacker iteratively probes prompt behaviour and crafts a payload that causes the model to output code aligned to attacker objectives, the application executes the generated code (this is the critical design hazard — the LLM output becomes an execution substrate), and the attacker coerces access to sensitive runtime data or triggers resource exhaustion.

API-driven system prompt extraction for targeted follow-on attacks. ATLAS notes adversaries may attempt to extract system prompts via prompt injection or by obtaining configuration files. Research on prompt extraction reinforces that simple text-based attacks can reveal prompts with high probability across many models, and prompt extraction from real systems suggests that defences are often insufficient. The attacker uses an inference API and performs model behaviour discovery, then attempts system prompt extraction with obfuscated reformulations and iterative refinement. Extracted content is used to design higher-success indirect injections using compatible delimiters, tool-call formats, and refusal bypass patterns.

Plugin and browsing indirect injection to exfiltrate private conversation. An ATLAS case study demonstrates that if a chat system retrieves a malicious webpage via a plugin, the injected content can cause the assistant to output a crafted element that exfiltrates the private conversation when the client fetches it — and can also induce additional plugin actions. The attacker publishes malicious content in a location likely to be fetched, the user triggers retrieval, indirect injection executes inside the user’s session context (the attacker’s instructions run with the user’s credentials), and exfiltration occurs through output rendering mechanics.

RAG poisoning and credential harvesting in an enterprise assistant. A case study shows how an attacker can seed a malicious payload into an indexed channel so that later user queries retrieve it. The payload drives the assistant to retrieve an API key from a private channel and present it as a clickable lure that transmits the secret to the attacker.

Memory poisoning for persistence across sessions. An ATLAS case study demonstrates that hidden prompt injection in shared content can lead to memory store poisoning, so future chats incorporate attacker-inserted “facts” or instructions — even after the initial session ends. The adversary hides injection content via obfuscation (invisible text) inside a shared resource, the system ingests it, indirect injection executes and writes to memory store, and future sessions inherit the poisoned memory unless hardening and remediation processes exist.

Supply chain prompt injection via skills, rules, and config packages. ATLAS documents scalable supply chain patterns where rules files or agent skills contain hidden or obfuscated prompt payloads that manipulate coding assistants or agents to insert backdoors or execute commands. Where the artifact is treated as trusted configuration, the payload executes as a system-level instruction, and the agent invokes tools or generates compromised outputs with downstream impact.

Agent becomes command-and-control via prompt injection. An ATLAS case study describes a chain where a malicious webpage’s indirect prompt injection abuses control tokens to invoke an unrestricted execution tool. A downloaded script then plants persistent malicious instructions into future system prompts, enabling remote command issuance. The result is an “agent implant” that continues communicating with attacker infrastructure.

Chain-of-thought leakage and hijacking in reasoning models. Two distinct but related risks: intermediate reasoning can leak PII even when final answers are sanitised, and long reasoning sequences can be exploited to weaken safety signals and achieve high attack success rates across multiple reasoning models. The key takeaway for threat modelling is that any system that logs, displays, or stores reasoning traces increases the sensitive-information attack surface and may create new safety bypass avenues.


Detection and Monitoring Strategies

What to log and why

ATLAS mitigation AI Telemetry Logging (AML.M0024) recommends logging inputs and outputs of deployed models and, for agents, the intermediate steps of agentic actions, tool use, and decisions. This is particularly critical because prompt injection can be invisible or obfuscated (tiny font, hidden Unicode, encoding).

A monitoring baseline for prompt injection threat detection should include:

  • Prompt provenance metadata — user input versus retrieved content versus tool output versus memory (source IDs, trust levels, timestamps). Provenance signalling is a core idea behind spotlighting-style defences to help separate instructions from untrusted data.
  • Tool invocation logs — tool name, parameters (with sensitive redaction), caller identity, permission context, and whether untrusted data was present in the context at invocation time.
  • Output safety signals — DLP findings (PII and keys), suspicious formatting (URLs with long query strings), embedded payload patterns, and prompt leakage patterns.

Detection analytics and alert content

High-yield detections tend to be behavioural rather than purely signature-based, because attackers can obfuscate payload strings. Recommended detection rules (generic, not product-specific):

  • Context boundary violations — untrusted retrieved content contains instruction-like patterns and coincides with changes in tool invocation frequency or privilege.
  • Suspicious tool-call sequences — browsing → parsing → shell/email/ticket-posting within the same session, or tool calls that include unusually encoded or long parameters (possible exfil payload).
  • Prompt leakage anomalies — outputs with large verbatim blocks matching system prompt templates, references to hidden policies, or repeated internal configuration content.
  • RAG targeted-retrieval patterns — repeated queries that look like credential hunts, or sudden retrieval of secrets from private channels.

Incident response considerations

Because prompt injection can persist through memory and context poisoning, detection must include retroactive analysis: which sources entered context, what was written to memory or config, and which downstream actions were taken.


Mitigations, Hardening Controls, and ATLAS Mapping

Control families

The mitigation landscape can be modelled as layered controls, with ATLAS mitigations providing a structured anchor.

Input validation and output validation. ATLAS mitigation Input and Output Validation for AI Agent Components (AML.M0033) calls for validation of inputs and outputs for tools and data sources used by agents — enforcing schemas and preventing unsafe agentic workflows.

Sandboxing and segmentation. ATLAS mitigation Segmentation of AI Agent Components (AML.M0032) recommends defining security boundaries around tools and data sources using API access controls, container isolation, code execution sandboxing, and rate limiting of tool invocation.

Instruction-level controls (guardrails, guidelines, alignment). ATLAS mitigation Generative AI Guardrails (AML.M0020) describes guardrails as validators, filters, rules, and classifiers that evaluate prompt and response safety, including domains such as jailbreaks, code exploits, and leakage. Generative AI Model Alignment (AML.M0022) highlights that fine-tuning can remove safety mechanisms, and that supervised fine-tuning, RLHF/RLAIF, and targeted safety distillation can improve alignment against unsafe prompts and responses.

Provenance and supply chain controls. ATLAS mitigation AI Bill of Materials (AML.M0023) supports supply chain risk mitigation by listing artifacts and resources used to build the AI, enabling rapid response to vulnerabilities.

Rate limits and access control. ATLAS includes Restrict Number of AI Model Queries (AML.M0004) for limiting total and rate of queries to hinder attack iteration, and access-control mitigations for production model access.

System prompt protections (realistic posture). OWASP explicitly states system prompts should not be treated as secrets or security controls and should not contain credentials. ATLAS documents prompt extraction as a dedicated exfiltration technique.

Tool access controls. ATLAS includes several agent permission configuration mitigations: Privileged AI Agent Permissions Configuration (AML.M0026) (RBAC/ABAC and least privilege for privileged agents), Single-User AI Agent Permissions Configuration (AML.M0027) (agents inherit user permissions and lifecycle management), and AI Agent Tools Permissions Configuration (AML.M0028) (delegated access; tool permissions inherited from invoking agent or user; applicable to MCP-style tool servers).

Technique-to-mitigation mapping

ATLAS Technique Key ATLAS Mitigations Engineering Interpretation
LLM Prompt Injection (AML.T0051) Guardrails (M0020); Guidelines (M0021); Alignment (M0022); Telemetry (M0024); I/O Validation (M0033) Multi-layer prompt+output filters; strict schema validation; robust logging; improve model resistance via alignment where appropriate
Extract LLM System Prompt (AML.T0056) Guardrails (M0020); Guidelines (M0021); Alignment (M0022) Assume prompt leakage; prevent secrets in prompts; add output-side leakage detection; consider architectures that don’t expose system text directly
AI Agent Tool Invocation (AML.T0053) Privileged Permissions (M0026); Single-User Permissions (M0027); Tools Permissions (M0028); Restrict on Untrusted Data (M0030); Segmentation (M0032); Human-in-the-loop (M0029); Telemetry (M0024) Least privilege + delegated auth; approval gates for high-risk actions; deny or require confirmation if untrusted content in context; sandbox tools; log every tool call
Exfiltration via Tool Invocation (AML.T0086) Restrict on Untrusted Data (M0030); Segmentation (M0032); Telemetry (M0024); Permissions (M0026/27/28) Block auto-write tools when untrusted context present; confine write-path tools; monitor for exfil patterns in tool params; enforce delegated permissions
AI Agent Context Poisoning (AML.T0080) Memory Hardening (M0031); Segmentation (M0032); Telemetry (M0024) Treat memory as a controlled datastore with trust levels, validation, remediation, and provenance; prevent untrusted sources from writing durable memory without review
RAG Poisoning / Credential Harvesting (AML.T0070 / AML.T0082) Permissions (M0026/27); Segmentation (M0032); Telemetry (M0024) Restrict retrieval scopes; keep secrets out of indexed corpora; monitor retrieval queries and sensitive outputs; isolate high-sensitivity document stores
LLM Prompt Obfuscation (AML.T0068) Guardrails (M0020); Telemetry (M0024) Detect hidden/encoded instruction patterns; normalise inputs (strip invisible chars); log raw+normalised representations for forensics
Data Destruction via Tool Invocation (AML.T0101) Restrict on Untrusted Data (M0030); Permissions (M0026/27/28); Human-in-the-loop (M0029) Always gate destructive actions; require human confirmation; enforce tool-level allow/deny lists; isolate destructive tools in hardened sandboxes

Design hardening patterns

The following patterns are consistently supported by the above mappings and by the literature on indirect prompt injection and provenance marking:

  • Provenance-aware context assembly (“separate code from data”) — implement explicit boundaries and provenance signals for external content (delimiting, encoding, or marking retrieved data) to reduce the chance of it being interpreted as instruction. Spotlighting-style techniques report large reductions in attack success rate with minimal task impact in experiments.
  • Tool firewall and policy engine — treat tool calls as untrusted requests requiring deterministic authorisation, not as model outputs to execute. OWASP stresses that security controls (authZ, privilege bounds) must not be delegated to the LLM.
  • “Untrusted-context mode” — when any untrusted retrieved content enters context, automatically degrade agent capabilities: block high-impact tools, require confirmation, and shorten memory writes. This aligns directly with AML.M0030.
  • Secrets governance for RAG and prompts — prevent credentials from appearing in RAG-indexed sources, prevent secrets in system prompts, and assume prompt extraction is feasible.

Evaluation, Testing, Residual Risks, and Roadmap

Evaluation and testing methods

Threat-model-based red teaming. Use ATLAS technique IDs as a test-plan backbone: for each high-risk workflow, design adversary simulations that attempt AML.T0051 (direct/indirect/triggered), tool invocation abuse, RAG poisoning, memory poisoning, and exfiltration via tools and output rendering. MITRE’s SAFE-AI framing of security control assessment for AI-enabled systems supports this approach: select controls that address AI-specific threats, not only traditional IT threats.

Unit tests and regression tests. Create deterministic unit tests around context construction (ensuring untrusted segments are marked and segmented correctly), tool policy enforcement (ensuring tool calls cannot execute without policy checks and required approvals), and memory writes and retrieval validation (preventing persistent poisoning).

Metrics and benchmarks. Use dual-axis evaluation: effectiveness against adaptive prompt injection attacks and diverse prompts, and general-purpose utility to quantify trade-offs (latency, refusal rates, task success). Where applicable, adopt metrics from the research ecosystem — attack success rate (ASR) is commonly used in prompt-injection defence evaluation, SPE-LLM proposes evaluation metrics and benchmark-driven validation for system prompt extraction, and Imprompter-style evaluations focus on end-to-end confidentiality and integrity compromise via improper tool use.

Operational validation. NIST AI RMF emphasises applying risk management functions (govern/map/measure/manage) in a lifecycle manner. For prompt injection, this encourages continuous measurement and iterative control improvement rather than one-time “prompt patching.”

Residual risk and trade-offs

Residual risk remains even after strong controls, for well-documented reasons:

  • No foolproof prevention — OWASP explicitly notes the stochastic nature of generative AI and the lack of guaranteed prevention; systems should instead mitigate impact.
  • Defence efficacy versus utility — strong guardrails and heavy filtering can degrade helpfulness and increase false positives, and academic evaluation stresses measuring both dimensions.
  • Agentic power concentrates risk — as workflows add browsing, plugins, and tools, indirect prompt injection becomes more severe because attacker instructions can run in the user’s credential context and drive real actions.
  • Persistence surfaces (memory, shared threads, config) mean that containment must include remediation processes, not only input filters.
  • Reasoning trace exposure can create additional leakage and safety-bypass risk in reasoning models.

Prioritised implementation roadmap

The roadmap below is deployment-agnostic and prioritised by expected risk reduction against documented ATLAS chains, feasibility, and blast-radius containment.

Prioritised Implementation Roadmap Phase 1 — Immediate Hardening (days to weeks) • AI telemetry logging (AML.M0024) for prompts, outputs, retrieval sources, memory, and tool invocations • Tool permission controls: delegated access, least privilege (AML.M0026/27/28), block high-risk tools by default • "Untrusted-context mode" (AML.M0030): disallow auto tool execution when external content is in context • Remove secrets from system prompts; treat prompts as non-secret; enforce authZ/authN outside the model Phase 2 — Core Resilience Build-Out (weeks to months) • Guardrails (AML.M0020) and I/O validation (AML.M0033) with safe-schema enforcement and DLP scanning • Segmentation/sandboxing (AML.M0032): isolate tools in hardened containers; separate read vs write tools • Memory hardening (AML.M0031): provenance-based trust for memory writes, validation, remediation playbooks • Provenance signalling for retrieved content (spotlighting-style marking) to reduce indirect injection effectiveness Phase 3 — Strategic Improvements (quarterly and ongoing) • Continuous red-team programme mapped to ATLAS techniques (AML.T0051/T0053/T0086/T0080) with regression tests • Model-side hardening: alignment (AML.M0022), targeted fine-tuning (RLHF/RLAIF) as complement to external controls • AI BOM / provenance tooling (AML.M0023) for third-party models, skills, and config artifacts • For reasoning models: avoid exposing raw chain-of-thought; treat reasoning traces as sensitive data • Dual-axis evaluation metrics: effectiveness against adaptive attacks + general-purpose utility measurement

Figure 4 — Three-phase implementation roadmap prioritised by risk reduction, feasibility, and blast-radius containment.


Conclusion

Prompt injection is not a single vulnerability — it is an attack primitive that enables entire kill chains. MITRE ATLAS provides the structured, adversary-centric vocabulary we need to reason about these threats systematically, from the reconnaissance and resource development phases through execution, persistence, and impact.

The most important takeaway from this threat model is that no single defence layer is sufficient. The recurring patterns documented in ATLAS case studies — prompt injection chained with tool invocation, configuration modification, and memory poisoning — demonstrate that attackers will combine traditional and AI-specific techniques in the same campaign. Effective defence requires constraining agency and tool impact, instrumenting everything, hardening prompt and context handling, and eliminating secrets from the model’s reach.

For teams getting started, the immediate hardening actions (telemetry logging, tool permission controls, untrusted-context mode, and removing secrets from prompts) deliver the highest risk reduction per unit of effort. For teams already down that path, investing in provenance-aware context assembly, continuous red teaming mapped to ATLAS technique IDs, and dual-axis evaluation metrics (effectiveness plus utility) will build durable, measurable resilience.


See also: MITRE ATLAS Deep Dive: Threat Intelligence for AI Systems in 2026 for a comprehensive walkthrough of the full ATLAS framework, and OWASP Top 10 for LLM Applications: An In-Depth Guide for 2026 for a complementary vulnerability-focused perspective.