LLM Agents in Production: Workflows, Frameworks, Security and Deployment

24 minute read

LLM Agents in Production

Introduction

Over the past two years, large language models have evolved from stateless question-answer engines into agents that perceive their environment, plan multi-step strategies, execute tool calls, and persist context across sessions lasting dozens of hours. By early 2026 the agent market has reached an estimated $7.63 billion, with projections of $50.31 billion by 2030 at a 45.8 % CAGR. According to LangChain’s State of Agent Engineering survey (December 2025), 57 % of organisations now have agents running in production, up from 51 % the year before. Customer service leads adoption at 26.5 % of deployments, with research and data analysis close behind at 24.4 %.

Yet building a reliable agent is far harder than building a demo. An empirical study of twenty production deployments found that 85 % of teams forgo third-party agent frameworks entirely, building custom orchestration from scratch, and that 68 % of deployed agents execute at most ten steps before requiring human intervention. The gap between prototype and production demands careful attention to architecture, workflow design, termination logic, framework selection, security, and deployment strategy.

This post provides a practitioner-oriented guide across all of these dimensions. It expands on the foundational survey in the companion report by adding a general architecture overview, concrete workflow patterns, a head-to-head framework comparison spanning open-source projects and hyperscaler platforms, a layered security model, deployment architectures, and lessons from companies that have successfully shipped agents to production.

1. General Architecture of LLM Agents

1.1 What makes a system an agent

An LLM agent is more than an LLM with a system prompt. The defining characteristics are autonomy (the ability to decide which actions to take without explicit per-step instructions), tool use (interaction with external systems — APIs, databases, file systems, browsers), memory (retention of context within and across sessions), and goal-directed behaviour (persistent pursuit of an objective over multiple reasoning-action cycles). A useful shorthand from the practitioner community: Agent = LLM + Tools + Memory + Planning / Control.

A plain LLM call is reactive: given input, produce output. A workflow adds developer-defined control flow — a sequence of LLM calls, conditionals, and tool invocations in a pre-determined order. An agent adds dynamic decision-making: the LLM itself decides which tool to call, when to call it, and when the task is done. Most production systems sit on a spectrum between rigid workflows and fully autonomous agents, borrowing elements from both.

1.2 The agentic loop

Most frameworks decompose agent cognition into four core stages: perception, planning, action, and memory. Perception interprets user input and environmental observations. Planning decides what to do next — which tool to call, which sub-task to address. Action executes the chosen tool call or generates a response. Memory stores the result for future iterations. This cycle repeats until a termination condition is met.

Figure 1 — The agentic loop. Perception feeds planning, which drives action, which updates memory. The cycle continues until a termination condition is satisfied.

A unified model of this loop, often termed the LLM-Agent Universal Model Framework (UMF), maps these modules to equivalent computer-system components: the LLM is the CPU, memory tiers are RAM and disk, tools are I/O peripherals, and the orchestration layer is the operating system scheduler. This analogy underscores the importance of modular design and explicit interfaces — the same principles that make traditional systems reliable.

1.3 Components and sub-elements

An alternative practitioner-oriented breakdown emphasises four functional layers:

Layer	Sub-elements	Role
Planning	Prompting techniques, task decomposition, logical reasoning	Break a high-level goal into sub-tasks using strategies such as chain-of-thought, ReAct, Self-Refine, Reflexion, and tree search.
Execution	Tools, sub-agents, guardrails, error handling	Provide access to web search, vector stores, databases, code interpreters, or ML models. Enforce guardrails via validation and fallback strategies. Autonomously generate new tools when none exist (LATM).
Refinement	Memory, human-in-the-loop, LLM judging	Use short-term memory (context window) and long-term memory (vector stores, key/value stores, knowledge graphs). Agents can write notes, summarise sessions, or use persistent memory systems like MemGPT. Human reviewers evaluate intermediate outputs and guide error recovery.
Interface	Human–agent and agent–agent communication	Implement user interfaces (chat, voice, GUI) and protocols for agent-to-agent messaging. Provide observability tools (tracing, replay, metrics) and version control for prompts and models.

1.4 Memory mechanisms

Memory is critical for long-horizon tasks yet remains a bottleneck. Even models with million-token context windows suffer context rot at approximately 100 k tokens, with performance dropping by over 50 %. The “lost in the middle” phenomenon means accuracy peaks for information at the beginning and end of the context but drops for content in the middle 40–60 % of the window.

Advanced memory systems address this through several strategies. MemGPT introduces hierarchical memory tiers and interrupts to swap data between short-term and long-term stores. A-Mem uses Zettelkasten-inspired interconnected notes. Amazon Bedrock AgentCore now offers episodic memory that enables agents to learn and adapt from past interactions. Selective addition and deletion are essential — indiscriminate storage propagates errors, while utility-based deletion yields up to 10 % improvement.

1.5 The brain, body, and eyes mental model

A useful practitioner mental model frames agent design as choosing three components: a brain (the LLM), a body (the orchestration framework), and eyes (the observability stack). Getting any one wrong can doom the agent in production. Developers report frustration when frameworks add excessive abstraction between the developer and the model — a phenomenon sometimes called “abstraction tax.” The trend in 2025–2026 has been towards lighter-weight, more explicit frameworks that make the orchestration logic transparent.

2. Agent Workflows and Termination

2.1 Core workflow patterns

Six recurring design patterns dominate production agent systems. These patterns can be composed — a production agent often nests several together.

Figure 2 — Six core workflow patterns with their termination strategies. Production agents typically compose multiple patterns together.

Sequential chaining pipelines a task through a series of specialised LLM calls. Each step receives only the context it needs, and conditional gates between steps validate outputs, route branches, or terminate the chain early. This pattern dominates document-processing pipelines.

Parallel fan-out / fan-in breaks a task into independent sub-tasks processed concurrently before an aggregation step synthesises the results. Google ADK provides a built-in ParallelAgent for this pattern; Amazon Bedrock Agents implements a similar capability through its multi-agent collaboration feature.

Routing classifies an incoming request and dispatches it to the appropriate specialised handler. E-commerce customer-service agents frequently use routing to direct different query types to different downstream flows.

Evaluator-optimizer loops implement the self-refine pattern: a generator produces output, an evaluator grades it, and the generator revises until the evaluator approves or a maximum iteration count is reached. Google ADK’s LoopAgent and LangGraph’s cycle support enable this natively.

Orchestrator-worker systems use a meta-agent to decompose a goal, delegate sub-tasks to specialist workers, and synthesise results. CrewAI’s role-based design and Amazon Bedrock’s supervisor-mode multi-agent collaboration are built explicitly around this pattern.

Handoff (agent transfer) allows an agent to delegate control to another mid-conversation. The OpenAI Agents SDK formalises this as a first-class primitive; Microsoft Agent Framework implements it through its workflow graph abstraction.

2.2 Termination conditions in depth

Reliable termination is one of the hardest practical problems. Research on multi-agent system failures identifies premature termination as a top failure mode — AppWorld frequently suffers from agents stopping before the task is complete, while OpenManus exhibits step repetition where agents loop without progress.

Production agents layer multiple termination strategies: goal completion (the LLM generates a final response without requesting another tool call), max-step or budget limits (a hard cap on iterations, tokens, or wall-clock time), error and confidence thresholds (abort on repeated failures or repeated identical actions), human override (products like Operator implement a “Takeover” mode), quality gates (an evaluator agent or LLM-as-judge approves the output — used by 52 % of production teams), and LLM stop signal (the default in most frameworks). Well-designed agents combine at least three of these strategies.

3. Comparison of Prominent Agent Frameworks

3.1 The framework landscape in 2026

The agent framework ecosystem has matured rapidly. Open-source libraries like LangGraph, CrewAI, and LlamaIndex coexist with vendor-specific platforms from each major hyperscaler and model provider. The industry experienced a reckoning in 2025 when 45 % of developers who tried LangChain never deployed it, and 23 % eventually removed it, citing abstraction tax and API instability. This backlash accelerated both the rise of lighter-weight open-source frameworks and the development of integrated enterprise platforms.

Microsoft consolidated its two agent projects — Semantic Kernel and AutoGen — into the unified Microsoft Agent Framework (announced October 2025, currently at Release Candidate status heading towards GA). Google open-sourced the Agent Development Kit (ADK) at Cloud NEXT in April 2025. Amazon launched Bedrock AgentCore as a fully managed platform, reaching GA in mid-2025 and receiving a major capabilities update at re:Invent in December 2025 with policy enforcement, episodic memory, and built-in evaluation. OpenAI shipped its Agents SDK in March 2025 as a deliberately minimal alternative. Meanwhile, OpenAI and Amazon jointly announced the Stateful Runtime Environment in February 2026, providing persistent agent state natively within Bedrock.

3.2 Open-source frameworks

Dimension	LangGraph	CrewAI	OpenAI Agents SDK	LlamaIndex Workflows
Publisher	LangChain, Inc.	CrewAI, Inc.	OpenAI	LlamaIndex, Inc.
First release	2024	2024	March 2025	2023 (agents: 2025)
Design paradigm	Graph-based state machine — nodes are agents or functions, edges can be conditional or cyclic	Role-based agent teams — each agent has a role (Planner, Researcher, Writer)	Minimal agent + handoffs — any Python function becomes a tool via automatic schema generation	Data-centric agent workflows over document indices and retrieval pipelines
Orchestration	Directed graph (DAG + cycles); conditional edges; supervisor patterns; explicit state via annotated `TypedDict`; checkpointing and persistence	Crew of agents with sequential or parallel task execution; shared memory via retrieval	Agents with tools + guardrails; handoffs transfer control between specialist agents; sessions manage conversation history	`AgentWorkflow` + `AgentRunner`; event-driven steps; Agentic Document Workflows for end-to-end knowledge work
Multi-agent	Native — agents are graph nodes	Core design — role-based teams	Via handoffs	Via multi-agent workflow orchestration
Observability	LangSmith (trace, replay, metrics)	Basic logging; LangSmith compatible	Built-in tracing (one-line enablement)	LlamaTrace; callback handlers
Model support	Model-agnostic	Model-agnostic (GPT, Claude, Gemini, Llama, Mistral)	OpenAI models only	Model-agnostic
Languages	Python, JS/TS	Python	Python, Node.js	Python, TypeScript
GitHub stars	~30 k (LangChain org)	~44 k	~5 k	~40 k (LlamaIndex org)
Maturity	Production-proven (Klarna, Replit, Elastic)	Growing production adoption	Production-ready; minimal surface area	Production-ready for document-heavy use cases
Best for	Complex stateful workflows with conditional logic and cycles	Teams that think in roles and business processes	Developers wanting minimal abstractions and OpenAI-native tooling	RAG-heavy and document-centric agent systems

3.3 Hyperscaler and vendor platforms

Dimension	Google ADK	Microsoft Agent Framework	Amazon Bedrock AgentCore
Publisher	Google (open-source)	Microsoft (open-source)	Amazon Web Services (managed service)
Announced	April 2025 (Cloud NEXT)	October 2025 (merging Semantic Kernel + AutoGen)	June 2025 GA; major update December 2025 (re:Invent)
Design paradigm	Event-driven runtime — `LlmAgent` for dynamic reasoning; `SequentialAgent`, `ParallelAgent`, `LoopAgent` for structured workflows; hierarchical agent composition	Unified framework combining AutoGen’s simple agent abstractions with Semantic Kernel’s enterprise features — graph-based workflows for explicit multi-agent orchestration	Fully managed agentic platform — agents built via natural language instructions, with supervisor-mode multi-agent collaboration and policy enforcement
Key differentiators	Multi-language (Python, TypeScript, Go, Java); built-in visual Web UI for step-by-step debugging; native bidirectional audio/video streaming; MCP tool support; built-in evaluation	Session-based state management; type-safe middleware and filters; telemetry; checkpointing for human-in-the-loop; cross-cloud connectors (Azure, AWS, GCP); .NET and Python at parity	Zero-infrastructure deployment; Policy (natural-language rules auto-converted to Cedar policy language) intercepting every tool call in real time; Guardrails blocking up to 88 % of harmful content; Evaluations with 13 built-in evaluators; episodic memory; bidirectional streaming; complete session isolation
Multi-agent	Native hierarchy — specialised agents composed as tools or sub-agents	Graph-based multi-agent workflows with persistent state and error recovery	Supervisor-mode multi-agent collaboration with automatic task delegation to specialist sub-agents
Observability	CLI + visual Web UI; built-in evaluation framework	OpenTelemetry integration; Azure Monitor; built-in telemetry	Amazon CloudWatch dashboard; AgentCore Evaluations
Model support	Optimised for Gemini; model-agnostic via LiteLLM	Azure OpenAI plus any model via connectors; bring-your-own-model through API gateways	Hundreds of FMs via Bedrock (Anthropic, Meta, Mistral, Cohere, Amazon Nova, etc.); Intelligent Prompt Routing can cut costs by up to 30 %
Cloud integration	Vertex AI Agent Engine for managed deployment; Cloud Run for containers	Azure AI Foundry Agent Service for hosted deployment; Microsoft 365 integration; Copilot Studio interop	Native AWS — VPC connectivity, PrivateLink, IAM, encryption at rest/in transit; Lambda and MCP tool integration; AgentCore Gateway for tool connectivity
Languages	Python, TypeScript, Go, Java	C#/.NET, Python	Python (primary); any language via API; any framework via AgentCore Runtime
Maturity	GA; powers Google Agentspace and Customer Engagement Suite	Public preview → Release Candidate (February 2026); successor to two battle-tested frameworks	GA; used by 100 k+ organisations including Robinhood, Epsilon, PGA Tour, Ericsson
Best for	Google Cloud / Gemini-centric teams; polyglot environments	Microsoft / Azure / .NET shops; enterprises needing cross-cloud flexibility	Enterprises wanting fully managed agent infrastructure with built-in governance, security, and compliance

3.4 Platform deep-dives

Amazon Bedrock AgentCore

Amazon’s agent platform has become a three-layer stack: Bedrock Agents provides guided agent building via natural-language instructions and automatic task decomposition; Bedrock AgentCore sits beneath it as a framework-agnostic infrastructure layer for building, deploying, and operating agents at scale. AgentCore includes a Gateway that converts APIs and Lambda functions into agent-compatible tools and connects to MCP servers with semantic search for intelligent tool discovery. The Policy system (December 2025) uses natural language translated into AWS’s open-source Cedar policy language to enforce boundaries on every tool call. AgentCore Runtime supports workloads ranging from low-latency conversations to 8-hour asynchronous tasks, with complete session isolation and container-based deployment. The February 2026 partnership with OpenAI introduced the Stateful Runtime Environment, providing persistent working context across multi-step tasks without custom orchestration infrastructure.

Microsoft Agent Framework

Microsoft’s approach unifies the best of two lineages. From Semantic Kernel comes enterprise-grade stability: session-based state management, type safety, middleware filters, telemetry via OpenTelemetry, and deep Azure integration. From AutoGen comes pioneering multi-agent orchestration and developer experience (including AutoGen Studio). Agent Framework adds graph-based workflows for explicit control over multi-agent execution paths, a robust state management system for long-running and human-in-the-loop scenarios, and MCP client support for tool integration. The framework deploys to Azure AI Foundry Agent Service as hosted agents with no container images or Kubernetes clusters required. Foundry Agent Service supports agents built with Agent Framework, LangGraph, CrewAI, or other open-source frameworks, providing enterprise-grade identity, observability, governance, and autoscaling.

Google Agent Development Kit (ADK)

Google’s ADK distinguishes itself through its multi-language support (Python, TypeScript, Go, Java — the broadest of any agent framework), its hierarchical agent composition (any agent can use another agent as a tool), and its built-in visual Web UI for step-by-step debugging. ADK provides three structured workflow agents — SequentialAgent, ParallelAgent, LoopAgent — alongside LlmAgent for dynamic LLM-driven reasoning. The framework supports bidirectional audio and video streaming for multimodal agent experiences. In production, ADK powers Google’s Agentspace (enterprise search and agent platform) and the Customer Engagement Suite (contact centre AI). ADK agents can be deployed on Vertex AI Agent Engine for fully managed hosting or on Cloud Run for custom container deployments.

3.5 Selection guidance

Complexity vs. control — LangGraph and Microsoft Agent Framework offer the most granular control through graph abstractions. OpenAI Agents SDK and Google ADK optimise for simplicity. Amazon Bedrock AgentCore minimises infrastructure burden at the cost of platform lock-in.

Model lock-in — OpenAI Agents SDK is tightly coupled to OpenAI models. Amazon Bedrock AgentCore supports hundreds of models across providers. Google ADK defaults to Gemini but supports others via LiteLLM. LangGraph, CrewAI, Microsoft Agent Framework, and LlamaIndex are all model-agnostic.

Ecosystem fit — Choose the framework that matches your existing infrastructure: LangGraph if you’re invested in LangSmith; Google ADK for Vertex AI; Microsoft Agent Framework for Azure AI Foundry; Amazon Bedrock AgentCore for AWS-native workloads; CrewAI or LlamaIndex for framework-agnostic, open-source-first approaches.

Build vs. buy — 85 % of detailed production case studies forgo third-party frameworks entirely, building custom orchestration. This suggests the value of a framework lies more in its patterns and abstractions than in its runtime. That said, the managed deployment options from hyperscalers (Foundry Agent Service, AgentCore Runtime, Vertex AI Agent Engine) reduce the infrastructure burden substantially.

4. Design Process and Deployment Models

4.1 The agent design lifecycle

Define scope and role. Start with a single, well-defined use case. AtlantiCare tested an agentic clinical assistant with just fifty providers, achieving 80 % adoption and a 42 % reduction in documentation time before expanding. Monolithic “super-agents” suffer from instruction overload, inaccurate outputs, and brittle scaling.

Select brain, body, and eyes. Cost, latency, and reasoning depth drive model selection. The framework comparison in §3 guides body selection. Tools like LangSmith, Arize Phoenix, Google ADK’s Web UI, or Amazon CloudWatch address observability.

Design the workflow. Map the task to one or more patterns from §2.1. Start with the simplest pattern that works — many production agents begin as sequential chains and only evolve into more complex architectures when demanded.

Implement guardrails and termination. Define input validation, output sanitisation, permission scoping, and at least three termination strategies. Guardrails are a core architectural component, not an optional add-on.

Evaluate iteratively. 52 % of organisations run offline evaluations on test sets; 37 % add online evaluations in production. Build evaluation into the development loop, not as an afterthought. Amazon Bedrock AgentCore provides 13 built-in evaluators and a CloudWatch dashboard; Microsoft Agent Framework ships evaluation capabilities via Azure AI Foundry.

Deploy incrementally. Start with human-in-the-loop approval for every action, then selectively relax oversight as confidence grows.

4.2 Deployment models

Figure 3 — Three deployment architectures for LLM agents, from embedded single-agent to multi-agent platforms.

Embedded agent — the agent is a library within the application. Simplest to deploy and most common in practice. Agent-as-a-service — the agent runs as an independent, containerised microservice behind an API gateway, enabling independent scaling, versioning, and rollback. Stripe uses this model with an internal LLM proxy over Amazon Bedrock. Multi-agent platform — multiple specialised agents run on a shared orchestration layer. Google ADK, Amazon Bedrock multi-agent collaboration, Microsoft Foundry Agent Service, and AWS AgentCore Runtime all provide managed hosting for this model.

Key selection factors include the autonomy budget (how many steps before human intervention), state requirements (stateless agents can run in serverless containers; stateful agents need managed session services like the OpenAI–Amazon Stateful Runtime), and multi-model orchestration (59 % of deployed agents use two or more LLMs, routing simpler sub-tasks to smaller, cheaper models — Amazon Bedrock’s Intelligent Prompt Routing automates this).

5. Security Model

Figure 4 — Threat model for LLM agents and the defence-in-depth layers that mitigate each threat category.

5.1 The agent attack surface

LLM agents introduce a fundamentally new attack surface. The combination of tool access, autonomous decision-making, and exposure to untrusted input creates the confused deputy problem: the agent trusts anything that can send it convincing-sounding tokens, making it an ideal target for prompt injection. The OWASP Top 10 for LLM Applications (2025) and the OWASP Top 10 for Agentic Applications (December 2025) provide the canonical vulnerability taxonomy.

Direct prompt injection — crafted input that overrides the agent’s system prompt. Appears in over 73 % of production AI deployments assessed during audits. Attack success rates exceed 84 % in benchmarked agentic systems.

Indirect prompt injection — malicious instructions embedded in external data (documents, emails, web pages, RAG knowledge bases). In early 2025, GitHub Copilot suffered CVE-2025-53773, allowing remote code execution through prompt injection hidden in repository code comments. In February 2026, OpenAI launched Lockdown Mode for ChatGPT, acknowledging that prompt injection in AI browsers “may never be fully patched.”

Tool and API misuse (excessive agency) — overly broad permissions enabling attackers to trigger data exfiltration, purchases, or system modifications. Amazon Bedrock AgentCore’s Policy system addresses this by intercepting every tool call in real time and enforcing Cedar-based access rules. Microsoft Agent Framework provides middleware filters for the same purpose.

Supply chain attacks — compromised models, poisoned training data, or malicious plugins. Research demonstrates that just five poisoned documents among millions can achieve 90 % attack success rates in enterprise RAG deployments.

Memory poisoning — attackers corrupt the agent’s persistent context to influence future interactions. The ICLR 2025 Agent Security Bench shows this as a high-success-rate attack vector.

Output exploitation — LLM output consumed by downstream systems without sanitisation enables SQL injection, XSS, or command injection.

5.2 Defence-in-depth architecture

No single defence addresses all threats. The security model requires layered controls: input filtering (prompt shields such as Microsoft Prompt Shields, Amazon Bedrock Guardrails blocking up to 88 % of harmful content with 99 % accuracy for hallucination detection using Automated Reasoning checks), least-privilege tool scoping (Amazon Bedrock AgentCore Policy, Microsoft Agent Framework middleware, Google ADK’s tool-as-agent composition for isolation), output sanitisation (treat all LLM output as untrusted), human-in-the-loop (require explicit user confirmation for high-impact actions), sandboxing (isolate code execution in containers or managed sandboxes — AgentCore Runtime provides complete session isolation), observability (trace every agent step for post-incident analysis), and supply chain audit (maintain an AI Bill of Materials, use model registries, align with MITRE ATLAS and NIST AI RMF).

5.3 Mapping to security frameworks

Security Concern	Applicable Framework(s)
Prompt injection detection and prevention	OWASP Top 10 for LLMs (LLM01), OWASP Agentic Applications Top 10
Adversarial threat modelling for agents	MITRE ATLAS (14 agentic AI techniques added October 2025)
Tool permission governance	NIST AI RMF (Govern, Manage functions), CSA AI Controls Matrix
Supply chain integrity	Google SAIF, ISO/IEC 42001, NIST CSF Profile for AI
Regulatory compliance (EU)	EU AI Act (high-risk classification for autonomous decision-making)
Certifiable management systems	ISO/IEC 42001
Cloud-AI operational controls	CSA AICM (243 control objectives across 18 domains)

6. Successful Production Deployments

6.1 What production looks like

An analysis of over 1,200 production deployments reveals a consistent pattern: the teams shipping reliable systems are distinguished less by AI research credentials than by their software engineering fundamentals. Customer service is the leading use case at 26.5 % of deployments, followed by research and data analysis at 24.4 %.

6.2 Case studies

Klarna — customer service at scale. Klarna deployed an LLM-powered assistant handling millions of conversations monthly across multiple markets and languages. The system uses LangGraph for orchestration and has become a flagship reference customer for the framework. Lesson: start with a high-volume, well-understood workflow, constrain scope tightly, and invest in observability.

Stripe — compliance investigation. Stripe built an agent for internal compliance using Amazon Bedrock, chosen for its unified security vetting across model providers and prompt caching to address the quadratic cost problem in iterative loops. Stripe built an internal LLM proxy on top for traffic management, model fallback, and bandwidth allocation. Lesson: the differentiation in production happens in the operational layer above the model platform.

Robinhood — financial services at scale. Robinhood scaled from 500 million to 5 billion tokens daily in six months using Amazon Bedrock, while cutting AI costs by 80 % and development time in half. Bedrock’s model diversity, security, and compliance features were critical for their regulated environment. Lesson: in regulated industries, the compliance and security story of the platform matters as much as the model capabilities.

Epsilon — marketing automation. Epsilon used Amazon Bedrock AgentCore to transform campaign operations, enabling intelligent agents to automate complex campaign workflows while maintaining enterprise-grade security and compliance. The result was reduced campaign setup time by 30 %, increased personalisation by 20 %, and 8 hours saved per team per week. Lesson: agent platforms that handle security and compliance natively free teams to focus on business logic.

Ericsson — R&D at infrastructure scale. Ericsson uses Amazon Bedrock AgentCore across 3G/4G/5G/6G systems spanning millions of lines of code, reporting double-digit productivity gains across a workforce in the tens of thousands. AgentCore’s support for any agent framework was critical for scaling across many teams. Lesson: framework-agnostic platforms enable adoption across diverse engineering organisations.

Prosus — enterprise assistant. Prosus deployed “Toan,” a RAG-based assistant on Amazon Bedrock serving 15,000+ employees across 24 companies, reducing hallucination rates to below 2 % through iterative optimisation. Lesson: reliability at enterprise scale requires systematic evaluation loops.

AtlantiCare — clinical documentation. Fifty healthcare providers piloted an agentic clinical assistant, achieving 80 % adoption and 42 % reduction in documentation time (approximately 66 minutes per provider per day). Lesson: incremental deployment in controlled environments builds confidence for broader rollout.

Mercado Libre — developer productivity. Latin America’s largest e-commerce platform deployed an AI coding assistant fine-tuned on internal codebases to accelerate development across engineering teams. Lesson: domain-specific fine-tuning on proprietary code improves agent accuracy for specialised environments.

6.3 Patterns across successful deployments

Constrain autonomy deliberately. Production agents that succeed are tightly scoped. Adding more tools often degrades performance — one code-review agent saw quality collapse when given too many tools.

Invest in the operational layer. The model is rarely the bottleneck. Reliable agents require custom proxy services, fallback logic, cost management, rate limiting, and authentication. Both Stripe’s LLM proxy and Amazon’s Intelligent Prompt Routing illustrate this principle.

Favour manual prompt engineering. 79 % of deployed agents rely on manual prompt construction, with prompts often exceeding 10,000 tokens.

Use multiple models. 59 % of deployed agents use two or more LLMs, routing simpler sub-tasks to smaller, cheaper models. Amazon Bedrock’s model distillation runs up to 500 % faster and costs up to 75 % less with minimal accuracy impact.

Build evaluation into the workflow. Successful teams combine offline evaluation, online monitoring, and human review. Nearly a quarter of organisations combine all three approaches. Amazon Bedrock AgentCore’s 13 built-in evaluators and Microsoft Agent Framework’s evaluation integration lower the barrier to systematic testing.

Conclusion

LLM agents have moved from research curiosities to production infrastructure. The field in 2026 is characterised by maturing workflow design patterns, a consolidation of frameworks around practical production concerns (with hyperscalers competing on managed platforms alongside a healthy open-source ecosystem), and an emerging security discipline that recognises agents as a fundamentally new attack surface.

The most important insight from production deployments is that building reliable agents is a software engineering problem, not an AI research problem. The teams that succeed treat their agents as probabilistic components within well-architected systems — constraining autonomy, layering defences, investing in observability, and deploying incrementally.

The framework landscape has bifurcated into two clear tiers. Open-source libraries like LangGraph, CrewAI, OpenAI Agents SDK, and LlamaIndex provide patterns and abstractions for developers who want full control. Managed platforms from Amazon (Bedrock AgentCore), Microsoft (Agent Framework + Foundry), and Google (ADK + Vertex AI) provide infrastructure-level capabilities — policy enforcement, episodic memory, built-in evaluation, and zero-infrastructure deployment — that reduce operational burden for enterprises. Choosing between them is less about technical superiority and more about ecosystem alignment, compliance requirements, and the build-versus-buy decision for your operational layer.

For practitioners getting started, the path is clear: define a single, well-scoped use case; select a framework that matches your ecosystem and complexity needs; implement defence-in-depth security from day one; deploy with human-in-the-loop oversight; and expand scope only after evaluation confirms reliability. The frameworks, patterns, and lessons described in this post provide the map. The engineering work of building the road remains with you.

Applying This in Practice

If you are applying these ideas to a regulated product, certification target, or production system, I can help turn the analysis into a threat model, architecture review, migration roadmap, or remediation plan.

Discuss an AI security architecture challenge

Twitter Facebook LinkedIn