The Compound Error Crisis: Why LLM Agents Are Failing Like Broken Robots (And Why Computer Science Warned Us)

A 6-axis robot arm reaches for a coffee cup. Joint 1 is off by 0.5 degrees. Joint 2 compensates but overshoots by 0.8 degrees. By the time the arm reaches the cup, it's 3 inches to the left and crashes into the table.

An LLM agent analyzes quarterly sales data. It misinterprets Q2 growth as 15% instead of 5%. This becomes the baseline for Q3 projections. The agent then builds a hiring plan based on the inflated projections. By step 5, it's recommending the company triple its workforce.

Both scenarios showcase the same fundamental problem: error propagation. Yet while computer science theory predicted this decades ago and robotics engineers have spent decades developing sophisticated error correction mechanisms, the AI community is deploying multi-step LLM agents with barely a whisper about compound failures.

The Computer Science Foundation We're Ignoring

Long before robots or LLMs existed, computer science established the mathematical foundations of error propagation. Wilkinson (1963) in "Rounding Errors in Algebraic Processes" proved that numerical errors compound predictably in sequential computations. His work on condition numbers showed exactly how input uncertainties amplify through algorithmic chains.

Goldberg (1991) in "What Every Computer Scientist Should Know About Floating-Point Arithmetic" demonstrated that even simple arithmetic operations suffer from cumulative precision loss. The IEEE 754 standard exists precisely because early computer scientists recognized that ignoring error propagation leads to catastrophic failures in computational systems.

The theoretical framework was clear: any sequential system without error correction will experience reliability degradation proportional to the number of operations. This isn't just theory—it's why financial systems use decimal arithmetic instead of floating-point, and why NASA's flight computers employ triple redundancy.

The Robotics Response: Engineering for Reality

The robotics community didn't just acknowledge these mathematical realities—they engineered solutions. The transition from theoretical computer science to physical systems revealed new dimensions of the error propagation problem.

Chatila and Laumond (1985) in "Position Referencing and Consistent World Modeling for Mobile Robots" showed that sensor noise compounds quadratically with the number of observations. This led to the development of simultaneous localization and mapping (SLAM) algorithms that explicitly model and correct for cumulative uncertainty.

LaValle (2006) in "Planning Algorithms" formalized the concept of configuration space obstacles created by uncertainty propagation. His work showed that without explicit error modeling, path planning algorithms become unreliable after just a few waypoints.

The robotics solution was systematic:

Model uncertainty explicitly at every step
Implement closed-loop feedback to correct accumulated errors
Use probabilistic frameworks (Kalman filters, particle filters) to track confidence
Design for graceful degradation when uncertainty exceeds acceptable bounds

LLMs: A New Class of Sequential System

LLM agents represent a fascinating convergence of computer science theory and robotics practice, but operating in the space of semantic computation rather than numerical calculation or physical manipulation.

The Computer Science Parallel: Like floating-point arithmetic, each LLM inference introduces uncertainty. Bengio et al. (2013) in "Representation Learning: A Review and New Perspectives" showed that deep networks accumulate representational errors through their layers. LLM agents simply extend this to the temporal dimension—errors accumulate across reasoning steps rather than network layers.

The Robotics Parallel: Like sensor fusion, LLM agents must integrate information from multiple sources (context, tools, memory) while maintaining coherent world models. Thrun et al. (2005) demonstrated that without explicit uncertainty tracking, integrated information becomes unreliable exponentially fast.

The Unique Challenge: Unlike numerical computation (where errors are well-defined) or robotics (where errors are measurable), LLM semantic errors are often undetectable until propagation makes them catastrophic. A hallucinated fact looks identical to a real fact until it causes downstream failures.

The Mathematical Reality: Why This Was Predictable

The error propagation in LLM agents follows well-established mathematical principles, but manifests in ways that make traditional solutions challenging:

From Numerical Analysis: Higham (2002) in "Accuracy and Stability of Numerical Algorithms" proved that error propagation follows condition number mathematics. For LLM agents, the "condition number" is effectively the semantic sensitivity of each reasoning step to input uncertainty.

From Information Theory: Shannon (1948) established that information transmission through noisy channels degrades predictably. LLM reasoning chains are essentially semantic channels where each step introduces noise, but unlike digital channels, we lack error-correcting codes for meaning.

From Control Theory: Åström and Murray (2021) in "Feedback Systems: An Introduction for Scientists and Engineers" showed that open-loop systems (like current LLM agents) are inherently unstable over multiple iterations, while closed-loop systems with feedback can maintain stability.

The mathematics predicted exactly what we're observing: sequential systems without error correction mechanisms will fail predictably as chain length increases.

The Counterargument: Why LLMs Might Be Different

Before accepting the doom-and-gloom narrative, we should examine why some researchers believe LLMs might escape the traditional error propagation trap:

Emergent Error Correction: Brown et al. (2020) in the GPT-3 paper suggested that large language models exhibit "emergent capabilities" that might include self-correction. Some researchers argue that sufficiently large models can detect and correct their own errors through their training on self-consistent text.

Semantic Robustness: Hendrycks et al. (2021) in "Measuring Massive Multitask Language Understanding" found that large models show surprising robustness to input perturbations. This suggests that semantic reasoning might be more fault-tolerant than numerical computation.

Context-Driven Recovery: Wei et al. (2022) in "Chain-of-Thought Prompting" showed that models can sometimes recover from early errors when provided with sufficient context. The argument: semantic systems might have self-healing properties that numerical systems lack.

The Scale Hypothesis: Kaplan et al. (2020) in "Scaling Laws for Neural Language Models" suggested that error rates decrease predictably with model size. If true, sufficiently large models might achieve error rates low enough to make propagation manageable.

Why the Counterargument Falls Short

However, empirical evidence suggests these optimistic views don't hold under systematic analysis:

Emergent Correction is Inconsistent: Kadavath et al. (2022) in "Language Models (Mostly) Know What They Know" found that while models can sometimes self-correct, this ability is unpredictable and doesn't scale systematically with task complexity.

Semantic Robustness Has Limits: Ribeiro et al. (2020) in "Beyond Accuracy: Behavioral Testing of NLP Models" showed that apparent robustness often masks brittleness to specific types of semantic perturbations—exactly the kind that propagate through reasoning chains.

Context Recovery Requires Perfect Context: The self-healing properties depend on maintaining perfect contextual information, but Liu et al. (2023) in "Lost in the Middle" demonstrated that long-context reasoning degrades significantly as context length increases.

Scale Doesn't Solve Systemic Issues: Ganguli et al. (2022) in "Predictability and Surprise in Large Generative Models" found that while individual error rates decrease with scale, systemic issues like hallucination and reasoning failures persist even in the largest models.

The Empirical Evidence: Measuring the Invisible Failure

The theoretical debates matter less than empirical evidence. Recent systematic studies provide clear data on error propagation in LLM systems:

Wei et al. (2023) in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" found that reasoning accuracy degrades significantly in multi-step problems. Their analysis of GPT-3 on arithmetic word problems showed:

2-step problems: 78% accuracy
4-step problems: 58% accuracy
8-step problems: 31% accuracy

This follows the exponential decay predicted by classical error propagation theory.

Press et al. (2023) in "Measuring and Narrowing the Compositionality Gap in Language Models" provided the most comprehensive analysis to date. Their key finding: "Performance degradation follows a power law as the number of composition steps increases." This matches exactly what Wilkinson (1963) predicted for sequential computational systems.

Huang et al. (2023) in "A Survey on Hallucination in Large Language Models" documented the mechanism: factual errors propagate through reasoning chains with 73% probability of causing downstream failures. Critically, they found that error detection decreases as chain length increases—the system becomes less capable of recognizing its own mistakes precisely when it's making more of them.

Three Perspectives, One Conclusion

The convergence is striking:

Computer Science Theory (1960s-1990s): Sequential computation without error correction is inherently unstable. Mathematical proof exists.

Robotics Practice (1980s-2010s): Physical systems confirm the theory. Engineering solutions developed through necessity.

LLM Empirics (2020s): Semantic reasoning systems exhibit identical patterns. The mathematics still holds.

All three fields arrived at the same conclusion through different paths: systems that chain operations without explicit error correction will fail predictably as chain length increases.

The Compound Interest of Being Wrong

The mathematics of error propagation are well-established in control theory. Kalman (1960) laid the groundwork in "A New Approach to Linear Filtering and Prediction Problems," showing how errors accumulate in dynamic systems.

For LLM agents, Dziri et al. (2023) in "Faith and Fate: Limits of Transformers on Compositionality" provided concrete measurements. They found that if each step in an agent workflow has a 90% accuracy rate, a 10-step process has only 35% reliability (0.9^10 = 0.35).

More concerning, Gao et al. (2023) in "Retrieval-Augmented Generation for AI-Generated Content: A Survey" showed that errors don't just multiply—they accelerate. In their study of multi-step reasoning:

Steps 1-3: Linear error accumulation
Steps 4-7: Exponential error growth
Steps 8+: Catastrophic failure rates above 80%

This matches exactly what robotics engineers discovered in the 1980s with manipulator arms and mobile robots.

This isn't theoretical. Shinn et al. (2023) in "Reflexion: Language Agents with Verbal Reinforcement Learning" documented systematic failures in production-like environments:

WebShop task: Agents successfully completed simple purchases but failed 68% of multi-step transactions due to context loss and error accumulation
HotPotQA: Multi-hop question answering saw accuracy drop from 82% (single-hop) to 34% (four-hop) as reasoning chains grew longer
Programming tasks: Code generation agents produced syntactically correct but functionally broken applications when initial architectural decisions were flawed

Yao et al. (2023) in "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" showed similar patterns across multiple domains, concluding that "the reliability of multi-step reasoning degrades faster than previously assumed."

Why the Silence? Understanding the Resistance

The reluctance to address error propagation in LLM agents stems from several sources:

Historical Perspective: Computer science developed error analysis because early systems failed catastrophically without it. Robotics adopted these principles because physical failures are impossible to ignore. LLM agents produce plausible-sounding failures that can be dismissed as "edge cases" or "prompt engineering problems."

Economic Incentives: The current AI boom rewards rapid deployment over systematic reliability. Christensen (1997) in "The Innovator's Dilemma" predicted this pattern: disruptive technologies initially prioritize capability over reliability, often to their eventual detriment.

Complexity Illusion: The sophistication of modern LLMs masks the brittleness of multi-step systems built on top of them. This is analogous to what Perrow (1984) called "normal accidents" in "Normal Accidents: Living with High-Risk Technologies"—complex systems fail in ways that seem impossible until they happen.

Domain Transfer Resistance: Each field (computer science, robotics, AI) tends to believe its problems are unique. The mathematical foundations are identical, but the surface differences create cognitive barriers to knowledge transfer.

Learning from All Three Domains: The Path Forward

The solution isn't to abandon LLM agents, but to apply the hard-won lessons from computer science theory and robotics practice:

From Numerical Analysis: Implement semantic condition numbers—metrics that quantify how sensitive each reasoning step is to input uncertainty. Demmel (1997) showed how to compute these for numerical algorithms; we need equivalent measures for semantic reasoning.

From Robotics: Deploy closed-loop verification systems. Instead of open-loop agent workflows, implement verification steps that validate outputs before proceeding. Siciliano and Khatib (2016) in "Springer Handbook of Robotics" provide extensive frameworks for fault-tolerant control that could be adapted to semantic reasoning.

From Information Theory: Develop semantic error-correcting codes. MacKay (2003) in "Information Theory, Inference, and Learning Algorithms" showed how redundancy can correct transmission errors. LLM agents need similar redundancy mechanisms for reasoning errors.

Specific Engineering Solutions:

Confidence Tracking: Implement explicit uncertainty quantification at each step, following Gal and Ghahramani (2016) on Bayesian deep learning
Redundant Reasoning: Use multiple independent reasoning paths and voting mechanisms, inspired by fault-tolerant computing principles from Pradhan (1996)
Semantic Checksums: Develop verification procedures that can detect reasoning errors, analogous to CRC checks in digital communication
Graceful Degradation: Design systems that recognize when uncertainty exceeds acceptable bounds and hand off to human operators

The False Dichotomy: The current debate often presents a false choice between "LLM agents are magic and will solve everything" versus "LLM agents are hopeless and will always fail." The reality is that they're engineering systems subject to well-understood mathematical constraints. We can build reliable systems if we apply the same rigor that computer science and robotics developed over decades.

The Stakes Are Rising

As LLM agents move from demos to production systems handling financial transactions, medical decisions, and infrastructure management, the cost of compound errors grows exponentially. We're not just building cool demos—we're building systems that could cause real harm when they fail.

The robotics community learned to design for reliability because physical failures were impossible to ignore. The AI community needs to learn the same lesson before our invisible failures become visible disasters.

Time to Build Better

Error propagation isn't an unsolvable problem—it's an engineering challenge that robotics has already addressed. The question is whether the AI community will learn from these lessons or repeat the same mistakes at scale.

The next time you see a demo of an LLM agent completing a complex multi-step task, ask the hard question: "What's the failure rate when this runs 1,000 times in production?"

Because in the world of compound errors, being impressive once means nothing if you're unreliable twice.

Tushar Dadlani’s Blog

Moving forward, one post at a time