articleArchitectureReasoningTool Use

The ReAct Loop Unpacked: Reasoning + Acting in Practice

A rigorous treatment of the Thought → Action → Observation cycle — how it works at the execution level, where it breaks in production, and which alternatives exist and why.

AgentEngineering EditorialApril 14, 202614 min read

Share Y

If you've built agents with tool use or worked through multi-agent orchestration, you've already used ReAct — almost certainly without calling it that. Every agent framework's default loop, every "I need to search for X, then check Y" trace you've seen in a run log, every structured Thought → Action → Observation block in a prompt template: that's ReAct.

The glossary definition — "interleave reasoning and acting" — is accurate but useless for production work. What you need is the execution-level model: exactly how the loop runs, precisely why it beats chain-of-thought, and specifically where it hits walls. Then you can make an informed choice about when to reach past it.

Important

This article assumes you understand tool calling and multi-agent coordination. If those are unfamiliar, start with Tool Use in LLM Agents first. ReAct's value only becomes clear in the context of what it replaced.

The Execution Model

ReAct (Yao et al., ICLR 2023) defines an agent loop with three token types interleaved in a single prompt context:

Thought — the model's internal monologue. Decomposes the goal, tracks progress, plans the next action.
Action — a structured call to a tool from a predefined set (e.g., Search[query], Lookup[keyword], Finish[answer]).
Observation — the tool's returned output, appended to the context verbatim.

The loop continues until the agent emits a terminal action (Finish) or hits a budget constraint. Here is a real trace from the HotpotQA experiments in the original paper:

Question: Aside from the Apple Remote, what other devices can control
the program Apple Remote was originally designed to interact with?

Thought 1: I need to search Apple Remote and find the program it was
originally designed to interact with.
Action 1: Search[Apple Remote]
Observation 1: The Apple Remote is a remote control device originally
designed to interact with the Front Row media center program...

Thought 2: The program is Front Row. I need to search Front Row and
find what other devices can control it.
Action 2: Search[Front Row (software)]
Observation 2: Front Row is controlled by the Apple Remote or the
keyboard function keys.

Thought 3: Front Row can be controlled by the Apple Remote or keyboard
function keys. So the answer is keyboard function keys.
Action 3: Finish[keyboard function keys]

Notice what the Thought steps are doing. They are not decorative. Thought 1 decomposes a two-hop question into a concrete first retrieval target. Thought 2 synthesizes what was just learned and formulates the second sub-query. Thought 3 draws the conclusion from both observations. This is the "grounding" mechanism: each thought is constrained by what the previous observations actually said, not by what the model believes from pretraining.

Why This Works

The mechanism of improvement over pure Chain-of-Thought (CoT) is precise. CoT generates reasoning traces from parametric knowledge alone — knowledge baked into model weights during training. That knowledge can be wrong, outdated, or hallucinated. On HotpotQA, CoT-only agents failed due to hallucination in 56% of cases. ReAct reduced that to near-zero by inserting external, verified data as observations between reasoning steps.

The mechanism of improvement over Act-only (no reasoning traces) is equally precise. Without thoughts, the agent lacks a mechanism to synthesize information across multiple observations. It cannot explain to itself why a particular search failed and adjust. A pure action sequence is a reflex; ReAct is a deliberate loop.

Decision-making tasks make this concrete. In the ALFWorld household simulation benchmark, ReAct needed to find a mug, locate a lamp, and use the lamp to examine the mug. The agent searched two cabinets before finding the mug, then navigated to a desk to find the lamp:

Thought 1: I need to find a mug. I'll check cabinet 1.
Action 1: go to cabinet 1
Observation 1: On cabinet 1, you see a plate and a spoon.

Thought 2: No mug here. I will try cabinet 2.
Action 2: go to cabinet 2
Observation 2: On cabinet 2, you see a mug 1.

Action 3: take mug 1 from cabinet 2

Thought 4: Now I have the mug. I need to find a lamp and turn it on.
Action 4: go to desk 1
Observation 4: On desk 1, you see a desk lamp 1.

Action 5: use desk lamp 1

Thought 2 is the critical step: the agent registers failure, updates its world model ("no mug in cabinet 1"), and plans a recovery. Act-only agents in this benchmark had no such mechanism — they achieved 45% success rate. ReAct reached 71%, and it did so with just 2-shot prompting.

Empirical Baselines

The original paper evaluated on PaLM-540B across four benchmarks:

Benchmark	ReAct	Best Baseline	Gap
HotpotQA (multi-hop QA)	27.4 EM	29.4 (CoT self-consistency)	−2.0
FEVER (fact verification)	60.9%	56.3% (CoT)	+4.6
ALFWorld (household sim)	71% SR	45% (Act-only)	+26
WebShop (e-commerce nav)	40% SR	30.1% (Act-only)	+9.9

The HotpotQA deficit vs. CoT self-consistency is worth noting: for pure knowledge tasks with stable training-time facts, CoT with sampling can outperform ReAct. The advantage of ReAct is clearest where retrieval is mandatory (FEVER) and where environment interaction requires sequential adaptation (ALFWorld, WebShop).

Where ReAct Breaks

ReAct is a solid default. It is not a reliable foundation for every production deployment. The failure modes below are structural — they arise from the loop's design, not from model quality or prompt engineering.

1. Long-Horizon Drift

The ReAct loop appends every thought, action, and observation to the running context. On a 5-step task this is fine. On a 30-step task, you have accumulated thousands of tokens of intermediate history that the model must attend to on every new thought generation.

Two things happen. First, relevant information from early in the trace gets diluted by later content. Chroma Research (2025) measured this directly across 18 frontier models: GPT-4o accuracy dropped from 98.1% to 64.1% as the context window filled — "Context Rot." Second, the model's attention drifts toward recency. Early objectives, constraints, and tool results stop influencing generation as effectively as they should.

A compounding error rate of even 1% per step reaches ~60% success probability at step 50, ~36% at step 100. CORAL (2025) terms this “cognitive overload” — the agent’s working memory becomes cluttered with irrelevant intermediate reasoning, and planning coherence collapses. CMU’s TheAgentCompany benchmark (2025) found state-of-the-art agents failing ~70% of multi-step office simulation tasks, largely due to this drift.

2. Irreversible Actions

The ReAct loop has no native concept of action reversibility. When the next action in the loop is send_email, delete_file, or execute_payment, the model generates it the same way it generates Search[query]. The loop does not pause to assess downstream consequences. There is no built-in rollback.

This is not hypothetical. Production incident reports include agents that deleted production databases, posted sensitive data to public forums, and executed financial transactions based on misunderstood commands. In each case, the action was irreversible and the exposure window was measured in minutes to hours.

The standard mitigation — gating high-stakes actions behind human approval — breaks the autonomous loop. That's a deliberate choice, but it must be made explicitly at design time, not discovered after an incident.

3. Thought-Action Divergence

The Thought token and the Action token are generated sequentially by the same model in the same pass. Ideally, the thought causally produces the action. In practice, fine-tuning on ReAct-formatted data can create a structural bias: the model generates the action it "expects" to come next in the pattern, regardless of what the thought actually concluded.

The observable symptom is a model that reasons correctly — "I should not delete this file" — and then generates delete_file as the next action. The format pulls the generation. Research on prompt injection (2025) found that once a malicious action is injected into an agent's thought process, the agent executes it 95% of the time without re-evaluating — it has learned to follow the format.

4. Error Recovery Limits

When a tool call fails — wrong parameters, rate limit, network error — the standard ReAct loop handles it via the Observation: the error message gets appended and the agent generates a new Thought. For transient failures this often works. For structural failures (hallucinated tool names, non-existent API endpoints), it does not.

A 2026 analysis found that 90.8% of retries in standard ReAct agents were spent on errors that could never succeed. The loop kept iterating until the token budget was exhausted. ReAct's reactive correction mechanism handles the cases where correction is possible; it has no way to classify the cases where it is not.

Alternatives Positioned by What They Fix

These are not replacements for ReAct. They are architectural responses to specific failure modes. You should understand which failure mode you are actually hitting before reaching for an alternative.

Reflexion: Self-Critique Across Episodes

Reflexion (Shinn et al., 2023) extends ReAct by adding a meta-loop around the intra-episode loop. When a full task attempt fails, a Self-Reflection model analyzes the failed trajectory and generates a verbal summary of what went wrong. This summary is stored in an episodic memory buffer and prepended to the next attempt.

The architecture has three components:

Actor — a standard ReAct agent
Evaluator — a success/failure signal (heuristic, unit tests, or LLM judge)
Self-Reflection model — an LLM that generates natural-language post-mortems

Feature	ReAct	Reflexion
Loop type	Intra-episode: corrects the next step	Inter-episode: corrects the next attempt
Memory scope	Current trace only	Accumulated reflections across failures
Correction granularity	Local (last observation → next action)	Global (full trajectory → strategy revision)

What it fixes: Accuracy-critical tasks where one failed attempt provides enough signal for a meaningful strategy change. On HumanEval (coding), Reflexion with GPT-4 achieved 91% pass@1 vs. 80% for the base model. On ALFWorld it improved 22% over the ReAct baseline.

What it does not fix: Context rot within a single episode. It also requires an accurate evaluator — if the evaluator produces false signals, the reflections are built on wrong premises. And self-critique from a weak model can actively hurt a stronger actor.

Cost profile: Token usage compounds with each failed attempt. A three-attempt Reflexion run can cost 3–5x a single ReAct run. For coding tasks the unit-test evaluator is free; for open-ended tasks the LLM-judge evaluator adds cost.

Tree-of-Thought: Search Over Reasoning Paths

ToT (Yao et al., 2023) replaces the linear Thought → Action chain with a search tree. At each step the model generates multiple candidate thoughts (branches), a State Evaluator rates each branch's promise, and a search algorithm (BFS or DFS) selects which branches to expand.

The State Evaluator is the key structural addition. Instead of greedily following the most recently generated thought, ToT can prune "Impossible" branches early and backtrack from dead ends — something ReAct's linear loop cannot do.

When it outperforms ReAct: Tasks with combinatorial structure where greedy choices commit to wrong paths. On Game of 24 (use four numbers to reach 24), ReAct-style CoT achieves 4% success; ToT achieves 74%. On 5×5 mini crosswords, CoT approaches 0% on full solutions; ToT solves ~60% of letters. For knowledge-retrieval tasks, ToT's search overhead outweighs any benefit.

Cost profile: Multiple LLM calls per reasoning step. A ToT run is typically 5–10x more expensive in tokens and latency than an equivalent ReAct run. Graph-of-Thought (Besta et al., 2023) extends the tree into a directed acyclic graph, allowing merging of branches — 62% better sorting quality than ToT at 31% lower cost through node reuse.

The LATS hybrid (Zhou et al., 2023) combines ToT-style MCTS with ReAct's tool use, using tool call results as node evaluation signals. For tasks that require both search over reasoning paths and real environment interaction, LATS is the current state of the art, at the cost of significant implementation complexity.

Plan-and-Execute: Decouple Planning from Action

Plan-and-Execute architectures solve the token bloat problem structurally by separating the planning LLM call from the execution calls. A Planner generates a complete task graph upfront. Workers (often smaller, cheaper models) execute each step against only the sub-task context — not the full history.

ReWOO (Xu et al., 2023) makes this explicit: the Planner generates a blueprint with placeholders (#E1, #E2) for future tool outputs. The Worker executes all tool calls. The Solver receives the completed plan to generate the final answer. The LLM reasons exactly once; it does not re-attend to the full history on every step.

LLMCompiler (Kim et al., 2024) extends this to parallel execution: it generates a DAG of tasks, identifies independent nodes, and dispatches them to an executor simultaneously. On movie recommendation tasks requiring multiple independent database lookups, LLMCompiler showed 3.74x speedup and 6.73x cost reduction versus ReAct.

Metric	ReAct	ReWOO	LLMCompiler
Token usage	1× (baseline)	~5× reduction	Up to 6.7× reduction
Latency	Sequential	Parallelizable	Up to 3.7× speedup
Accuracy (HotpotQA)	Baseline	+4.4%	Up to +9%

What it fixes: Long-horizon token accumulation and sequential bottlenecks. Zup's internal coding agent moved from ReAct to Plan-and-Execute and increased accuracy on multi-file edits from 40% to 75%, because isolating file context in sub-tasks eliminated cross-contamination from accumulated observations.

What it does not fix: Tasks where the next step genuinely depends on what the previous step returned. A plan generated without observations can be structurally wrong. Reflexion's inter-episode correction is better suited for iterative refinement; Plan-and-Execute is better suited for stable, repeatable pipelines.

Choosing in Practice

The decision is not about which pattern is "better" — it's about which failure mode you're actually encountering.

Your situation	Architecture
Short task (< 5 steps), dynamic next steps, reversible actions	ReAct — start here
Task keeps drifting after step 10, context fills up	Plan-and-Execute (ReWOO / LLMCompiler)
Multi-attempt iterative task, good success signal available	Reflexion
Combinatorial search, strategy requires lookahead + backtracking	Tree-of-Thought (or LATS with tools)
High-stakes irreversible actions involved	Any architecture + gated HITL approval

A few practical constraints that the decision matrix above doesn't capture:

Latency. ReAct returns a first response immediately. ToT and Reflexion require multiple LLM calls before producing an answer. If users are waiting, that matters.

Evaluator availability. Reflexion requires a reliable success signal. Unit tests and exact-match graders are good evaluators. "Did the user seem satisfied" is not.

Horizon is observable at runtime. The most common production pattern in 2025 is agent-gated workflow: use deterministic code for 90% of the logic, invoke a ReAct agent only for the ambiguous sub-tasks where the path is genuinely unknown. This is more reliable than any pure-agent architecture because the deterministic wrapper constrains the search space.

Reasoning models change the equation. Models like o1 and o3 internalize the CoT loop. They function as implicit Plan-and-Execute systems — the reasoning trace is hidden, but the model is effectively planning before acting. For tasks where these models are available and cost-acceptable, they shift the tradeoff: you get long-horizon stability without implementing a separate planner.

What to Take Away

ReAct's core contribution is the mechanism that grounds reasoning in environmental feedback. Every time you see a thought that references what an observation actually said, that's the mechanism working. It is the single most important pattern in production agents precisely because it solves hallucination-in-reasoning without requiring retrieval infrastructure or multi-attempt loops.

The failure modes — long-horizon drift, irreversible actions, thought-action divergence, brittle error recovery — are structural. They won't be patched by a better prompt. Recognizing them early determines whether you reach for Reflexion, Plan-and-Execute, or ToT because the alternative fits your specific constraint, not because it's newer or has a better benchmark number.

The next piece covers the craft layer that sits on top of any of these patterns: system prompts that make agent behavior consistent across tasks and across iterations.