Reasoning & Planning
In the process of building autonomous AI Agents, if large language models (LLM) are the Agent's brain and tool use is its hands and feet, then Reasoning & Planning is the core engine that upgrades it from a simple Q&A machine to an autonomous problem solver.
Complex real-world tasks often cannot be completed in a single generation pass. AI needs the ability to decompose goals, perform logical reasoning, explore paths, self-correct, and orchestrate tools.
The following are the most mainstream reasoning and planning frameworks in the industry today.
Chain of Thought (CoT)
Step-by-Step Reasoning Capability
Traditional LLM answer generation is often intuitive and one-shot.
The core idea of Chain of Thought (CoT) is: forcing the model to explicitly output intermediate reasoning steps before outputting the final answer (Let's think step by step). This approach can significantly activate the model's potential in complex math, logical reasoning, and commonsense QA.
CoT not only gives the model more computation time (token count represents computation) but also allows subsequent generation to be built on a correct logical foundation.
Example: Few-shot CoT Prompt Design
# By providing examples containing reasoning processes, guide the model to perform CoT reasoning
prompt ="""
Question: Roger has 5 tennis balls. He then bought 2 cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer: Roger initially had 5 tennis balls. 2 cans of tennis balls, each with 3, totaling 2 * 3 = 6 tennis balls. 5 + 6 = 11. The answer is 11.
Question: The cafeteria has 23 apples. If they use 20 for lunch and buy 6 more, how many apples do they have now?
Answer: The cafeteria originally had 23 apples. After using 20, they have 23 - 20 = 3 apples left. They then bought 6 more, so they now have 3 + 6 = 9 apples. The answer is 9.
Question: {user_question}
Answer:"""
ReAct Framework (Reasoning + Acting)
Reasoning + Action Loop
If CoT is just the model working internally behind closed doors, then ReAct (Reason + Act) lets the model open its eyes to the world. It interweaves internal logical reasoning (Thought) with external tool interaction (Action), forming a dynamic closed-loop feedback system.
Under the ReAct paradigm, the Agent follows the Thought -> Action -> Observation loop until it reaches a final conclusion.
Limitations: ReAct excels in short, clearly-stepped tasks. However, since the entire history of thoughts and actions accumulates in the same context window, when the task chain is too long, it easily falls into dead loops or forgets the initial goal due to context overload.
Plan-and-Execute (Plan-First Execution Mode)
To address ReAct's weakness in long-horizon tasks, Plan-and-Execute decouples thinking and action, adopting a strategy similar to how humans approach large projects: first create a schedule, then tackle tasks one by one.
The system typically consists of two independent roles:
- Planner: Responsible for receiving the big goal and generating a detailed step-by-step list of sub-tasks.
- Executor: Responsible for executing these sub-tasks in sequence. The executor is usually a small ReAct Agent, focusing on completing only one small goal at a time.
Tree of Thoughts (ToT) and Tree-based Multi-path Exploration
Tree-based Multi-path Exploration
Whether CoT or Plan-and-Execute, they are essentially linear path exploration. But when writing code, solving math problems, or doing creative writing, humans often envision multiple schemes, evaluate them, choose the best, and even backtrack when they discover errors.
ToT (Tree of Thoughts) models the reasoning process as a tree: nodes are the current state of thought. At each branch point, the model generates multiple candidate Thoughts, then uses an internal Evaluator to score these nodes (e.g., feasible, potentially risky, infeasible). Combined with BFS (Breadth-First Search) or DFS (Depth-First Search) algorithms, it decides whether to continue deeper or backtrack and retry.
Task Planning & MCTS (Monte Carlo Tree Search)
Complex Task Decomposition and Search
When involving strategic games or extremely difficult reasoning tasks (such as cutting-edge mathematical verification, complex code repository refactoring), simple ToT is still not efficient enough. The industry has begun combining LLM with traditional reinforcement learning search algorithms MCTS (Monte Carlo Tree Search) (similar to AlphaGo's core logic).
- LLM as Policy Network: Provides heuristic suggestions for next steps, reducing meaningless branch expansion.
- LLM/Code Environment as Value Network: Through simulation rollout, predicts the final win rate or success probability of an action sequence.
- Advantage: In a huge solution space, it can find planning paths with the most global optimal potential.
Reflexion: Self-Reflection and Error Correction
When humans execute tasks, if they fail the first time, they summarize lessons learned and avoid mistakes in the next attempt.
The Reflexion framework endows the Agent with similar capabilities.
In the Reflexion loop, when the Agent's output is judged as a failure (e.g., test cases failed, API error), a Reviewer mechanism is triggered.
The LLM is asked to write a conversational reflection based on historical actions and failure feedback, for example: "I used the wrong API parameter format just now, next time I should read the documentation before passing JSON". This reflection is stored in Episodic Memory as a contextual hint for the next attempt, greatly enhancing the Agent's self-healing capability.
Example: Reflexion's Reflection Prompt Design
reflection_prompt ="""
You are an AI assistant trying to write a Python crawler.
This is the code you just executed: {previous_code}
This is the error message returned by the runtime environment: {error_traceback}
Please reflect deeply:
1. What is the root cause of the error?
2. What is your specific modification strategy in the next attempt?
Please record the reflection to guide subsequent actions.
"""
Task Decomposition Strategies and Engineering Practices
In actual production-level AI Agent development, relying purely on LLM "zero-shot" for complex planning is unstable. Commonly used hybrid intervention strategies include:
| Intervention Strategy | Core Approach | Applicable Scenarios |
|---|---|---|
| Sub-task Templating (SOP) | Instead of letting LLM plan freely, pre-define Standard Operating Procedures (SOP), letting LLM flow in a fixed State Machine. | Customer service systems, standardized data cleaning pipelines. |
| HITL (Human-in-the-Loop) | After Planner generates the task list, interrupt execution and require human users to confirm, modify, or Approve, then hand it to Executor. | High-risk operations: deleting database records, sending mass emails, large fund transfers. |
| RLHF Guided Planning | Using reinforcement learning and human preference feedback, specially fine-tune the planning capability of the large model, making it more inclined to generate safe, efficient step combinations. | Training phase of the underlying LLM base (e.g., OpenAI's o1 model training). |
Framework Comparison Summary
| Pattern | Core Mechanism | Advantages | Disadvantages |
|---|---|---|---|
| CoT | Step-by-step linear reasoning | Extremely simple implementation, significantly improves basic reasoning accuracy | Cannot call external tools, easily goes down a single path |
| ReAct | Alternating thinking and action loop | Dynamically adapts to environment, can adjust in real-time through observation | Context easily explodes with accumulated steps, loses original intent |
| Plan-and-Execute | First decompose into sub-tasks, then execute in isolation | Extremely suitable for long-horizon complex tasks, clear context | Not flexible enough when facing unexpected changes (when planning itself is wrong) |
| ToT / MCTS | Tree search, evaluation and backtracking | Can solve the most difficult complex logic problems | Extremely high computational cost, token consumption grows exponentially |
| Reflexion | Generate reflection memory based on failure feedback | Has self-correction and continuous evolution capability | Depends on clear feedback signals (like code compiler errors) |
YouTip