Executable Code Actions Elicit Better LLM Agents

Content has been generated from NotebookLM

Authors: Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji
Source: Excerpts from “2402.01030.pdf”

Summary

This briefing document summarizes the key findings and contributions of the paper “Executable Code Actions Elicit Better LLM Agents.” The paper introduces CodeAct, a novel approach that consolidates Large Language Model (LLM) agent actions into a unified action space using executable Python code. By integrating LLMs with a Python interpreter, CodeAct allows for dynamic action revision and the emission of new actions based on real-time feedback from the environment. The authors demonstrate through extensive experimentation on existing and newly curated benchmarks that CodeAct significantly outperforms traditional action formats (JSON or text), leading to higher success rates in complex, multi-turn tasks. Furthermore, the paper presents CodeActInstruct, an instruction-tuning dataset designed to enhance open-source LLMs’ ability to interact with environments through executable code. The fine-tuned models, CodeActAgent (based on Llama2 and Mistral), showcase improved performance in agent-oriented tasks while maintaining general capabilities, and exhibit autonomous self-debugging.

Main Themes and Important Ideas

Limitations of Traditional LLM Agent Action Formats

The paper highlights that current LLM agents typically use pre-defined formats like JSON or text to specify actions. These formats suffer from:

Constrained Action Space: Limited by the scope of pre-defined tools.
Restricted Flexibility: Inability to easily compose multiple tools for complex tasks.
Lack of Dynamic Adaptation: Difficulty in revising actions based on environmental feedback in multi-turn interactions.

Introduction of CodeAct: A Unified Action Space

To overcome these limitations, the authors propose CodeAct:

Unified Action Space: Employs Python code as the single format for all agent-environment interactions.
Executable Actions: Integrates a Python interpreter, allowing LLMs to execute generated code.
Dynamic Interaction: Enables agents to receive code execution outputs (results, errors) as observations, leading to potential action revisions or new action generation in subsequent turns.
Interpretability: Actions are expressed as interpretable Python code.

The paper provides an illustrative example where a user asks for the sum of the reciprocals of the roots of a quadratic equation. The LLM utilizes CodeAct to generate and execute Python code using the sympy library to solve the problem:

import sympy
x = sympy.Symbol('x')
roots = sympy.solve(x**2 - 13*x + 4)
print(1/roots[0] + 1/roots[1])

Performance Advantages of CodeAct

Extensive evaluations on API-Bank and a newly created benchmark, M3ToolEval, demonstrate the superiority of CodeAct:

Higher Success Rates: CodeAct achieved up to 20% higher success rates compared to JSON and text-based action formats across 17 LLMs.
Effectiveness on Complex Tasks: M3ToolEval, designed to evaluate complex multi-tool composition, showed that CodeAct enables LLMs to “get more done with fewer interactions” (Section 2.3).
Self-Debugging Capability: The interactive nature of code execution allows for automated error messages, which in turn helps LLM agents “self-debug their actions in a multi-turn interaction and eventually complete the human user’s request correctly.”

Table 3 in the paper showcases the success rates and average turns required on M3ToolEval for various LLMs using CodeAct, JSON, and Text action formats. For instance, on the open-source LLM Llama-2-7b-chat-hf, CodeAct achieved a success rate of 28.8% compared to 11.3% for JSON and 25.8% for Text.

CodeActInstruct: An Instruction-Tuning Dataset

To enhance the CodeAct capabilities of open-source LLMs, the authors curated CodeActInstruct:

High-Quality Multi-Turn Interactions: Contains 7k multi-turn interaction trajectories using CodeAct.
Focus on Agent-Environment Interaction: Covers use cases like information seeking, software package usage, external memory access (SQL, Pandas), and robot planning.
Data Selection for Improved Interaction: Emphasizes trajectories where the model initially encounters errors but successfully rectifies them in subsequent interactions, promoting “self-debug” capabilities.
Compatibility with Existing Data: Can be combined with general instruction-tuning data to improve agent performance without compromising general language understanding.

The data for CodeActInstruct was generated by repurposing existing datasets (HotpotQA, MATH, APPS, WikiTableQuestion, ALFWorld) and using strong LLMs like GPT-3.5-turbo-0613, Claude, and GPT-4-0613 for trajectory generation.

CodeActAgent: Fine-Tuned LLM Agents

The authors fine-tuned Llama-2 and Mistral models on a mixture of CodeActInstruct and general conversation data, resulting in CodeActAgent:

Improved CodeAct Performance: Demonstrated enhanced ability to interact with environments using executable code.
Generalization to Other Action Formats: Showed improvements on out-of-domain agent tasks even with text actions in pre-defined formats (evaluated on MiniWob++ and ScienceWorld).
Preserved General Capabilities: Maintained strong performance on general LLM benchmarks (MMLU, HumanEval, GSM8K, MTBench).
Autonomous Self-Debugging: Capable of using error feedback from the Python interpreter to correct its code and complete tasks.

Figure 3 in the paper provides an example of a multi-turn interaction with CodeActAgent (Mistral-7b) where it uses various Python libraries (pandas, scikit-learn, matplotlib), handles errors, answers follow-up questions, and performs self-debugging for data visualization, all through executable Python code.

Relationship to Existing Work

The paper distinguishes CodeAct from prior work that uses code generation for problem-solving. While others have explored generating code for tasks like structured prediction, math reasoning, and robot control, CodeAct offers:

Dynamic Re-adjustment of Atomic Actions: Unlike approaches that generate complete function definitions at once, CodeAct allows for executing small code snippets and dynamically adjusting subsequent actions based on immediate feedback. The paper contrasts this with Voyager, which generates entire function definitions as actions.
Reduced Reliance on Prompt Engineering: CodeAct operates in a setting where the LLM’s context window primarily contains past actions and observations, minimizing the need for extensive human-engineered prompts to provide relevant information for revision.

The paper also acknowledges concurrent work like TaskWeaver, which similarly endorses the use of code for LLM agents, but highlights principal distinctions in their approaches (detailed in Section B of the paper).

Key Quotes

“This work proposes to use executable Python code to consolidate LLM agents’ actions into a unified action space (CodeAct).”
“Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions.”
“Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate).”
“The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language.”
“CodeAct employs Python code to consolidate all actions for agent-environment interaction. In CodeAct, each emitted action to the en-vironment is a piece of Python code, and the agent will receive outputs of code execution (e.g., results, errors) as observation.”
“Furthermore, using the interactive Python interpreter for code execution allows automated error messages that help the LLM agent ‘self-debug’ their actions in a multi-turn interaction and eventually complete the human user’s request correctly.”
“To improve open-source LLMs’ CodeAct capability, in §3.1, we introduce Code-ActInstruct, an instruction finetuning dataset that contains agent-environment interaction trajectories.”

Potential Implications

Advancement of LLM Agents: CodeAct provides a more powerful and flexible paradigm for building LLM agents capable of tackling complex real-world tasks.
Improved Open-Source Agent Capabilities: CodeActInstruct and CodeActAgent offer valuable resources and models for the open-source community to develop more sophisticated agents.
Enhanced Tool Usage and Composition: The ability to execute arbitrary Python code allows agents to leverage a vast ecosystem of existing libraries and tools, and to compose them in intricate ways.
More Natural Multi-Turn Interactions: The dynamic nature of CodeAct facilitates more adaptive and effective multi-turn interactions between users and agents.
Safety Considerations: The paper acknowledges the potential risks associated with granting LLM agents direct access to code execution and highlights the need for future work on safety mechanisms.

Further Research Directions

The paper suggests several avenues for future research, including:

Designing more robust safety mechanisms for autonomous agents with code execution capabilities.
Exploring the scalability and applicability of CodeAct in even more complex and dynamic environments.
Investigating methods for further improving the self-debugging and reasoning abilities of CodeAct-enabled agents.
Developing more comprehensive benchmarks for evaluating LLM agents with executable action spaces.

6. Conclusion

The “Executable Code Actions Elicit Better LLM Agents” paper presents a compelling case for using executable Python code as a unified action space for LLM agents. The proposed CodeAct framework, along with the CodeActInstruct dataset and the fine-tuned CodeActAgent models, demonstrates significant improvements in performance and flexibility compared to traditional action formats. This work represents a notable step forward in the development of more capable and adaptable LLM agents for real-world applications.