Skip to content

Executable Code Actions Elicit Better LLM Agents

Published:Suggest Changes
Content has been generated from NotebookLM

Summary

This briefing document summarizes the key findings and contributions of the paper “Executable Code Actions Elicit Better LLM Agents.” The paper introduces CodeAct, a novel approach that consolidates Large Language Model (LLM) agent actions into a unified action space using executable Python code. By integrating LLMs with a Python interpreter, CodeAct allows for dynamic action revision and the emission of new actions based on real-time feedback from the environment. The authors demonstrate through extensive experimentation on existing and newly curated benchmarks that CodeAct significantly outperforms traditional action formats (JSON or text), leading to higher success rates in complex, multi-turn tasks. Furthermore, the paper presents CodeActInstruct, an instruction-tuning dataset designed to enhance open-source LLMs’ ability to interact with environments through executable code. The fine-tuned models, CodeActAgent (based on Llama2 and Mistral), showcase improved performance in agent-oriented tasks while maintaining general capabilities, and exhibit autonomous self-debugging.

Main Themes and Important Ideas

Limitations of Traditional LLM Agent Action Formats

The paper highlights that current LLM agents typically use pre-defined formats like JSON or text to specify actions. These formats suffer from:

Introduction of CodeAct: A Unified Action Space

To overcome these limitations, the authors propose CodeAct:

The paper provides an illustrative example where a user asks for the sum of the reciprocals of the roots of a quadratic equation. The LLM utilizes CodeAct to generate and execute Python code using the sympy library to solve the problem:

import sympy
x = sympy.Symbol('x')
roots = sympy.solve(x**2 - 13*x + 4)
print(1/roots[0] + 1/roots[1])

Performance Advantages of CodeAct

Extensive evaluations on API-Bank and a newly created benchmark, M3ToolEval, demonstrate the superiority of CodeAct:

Table 3 in the paper showcases the success rates and average turns required on M3ToolEval for various LLMs using CodeAct, JSON, and Text action formats. For instance, on the open-source LLM Llama-2-7b-chat-hf, CodeAct achieved a success rate of 28.8% compared to 11.3% for JSON and 25.8% for Text.

CodeActInstruct: An Instruction-Tuning Dataset

To enhance the CodeAct capabilities of open-source LLMs, the authors curated CodeActInstruct:

The data for CodeActInstruct was generated by repurposing existing datasets (HotpotQA, MATH, APPS, WikiTableQuestion, ALFWorld) and using strong LLMs like GPT-3.5-turbo-0613, Claude, and GPT-4-0613 for trajectory generation.

CodeActAgent: Fine-Tuned LLM Agents

The authors fine-tuned Llama-2 and Mistral models on a mixture of CodeActInstruct and general conversation data, resulting in CodeActAgent:

Figure 3 in the paper provides an example of a multi-turn interaction with CodeActAgent (Mistral-7b) where it uses various Python libraries (pandas, scikit-learn, matplotlib), handles errors, answers follow-up questions, and performs self-debugging for data visualization, all through executable Python code.

Relationship to Existing Work

The paper distinguishes CodeAct from prior work that uses code generation for problem-solving. While others have explored generating code for tasks like structured prediction, math reasoning, and robot control, CodeAct offers:

The paper also acknowledges concurrent work like TaskWeaver, which similarly endorses the use of code for LLM agents, but highlights principal distinctions in their approaches (detailed in Section B of the paper).

Key Quotes

Potential Implications

Further Research Directions

The paper suggests several avenues for future research, including:

6. Conclusion

The “Executable Code Actions Elicit Better LLM Agents” paper presents a compelling case for using executable Python code as a unified action space for LLM agents. The proposed CodeAct framework, along with the CodeActInstruct dataset and the fine-tuned CodeActAgent models, demonstrates significant improvements in performance and flexibility compared to traditional action formats. This work represents a notable step forward in the development of more capable and adaptable LLM agents for real-world applications.


Previous Post
The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers
Next Post
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions