Magma: A Foundation Model for Multimodal AI Agents

Content has been generated from NotebookLM

Authors: Jianwei Yang Reuben Tan Qianhui Wu Ruijie Zheng Baolin Peng Yongyuan Liang Yu Gu Mu Cai Seonghyeon Ye Joel Jang Yuquan Deng Lars Liden Jianfeng Gao
Source: Excerpts from “2502.13130.pdf”

Executive Summary

Magma is a new foundation model designed to function as a multimodal AI agent. It extends traditional vision-language (VL) models by adding the ability to plan and act within visual and spatial environments (both digital and physical). Magma is pre-trained on a wide range of datasets including images, videos, and robotics data. A key aspect of Magma’s training involves “Set-of-Mark” (SoM) and “Trace-of-Mark” (ToM) techniques for action grounding and planning. Experimental results demonstrate that Magma exhibits strong spatial-temporal reasoning skills and outperforms existing models in UI navigation and robotic manipulation tasks.

Key Themes and Concepts

Multimodal Understanding and Action: Magma aims to understand multimodal input (visual, linguistic) from various domains, both semantically, spatially, and temporally, and to break down long-horizon tasks into accurate action sequences. The goal is for the agent system to be driven by external, human-specified goals.
Foundation Model for AI Agents: Magma is presented as a foundation model specifically designed for AI agents, capable of performing tasks in both digital and physical environments. This differs from many existing models which are often trained separately for 2D digital and 3D physical worlds. Magma strives for generalizability across tasks and domains.
Set-of-Mark (SoM) for Action Grounding: SoM is a technique where actionable points/regions in an image are marked with numerical labels. The model learns to select the appropriate mark and corresponding coordinates for a given task, making action grounding easier. “Given the prompted image IMt in an atomic action step, the model needs to select the candidate marks along with the original coordinates, significantly easing the action grounding for the agentic model.”
- SoM prompting is applied to UI screenshots, robotics images, and instructional videos.
Trace-of-Mark (ToM) for Action Planning: ToM extends SoM to dynamic videos. It involves tracking the positions of the overlaid marks over time (traces). The model is then trained to predict future trajectories for these marks. This helps the model understand temporal dynamics and “look ahead of time.” “Unlike predicting next frames…predicting traces uses much fewer tokens to capture much longer temporal horizon and action-related object dynamics, while disregarding ambient contents.”
- ToM is applied to robotics and instructional video data.
Architecture and Pre-training: Magma uses a vision encoder (ConvNeXt) to encode visual input and a decoder-only LLM to process language and visual tokens. The model is pre-trained on a comprehensive dataset from images, videos, and robotics domains.
Downstream Tasks: Magma can be finetuned for:
- Image Captioning and QA: Achieving competitive performance with better spatial understanding.
- Video Captioning and QA: Achieving competitive performance with better temporal understanding.

Key Capabilities and Performance

Zero-Shot Performance: Magma demonstrates strong zero-shot performance on various benchmarks related to agentic intelligence, without task-specific fine-tuning. It can perform across the full task spectrum, unlike other models tested.
UI Navigation: Magma achieves high accuracy in UI navigation tasks on both web and mobile platforms (Mind2Web, AITW). It outperforms existing models in element selection accuracy, operation F1 score, and step-wise success rate. “Magma-8B (Ours) LLaMA3 [92] ✓ 57.2 76.9 45.4 54.8 79.7 43.4 55.7 80.6 47.3”
Robotic Manipulation: Magma exhibits strong performance in robotic manipulation tasks, as demonstrated on a WidowX robot and in the LIBERO simulation benchmark. Few-shot finetuning further enhances its performance. Removing SoM/ToM during pretraining negatively impacts performance, highlighting the effectiveness of these techniques.
- In one experiment, it successfully completes tasks such as “Put the sausage to hotdog,” while a baseline model (OpenVLA) fails.
Spatial Reasoning: Magma shows a strong spatial reasoning ability, answering spatial reasoning questions well, even outperforming GPT-4o in some instances despite using fewer pre-training data.
Video QA: Magma performs competitively and outperforms some state-of-the-art video LMMs on zero-shot Video QA benchmarks.

Technical Details

Vision Encoder: ConvNeXt convolutional networks used for image and video encoding.
Training Data: Curated dataset including robotics manipulation data (Open-X-Embodiment), UI navigation data (SeeClick, Vision2UI), instructional videos, Ego4D, and synthetic image-text pairs (ShareGPT4V, LLaVA-1.5).
SoM Implementation: Bounding boxes or points are marked with numerical labels in the image. Various proposal networks can be used to obtain candidate regions.
ToM Implementation: Point tracking models (e.g., Co-Tracker) are used to track keypoints in video segments. Homography transformation is applied to remove global camera motion.

Datasets Utilized:

The following datasets are used in pretraining:

Robotics Manipulation: Open-X-Embodiment
UI Navigation: SeeClick, Vision2UI
Multimodal Understanding: ShareGPT4V, LLaVA 1.5
Videos: Ego4d, Instructional Videos, Somethingv2, Epic-Kitchen

The following datasets are used in fine-tuning/evaluation:

ScreenSpot
Mind2Web
AITW
LIBERO

Ablation Studies

Combining UI and robotics data without SoM/ToM doesn’t yield performance gains.
Removing SoM/ToM negatively impacts spatial reasoning and performance on robot manipulation tasks, proving the effectiveness of the pretraining method.

The documentation briefly acknowledges the social impacts and limitations without detailing them.

Quotes

We introduce Magma, the first foundation model that is capable of interpreting and grounding multimodal inputs within its environment. Given a described goal, Magma is able to formulate plans and execute actions to achieve it.
Such an agent system should be driven by external goals specified by human commands as shown in Fig. 2.
Unlike predicting next frames as used in [77], predicting traces uses much fewer tokens to capture much longer temporal horizon and action-related object dynamics, while disregarding ambient contents.

Conclusion

Magma represents a significant step towards building versatile multimodal AI agents. Its ability to understand and act in both digital and physical environments, combined with its strong spatial-temporal reasoning, makes it a promising foundation model for a wide range of applications. The use of SoM and ToM techniques appears crucial to its success.