**💻📰 Magma: A foundation model for multimodal AI agents**

Magma is a foundational model for multimodal AI agents, trained using a pipeline that processes text, images, and videos. Text is tokenized, while visual data is encoded by a shared vision encoder. These combined tokens are then input to a large language model (LLM) to generate outputs in verbal, spatial, and action formats.

The model incorporates two key mechanisms: Set-of-Mark (SoM) for action grounding and Trace-of-Mark (ToM) for action planning. SoM allows the model to effectively ground actions within images, predicting coordinates for interactions like button clicks or robot arm movements. ToM improves action planning by focusing on longer temporal horizons and relevant action dynamics in videos, requiring fewer tokens than traditional next-frame prediction methods. This approach helps the model understand temporal dynamics and predict future states.

[Read More](https://microsoft.github.io/Magma/)

💬 [HN Comments](https://news.ycombinator.com/item?id=43110265.0) (60)

Reply to this note

Please Login to reply.

Discussion

No replies yet.