Archives

All the articles I've archived.

2025 ¹⁶

June ¹

How much do language models memorize?
Published:Jun 14, 2025
Explores the concept of memorization in large language models (LLMs), introducing a novel method to quantify it and distinguish it from generalization. The authors define model capacity and investigate its relationship with dataset size, training dynamics, and membership inference.

March ⁵

The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers
Published:Mar 30, 2025
This paper investigates the impact of Generative AI (GenAI) tools on critical thinking skills and practices among knowledge workers. Through a survey of 319 participants who shared 936 real-world examples of using GenAI in their work, the study explores when and how critical thinking is enacted and how GenAI affects the effort involved.
Executable Code Actions Elicit Better LLM Agents
Published:Mar 23, 2025
This briefing document summarizes the key findings and contributions of the paper "Executable Code Actions Elicit Better LLM Agents." The paper introduces CodeAct, a novel approach that consolidates Large Language Model (LLM) agent actions into a unified action space using executable Python code. By integrating LLMs with a Python interpreter, CodeAct allows for dynamic action revision and the emission of new actions based on real-time feedback from the environment.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Published:Mar 15, 2025
This paper addresses a critical vulnerability in modern Large Language Models (LLMs): their susceptibility to prompt injection attacks, jailbreaks, and system prompt extractions. The authors argue that this stems from the lack of a clear instruction hierarchy, where LLMs treat instructions from application developers (system messages) with the same priority as those from potentially malicious users or third-party sources.
LLMs Can Teach Themselves to Better Predict the Future
Published:Mar 9, 2025
This paper introduces a novel framework for improving the forecasting capabilities of Large Language Models (LLMs) through outcome-driven fine-tuning. The method leverages model self-play to generate diverse reasoning trajectories and probabilistic forecasts for future events. These forecasts are then ranked based on their accuracy compared to actual outcomes, and the model is fine-tuned using Direct Preference Optimization (DPO). The results demonstrate significant accuracy improvements (7-10%) on Phi-4 14B and DeepSeek-R1 14B models, bringing their performance on par with much larger models like GPT-4o, without relying on human-curated reasoning samples. This approach has implications for decision-making across various sectors like finance, policy, and law.
Introducing the Model Context Protocol
Published:Mar 1, 2025
The Model Context Protocol (MCP) is an open standard developed by Anthropic to facilitate seamless and secure integration between AI applications/agents and external data sources, tools, and systems. It aims to address the problem of fragmented integrations and data silos that limit the effectiveness of AI assistants. MCP provides a universal protocol for connecting AI systems with data, promoting a more scalable and reliable way to provide AI systems with the necessary context. The core principle is that "models are only as good as the context we provide to them."

February ⁵

Magma: A Foundation Model for Multimodal AI Agents
Published:Feb 25, 2025
Magma is a multimodal agentic AI model that can generate text based on the input text and image. The model is designed for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, in particular the multimodal agentic AI. The main innovation of this model lies on the introduction of two technical innovations: Set-of-Mark and Trace-of-Mark, and the leverage of a large amount of unlabeled video data to learn the spatial-temporal grounding and planning.
Retrieval Augmented Generation or Long-Context LLMs
Published:Feb 22, 2025
This document summarizes the findings of a comprehensive study comparing Retrieval Augmented Generation (RAG) and Long-Context (LC) Large Language Models (LLMs) for processing lengthy contexts. The study benchmarks both approaches across various public datasets using recent LLMs (Gemini-1.5-Pro, GPT-4O, and GPT-3.5-Turbo). The key finding is that LC models, when resourced sufficiently, generally outperform RAG in terms of average performance. However, RAG maintains a significant cost advantage due to the reduced input length to the LLM. Based on these observations, the study introduces SELF-ROUTE, a method that intelligently routes queries to either RAG or LC based on model self-reflection, significantly reducing computational costs while maintaining performance comparable to LC. The findings provide guidance for building long-context applications utilizing both RAG and LC.
International AI Safety Report 2025
Published:Feb 16, 2025
The purpose of this report is to help create a shared international understanding of risks from advanced AI and how they can be mitigated. To achieve this, this report focuses on general-purpose AI – or AI that can perform a wide variety of tasks – since this type of AI has advanced particularly rapidly in recent years and has been deployed widely by technology companies for a range of consumer and business purposes. The report synthesises the state of scientific understanding of general-purpose AI, with a focus on understanding and managing its risks.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Published:Feb 8, 2025
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
Published:Feb 1, 2025
IntellAgent, an open-source multi-agent framework, is presented as a novel solution for comprehensively evaluating conversational AI systems. It addresses limitations of existing methods by automating the generation of diverse, realistic, policy-driven scenarios using a graph-based policy model. The framework simulates interactions between user and chatbot agents, providing fine-grained performance diagnostics and actionable insights for optimization. IntellAgent's modular design promotes reproducibility and collaboration, bridging the gap between research and deployment. Its effectiveness is demonstrated through experiments comparing its results to those of established benchmarks like τ-bench.

January ⁵

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts
Published:Jan 29, 2025
This document reviews the key concepts and findings from two sources related to PromptWizard, a prompt optimization framework developed by Microsoft Research. These sources highlight the limitations of existing prompt optimization techniques, particularly for closed-source Large Language Models (LLMs), and introduce PromptWizard as a novel, iterative approach that leverages feedback and iterative refinement.
Automatic Prompt Engineering with Large Language Models
Published:Jan 26, 2025
This research paper introduces Automatic Prompt Engineer (APE), an algorithm that uses large language models (LLMs) to automatically generate and select optimal prompts for various tasks. APE surpasses human performance in prompt engineering by treating instructions as "programs" and optimizing them through a search process guided by LLMs.
VOYAGER: An Open-Ended Embodied Agent with Large Language Models
Published:Jan 21, 2025
This paper introduces VOYAGER, a novel AI agent powered by Large Language Models (LLMs) that demonstrates lifelong learning capabilities within the Minecraft environment.
StructRAG: Retrieval-Augmented Generation via Hybrid Information Structurization
Published:Jan 13, 2025
This comprehensive study explores the burgeoning field of prompt engineering, encompassing a wide array of techniques used to elicit desired outputs from Generative AI (GenAI) models, particularly focusing on large language models (LLMs).
Prompt Engineering for Large Language Models
Published:Jan 6, 2025
This comprehensive study explores the burgeoning field of prompt engineering, encompassing a wide array of techniques used to elicit desired outputs from Generative AI (GenAI) models, particularly focusing on large language models (LLMs).

Archives

How much do language models memorize?

The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers

Executable Code Actions Elicit Better LLM Agents

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

LLMs Can Teach Themselves to Better Predict the Future

Introducing the Model Context Protocol

Magma: A Foundation Model for Multimodal AI Agents

Retrieval Augmented Generation or Long-Context LLMs

International AI Safety Report 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

Automatic Prompt Engineering with Large Language Models

VOYAGER: An Open-Ended Embodied Agent with Large Language Models

StructRAG: Retrieval-Augmented Generation via Hybrid Information Structurization

Prompt Engineering for Large Language Models