The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Content has been generated from NotebookLM

Authors: Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, Alex Beutel
Source: Excerpts from “2404.13208”

Summary

This paper addresses a critical vulnerability in modern Large Language Models (LLMs): their susceptibility to prompt injection attacks, jailbreaks, and system prompt extractions. The authors argue that this stems from the lack of a clear instruction hierarchy, where LLMs treat instructions from application developers (system messages) with the same priority as those from potentially malicious users or third-party sources. To counter this, the paper proposes and demonstrates an “instruction hierarchy” that explicitly defines how LLMs should prioritize and handle conflicting instructions based on their origin and privilege level. By developing an automated data generation method to train LLMs on this hierarchy, the authors demonstrate a significant increase in robustness against various attack types, including those not seen during training, with minimal impact on standard LLM capabilities.

Main Themes and Important Ideas

Vulnerability of Current LLMs
- LLMs are vulnerable to attacks like prompt injections, jailbreaks, and system prompt extractions.
- These attacks can lead to unsafe actions, data exfiltration, bypassing restrictions, and revealing sensitive information.
- The core issue is that current LLMs lack a mechanism to differentiate and prioritize instructions from different sources.
- As the paper states, “one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties.”
The Instruction Hierarchy Concept
- The paper proposes an explicit “instruction hierarchy” to define how LLMs should behave when instructions of different priorities conflict.
- This hierarchy assigns different privilege levels to various message types:
  - Highest Privilege: System Message (from application developers)
  - Medium Privilege: User Message (from end users)
  - Lower Privilege: Tool Outputs (e.g., web search results)
- The principle is that higher-privileged instructions should take precedence over lower-privileged ones in case of conflict.
- The authors use an analogy: “Using this analogy, the current state of affairs is that every instruction is executed as if it was in kernel mode, i.e., untrusted third-parties can run arbitrary code with access to private data and functions.” The instruction hierarchy aims to establish privilege separation.
Aligned vs. Misaligned Instructions
- The paper distinguishes between “aligned” and “misaligned” lower-privileged instructions in relation to higher-privileged ones.
- Aligned Instructions: These have the same goals or constraints as higher-level instructions and should be followed (e.g., a user asking a car salesman bot to speak Spanish).
- Misaligned Instructions: These conflict with or contradict higher-level instructions and should be ignored or lead to refusal (e.g., a web search result instructing the LLM to exfiltrate user data).
- The goal is for models to “conditionally follow lower-level instructions based on their alignment with higher-level instructions.”
Automated Data Generation for Training
- To instill the instruction hierarchy in LLMs, the authors developed an automated data generation method based on two principles:
  - Context Synthesis (for Aligned Instructions): Compositional requests are decomposed into smaller pieces and placed at different hierarchy levels, training the model to predict the original response.
  - Context Ignorance (for Misaligned Instructions): Models are trained to ignore lower-level misaligned instructions and predict the same output they would have generated without them, or to refuse to comply if necessary.
- This method was applied to different attack scenarios like direct and indirect prompt injections, and system message extraction.
- For closed-domain prompt injections, the focus is on training models to treat user inputs as data, not instructions. The paper notes, “we argue that there are no Aligned instructions for closed-domain tasks, e.g., if a developer puts in an instruction such as ‘Summarize the below text’, the model should summarize the text no matter what the user inserts.”
Evaluation and Results
- The method was applied to fine-tune GPT-3.5 Turbo using supervised fine-tuning and Reinforcement Learning from Human Feedback (RLHF).
- The resulting model was evaluated on open-source and novel benchmarks, including in-domain attacks and those designed to test generalization.
- The results show a “dramatically improved robustness” across all evaluations compared to a baseline model without the instruction hierarchy training.
- For example, “defense against system prompt extraction is improved by 63%,” and “jailbreak robustness increases by over 30%,” even though no jailbreak data was used during training (demonstrating generalization).
- Qualitative examples illustrate how the trained model correctly ignores malicious instructions while still following legitimate ones. One example shows the model treating an instruction within the user message as part of the data to be translated, not as a new instruction to execute.
Over-Refusal Considerations
- The authors acknowledge the risk of “over-refusals,” where the model might incorrectly ignore benign lower-priority instructions.
- Evaluations were conducted to assess this, showing that while some regressions were observed on specific adversarial over-refusal benchmarks, the model generally performs well on benign instructions.
- They state, “on typical real-world usages, we do not expect the instruction hierarchy to cause noticeable degradations in model behavior.”
Comparison to Other Defenses
- The paper positions the instruction hierarchy approach in relation to other defense mechanisms, particularly for prompt injection in closed-domain tasks.
- While some prior work focused on ignoring all user instructions in such contexts, this work emphasizes a multi-level hierarchy and the conditional following of aligned lower-level instructions.
- The authors also note that their model-based approach is complementary to system-level guardrails.

Future Work

The paper outlines several avenues for future research, including:

Refining how models handle conflicting instructions.
Extending the hierarchy to multi-modal inputs (images, audio).
Exploring model architecture changes to better instill the hierarchy.
Conducting more explicit adversarial training to further improve robustness.
Investigating the applicability of this approach to high-stakes agentic applications.

Key Quotes

Today’s LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model’s original instructions with their own malicious prompts.
In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties.
We thus propose to instill such a hierarchy into LLMs, where system messages take precedence over user messages, and user messages take precedence over third-party content…
Our approach yields dramatically improved robustness across all evaluations… even increasing robustness by up to 63%.
The instruction hierarchy also exhibits generalization to each of the evaluation criteria that we explicitly excluded from training… suggesting that the LLM has learned to internalize the instruction hierarchy.

Conclusion

This paper presents a compelling solution to a fundamental vulnerability in modern LLMs by introducing and demonstrating the effectiveness of an instruction hierarchy. By training models to prioritize instructions based on their source, the authors achieve significant improvements in robustness against various attack vectors, showcasing the potential for building more secure and reliable LLM-powered applications. The generalization of the learned hierarchy to unseen attacks and the minimal degradation in standard capabilities highlight the promise of this approach. Future work will likely focus on further refining and expanding this framework to address the evolving landscape of LLM security challenges.