Retrieval Augmented Generation or Long-Context LLMs

Content has been generated from NotebookLM

Authors: Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky
Source: Excerpts from “2407.16833.pdf”

Key Themes and Ideas:

RAG vs. LC Performance Trade-offs

LC Superior Performance: The research indicates that recent long-context LLMs (like Gemini-1.5 and GPT-4O) generally outperform RAG when given sufficient resources. The authors state, “LC consistently outperforms RAG in almost all settings (when resourced sufficiently). This demonstrates the superior progress of recent LLMs in long-context understanding.”
RAG Cost Efficiency: RAG’s primary advantage is its significantly lower computational cost. “RAG remains relevant due to its significantly lower computational cost. In contrast to LC, RAG significantly decreases the input length to LLMs, leading to reduced costs, as LLM API pricing is typically based on the number of input tokens.”
Exception: RAG can outperform LC when the input text significantly exceeds the LLM’s context window size. This was observed with GPT-3.5-Turbo on the longer datasets from ∞Bench.

The SELF-ROUTE Approach

Motivation: The core idea behind SELF-ROUTE stems from the observation that RAG and LC often produce identical predictions. “For these queries, RAG can reduce cost without sacrificing performance.” Also that “for 63% queries, the model pre-dictions are exactly identical”. This suggests RAG can be used for the “majority of queries, reserving computationally more expensive LC for a small subset of queries where it truly excels.”
Method: SELF-ROUTE leverages the LLM itself to determine whether a query can be answered using the retrieved context provided by RAG. If the LLM deems the query “unanswerable” based on the RAG output, it is then processed using the full long context (LC). “SELF-ROUTE utilizes LLM itself to route queries based on self-reflection, under the assumption that LLMs are well-calibrated in predicting whether a query is answerable given provided context.”
Results: SELF-ROUTE achieves performance comparable to LC while significantly reducing cost. “With SELF-ROUTE, we significantly reduce the cost while achieving overall performance comparable to LC.” For instance, the cost is reduced by 65% for Gemini-1.5-Pro and 39% for GPT-4O.

Failure Analysis of RAG

The study identifies several reasons why RAG might fail to answer a query correctly:
- Multi-Step Reasoning: The query requires combining information retrieved in multiple steps.
- General Query: The query is too broad for the retriever to find specific relevant information.
- Complex Query: The query is long and complex, making it difficult for the retriever to understand.
- Implicit Query: The query requires understanding the entire context to infer the answer.
The authors suggest that techniques like chain-of-thought prompting and query expansion could potentially address some of these failure modes.

Importance of Evaluation Datasets

The research emphasizes the importance of using real-world datasets for evaluating long-context models, as synthetic datasets can introduce artifacts that skew the comparison between RAG and LC. “results on synthetic data, which are artificially created by researchers, may subject to dataset artifacts.”
The authors also acknowledge the potential issue of data leakage, where LLMs may have been pre-trained on the evaluation datasets, and attempt to mitigate it by prompting the model to answer “based only on the provided passage.” “In our experiment, we try mitigating this issue by prompting the model to an-swer “based only on the provided passage” for both RAG and LC.”

Supporting Details

Datasets: The study uses a subset of datasets from LongBench and ∞Bench, focusing on English, real-world, and query-based tasks. Datasets include NarrativeQA, Qasper, MultiFieldQA, HotpotQA, 2WikiMultihopQA, MuSiQue, QMSum, En.QA, and En.MC.
Models: The evaluated LLMs include Gemini-1.5-Pro, GPT-4O, and GPT-3.5-Turbo.
Retrievers: The study uses two retrievers: Contriever and Dragon.
Metrics: F1 scores, accuracy, and ROUGE scores are used for evaluation.
Ablation Studies: The study includes ablation studies on the number of retrieved chunks (k) to analyze the trade-offs between performance and cost. Increasing K improves RAG and SELF-ROUTE performance, but SELF-ROUTE is most effective for smaller K values. The cost of SELF-ROUTE is not always linear, it tends to have a minimum at K=5.

Conclusion

This study provides valuable insights into the strengths and weaknesses of RAG and long-context LLMs. The SELF-ROUTE approach offers a practical solution for leveraging the benefits of both, achieving high performance at a reduced cost. The failure analysis of RAG highlights areas for future research and improvement. The work emphasizes the importance of careful dataset selection and mitigation of data leakage when evaluating LLMs.