Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi

ICLR · 2024

self-ragreflectionretrievalcritique

TL;DR

Trains LLMs to adaptively retrieve information and self-critique outputs using special reflection tokens, improving both accuracy and attribution.

Key Contribution

Self-RAG trains language models to:

  • Decide when retrieval is needed
  • Evaluate retrieved passage relevance
  • Assess if generations are supported by evidence
  • Critique output quality
  • All through special reflection tokens learned during training.

    Reflection Tokens

    Retrieve Token

    Determines if retrieval is needed:

  • `[Retrieve=Yes]`: Should retrieve for this segment
  • `[Retrieve=No]`: Can generate without retrieval
  • Relevance Token

    Evaluates retrieved passage:

  • `[Relevant]`: Passage is relevant to query
  • `[Irrelevant]`: Passage is not useful
  • Support Token

    Checks if generation is grounded:

  • `[Fully Supported]`: All claims supported by evidence
  • `[Partially Supported]`: Some claims supported
  • `[No Support]`: Claims not in evidence
  • Utility Token

    Assesses overall quality:

  • `[Utility: 5]` through `[Utility: 1]`
  • Architecture

    Training

  • Use GPT-4 to generate reflection token annotations
  • Train LM to predict reflection tokens
  • Joint training on task and reflection objectives
  • Resulting model can self-critique
  • Inference

    Input Query

    Generate [Retrieve=?]

    ├─► [Retrieve=No] ─► Generate response

    └─► [Retrieve=Yes] ─► Retrieve passages

    For each passage:

    - Generate [Relevant=?]

    - If relevant, generate response

    - Generate [Support=?]

    - Generate [Utility=?]

    Select best response

    Key Innovations

    Adaptive Retrieval

    Model learns when retrieval helps:

  • Don't retrieve for simple factual recall
  • Retrieve for complex or uncertain queries
  • Saves computation when not needed
  • Self-Critique

    Model evaluates its own outputs:

  • Catches unsupported claims
  • Identifies irrelevant retrievals
  • Ranks multiple candidate responses
  • Controllable Generation

    At inference time, can adjust:

  • Retrieval frequency
  • Support threshold
  • Utility requirements
  • Evaluation

    Benchmarks

  • Open-domain QA (Natural Questions, TriviaQA)
  • Fact verification (PubHealth, ARC-C)
  • Long-form generation (ASQA, biography)
  • Results

  • Outperforms standard RAG baselines
  • Better citation/attribution accuracy
  • Reduces hallucination
  • More efficient (fewer retrievals)
  • Relevance to Agent Memory

    Memory Retrieval Decisions

    Like Self-RAG, agents should:

  • Decide when to access memory
  • Evaluate if retrieved memories are relevant
  • Check if responses are grounded in memory
  • Self-assess output quality
  • Trust Calibration

    Reflection tokens enable:

  • Knowing when the agent is uncertain
  • Identifying when memory is insufficient
  • Flagging potential hallucinations
  • Implementation Considerations

    Training Requirements

  • Need reflection token annotations
  • Significant training compute
  • Quality of annotations matters
  • Inference Trade-offs

  • Multiple generation passes increase latency
  • Critique adds compute cost
  • But improves quality and attribution
  • Limitations

  • Requires model fine-tuning
  • Reflection quality depends on training
  • May over/under-retrieve in edge cases
  • Binary relevance is simplistic
  • Citation

    @inproceedings{asai2024selfrag,

    title={Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection},

    author={Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh},

    booktitle={International Conference on Learning Representations},

    year={2024}

    }