Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection - Agent Memory Research

Key Contribution

Self-RAG trains language models to:

Decide when retrieval is needed

Evaluate retrieved passage relevance

Assess if generations are supported by evidence

Critique output quality

All through special reflection tokens learned during training.

Reflection Tokens

Retrieve Token

Determines if retrieval is needed:

`[Retrieve=Yes]`: Should retrieve for this segment

`[Retrieve=No]`: Can generate without retrieval

Relevance Token

Evaluates retrieved passage:

`[Relevant]`: Passage is relevant to query

`[Irrelevant]`: Passage is not useful

Support Token

Checks if generation is grounded:

`[Fully Supported]`: All claims supported by evidence

`[Partially Supported]`: Some claims supported

`[No Support]`: Claims not in evidence

Utility Token

Assesses overall quality:

`[Utility: 5]` through `[Utility: 1]`

Architecture

Training

Use GPT-4 to generate reflection token annotations

Train LM to predict reflection tokens

Joint training on task and reflection objectives

Resulting model can self-critique

Inference

Input Query

│

▼

Generate [Retrieve=?]

│

├─► [Retrieve=No] ─► Generate response

│

└─► [Retrieve=Yes] ─► Retrieve passages

│

▼

For each passage:

- Generate [Relevant=?]

- If relevant, generate response

- Generate [Support=?]

- Generate [Utility=?]

│

▼

Select best response

Key Innovations

Adaptive Retrieval

Model learns when retrieval helps:

Don't retrieve for simple factual recall

Retrieve for complex or uncertain queries

Saves computation when not needed

Self-Critique

Model evaluates its own outputs:

Catches unsupported claims

Identifies irrelevant retrievals

Ranks multiple candidate responses

Controllable Generation

At inference time, can adjust:

Retrieval frequency

Support threshold

Utility requirements

Evaluation

Benchmarks

Open-domain QA (Natural Questions, TriviaQA)

Fact verification (PubHealth, ARC-C)

Long-form generation (ASQA, biography)

Results

Outperforms standard RAG baselines

Better citation/attribution accuracy

Reduces hallucination

More efficient (fewer retrievals)

Relevance to Agent Memory

Memory Retrieval Decisions

Like Self-RAG, agents should:

Decide when to access memory

Evaluate if retrieved memories are relevant

Check if responses are grounded in memory

Self-assess output quality

Trust Calibration

Reflection tokens enable:

Knowing when the agent is uncertain

Identifying when memory is insufficient

Flagging potential hallucinations

Implementation Considerations

Training Requirements

Need reflection token annotations

Significant training compute

Quality of annotations matters

Inference Trade-offs

Multiple generation passes increase latency

Critique adds compute cost

But improves quality and attribution

Limitations

Requires model fine-tuning

Reflection quality depends on training

May over/under-retrieve in edge cases

Binary relevance is simplistic

Citation

@inproceedings{asai2024selfrag,

title={Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection},

author={Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh},

booktitle={International Conference on Learning Representations},

year={2024}

}

TL;DR