Key Contribution
RETRO (Retrieval-Enhanced Transformer) demonstrates that retrieval can substitute for model scale. A 7B parameter model with retrieval matches a 175B+ parameter model without retrieval, suggesting memory/retrieval is a more efficient path to capability than raw parameter count.
Architecture
Chunked Cross-Attention
Process text in chunks, retrieve for each:
Retrieval Database
Massive pre-computed index:
Encoder-Decoder Fusion
Retrieved neighbors processed by encoder:
Training
Pre-training with Retrieval
Frozen Retriever
Unlike REALM:
Computational Efficiency
Retrieval adds minimal overhead:
Results
Scaling Comparison
| Model | Parameters | Perplexity |
|-------|------------|------------|
| Baseline | 7.5B | 2.96 |
| RETRO | 7.5B | 2.47 |
| Baseline | 175B | ~2.5 |
RETRO 7.5B matches much larger models!
Knowledge-Intensive Tasks
Strong performance on:
When Retrieval Helps Most
Architectural Details
Chunked Attention
Every few layers, add cross-attention:
For chunk C at position i:
Neighbor Encoding
Shared encoder processes all neighbors:
Implications for Agent Memory
Memory at Scale
RETRO validates retrieval-augmented approaches:
Memory Architecture Design
Lessons for agents:
Practical Considerations
Building agent memory systems:
Comparison to Other Work
vs. RAG
vs. REALM
vs. MemGPT
Limitations
Citation
@inproceedings{borgeaud2022improving,
title={Improving Language Models by Retrieving from Trillions of Tokens},
author={Borgeaud, Sebastian and others},
booktitle={International Conference on Machine Learning},
year={2022}
}