Improving Language Models by Retrieving from Trillions of Tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, Laurent Sifre

ICML · 2022

retroretrievalscalingefficient

TL;DR

Shows that retrieval from a massive corpus (2 trillion tokens) can match the performance of 25x larger models, demonstrating retrieval as an efficient alternative to scaling parameters.

Key Contribution

RETRO (Retrieval-Enhanced Transformer) demonstrates that retrieval can substitute for model scale. A 7B parameter model with retrieval matches a 175B+ parameter model without retrieval, suggesting memory/retrieval is a more efficient path to capability than raw parameter count.

Architecture

Chunked Cross-Attention

Process text in chunks, retrieve for each:

  • Split input into chunks (64 tokens)
  • Retrieve neighbors for each chunk
  • Cross-attend to retrieved content
  • Standard self-attention otherwise
  • Retrieval Database

    Massive pre-computed index:

  • 2 trillion tokens from web
  • Chunked and embedded
  • Approximate nearest neighbor search
  • ~2B chunks indexed
  • Encoder-Decoder Fusion

    Retrieved neighbors processed by encoder:

  • Encode retrieved chunks
  • Cross-attention in decoder layers
  • Information flows from retrieval to generation
  • Training

    Pre-training with Retrieval

  • Retrieval active during training
  • Model learns to use retrieved context
  • Same chunks used at train and test
  • Frozen Retriever

    Unlike REALM:

  • Retriever not trained end-to-end
  • Reduces complexity
  • Still highly effective
  • Computational Efficiency

    Retrieval adds minimal overhead:

  • One-time index building
  • Fast approximate NN search
  • Cross-attention is sparse
  • Results

    Scaling Comparison

    | Model | Parameters | Perplexity |

    |-------|------------|------------|

    | Baseline | 7.5B | 2.96 |

    | RETRO | 7.5B | 2.47 |

    | Baseline | 175B | ~2.5 |

    RETRO 7.5B matches much larger models!

    Knowledge-Intensive Tasks

    Strong performance on:

  • Question answering
  • Fact verification
  • Knowledge probing
  • When Retrieval Helps Most

  • Rare/long-tail knowledge
  • Recent information
  • Factual queries
  • Less help for reasoning tasks
  • Architectural Details

    Chunked Attention

    Every few layers, add cross-attention:

    For chunk C at position i:

  • Retrieve k neighbors N_1...N_k
  • Encode neighbors: E = Encoder(N_1...N_k)
  • Cross-attend: Attend(C, E)
  • Continue with self-attention
  • Neighbor Encoding

    Shared encoder processes all neighbors:

  • Efficient batch processing
  • Captures neighbor content
  • Enables comparison across neighbors
  • Implications for Agent Memory

    Memory at Scale

    RETRO validates retrieval-augmented approaches:

  • Retrieval can replace parameters
  • Massive memory stores are practical
  • Efficient at inference time
  • Memory Architecture Design

    Lessons for agents:

  • Chunk-based storage works well
  • Cross-attention effective for fusion
  • Retrieval per-segment, not per-token
  • Frozen retriever is sufficient
  • Practical Considerations

    Building agent memory systems:

  • Index user interactions like RETRO indexes web
  • Retrieve relevant past for each turn
  • Cross-attend to memory content
  • Scale memory, not model
  • Comparison to Other Work

    vs. RAG

  • RETRO: integrated in model architecture
  • RAG: prepend to prompt
  • RETRO more parameter-efficient
  • vs. REALM

  • REALM: end-to-end retriever training
  • RETRO: frozen retriever
  • Both effective, RETRO simpler
  • vs. MemGPT

  • MemGPT: LLM controls memory
  • RETRO: automatic retrieval
  • Different levels of control
  • Limitations

  • Large index required
  • Retrieval latency at inference
  • Less interpretable than prompt-based
  • Requires architecture modification
  • Citation

    @inproceedings{borgeaud2022improving,

    title={Improving Language Models by Retrieving from Trillions of Tokens},

    author={Borgeaud, Sebastian and others},

    booktitle={International Conference on Machine Learning},

    year={2022}

    }