Improving Language Models by Retrieving from Trillions of Tokens - Agent Memory Research

Key Contribution

RETRO (Retrieval-Enhanced Transformer) demonstrates that retrieval can substitute for model scale. A 7B parameter model with retrieval matches a 175B+ parameter model without retrieval, suggesting memory/retrieval is a more efficient path to capability than raw parameter count.

Architecture

Chunked Cross-Attention

Process text in chunks, retrieve for each:

Split input into chunks (64 tokens)

Retrieve neighbors for each chunk

Cross-attend to retrieved content

Standard self-attention otherwise

Retrieval Database

Massive pre-computed index:

2 trillion tokens from web

Chunked and embedded

Approximate nearest neighbor search

~2B chunks indexed

Encoder-Decoder Fusion

Retrieved neighbors processed by encoder:

Encode retrieved chunks

Cross-attention in decoder layers

Information flows from retrieval to generation

Training

Pre-training with Retrieval

Retrieval active during training

Model learns to use retrieved context

Same chunks used at train and test

Frozen Retriever

Unlike REALM:

Retriever not trained end-to-end

Reduces complexity

Still highly effective

Computational Efficiency

Retrieval adds minimal overhead:

One-time index building

Fast approximate NN search

Cross-attention is sparse

Results

Scaling Comparison

| Model | Parameters | Perplexity |

|-------|------------|------------|

| Baseline | 7.5B | 2.96 |

| RETRO | 7.5B | 2.47 |

| Baseline | 175B | ~2.5 |

RETRO 7.5B matches much larger models!

Knowledge-Intensive Tasks

Strong performance on:

Question answering

Fact verification

Knowledge probing

When Retrieval Helps Most

Rare/long-tail knowledge

Recent information

Factual queries

Less help for reasoning tasks

Architectural Details

Chunked Attention

Every few layers, add cross-attention:

For chunk C at position i:

Retrieve k neighbors N_1...N_k

Encode neighbors: E = Encoder(N_1...N_k)

Cross-attend: Attend(C, E)

Continue with self-attention

Neighbor Encoding

Shared encoder processes all neighbors:

Efficient batch processing

Captures neighbor content

Enables comparison across neighbors

Implications for Agent Memory

Memory at Scale

RETRO validates retrieval-augmented approaches:

Retrieval can replace parameters

Massive memory stores are practical

Efficient at inference time

Memory Architecture Design

Lessons for agents:

Chunk-based storage works well

Cross-attention effective for fusion

Retrieval per-segment, not per-token

Frozen retriever is sufficient

Practical Considerations

Building agent memory systems:

Index user interactions like RETRO indexes web

Retrieve relevant past for each turn

Cross-attend to memory content

Scale memory, not model

Comparison to Other Work

vs. RAG

RETRO: integrated in model architecture

RAG: prepend to prompt

RETRO more parameter-efficient

vs. REALM

REALM: end-to-end retriever training

RETRO: frozen retriever

Both effective, RETRO simpler

vs. MemGPT

MemGPT: LLM controls memory

RETRO: automatic retrieval

Different levels of control

Limitations

Large index required

Retrieval latency at inference

Less interpretable than prompt-based

Requires architecture modification

Citation

@inproceedings{borgeaud2022improving,

title={Improving Language Models by Retrieving from Trillions of Tokens},

author={Borgeaud, Sebastian and others},

booktitle={International Conference on Machine Learning},

year={2022}

}

TL;DR