Key Contribution
DPR demonstrates that simple dual-encoder models trained on question-answer pairs can dramatically outperform traditional sparse retrieval methods like BM25, establishing dense retrieval as the standard for semantic search.
Architecture
Dual Encoder
Two separate encoders:
Question Encoder: Maps questions to vectors
Passage Encoder: Maps passages to vectors
Both produce vectors in same embedding space
Similarity = dot product of vectors
Training
Contrastive learning:
Positive pairs: question + answer passage
Negative pairs: question + random passages
In-batch negatives for efficiency
Hard negatives improve quality
Training Details
Data
Natural Questions, TriviaQA, etc.
~60k question-passage pairs
Passages from Wikipedia
Negative Sampling
Three types of negatives:
**Random**: Random passages from corpus
**BM25**: High BM25 but wrong passages
**In-batch**: Other questions' positives
Hard negatives (BM25) significantly improve results.
Loss Function
L = -log(exp(sim(q,p+)) / (exp(sim(q,p+)) + Σ exp(sim(q,p-))))
Where:
q = question embedding
p+ = positive passage embedding
p- = negative passage embeddings
sim = dot product similarity
Evaluation
Datasets
Natural Questions
TriviaQA
WebQuestions
CuratedTREC
SQuAD
Results
Top-20 retrieval accuracy:
| Method | NQ | TriviaQA |
|--------|-----|----------|
| BM25 | 59.1 | 66.9 |
| DPR | **79.4** | **78.8** |
Massive improvements over sparse baselines.
Why Dense Works Better
Semantic Matching
Captures meaning, not just keywords
Handles synonyms and paraphrases
Better for natural language questions
Learned Relevance
Trained on what answers questions
Not just term overlap
Task-specific notion of relevance
Implementation
Index Building
Encode all passages with passage encoder
Build vector index (FAISS, etc.)
Index enables fast similarity search
Query Time
Encode question with question encoder
Find nearest neighbors in index
Return top-k passages
Efficiency
Sub-linear search with approximate NN
Single forward pass per query
Can search billions of passages
Impact on Agent Memory
Semantic Memory Search
DPR principles apply directly:
Encode memories with embedding model
Search by semantic similarity
Find relevant context even with different wording
Memory Encoding
Considerations for agents:
What to use as "query" (current message? task?)
What to encode as "passage" (full memory? chunks?)
How to handle multiple memory types
Extensions
ColBERT
Late interaction for better accuracy:
Per-token embeddings instead of single vector
MaxSim operation for matching
Better quality, more compute
Hybrid Search
Combine dense and sparse:
Dense for semantic matching
Sparse for exact keywords
Often outperforms either alone
Limitations
Requires training data
Fixed embedding size limits capacity
Doesn't handle very long passages well
Domain shift can hurt performance
Citation
@inproceedings{karpukhin2020dense,
title={Dense Passage Retrieval for Open-Domain Question Answering},
author={Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau},
booktitle={Proceedings of EMNLP},
year={2020}
}