Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih

EMNLP · 2020

dense-retrievaldprembeddingsquestion-answering

TL;DR

Introduces Dense Passage Retrieval (DPR), showing that learned dense embeddings significantly outperform sparse methods like BM25 for open-domain QA retrieval.

Key Contribution

DPR demonstrates that simple dual-encoder models trained on question-answer pairs can dramatically outperform traditional sparse retrieval methods like BM25, establishing dense retrieval as the standard for semantic search.

Architecture

Dual Encoder

Two separate encoders:

  • Question Encoder: Maps questions to vectors
  • Passage Encoder: Maps passages to vectors
  • Both produce vectors in same embedding space
  • Similarity = dot product of vectors
  • Training

    Contrastive learning:

  • Positive pairs: question + answer passage
  • Negative pairs: question + random passages
  • In-batch negatives for efficiency
  • Hard negatives improve quality
  • Training Details

    Data

  • Natural Questions, TriviaQA, etc.
  • ~60k question-passage pairs
  • Passages from Wikipedia
  • Negative Sampling

    Three types of negatives:

  • **Random**: Random passages from corpus
  • **BM25**: High BM25 but wrong passages
  • **In-batch**: Other questions' positives
  • Hard negatives (BM25) significantly improve results.

    Loss Function

    L = -log(exp(sim(q,p+)) / (exp(sim(q,p+)) + Σ exp(sim(q,p-))))

    Where:

  • q = question embedding
  • p+ = positive passage embedding
  • p- = negative passage embeddings
  • sim = dot product similarity
  • Evaluation

    Datasets

  • Natural Questions
  • TriviaQA
  • WebQuestions
  • CuratedTREC
  • SQuAD
  • Results

    Top-20 retrieval accuracy:

    | Method | NQ | TriviaQA |

    |--------|-----|----------|

    | BM25 | 59.1 | 66.9 |

    | DPR | **79.4** | **78.8** |

    Massive improvements over sparse baselines.

    Why Dense Works Better

    Semantic Matching

  • Captures meaning, not just keywords
  • Handles synonyms and paraphrases
  • Better for natural language questions
  • Learned Relevance

  • Trained on what answers questions
  • Not just term overlap
  • Task-specific notion of relevance
  • Implementation

    Index Building

  • Encode all passages with passage encoder
  • Build vector index (FAISS, etc.)
  • Index enables fast similarity search
  • Query Time

  • Encode question with question encoder
  • Find nearest neighbors in index
  • Return top-k passages
  • Efficiency

  • Sub-linear search with approximate NN
  • Single forward pass per query
  • Can search billions of passages
  • Impact on Agent Memory

    Semantic Memory Search

    DPR principles apply directly:

  • Encode memories with embedding model
  • Search by semantic similarity
  • Find relevant context even with different wording
  • Memory Encoding

    Considerations for agents:

  • What to use as "query" (current message? task?)
  • What to encode as "passage" (full memory? chunks?)
  • How to handle multiple memory types
  • Extensions

    ColBERT

    Late interaction for better accuracy:

  • Per-token embeddings instead of single vector
  • MaxSim operation for matching
  • Better quality, more compute
  • Hybrid Search

    Combine dense and sparse:

  • Dense for semantic matching
  • Sparse for exact keywords
  • Often outperforms either alone
  • Limitations

  • Requires training data
  • Fixed embedding size limits capacity
  • Doesn't handle very long passages well
  • Domain shift can hurt performance
  • Citation

    @inproceedings{karpukhin2020dense,

    title={Dense Passage Retrieval for Open-Domain Question Answering},

    author={Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau},

    booktitle={Proceedings of EMNLP},

    year={2020}

    }