Overview
This architecture combines traditional RAG (Retrieval-Augmented Generation) with persistent agent memory. While RAG provides access to a static knowledge base, memory adds dynamic, evolving context from interactions. Together, they create agents that are both knowledgeable and personalized.
The Two Memory Systems
Knowledge Base (RAG)
Static or slowly-changing information:
Agent Memory
Dynamic, interaction-derived information:
Architecture Components
User Query
│
▼
┌─────────────────┐
│ Query Router │ ── Determine what context is needed
└────────┬────────┘
│
┌────┴────┐
▼ ▼
┌───────┐ ┌───────┐
│ RAG │ │Memory │
│Search │ │Search │
└───┬───┘ └───┬───┘
│ │
└────┬────┘
▼
┌─────────────────┐
│Context Assembly │ ── Combine knowledge + memory
└────────┬────────┘
│
▼
┌─────────────────┐
│ LLM + Prompt │ ── Generate response
└────────┬────────┘
│
▼
┌─────────────────┐
│ Memory Update │ ── Store new learnings
└─────────────────┘
Query Routing
Not every query needs both systems:
Knowledge-Heavy Queries
"What's your return policy?"
Memory-Heavy Queries
"What did we discuss last time?"
Hybrid Queries
"Based on my preferences, what do you recommend?"
Collection Strategy
Separate Collections
Keep RAG and memory distinct:
**Knowledge Collection**
**Memory Collection**
Query Both, Merge Results
knowledge_results = qdrant.search(
collection="knowledge",
query_vector=query_embedding,
limit=5
)
memory_results = qdrant.search(
collection="memories",
query_vector=query_embedding,
filter={"user_id": current_user},
limit=5
)
context = merge_and_rank(knowledge_results, memory_results)
Context Assembly
Prioritization Logic
When context window is limited:
Deduplication
Avoid redundancy:
Source Attribution
Track where context came from:
Memory Update Patterns
During Conversation
Extract and store:
Post-Conversation
Consolidation tasks:
Scaling Considerations
Knowledge Base
Memory Store
When to Use This Pattern
Good fit:
Consider alternatives if: