Hierarchical Memory System

Multi-level memory architecture from working memory to long-term storage

advancedmemory-hierarchycachingconsolidationcognitive

Overview

Inspired by human memory systems, hierarchical memory organizes agent memory into multiple levels with different capacities, access speeds, and retention characteristics. This enables efficient handling of both immediate context and long-term knowledge.

Memory Levels

Working Memory (L1)

Immediate context:

  • Capacity: Small (fits in context window)
  • Duration: Current conversation/task
  • Access: Instant (in prompt)
  • Content: Active task state, recent exchanges
  • Short-term Memory (L2)

    Recent history:

  • Capacity: Medium (hundreds of items)
  • Duration: Hours to days
  • Access: Fast vector search
  • Content: Recent conversations, temporary facts
  • Long-term Memory (L3)

    Persistent knowledge:

  • Capacity: Large (unlimited with scaling)
  • Duration: Indefinite
  • Access: Standard vector search
  • Content: User profile, historical summaries, learned facts
  • Archival Memory (L4)

    Cold storage:

  • Capacity: Very large
  • Duration: Permanent
  • Access: Slower, batch retrieval
  • Content: Old conversations, audit logs, rarely-accessed data
  • Architecture Flow

    ┌─────────────────────────────────────────────────────────────┐

    │ CURRENT CONVERSATION │

    └──────────────────────────┬──────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐

    │ L1: Working Memory (In Context) │

    │ ├── Current task state │

    │ ├── Last N messages │

    │ └── Retrieved context from lower levels │

    └──────────────────────────┬──────────────────────────────────┘

    │ overflow / retrieval

    ┌─────────────────────────────────────────────────────────────┐

    │ L2: Short-term Memory (Hot Storage) │

    │ ├── Today's conversations │

    │ ├── Active project context │

    │ └── Recently accessed memories │

    └──────────────────────────┬──────────────────────────────────┘

    │ consolidation / retrieval

    ┌─────────────────────────────────────────────────────────────┐

    │ L3: Long-term Memory (Warm Storage) │

    │ ├── User profile and preferences │

    │ ├── Conversation summaries │

    │ └── Learned facts and patterns │

    └──────────────────────────┬──────────────────────────────────┘

    │ archival / deep retrieval

    ┌─────────────────────────────────────────────────────────────┐

    │ L4: Archival Memory (Cold Storage) │

    │ ├── Historical conversations (full text) │

    │ ├── Audit trails │

    │ └── Rarely accessed data │

    └─────────────────────────────────────────────────────────────┘

    Movement Between Levels

    Promotion (Cold → Hot)

    When archived memory becomes relevant:

  • User asks about old conversation
  • Pattern matching finds historical relevance
  • Explicit user request to recall
  • Demotion (Hot → Cold)

    As memories age or lose relevance:

  • Time-based aging
  • Access frequency tracking
  • Importance score decay
  • Consolidation into summaries
  • Consolidation

    Transform detailed memories into summaries:

  • Multiple conversations → Single summary
  • Repeated facts → Confident knowledge
  • Episodes → Learned patterns
  • Reduce storage, preserve meaning
  • Retrieval Strategy

    Query Planning

    When a query arrives:

  • Check L1 (already in context)
  • Fast search L2 (recent, likely relevant)
  • Search L3 if needed (broader knowledge)
  • Deep search L4 only if explicitly needed
  • Relevance Scoring

    Combine multiple signals:

  • Semantic similarity to query
  • Recency (prefer recent for same relevance)
  • Access frequency (often-used = important)
  • Explicit importance tags
  • Source level (L2 > L3 for active context)
  • Budget Allocation

    Distribute context window budget:

  • Reserve space for L1 (essential current state)
  • Allocate to L2 (recent relevant context)
  • Fill remaining with L3 (background knowledge)
  • L4 only on specific request
  • Consolidation Process

    When to Consolidate

  • End of conversation → Summarize
  • End of day → Roll up short-term
  • Periodic maintenance → Compress long-term
  • Storage thresholds → Archive old data
  • What to Preserve

    During consolidation, keep:

  • Key facts and decisions
  • Important preferences expressed
  • Significant events
  • Patterns and trends
  • What to Discard

    Safe to compress or remove:

  • Routine exchanges
  • Superseded information
  • Duplicate facts
  • Low-importance details
  • Implementation with Qdrant

    Collection per Level

    L2: Collection "memory_short_term"

    - High-performance configuration

    - Aggressive indexing

    - Small payload limits

    L3: Collection "memory_long_term"

    - Balanced configuration

    - Standard indexing

    - Full payloads

    L4: Collection "memory_archive"

    - Cost-optimized configuration

    - Minimal indexing

    - Compressed storage

    Tiered Search

    # Fast path - check hot memory first

    l2_results = search(collection="memory_short_term", limit=5)

    if sufficient_relevance(l2_results):

    return l2_results

    # Slower path - check warm memory

    l3_results = search(collection="memory_long_term", limit=5)

    return merge(l2_results, l3_results)

    Benefits

  • Efficient context window usage
  • Fast retrieval for common cases
  • Cost-effective storage scaling
  • Natural memory lifecycle
  • Mirrors human intuition