Hierarchical Memory System - Agent Memory Architecture

Overview

Inspired by human memory systems, hierarchical memory organizes agent memory into multiple levels with different capacities, access speeds, and retention characteristics. This enables efficient handling of both immediate context and long-term knowledge.

Memory Levels

Working Memory (L1)

Immediate context:

Capacity: Small (fits in context window)

Duration: Current conversation/task

Access: Instant (in prompt)

Content: Active task state, recent exchanges

Short-term Memory (L2)

Recent history:

Capacity: Medium (hundreds of items)

Duration: Hours to days

Access: Fast vector search

Content: Recent conversations, temporary facts

Long-term Memory (L3)

Persistent knowledge:

Capacity: Large (unlimited with scaling)

Duration: Indefinite

Access: Standard vector search

Content: User profile, historical summaries, learned facts

Archival Memory (L4)

Cold storage:

Capacity: Very large

Duration: Permanent

Access: Slower, batch retrieval

Content: Old conversations, audit logs, rarely-accessed data

Architecture Flow

┌─────────────────────────────────────────────────────────────┐

│ CURRENT CONVERSATION │

└──────────────────────────┬──────────────────────────────────┘

│

▼

┌─────────────────────────────────────────────────────────────┐

│ L1: Working Memory (In Context) │

│ ├── Current task state │

│ ├── Last N messages │

│ └── Retrieved context from lower levels │

└──────────────────────────┬──────────────────────────────────┘

│ overflow / retrieval

▼

┌─────────────────────────────────────────────────────────────┐

│ L2: Short-term Memory (Hot Storage) │

│ ├── Today's conversations │

│ ├── Active project context │

│ └── Recently accessed memories │

└──────────────────────────┬──────────────────────────────────┘

│ consolidation / retrieval

▼

┌─────────────────────────────────────────────────────────────┐

│ L3: Long-term Memory (Warm Storage) │

│ ├── User profile and preferences │

│ ├── Conversation summaries │

│ └── Learned facts and patterns │

└──────────────────────────┬──────────────────────────────────┘

│ archival / deep retrieval

▼

┌─────────────────────────────────────────────────────────────┐

│ L4: Archival Memory (Cold Storage) │

│ ├── Historical conversations (full text) │

│ ├── Audit trails │

│ └── Rarely accessed data │

└─────────────────────────────────────────────────────────────┘

Movement Between Levels

Promotion (Cold → Hot)

When archived memory becomes relevant:

User asks about old conversation

Pattern matching finds historical relevance

Explicit user request to recall

Demotion (Hot → Cold)

As memories age or lose relevance:

Time-based aging

Access frequency tracking

Importance score decay

Consolidation into summaries

Consolidation

Transform detailed memories into summaries:

Multiple conversations → Single summary

Repeated facts → Confident knowledge

Episodes → Learned patterns

Reduce storage, preserve meaning

Retrieval Strategy

Query Planning

When a query arrives:

Check L1 (already in context)

Fast search L2 (recent, likely relevant)

Search L3 if needed (broader knowledge)

Deep search L4 only if explicitly needed

Relevance Scoring

Combine multiple signals:

Semantic similarity to query

Recency (prefer recent for same relevance)

Access frequency (often-used = important)

Explicit importance tags

Source level (L2 > L3 for active context)

Budget Allocation

Distribute context window budget:

Reserve space for L1 (essential current state)

Allocate to L2 (recent relevant context)

Fill remaining with L3 (background knowledge)

L4 only on specific request

Consolidation Process

When to Consolidate

End of conversation → Summarize

End of day → Roll up short-term

Periodic maintenance → Compress long-term

Storage thresholds → Archive old data

What to Preserve

During consolidation, keep:

Key facts and decisions

Important preferences expressed

Significant events

Patterns and trends

What to Discard

Safe to compress or remove:

Routine exchanges

Superseded information

Duplicate facts

Low-importance details

Implementation with Qdrant

Collection per Level

L2: Collection "memory_short_term"

- High-performance configuration

- Aggressive indexing

- Small payload limits

L3: Collection "memory_long_term"

- Balanced configuration

- Standard indexing

- Full payloads

L4: Collection "memory_archive"

- Cost-optimized configuration

- Minimal indexing

- Compressed storage

Tiered Search

# Fast path - check hot memory first

l2_results = search(collection="memory_short_term", limit=5)

if sufficient_relevance(l2_results):

return l2_results

# Slower path - check warm memory

l3_results = search(collection="memory_long_term", limit=5)

return merge(l2_results, l3_results)

Benefits

Efficient context window usage

Fast retrieval for common cases

Cost-effective storage scaling

Natural memory lifecycle

Mirrors human intuition