Overview
This architecture describes a production-ready SaaS platform where multiple users interact with AI agents that remember context across sessions. Each user has their own isolated memory space, enabling personalized experiences while maintaining data privacy and security.
The core value proposition: agents that truly know your users, learning preferences, recalling past conversations, and building context over time.
System Components
User Layer
Users authenticate through your standard auth system (OAuth, email/password, SSO). Each user gets:
A unique tenant ID for memory isolation
Personal preference settings that guide agent behavior
Access to their conversation history and stored memories
Application Layer
The frontend provides interfaces for:
Chat Interface: Real-time conversation with the AI agent
Memory Browser: Users can view, search, and manage what the agent remembers
Settings: Control over memory retention, agent personality, privacy preferences
The backend handles:
Session Management: Tracking active conversations and context windows
Memory Operations: Storing, retrieving, and updating memories
Agent Orchestration: Managing the flow between user input, memory retrieval, LLM calls, and responses
Memory Layer
The vector database (Qdrant) serves as the persistent memory store:
Collection per Tenant: Each user/organization gets isolated collections
Memory Types: Store different kinds of memories (facts, preferences, conversation summaries, entities)
Metadata Filtering: Query memories by type, timestamp, importance, topic
Agent Layer
The AI agent combines:
Context Assembly: Pull relevant memories before each LLM call
Response Generation: LLM produces responses informed by user history
Memory Extraction: After responses, extract and store new memories
Data Flow
User Message
│
▼
┌─────────────────┐
│ Auth & Session │ ── Validate user, load session
└────────┬────────┘
│
▼
┌─────────────────┐
│ Memory Retrieval│ ── Query Qdrant for relevant memories
└────────┬────────┘ (filter by user_id, recency, relevance)
│
▼
┌─────────────────┐
│ Context Assembly│ ── Combine: system prompt + memories + conversation
└────────┬────────┘
│
▼
┌─────────────────┐
│ LLM Call │ ── Generate response with full context
└────────┬────────┘
│
▼
┌─────────────────┐
│ Memory Storage │ ── Extract and store new memories from conversation
└────────┬────────┘
│
▼
User Response
Memory Schema
Structure memories with rich metadata for effective retrieval:
Memory Object:
├── id: unique identifier
├── user_id: tenant isolation
├── content: the actual memory text
├── embedding: vector representation
├── type: fact | preference | summary | entity | event
├── importance: 0.0 - 1.0 score
├── source: conversation_id or "user_input"
├── created_at: timestamp
├── last_accessed: for recency weighting
└── metadata: flexible JSON for additional context
Multi-Tenancy Strategy
**Collection-per-tenant** for strong isolation:
Each user/organization gets their own Qdrant collection
No risk of cross-tenant data leakage
Easy to delete all user data (GDPR compliance)
Independent scaling per tenant
**Alternative: Shared collection with filtering**
Single collection with `user_id` in payload
More efficient for many small tenants
Requires careful payload indexing
Use Qdrant's filtering on every query
Key Design Decisions
Memory Retrieval Strategy
Pull memories based on:
**Semantic similarity** to current message
**Recency** weighting for fresh context
**Importance** scores for critical facts
**Type filtering** based on query intent
Memory Lifecycle
Creation: Extract memories after meaningful exchanges
Consolidation: Periodically merge similar memories
Decay: Reduce importance of unused memories over time
Deletion: User-controlled removal, automatic cleanup of low-value memories
Context Window Management
With limited LLM context windows:
Prioritize high-importance, high-relevance memories
Summarize older conversations rather than including full history
Use hierarchical memory (recent details, older summaries)
Scaling Considerations
**Horizontal scaling**:
Stateless application servers behind load balancer
Qdrant cluster for vector storage
Redis for session state and caching
**Performance optimizations**:
Cache frequent memory queries
Batch memory storage operations
Async memory extraction (don't block response)
**Cost management**:
Limit memories per user
Compress/summarize old memories
Tiered storage for archived memories
Security & Privacy
Encrypt memories at rest
User-controlled data deletion
Audit logs for memory access
Option for users to disable memory entirely
Clear data retention policies