AI Agent SaaS with User Memory - Agent Memory Architecture

Overview

This architecture describes a production-ready SaaS platform where multiple users interact with AI agents that remember context across sessions. Each user has their own isolated memory space, enabling personalized experiences while maintaining data privacy and security.

The core value proposition: agents that truly know your users, learning preferences, recalling past conversations, and building context over time.

System Components

User Layer

Users authenticate through your standard auth system (OAuth, email/password, SSO). Each user gets:

A unique tenant ID for memory isolation

Personal preference settings that guide agent behavior

Access to their conversation history and stored memories

Application Layer

The frontend provides interfaces for:

Chat Interface: Real-time conversation with the AI agent

Memory Browser: Users can view, search, and manage what the agent remembers

Settings: Control over memory retention, agent personality, privacy preferences

The backend handles:

Session Management: Tracking active conversations and context windows

Memory Operations: Storing, retrieving, and updating memories

Agent Orchestration: Managing the flow between user input, memory retrieval, LLM calls, and responses

Memory Layer

The vector database (Qdrant) serves as the persistent memory store:

Collection per Tenant: Each user/organization gets isolated collections

Memory Types: Store different kinds of memories (facts, preferences, conversation summaries, entities)

Metadata Filtering: Query memories by type, timestamp, importance, topic

Agent Layer

The AI agent combines:

Context Assembly: Pull relevant memories before each LLM call

Response Generation: LLM produces responses informed by user history

Memory Extraction: After responses, extract and store new memories

Data Flow

User Message

│

▼

┌─────────────────┐

│ Auth & Session │ ── Validate user, load session

└────────┬────────┘

│

▼

┌─────────────────┐

│ Memory Retrieval│ ── Query Qdrant for relevant memories

└────────┬────────┘ (filter by user_id, recency, relevance)

│

▼

┌─────────────────┐

│ Context Assembly│ ── Combine: system prompt + memories + conversation

└────────┬────────┘

│

▼

┌─────────────────┐

│ LLM Call │ ── Generate response with full context

└────────┬────────┘

│

▼

┌─────────────────┐

│ Memory Storage │ ── Extract and store new memories from conversation

└────────┬────────┘

│

▼

User Response

Memory Schema

Structure memories with rich metadata for effective retrieval:

Memory Object:

├── id: unique identifier

├── user_id: tenant isolation

├── content: the actual memory text

├── embedding: vector representation

├── type: fact | preference | summary | entity | event

├── importance: 0.0 - 1.0 score

├── source: conversation_id or "user_input"

├── created_at: timestamp

├── last_accessed: for recency weighting

└── metadata: flexible JSON for additional context

Multi-Tenancy Strategy

**Collection-per-tenant** for strong isolation:

Each user/organization gets their own Qdrant collection

No risk of cross-tenant data leakage

Easy to delete all user data (GDPR compliance)

Independent scaling per tenant

**Alternative: Shared collection with filtering**

Single collection with `user_id` in payload

More efficient for many small tenants

Requires careful payload indexing

Use Qdrant's filtering on every query

Key Design Decisions

Memory Retrieval Strategy

Pull memories based on:

**Semantic similarity** to current message

**Recency** weighting for fresh context

**Importance** scores for critical facts

**Type filtering** based on query intent

Memory Lifecycle

Creation: Extract memories after meaningful exchanges

Consolidation: Periodically merge similar memories

Decay: Reduce importance of unused memories over time

Deletion: User-controlled removal, automatic cleanup of low-value memories

Context Window Management

With limited LLM context windows:

Prioritize high-importance, high-relevance memories

Summarize older conversations rather than including full history

Use hierarchical memory (recent details, older summaries)

Scaling Considerations

**Horizontal scaling**:

Stateless application servers behind load balancer

Qdrant cluster for vector storage

Redis for session state and caching

**Performance optimizations**:

Cache frequent memory queries

Batch memory storage operations

Async memory extraction (don't block response)

**Cost management**:

Limit memories per user

Compress/summarize old memories

Tiered storage for archived memories

Security & Privacy

Encrypt memories at rest

User-controlled data deletion

Audit logs for memory access

Option for users to disable memory entirely

Clear data retention policies