AI Agent SaaS with User Memory

Multi-tenant AI agent platform with persistent user memory

advancedmulti-tenantuser-memoryproductionscalable

Overview

This architecture describes a production-ready SaaS platform where multiple users interact with AI agents that remember context across sessions. Each user has their own isolated memory space, enabling personalized experiences while maintaining data privacy and security.

The core value proposition: agents that truly know your users, learning preferences, recalling past conversations, and building context over time.

System Components

User Layer

Users authenticate through your standard auth system (OAuth, email/password, SSO). Each user gets:

  • A unique tenant ID for memory isolation
  • Personal preference settings that guide agent behavior
  • Access to their conversation history and stored memories
  • Application Layer

    The frontend provides interfaces for:

  • Chat Interface: Real-time conversation with the AI agent
  • Memory Browser: Users can view, search, and manage what the agent remembers
  • Settings: Control over memory retention, agent personality, privacy preferences
  • The backend handles:

  • Session Management: Tracking active conversations and context windows
  • Memory Operations: Storing, retrieving, and updating memories
  • Agent Orchestration: Managing the flow between user input, memory retrieval, LLM calls, and responses
  • Memory Layer

    The vector database (Qdrant) serves as the persistent memory store:

  • Collection per Tenant: Each user/organization gets isolated collections
  • Memory Types: Store different kinds of memories (facts, preferences, conversation summaries, entities)
  • Metadata Filtering: Query memories by type, timestamp, importance, topic
  • Agent Layer

    The AI agent combines:

  • Context Assembly: Pull relevant memories before each LLM call
  • Response Generation: LLM produces responses informed by user history
  • Memory Extraction: After responses, extract and store new memories
  • Data Flow

    User Message

    ┌─────────────────┐

    │ Auth & Session │ ── Validate user, load session

    └────────┬────────┘

    ┌─────────────────┐

    │ Memory Retrieval│ ── Query Qdrant for relevant memories

    └────────┬────────┘ (filter by user_id, recency, relevance)

    ┌─────────────────┐

    │ Context Assembly│ ── Combine: system prompt + memories + conversation

    └────────┬────────┘

    ┌─────────────────┐

    │ LLM Call │ ── Generate response with full context

    └────────┬────────┘

    ┌─────────────────┐

    │ Memory Storage │ ── Extract and store new memories from conversation

    └────────┬────────┘

    User Response

    Memory Schema

    Structure memories with rich metadata for effective retrieval:

    Memory Object:

    ├── id: unique identifier

    ├── user_id: tenant isolation

    ├── content: the actual memory text

    ├── embedding: vector representation

    ├── type: fact | preference | summary | entity | event

    ├── importance: 0.0 - 1.0 score

    ├── source: conversation_id or "user_input"

    ├── created_at: timestamp

    ├── last_accessed: for recency weighting

    └── metadata: flexible JSON for additional context

    Multi-Tenancy Strategy

    **Collection-per-tenant** for strong isolation:

  • Each user/organization gets their own Qdrant collection
  • No risk of cross-tenant data leakage
  • Easy to delete all user data (GDPR compliance)
  • Independent scaling per tenant
  • **Alternative: Shared collection with filtering**

  • Single collection with `user_id` in payload
  • More efficient for many small tenants
  • Requires careful payload indexing
  • Use Qdrant's filtering on every query
  • Key Design Decisions

    Memory Retrieval Strategy

    Pull memories based on:

  • **Semantic similarity** to current message
  • **Recency** weighting for fresh context
  • **Importance** scores for critical facts
  • **Type filtering** based on query intent
  • Memory Lifecycle

  • Creation: Extract memories after meaningful exchanges
  • Consolidation: Periodically merge similar memories
  • Decay: Reduce importance of unused memories over time
  • Deletion: User-controlled removal, automatic cleanup of low-value memories
  • Context Window Management

    With limited LLM context windows:

  • Prioritize high-importance, high-relevance memories
  • Summarize older conversations rather than including full history
  • Use hierarchical memory (recent details, older summaries)
  • Scaling Considerations

    **Horizontal scaling**:

  • Stateless application servers behind load balancer
  • Qdrant cluster for vector storage
  • Redis for session state and caching
  • **Performance optimizations**:

  • Cache frequent memory queries
  • Batch memory storage operations
  • Async memory extraction (don't block response)
  • **Cost management**:

  • Limit memories per user
  • Compress/summarize old memories
  • Tiered storage for archived memories
  • Security & Privacy

  • Encrypt memories at rest
  • User-controlled data deletion
  • Audit logs for memory access
  • Option for users to disable memory entirely
  • Clear data retention policies