Memory Patterns for Multi-Session Agents: File-Backed Context
Learn file-backed memory patterns for multi-session AI agents: indexing, summaries, eviction. Build agents that remember across user sessions.
Memory Patterns for Multi-Session Agents: File-Backed Context
Table of Contents
- Why Multi-Session Memory Matters
- The Core Challenge: Token Limits and Persistent State
- File-Backed Memory Architecture
- Indexing Strategies for Fast Retrieval
- Summary Rotation and Compression
- Stale-Data Eviction Patterns
- Implementation Patterns and Code Examples
- Scaling Multi-Session Memory
- Security and Compliance Considerations
- Measuring and Optimising Memory Performance
- Next Steps: Building Production-Ready Agents
Why Multi-Session Memory Matters
AI agents that span multiple user sessions face a fundamental problem: they need to remember context, decisions, and outcomes across disconnected interactions without exploding token budgets or losing critical information. A customer service agent handling a ticket over three days. A financial analysis agent running queries across weeks of data. A content moderation system that needs to flag repeat offenders. None of these work well if the agent resets its memory after each session.
The difference between a frustrating agent and a useful one often comes down to memory. When an agent remembers that a user prefers email over Slack, that they’ve already provided their account number, or that their last issue was similar to this one, interactions become faster, cheaper, and more human-feeling. That’s not magic—it’s engineering.
At PADISO, we’ve built dozens of agentic AI systems that need to maintain context across sessions whilst operating within strict token budgets. The patterns we’ve learned—indexing, summary rotation, and intelligent eviction—apply whether you’re building a customer support agent, a data analysis system, or an autonomous workflow orchestrator.
This guide walks through the practical patterns, trade-offs, and implementation strategies that let you build agents that genuinely remember.
The Core Challenge: Token Limits and Persistent State
Why File-Backed Memory Instead of Context Windows?
Large language models have fixed context windows. GPT-4 has 128K tokens. Claude 3.5 Sonnet has 200K tokens. That sounds generous until you realise that a typical user conversation, system prompt, retrieval results, tool definitions, and reasoning chains can easily consume 30–50K tokens. Once you’re past the halfway point, adding historical context becomes expensive.
File-backed memory solves this by keeping most historical data on disk (or in a database), retrieving only what’s relevant for the current session. Instead of stuffing every previous interaction into the context window, you:
- Index interactions so you can search them efficiently
- Summarise old sessions to preserve decision logic without storing raw transcripts
- Evict stale or irrelevant data to keep the working set small
This isn’t new. AI Agent Memory Architecture describes four typed memory layers—in-context, persisted state, external, and episodic—each with distinct lifetimes and access patterns. The trick is implementing them in a way that’s fast, cheap, and doesn’t require constant re-training.
The Token Economics
Let’s do the maths. Suppose you run 100 agent sessions per day, each lasting 10 turns (user message + agent response). That’s 1,000 interactions daily. If you naively included the full history in every subsequent session:
- A 10-turn session history = ~2,000 tokens
- Multiply by 100 sessions = 200,000 tokens per day just for history
- At $0.003 per 1K input tokens (typical LLM pricing), that’s $600/month in history overhead alone
With file-backed memory and smart retrieval, you might retrieve only the 3–5 most relevant prior interactions (~500 tokens) plus a summary of the rest (~300 tokens). Same quality, 80% cost reduction. For operators running high-volume agents, this compounds quickly.
File-Backed Memory Architecture
The Three-Layer Model
A production agent memory system typically has three layers:
Layer 1: Session Buffer (In-Memory) The current conversation lives in memory. This is your context window. It includes the system prompt, current turn, any tool outputs, and a pointer to external memory.
Layer 2: Persistent Store (File/Database) All completed sessions, interactions, and derived summaries live here. This could be SQLite, PostgreSQL, MongoDB, or even S3 + DynamoDB. The key is that it’s durable and queryable.
Layer 3: Index/Cache (Retrieval Layer) A fast lookup mechanism—vector embeddings, BM25 search, or keyword indices—that lets you find relevant prior context without scanning the entire history.
A Concrete Schema
Here’s a minimal but functional schema for multi-session agent memory:
Sessions Table:
- session_id (UUID, primary key)
- user_id (string)
- agent_id (string)
- created_at (timestamp)
- closed_at (timestamp)
- summary (text, auto-generated after session ends)
- metadata (JSON: tags, outcome, cost, duration)
Interactions Table:
- interaction_id (UUID)
- session_id (foreign key)
- turn_number (int)
- user_input (text)
- agent_response (text)
- tools_called (JSON array)
- created_at (timestamp)
- embedding (vector, for semantic search)
- is_summarised (boolean, for eviction tracking)
Memory Index Table:
- index_id (UUID)
- session_id (foreign key)
- index_type (enum: 'keyword', 'entity', 'decision')
- key (string)
- value (text or JSON)
- relevance_score (float, for ranking)
- created_at (timestamp)
- last_accessed_at (timestamp, for eviction)
This isn’t fancy, but it’s battle-tested. The key insight: separate raw interactions from derived indices. Raw data is immutable; indices are rebuilt as needed.
Why File-Backed Over Vector-Only?
You might ask: why not just embed everything and use vector search? Vector search is powerful for semantic retrieval, but it has weaknesses:
- Embedding drift: An embedding of “user wants refund” from six months ago might not rank high when the current query is “has this user asked for refunds before?”
- Temporal reasoning: Vector similarity doesn’t understand time. A recent interaction should often rank higher than an old one, even if both are semantically similar.
- Exact matching: If the agent needs to know “did user provide their account number on 2024-11-15?”, a keyword or SQL query is faster and more reliable than embedding search.
The best systems use hybrid retrieval: keyword indices for exact facts, embeddings for semantic similarity, and metadata filters (date ranges, user tags) for temporal reasoning. File-backed storage makes all three possible.
Indexing Strategies for Fast Retrieval
Keyword Indexing: The Fast Path
For any agent that needs to answer “has the user mentioned X?” or “what was the outcome of task Y?”, keyword indexing is essential. This isn’t full-text search; it’s selective extraction of key entities and decisions.
After each session, run a lightweight extraction step:
For each interaction:
1. Extract entities (names, dates, account numbers, product IDs)
2. Extract decisions ("user approved", "waiting for info", "issue resolved")
3. Extract topics ("billing", "technical support", "refund")
4. Store in Memory Index table with relevance_score
When a new session starts, you can query: “What topics has this user discussed?” or “Has this user been marked as high-risk?” in milliseconds, without touching embeddings or LLM calls.
Semantic Indexing with Embeddings
For queries that require understanding intent—“What was this user frustrated about?” or “Has the agent solved a similar problem before?”—embeddings shine.
The pattern:
- Embed key turns: Not every interaction needs embedding. Embed the agent’s summary of each turn, or only turns that changed state (decisions, escalations, resolutions).
- Store with metadata: Each embedding should carry a session_id, timestamp, and a human-readable label (“user frustrated”, “issue resolved”, etc.).
- Retrieve with filters: When you need prior context, search embeddings but filter by date range, user_id, or topic tag first. This reduces the search space and improves ranking.
For example, a support agent handling a new ticket might query: “Find 5 most similar issues this user reported in the last 90 days.” This combines semantic search (embeddings) with temporal filtering (date range).
Entity and Decision Graphs
For complex workflows, a lightweight graph can track relationships. If your agent is managing a deal, it might track:
Entities:
- deal_id: "D-2024-001"
- account_id: "ACC-123"
- contact_name: "Alice"
Relationships:
- deal_id --[owner]--> contact_name
- deal_id --[status]--> "negotiation"
- deal_id --[last_update]--> 2024-11-20
Decisions:
- deal_id --[next_action]--> "send contract"
- deal_id --[deadline]--> 2024-11-27
When the agent resumes this deal in a later session, it can instantly know the state, owner, and next step without re-reading the entire history. This is especially powerful for AI & Agents Automation systems that orchestrate long-running workflows.
Summary Rotation and Compression
Why Summarise?
A session that lasted 20 turns might contain 5,000 tokens of raw conversation. A well-written summary might capture the essence in 300 tokens. Over time, as sessions age, you want to compress them further to keep the working set small.
The trick is doing this without losing critical information. A summary of a customer support ticket needs to preserve:
- The problem (what the user asked for)
- The solution (what was done)
- The outcome (was it resolved? is follow-up needed?)
- Decisions (escalations, refunds, special handling)
- Entities (account number, product, date of issue)
Single-Session Summaries
After a session ends, generate a summary immediately. This is fast and cheap:
Prompt:
"Summarise this support session in 150 tokens. Include:
- User's issue
- Actions taken
- Resolution status
- Any follow-up needed
Session:
[raw interaction history]
Summary:
[LLM generates summary]
Store summary in sessions table.
Cost: ~$0.001 per session (a few hundred tokens). Benefit: future sessions can retrieve the summary instead of the raw history.
Rolling Summaries: Compressing Time
As sessions age, you can compress further. After 30 days, summarise the summaries:
Day 1: Session 1 (20 turns) -> Summary (300 tokens)
Day 2: Session 2 (15 turns) -> Summary (250 tokens)
...
Day 30: Summarise all 30 summaries into 1 monthly summary (500 tokens)
Now a year of interactions with a user is stored as 12 monthly summaries (~6K tokens total) instead of 365 raw sessions (potentially 100K+ tokens). When the agent needs historical context, it retrieves relevant monthly summaries plus the last 2–3 raw sessions.
Hierarchical Summarisation
For high-volume agents, go deeper:
Level 1: Raw interactions (current session, live)
Level 2: Session summaries (last 7 days, ~300 tokens each)
Level 3: Weekly summaries (last 90 days, ~500 tokens each)
Level 4: Monthly summaries (older than 90 days, ~600 tokens each)
When retrieving context for a new session, the agent pulls:
- Level 1: Current session (always)
- Level 2: Most relevant weekly summaries (semantic search)
- Level 3: Monthly summaries if needed (temporal context)
This keeps the active context window small whilst preserving long-term memory.
Stale-Data Eviction Patterns
The Eviction Problem
Your file-backed memory grows indefinitely unless you actively remove or archive old data. But you can’t just delete everything older than 90 days—some data is timeless (a user’s account number, a resolved technical issue that might recur).
Eviction strategies need to balance:
- Storage cost: Keep less, spend less.
- Retrieval quality: Evict the right data, not the useful data.
- Compliance: Some data must be retained for audit trails (see Security and Compliance Considerations).
Time-Based Eviction with Relevance Scoring
A simple but effective pattern:
For each interaction:
1. Assign a relevance_score based on:
- Recency (newer = higher)
- Frequency (how often was this user/topic queried?)
- Outcome (did this lead to a resolution? escalation?)
- User segment (VIP users' data kept longer)
2. Every 30 days, evict interactions where:
- age > 180 days AND relevance_score < 0.3
- OR age > 365 days AND relevance_score < 0.6
3. Before eviction, ensure the interaction is summarised.
This keeps high-value interactions (frequent issues, important users, recent data) whilst pruning low-value noise.
Summarise Before Eviction
Never evict raw data without preserving its essence:
Eviction Workflow:
1. Identify candidates for eviction (old, low relevance)
2. Check: is this interaction already summarised?
3. If not, generate a summary and store it
4. Mark raw interaction as "archived" (soft delete)
5. After 30 days, hard delete if no queries reference it
Soft deletes are safer than hard deletes—they let you recover data if needed and satisfy audit requirements.
User-Specific Retention Policies
Different users have different retention needs:
Policy for Free-Tier Users:
- Keep raw interactions for 30 days
- Keep summaries for 90 days
- Hard delete after 180 days
Policy for Paying Customers:
- Keep raw interactions for 90 days
- Keep summaries for 1 year
- Hard delete after 2 years (or per contract)
Policy for High-Risk Users (fraud, compliance):
- Keep everything for 3 years (regulatory requirement)
- Tag for audit review
This is especially important if you’re operating in a regulated industry or handling sensitive data. It also lets you optimise storage costs by tiering retention by user segment.
Monitoring Eviction Health
Track these metrics:
- Eviction rate: How much data are you removing per week?
- Resurrection rate: How often do users ask about evicted data? (High = too aggressive eviction)
- Storage growth: Is your total storage still growing despite eviction? (Indicates new data is arriving faster than old data is leaving)
- Retrieval latency: Does eviction improve query speed? (It should, if you’re removing junk)
Implementation Patterns and Code Examples
Pattern 1: SQLite + JSON for Small Teams
If you’re starting out or running a low-volume agent, SQLite with JSON columns is simple and effective:
import sqlite3
import json
from datetime import datetime, timedelta
class AgentMemory:
def __init__(self, db_path="agent_memory.db"):
self.conn = sqlite3.connect(db_path)
self.cursor = self.conn.cursor()
self._init_schema()
def _init_schema(self):
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS sessions (
session_id TEXT PRIMARY KEY,
user_id TEXT,
agent_id TEXT,
created_at TIMESTAMP,
closed_at TIMESTAMP,
summary TEXT,
metadata JSON
)
""")
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS interactions (
interaction_id TEXT PRIMARY KEY,
session_id TEXT,
turn_number INTEGER,
user_input TEXT,
agent_response TEXT,
tools_called JSON,
created_at TIMESTAMP,
is_summarised BOOLEAN DEFAULT 0,
FOREIGN KEY (session_id) REFERENCES sessions(session_id)
)
""")
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS memory_index (
index_id TEXT PRIMARY KEY,
session_id TEXT,
index_type TEXT,
key TEXT,
value TEXT,
relevance_score REAL,
created_at TIMESTAMP,
last_accessed_at TIMESTAMP,
FOREIGN KEY (session_id) REFERENCES sessions(session_id)
)
""")
self.conn.commit()
def store_interaction(self, session_id, turn_number, user_input, agent_response, tools_called=None):
interaction_id = f"{session_id}-{turn_number}"
self.cursor.execute("""
INSERT INTO interactions
(interaction_id, session_id, turn_number, user_input, agent_response, tools_called, created_at)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", (
interaction_id, session_id, turn_number, user_input, agent_response,
json.dumps(tools_called or []), datetime.now()
))
self.conn.commit()
def retrieve_session_context(self, session_id, max_tokens=2000):
"""Retrieve relevant context for a session, respecting token budget."""
# Get raw interactions from current session
self.cursor.execute("""
SELECT user_input, agent_response FROM interactions
WHERE session_id = ?
ORDER BY turn_number DESC
LIMIT 10
""", (session_id,))
raw_interactions = self.cursor.fetchall()
# Get summaries from prior sessions with same user
self.cursor.execute("""
SELECT summary, created_at FROM sessions
WHERE user_id = (SELECT user_id FROM sessions WHERE session_id = ?)
AND session_id != ?
ORDER BY created_at DESC
LIMIT 5
""", (session_id, session_id))
prior_summaries = self.cursor.fetchall()
return {
'raw_interactions': raw_interactions,
'prior_summaries': prior_summaries
}
def index_session(self, session_id, entities, decisions):
"""Create keyword indices for fast retrieval."""
for entity_key, entity_value in entities.items():
index_id = f"{session_id}-entity-{entity_key}"
self.cursor.execute("""
INSERT INTO memory_index
(index_id, session_id, index_type, key, value, relevance_score, created_at, last_accessed_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
index_id, session_id, 'entity', entity_key, entity_value, 0.8, datetime.now(), datetime.now()
))
for decision_key, decision_value in decisions.items():
index_id = f"{session_id}-decision-{decision_key}"
self.cursor.execute("""
INSERT INTO memory_index
(index_id, session_id, index_type, key, value, relevance_score, created_at, last_accessed_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
index_id, session_id, 'decision', decision_key, decision_value, 0.9, datetime.now(), datetime.now()
))
self.conn.commit()
def evict_stale_data(self, days_old=180, relevance_threshold=0.3):
"""Remove old, low-relevance interactions."""
cutoff_date = datetime.now() - timedelta(days=days_old)
# Find candidates
self.cursor.execute("""
SELECT interaction_id FROM interactions
WHERE created_at < ?
AND is_summarised = 1
""", (cutoff_date,))
candidates = self.cursor.fetchall()
for (interaction_id,) in candidates:
# Soft delete
self.cursor.execute("""
UPDATE interactions SET is_summarised = 0 WHERE interaction_id = ?
""", (interaction_id,))
self.conn.commit()
return len(candidates)
This gives you a working memory system in ~200 lines. For production, add error handling, logging, and batch operations.
Pattern 2: PostgreSQL + pgvector for Semantic Search
If you need semantic search and scale, PostgreSQL with the pgvector extension is powerful:
import psycopg2
import json
from datetime import datetime
from openai import OpenAI
class SemanticAgentMemory:
def __init__(self, db_url):
self.conn = psycopg2.connect(db_url)
self.cursor = self.conn.cursor()
self.client = OpenAI()
self._init_schema()
def _init_schema(self):
self.cursor.execute("""
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS sessions (
session_id UUID PRIMARY KEY,
user_id TEXT,
created_at TIMESTAMP,
summary TEXT
);
CREATE TABLE IF NOT EXISTS interactions (
interaction_id UUID PRIMARY KEY,
session_id UUID REFERENCES sessions(session_id),
user_input TEXT,
agent_response TEXT,
embedding vector(1536),
created_at TIMESTAMP
);
CREATE INDEX ON interactions USING ivfflat (embedding vector_cosine_ops);
""")
self.conn.commit()
def store_interaction_with_embedding(self, session_id, user_input, agent_response):
# Generate embedding
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=f"{user_input} {agent_response}"
)
embedding = response.data[0].embedding
# Store
self.cursor.execute("""
INSERT INTO interactions (session_id, user_input, agent_response, embedding, created_at)
VALUES (%s, %s, %s, %s, %s)
""", (session_id, user_input, agent_response, embedding, datetime.now()))
self.conn.commit()
def retrieve_similar_interactions(self, user_id, query, limit=5):
# Embed the query
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=query
)
query_embedding = response.data[0].embedding
# Find similar interactions
self.cursor.execute("""
SELECT i.user_input, i.agent_response, i.created_at,
1 - (i.embedding <=> %s) as similarity
FROM interactions i
JOIN sessions s ON i.session_id = s.session_id
WHERE s.user_id = %s
ORDER BY similarity DESC
LIMIT %s
""", (query_embedding, user_id, limit))
return self.cursor.fetchall()
This pattern is more powerful but requires PostgreSQL and embedding generation. Use it when you have 10K+ interactions and need semantic search quality.
Pattern 3: Hybrid Retrieval (Keyword + Semantic)
For best results, combine both:
def hybrid_retrieve(self, session_id, user_id, query, max_tokens=2000):
"""
Retrieve context using both keyword and semantic search.
"""
results = []
# Step 1: Keyword search (fast, exact)
keyword_results = self.keyword_search(user_id, query)
results.extend(keyword_results[:3]) # Top 3 keyword matches
# Step 2: Semantic search (slower, fuzzy)
semantic_results = self.semantic_search(user_id, query)
for result in semantic_results[:5]:
if result not in results: # Avoid duplicates
results.append(result)
# Step 3: Current session context
current_session = self.get_current_session(session_id)
results.insert(0, current_session)
# Step 4: Rank by relevance and recency
results.sort(key=lambda x: (x['relevance_score'], x['created_at']), reverse=True)
# Step 5: Truncate to token budget
context = self._truncate_to_tokens(results, max_tokens)
return context
This gives you fast exact matches (keywords) plus semantic understanding (embeddings) without bloating your context window.
Scaling Multi-Session Memory
From Hundreds to Millions of Sessions
As your agent scales, memory patterns must evolve:
Stage 1: Single SQLite File (0–10K sessions) Works fine. Keep it simple. Add basic eviction after 90 days.
Stage 2: PostgreSQL + Sharding (10K–1M sessions) Shard by user_id or date. Each shard is a separate database. Queries route to the right shard. Eviction runs per-shard in parallel.
Stage 3: Vector Database + Cache (1M+ sessions) Use a dedicated vector database (Pinecone, Weaviate, or Milvus) for embeddings. Use Redis or Memcached for hot summaries. Archive old data to S3 or cold storage.
Distributed Eviction
At scale, eviction becomes a background job:
# Background job, runs hourly
def eviction_job():
shards = get_all_shards()
for shard in shards:
candidates = shard.query("""
SELECT interaction_id FROM interactions
WHERE created_at < now() - interval '180 days'
AND relevance_score < 0.3
LIMIT 10000
""")
for batch in chunks(candidates, 1000):
shard.soft_delete(batch)
time.sleep(0.1) # Rate limit to avoid DB overload
This prevents eviction from blocking live queries.
Caching Hot Data
For frequently-accessed summaries, use a cache:
import redis
class CachedMemory:
def __init__(self, db, cache_ttl_seconds=3600):
self.db = db
self.cache = redis.Redis(host='localhost', port=6379)
self.ttl = cache_ttl_seconds
def get_session_summary(self, session_id):
# Try cache first
cached = self.cache.get(f"summary:{session_id}")
if cached:
return json.loads(cached)
# Fall back to DB
summary = self.db.get_summary(session_id)
self.cache.setex(f"summary:{session_id}", self.ttl, json.dumps(summary))
return summary
This reduces database load by 70–80% for high-traffic agents.
Security and Compliance Considerations
Encryption at Rest and in Transit
Agent memory often contains sensitive data: user account numbers, conversation history, decisions. Encrypt it:
from cryptography.fernet import Fernet
class EncryptedMemory:
def __init__(self, key):
self.cipher = Fernet(key)
def store_encrypted(self, session_id, user_input):
encrypted = self.cipher.encrypt(user_input.encode())
self.db.store(session_id, encrypted)
def retrieve_decrypted(self, session_id):
encrypted = self.db.retrieve(session_id)
return self.cipher.decrypt(encrypted).decode()
For compliance (SOC 2, ISO 27001), encryption is non-negotiable. If you’re handling data that requires audit trails, tools like Vanta can help automate compliance monitoring.
Data Retention and Right to Deletion
Under GDPR, users have the right to deletion. Your memory system must support it:
def delete_user_data(self, user_id):
"""
Hard delete all data for a user. Irreversible.
"""
# Find all sessions
sessions = self.db.query("SELECT session_id FROM sessions WHERE user_id = ?", (user_id,))
for (session_id,) in sessions:
# Delete interactions
self.db.execute("DELETE FROM interactions WHERE session_id = ?", (session_id,))
# Delete indices
self.db.execute("DELETE FROM memory_index WHERE session_id = ?", (session_id,))
# Delete sessions
self.db.execute("DELETE FROM sessions WHERE user_id = ?", (user_id,))
# Log for audit
self.audit_log.write(f"Deleted all data for user {user_id} at {datetime.now()}")
For agents handling health, financial, or legal data, retention policies are often mandated by regulation. Document them clearly and enforce them automatically.
Access Control and Audit Logging
Who can read agent memory? Implement role-based access:
class AuditedMemory:
def retrieve(self, user_id, session_id, requesting_user, requesting_role):
# Check permissions
if requesting_role == 'admin':
allowed = True
elif requesting_role == 'agent':
allowed = (requesting_user == user_id) # Agents see own user's data
else:
allowed = False
if not allowed:
self.audit_log.write(f"DENIED: {requesting_user} tried to access {session_id}")
raise PermissionError()
# Log successful access
self.audit_log.write(f"ALLOWED: {requesting_user} accessed {session_id}")
return self.db.retrieve(session_id)
For Security Audit (SOC 2 / ISO 27001), audit logs are essential. Store them separately and immutably (append-only).
Measuring and Optimising Memory Performance
Key Metrics
Track these to understand if your memory system is working:
Retrieval Latency
- How long does it take to fetch relevant context? Target: <100ms for keyword search, <500ms for semantic search.
- If it’s slow, add caching or indices.
Context Quality
- When the agent retrieves prior context, does it actually help? Measure by comparing agent performance with and without memory.
- A simple metric: “Did the agent answer this question correctly?” Score: 1 if yes, 0 if no. Average across 100 sessions.
Storage Efficiency
- How many bytes per session? Aim for <50KB raw + <10KB summary.
- If it’s bloated, your eviction or summarisation isn’t working.
Eviction Accuracy
- What percentage of evicted data is never requested again? Aim for >90%.
- If users frequently ask for deleted data, your eviction is too aggressive.
Token Efficiency
- How many tokens does a typical context window use? Aim for <20% of total budget.
- If you’re using 50%+ on history, tighten your retrieval or summarisation.
Monitoring in Production
import time
from prometheus_client import Histogram, Counter
retrieval_latency = Histogram('memory_retrieval_latency_ms', 'Time to retrieve context')
retrieval_hits = Counter('memory_retrieval_hits', 'Successful retrievals')
retrieval_misses = Counter('memory_retrieval_misses', 'Failed retrievals')
def retrieve_with_metrics(self, session_id):
start = time.time()
result = self.retrieve(session_id)
latency = (time.time() - start) * 1000
retrieval_latency.observe(latency)
if result:
retrieval_hits.inc()
else:
retrieval_misses.inc()
return result
Export metrics to Prometheus or Datadog. Set up alerts: if retrieval latency >500ms or eviction rate >1000 items/hour, page on-call.
Optimisation Checklist
- Keyword indices built for high-cardinality fields (user_id, date, topic)
- Embeddings computed only for significant turns (decisions, escalations)
- Cache layer for top 20% of frequently-accessed summaries
- Eviction runs in background, not on critical path
- Soft deletes used; hard deletes only after 30+ days
- Retrieval queries include date filters and user_id filters (to reduce search space)
- Summaries generated immediately after session close (not lazily)
- Token budget enforced: context window capped at 20–30% of LLM limit
Next Steps: Building Production-Ready Agents
File-backed memory is foundational, but it’s just one piece of a production agent. Here’s what else you need:
1. Tool Integration and State Management
Your agent needs to call external tools (APIs, databases, third-party services) and track their outcomes. Memory should record not just what the agent said, but what it did.
When building agentic AI systems, ensure your memory stores:
- Tool calls (which tool, what parameters)
- Tool outcomes (success/failure, returned data)
- Side effects (database rows updated, emails sent)
This lets the agent reason about causality: “I called the refund API on 2024-11-15, and the refund was processed. If the user asks again, I can say ‘I already processed this on…‘“
2. Multi-Agent Coordination
If you’re running multiple agents (one for support, one for billing, one for technical issues), they need shared memory. A support agent should be able to ask: “Has the billing agent already issued a refund for this user?”
This requires a shared memory layer accessible to all agents, with role-based access control. Intrinsic Memory Agents research shows that heterogeneous multi-agent systems benefit from structured, shared memory spaces.
3. Feedback Loops and Learning
Memory isn’t static. As your agent runs, it should learn:
- Which retrieval patterns work (which summaries lead to good outcomes?)
- Which eviction policies are optimal (what data is actually useful?)
- Which users have similar patterns (can we cluster them?)
Build feedback mechanisms: after each session, log whether the agent’s retrieved context actually helped. Use this to retrain your retrieval model or adjust eviction policies.
4. Observability and Debugging
When an agent makes a mistake, you need to trace it back to memory. Did it forget relevant context? Did it retrieve the wrong prior session? Did it misinterpret the summary?
Build dashboards that show:
- What context was retrieved for this session?
- Which prior interactions were considered?
- Why was this particular summary chosen?
This is especially important for compliance and audit. If you need to explain a decision to a regulator or customer, you need a full audit trail.
5. Continuous Improvement
Once you have a working memory system, measure it:
- A/B test retrieval strategies (keyword vs. semantic vs. hybrid)
- A/B test summarisation approaches (how much compression hurts quality?)
- A/B test eviction policies (are we deleting too aggressively?)
Run these experiments on 10% of traffic, measure impact, and roll out winners.
Getting Help
Building production agentic AI systems is complex. If you’re a founder or operator scaling an agent, consider working with a partner who’s done this before. At PADISO, we’ve built AI & Agents Automation systems for Sydney startups and enterprises, and we can help you design and implement memory patterns that work for your specific use case.
We also offer fractional CTO services and AI Strategy & Readiness assessments if you’re planning a larger AI transformation. Whether you’re building a chatbot, a workflow orchestrator, or a complex multi-agent system, the memory patterns in this guide apply.
Summary
Multi-session agent memory is not optional—it’s the difference between a toy and a useful system. By implementing file-backed memory with smart indexing, summary rotation, and eviction, you can:
- Cut costs: 80% reduction in token spend by retrieving only relevant context
- Improve quality: Agents that remember previous interactions provide better, faster answers
- Scale confidently: Patterns that work for 100 sessions scale to millions
- Stay compliant: Audit trails and encryption built in from the start
Start with the simplest pattern that fits your scale: SQLite for small teams, PostgreSQL for mid-market, vector databases for high-volume. Build monitoring and eviction from day one. Test your retrieval quality obsessively.
The agents that win aren’t the ones with the biggest models—they’re the ones with the best memory. Build yours right, and you’ll build something your users actually want to use.
Additional Resources
For deeper dives into agent memory architecture, check out these references:
AI Agent Memory Architecture explains the four-layer model (in-context, persisted, external, episodic) that underpins production systems. I Gave My AI Agent Memory Across Sessions: Here’s the Schema walks through a concrete SQLite + knowledge graph implementation. Agents that Remember: Introducing Agent Memory shows how managed services approach the problem. Why Multi-Agent Systems Need Memory Engineering covers state management at scale.
For session-specific patterns, Basics of Context Engineering: Session and Memory and Context Engineering Cookbook: Session Memory provide practical examples. The OpenAI Agents Python SDK: Session Memory documentation is essential if you’re building with OpenAI models.
For multi-agent systems, Intrinsic Memory Agents research offers theoretical grounding on how heterogeneous agents can share and update memory in common spaces.
If you’re building AI automation for customer service, memory patterns are critical—your agents need to remember customer history, preferences, and prior issues. The same principles apply whether you’re building a support bot, a financial advisor, or a technical troubleshooter.