Overview
This talk explains how to give LLM agents effective “memory” using context engineering, sessions, and long-term memory, based on a Google X Kaggle white paper.
Core Concepts
- Context engineering dynamically assembles all needed inputs per turn to overcome LLM statelessness.
- Sessions manage immediate, per-conversation state and history for one user interaction.
- Memory provides long-term, cross-session personalization via an LLM-driven ETL pipeline.
Context Engineering
- LLMs are stateless; statefulness requires dynamic, per-call context packaging.
- Goes beyond static prompt engineering; continuously adapts inputs to current turn.
- Inputs include system instructions, tool definitions, few-shot examples, external data, and dialogue history.
- Manages function outputs, scratchpads, and latest user message for precise responses.
Context Rot and Compaction
- Context rot: oversized, noisy context degrades attention and reasoning quality.
- Compaction trims context using summarization and pruning to preserve key signal.
- Must run asynchronously to avoid blocking user responses and adding latency.
Context Management Cycle
- Fetch context: retrieve relevant memories, RAG documents, and required data.
- Prepare context: assemble full prompt string on the hot path before inference.
- Invoke LLM and tools: send prompt, run functions, collect outputs as needed.
- Upload context: persist new insights to storage asynchronously after response.
Sessions: Structure and Frameworks
- A session is a self-contained container for one continuous user conversation.
- Events: chronological log of user/agent messages and tool calls in order.
- State: structured working memory like cart items or workflow step progress.
- Frameworks differ: ADK separates events and state; LangGraph uses mutable state.
Multi-Agent Systems (MAS)
- Shared unified history: all agents read/write one central log; high visibility, clutter risk.
- Separate individual histories: agents communicate via messages; autonomy but less shared context.
- Requires an abstract, framework-agnostic memory layer to share synthesized knowledge.
Security, Privacy, and Performance
- Strict isolation and ACLs are baseline; enforce user-by-user separation.
- PII reduction must occur before storage for compliance (e.g., GDPR, CCPA).
- Data hygiene: TTL policies and deterministic event ordering maintain integrity.
- Performance: minimize hot-path session size; compaction reduces cost and latency.
Compaction Strategies
- Sliding window: keep last N turns to limit context growth per conversation.
- Token-based truncation: cut oldest content once token budget is reached.
- Recursive summarization: periodically replace older chunks with concise summaries.
- Triggers: number of turns, inactivity periods, or task completion events.
Memory vs RAG
- RAG: static, shared knowledge; “research librarian” for world facts.
- Memory: dynamic, user-specific knowledge; “personal assistant” for personalization.
- Both are complementary; serve distinct roles in agent intelligence.
Types and Organization of Memory
- Declarative memory: facts and events (e.g., favorite team, upcoming destination).
- Procedural memory: skills and workflows (e.g., tool call sequences for tasks).
- Organization: per-user collections, topic collections, structured profiles, rolling summaries.
- Storage: vector databases for semantic search; knowledge graphs for relations.
- Scope: user-level (across sessions), session-level (temporary), application-level (global).
Multimodal Considerations
- Sources may be images or audio; store extracted key facts as text for processing.
- Text remains the primary representation for LLM search and reasoning.
Memory Generation: LLM-Driven ETL
- Extraction: targeted filtering of meaningful details based on agent purpose.
- Consolidation: compare with existing memories; create, update, delete, or invalidate.
- Provenance and confidence: source, age, explicitness, and reinforcement drive trust.
- Relevance decay: reduce importance over time without reinforcement to mimic forgetting.
- Asynchronous processing: run ETL in background to prevent response latency spikes.
Memory as a Tool
- Provide agent tools like create_memory and query_memory for autonomous management.
- Agent decides when to save or retrieve based on conversational needs and goals.
Retrieval and Scoring
- Blend scores: combine relevance (similarity), recency, and importance for ranking.
- Proactive retrieval: fetch likely memories each turn; simple but may add latency.
- Reactive retrieval: agent queries memory on demand; efficient but requires smarter control.
Inference and Prompt Placement
- Placement signals authority; system messages carry strong weight but risk bias.
- Conversation history injection may confuse roles and dilute dialogue clarity.
- Choose placement carefully to balance stability and risk from imperfect memories.
Evaluation and Testing
- Generation metrics: precision and recall for captured memories’ correctness.
- Retrieval metrics: recall@K to assess presence of correct memories in top results.
- Latency targets: memory lookups ideally under 200 milliseconds for responsiveness.
- End-to-end success: use LLM judges across test cases to score task completion gains.
Summary Table: Architecture Elements
| Component | Purpose | Key Techniques | Performance Notes |
|---|
| Context Engineering | Build per-turn statefulness | Fetch/prepare/invoke/upload; compaction | Hot path; minimize latency and cost |
| Sessions | Immediate conversation state | Events log; mutable state; TTL | Deterministic ordering; framework differences |
| Memory | Long-term personalization | LLM ETL; provenance; decay | Asynchronous processing required |
| Retrieval | Bring memories into context | Blended scoring; proactive/reactive | Aim for <200 ms lookup times |
| Storage | Persist and query knowledge | Vector DB + knowledge graph | Hybrid supports semantic and relational queries |
Key Terms & Definitions
- Context engineering: Dynamic assembly of all inputs per turn to overcome statelessness.
- Context rot: Quality drop when context becomes too large or noisy.
- Compaction: Techniques to reduce context size while preserving essential information.
- Session: Self-contained container of one conversation’s events and working state.
- Declarative memory: “Knowing what” facts and events about the user.
- Procedural memory: “Knowing how” processes and tool-use sequences.
- Provenance: Origin and characteristics of a memory used to judge reliability.
- Relevance decay: Scheduled reduction in memory importance without reinforcement.
- Blended scoring: Combining relevance, recency, and importance for retrieval ranking.
Action Items / Next Steps
- Implement session compaction with sliding window and token-based truncation first.
- Add asynchronous recursive summarization triggered by turn count or inactivity.
- Design and deploy an LLM-driven ETL for memory with provenance and decay.
- Expose memory tools to the agent for create/query within conversations.
- Adopt blended retrieval scoring and evaluate with recall@K and latency targets.