Skip to main content

Command Palette

Search for a command to run...

Memory in LLM Agents, Explained: From Stateless Calls to Long-Term Memory

Updated
7 min read
D
Fullstack Software Engineer specialising in cloud architecture, system design, and AI. I write deep dives on building production systems — the decisions, the tradeoffs, and the reasoning behind them. No fluff, no tutorials you've seen a hundred times.

Introduction

Most explanations of agent memory start at the wrong altitude. They jump straight to "use a vector database" without explaining what problem that solves, or why a vector database is sometimes the wrong tool entirely.

This post walks through memory the way I built it: as a progression of seven small, runnable scripts, starting from a completely stateless LLM call and ending at a two-layer system with separate short-term and long-term memory. Each step introduces exactly one new concept. By the end, the line between short-term and long-term memory stops being fuzzy and becomes a concrete architectural decision: what is scoped to a thread, and what is scoped to a user.

All code referenced here is in the open-source repo at the end of this post. Every script is self-contained and runnable on its own.


1. Stateless: the baseline problem

The simplest possible setup is a single call to an LLM with no history attached. Tell it your name, ask it your name in the next message, and it has no idea. Every call starts from zero.

This is not a bug. It is the literal absence of memory, and it is the baseline every other pattern in this guide is solving for.


2. Short-term memory in RAM

The most direct fix: keep the full conversation in a Python list, and send that list with every call to the model.

history.append({"role": "user", "content": user_input})
response = llm.invoke(history)
history.append({"role": "assistant", "content": response.content})

Now the model can recall earlier turns, because it is literally being shown them every time. The catch: this history lives in process memory. Restart the script and it is gone. This pattern is memory in the loosest sense, useful for a single uninterrupted session, useless across sessions.


3. Persistence and threads

The next step is durability. Write the conversation history to SQLite after every turn, and reload it on startup.

This alone solves the restart problem. The second piece is the thread_id: rather than one global history, each thread_id is its own isolated conversation. Multiple conversations can live in the same store without bleeding into each other, and you can switch between them on demand.

At this point memory has gone from "exists only while the process is running" to "exists independently of the process, scoped to a conversation."


4. Sliding window

A real conversation grows without bound, and sending the full history on every call gets expensive fast, both in tokens and in latency. The sliding window pattern fixes the cost problem directly: a @before_model middleware hook trims the stored history down to the last N messages right before every call to the model.

@before_model
def trim_to_window(messages):
    return messages[-WINDOW:]

Token cost is now bounded, regardless of how long the conversation runs. The tradeoff is blunt: anything said more than N turns ago is permanently inaccessible to the model. If the user mentioned an important constraint 10 turns ago and the window is 4, that constraint is gone.

This is the right tool when recent context is what matters and the conversation is expected to be short or self-contained.


5. Summarization

Summarization is the more graceful version of the same idea. Instead of discarding old messages outright, a @before_model middleware hook compresses them into a short summary once the stored message count crosses a threshold, while keeping the most recent few messages verbatim alongside that summary.

@before_model
def summarize_if_needed(messages):
    if len(messages) > SUMMARIZE_AFTER:
        old, recent = messages[:-KEEP_RECENT], messages[-KEEP_RECENT:]
        summary = llm.invoke([SUMMARY_PROMPT, *old])
        return [summary_message(summary), *recent]
    return messages

This retains the gist of everything that happened earlier in the conversation, at the cost of losing fine-grained detail. A sliding window would have kept the literal text of the last N turns and nothing before that; summarization keeps a compressed signal of everything before that, plus the literal text of the most recent few turns.

Neither pattern is strictly better. Sliding window is cheaper and simpler. Summarization preserves more of the conversation's shape, at the cost of an extra LLM call to generate the summary.


6. Long-term memory: the user profile

Every pattern so far is still short-term memory. It lives within one conversation thread and, even with persistence, none of it is shared if the user starts a different thread.

Long-term memory is a different scope entirely: durable facts about a user that should be available no matter which conversation they are currently having. The pattern here uses a second LLM call after every turn to extract durable facts (name, job, stated preferences) and write them to a key-value store (LangGraph's SqliteStore):

facts = extract_facts_llm.invoke([extraction_prompt, *recent_turns])
store.put(namespace=("user", user_id), key="profile", value=facts)

On the next session, even a completely separate thread, those facts are loaded from the store and injected into the system prompt. This is the first pattern in the series where memory survives across sessions that have no shared conversation history at all.


7. Combining short-term and long-term memory

The full architecture uses both layers together, each with a different scope:

  • Short-term memory, via SqliteSaver: conversation history, scoped to a thread_id. Restored automatically on reconnect to the same thread.

  • Long-term memory, via SqliteStore: durable behavioral rules, scoped to a user_id. The agent saves rules autonomously via a save_rule tool, and those rules apply to every future session for that user, regardless of thread.

The clarifying detail: open a new thread for the same user, and the conversation history resets, but the rules that user taught the agent in a previous thread are already active. Switch to a different user entirely, and that user inherits none of the first user's rules. Each user's long-term memory is fully isolated.

This is the insight that took the fuzziness out of "short-term vs long-term memory" for me. They are not the same kind of memory at different durations. They are scoped along different dimensions: STM resets per thread, LTM persists per user across every thread that user ever opens.


Choosing a pattern

Situation Pattern
Single-session prototype, no persistence needed STM in RAM
Conversation needs to survive restarts STM plus SQLite with threads
Long conversations, recent context matters most, cost-sensitive Sliding window
Long conversations, full context shape matters Summarization
Facts about the user should persist across sessions Long-term memory, user profile
Production agent with both ongoing conversations and persistent user behavior STM and LTM combined

Further reading

This series draws directly on LangChain and LangGraph's memory primitives. For the underlying API reference:


Conclusion

Memory in LLM applications is not one feature, it is a set of independent tradeoffs: durability versus simplicity, recall versus token cost, thread-scoped versus user-scoped. Building each pattern in isolation, in order, makes those tradeoffs visible rather than hidden behind a single "add memory" abstraction.

Full source, all 7 scripts, self-contained and runnable: https://github.com/f2015537/llm-memory-patterns