Better context management for AI chat
When you’re building a chatbot or agent that needs continuity over long conversations, you hit a real wall pretty quick. Context windows are finite, and the usual move is obvious: when you run out of space, summarize the old conversation and replace it with a compact version. It sounds reasonable on the surface, but that’s where the problems start.
The sawtooth problem
Every time you compress context into a summary, something gets lost in the translation. Details blur together, nuance evaporates completely, and the agent loses the thread of what mattered most. The immersion breaks, and your user notices immediately. Trust tanks fast.
I ran into this exact wall building Arland, a chatbot powered by Claude. Every compression meant Arland would lose context, repeat himself, or suddenly not understand why we were talking about something specific. This completely broke immersion. The entire app depends on the user wanting to chat with it about their goals and progress. Having to repeat something you said a few minutes ago once is bad. But having to do it regularly because the compression model was losing small details is tedious. Not something you’d want to chat with every day.
The compression got turned into nicely formatted markdown notes that Arland could use for storage, but building it with a separate prompt meant it injected its own personality into the mix and, worse, lost vital information in the process. Summarization is also expensive. You’re running separate inference passes on top of your main conversation loop, which means you keep the context window uncomfortably small to avoid those costly compressions. Small context means you compress more often, which compounds everything.
I found another way
You don’t actually need to compress context. You can prune it instead. Since developers are responsible for maintaining conversational context for agents, every prompt we send is an opportunity to frame the imporant bits of the conversation. And most human conversation is noise. If we can identify it, we can delete it.
The three-stage system
I ended up with a memory architecture that mimics how human memory works. No sawtooth effect. It helps make ongoing conversations smooth and bump-free. Plus it saves a lot on token spend, something that gets more important every day. The system has three parts.
Short-term memory
The last few conversation turns stay verbatim, no compression, no pruning, full fidelity. Think of this as active awareness or what you’re currently focused on, the thing you’re paying attention to right now.
Working memory
The next batch of turns, say 10 or so, stays in pruned form. As each turn comes in, Claude scores it on relevance using criteria that matter for your specific agent. That scoring happens inline in the same prompt, so it’s cheap token-wise and adds almost no overhead. When working memory hits capacity, the oldest lowest-scoring turn gets removed, and everything else ages down. Scores decay over time naturally, so even highly relevant turns fade as they get older. It’s like remembering yesterday better than last month.
Long-term memory
Persistent facts and context stay around indefinitely. Instead of having a separate summarization process, the agent updates its own long-term notes directly using tools, deciding what’s important enough to remember rather than you guessing through a summarization prompt. This keeps the agent in the driver’s seat.
Why this works
You’re not fighting the agent’s context window, you’re working with how attention and memory actually function in humans. The agent stays coherent because you preserve the signal and trim the noise, no sawtooth, no sudden context loss. Because scoring happens inline, you’re not burning tokens on separate summarization passes, so the system is leaner and more responsive overall.
It beats the old summarization-based compression system in ways that matter: immersion stays high, consistency improves noticeably, and costs have dropped. I even tested successfully with what seems like a very small context window (3 short trem turns, 6 working memory turns) and Arland keeps conversation flow better than most people. Your mileage may vary, of course, but it’s paid off for my prodcut.