Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 3: The Economics of Verbatim Storage

Positioning: This chapter uses hard data to prove that “store everything” is entirely feasible economically, shifting the focal point from “whether to store” to “how to organize,” setting the stage for the solution introduced in subsequent chapters.


An Inverted Equation

In the previous two chapters, we established two arguments:

  1. Millions of tokens’ worth of decision records evaporate daily in AI conversations (Chapter 1).
  2. Letting the LLM decide what is worth remembering is fundamentally wrong (Chapter 2).

The logical intersection of these two arguments points to a seemingly radical conclusion: store everything. No extraction, no compression, no filtering — preserve every token of every conversation verbatim.

Most people’s first reaction to this approach is: “that’s too expensive” or “that’s not realistic.”

This reaction is wrong. It comes from an inverted equation: people overestimate the cost of storage and underestimate the difficulty of retrieval.

Let the numbers speak.


Storage Cost: Approaching Zero

First, let us establish a baseline. In Chapter 1, we estimated the data volume a moderate-intensity AI user generates over 6 months:

180 days x 3 hours/day x 10,000 tokens/hour = approximately 19.5 million tokens

What does 19.5 million tokens convert to in raw text?

One token is approximately 4 characters (English) or 1.5 characters (mixed Chinese-English scenarios). Using a conservative estimate, 19.5 million tokens equals roughly 60 million characters, or 60MB of plain text. Adding metadata (timestamps, session IDs, project tags) brings it to about 100MB.

100MB. Six months of all AI conversation records. 100MB.

What does this number mean in 2026?

  • Local storage: A 1TB SSD costs roughly $50. 100MB occupies 0.01%. A single photo on your phone might be larger.
  • Cloud storage: AWS S3 standard storage costs $0.023/GB/month. The annual storage cost for 100MB is $0.028 — less than 3 cents.

Storage cost, in any reasonable discussion, can be rounded to zero.

But this is only raw text storage. AI memory systems typically also need vector indexes to support semantic retrieval. Vector index storage overhead is roughly 3–5x the raw text (depending on vector dimensions and index structure). Even so, a 500MB vector database runs perfectly fine locally — ChromaDB at this scale has query latencies in the millisecond range.

So the first conclusion is clear: “can’t afford to store” is a non-problem. The cost of completely storing all AI conversations is negligible.


The Real Cost: Not in Storage, but in Usage

Storage is cheap, but usage is not. Specifically, when you need to load memories into an LLM’s context window, every token has a cost — because LLM APIs charge by the token.

This is the real economic decision point. Let us compare the usage costs of four approaches.

Approach One: Full Paste

The most naive approach: stuff all historical conversations into the LLM’s context at once.

Token consumption: 19,500,000 tokens

This approach is physically impossible. As of early 2026, the maximum commercial LLM context window is approximately 200K–1M tokens. 19.5 million tokens exceeds any existing model’s context window. Even if future context windows expand to tens of millions, the loading cost would be astronomical.

Conclusion: Not possible.

Approach Two: LLM Summary

This is the approach adopted by Mem0, Zep, and similar systems. Use an LLM to compress 19.5 million tokens of original conversations into summaries, and load summaries when needed.

Assuming a 30:1 compression ratio (already quite aggressive), the total summary volume is approximately 650,000 tokens.

How many summary tokens need to be loaded per session? This depends on the scenario, but a reasonable estimate is roughly 5,000–10,000 tokens of relevant memory per session.

Per-session loading: ~7,500 tokens (summaries)
Daily sessions: ~5
Annual token consumption: 7,500 x 5 x 365 = 13,687,500 tokens

At Claude Sonnet’s input price of $3/million tokens (early 2026 reference price):

Annual cost = 13,687,500 x $3 / 1,000,000 = ~$41

But this is only the loading cost. Generating the summaries themselves also requires LLM calls — you need the LLM to process 19.5 million tokens of original conversations to produce summaries. At input pricing:

Summary generation cost = 19,500,000 x $3 / 1,000,000 = $58.5 (one-time)

Adding ongoing incremental updates (new daily conversations needing summarization), the actual annual total cost is roughly $200–500, depending on usage intensity and model choice.

Taking a middle estimate: ~$507/year.

This cost is not high — entirely acceptable for professional users. But the problem is not cost but quality: as argued in the previous chapter, these summaries have an accuracy rate of approximately 85%. You are spending $507 per year on a memory system that is 85% accurate.

Approach Three: MemPalace Wake-Up

MemPalace’s design employs an entirely different strategy. Instead of loading large quantities of memory summaries at each session, it loads a very small “identity layer” — who you are, who your team is, what your projects are — and then performs precise retrieval only when needed.

By the current source baseline, wake-up loads L0 (identity) and L1 (key story / key facts), typically totaling ~600-900 tokens. The more aggressive ~170 token / ~$0.70 figure often cited in the README corresponds to a later target path: rewrite L1 into AAAK and use that smaller representation for wake-up.

Per-session loading: ~600-900 tokens
Daily sessions: ~5
Annual token consumption: 600-900 x 5 x 365 = 1,095,000-1,642,500 tokens
Annual cost = ~$3.3-$4.9

Still only single-digit dollars. ~$3-$5/year. The README’s $0.70 figure belongs to the future AAAK wake-up path; for the current default CLI, the correct order of magnitude is “a few dollars,” not “a few hundred dollars.”

$5. Not $500, not $50. Single-digit dollars per year.

To be fair, though, 600-900 tokens still contain only your identity and the most important story layer — not every specific historical decision. When you need to look up “why we chose Postgres,” you still need retrieval.

Approach Four: MemPalace Wake-Up + On-Demand Retrieval

In actual use, MemPalace’s workflow is: first load 600-900 tokens of wake-up information, then perform semantic retrieval as the conversation requires. Each retrieval returns approximately 2,500 tokens of relevant content (including original conversation fragments).

Assuming an average of 5 retrievals per session:

Per-session loading: 600-900 + (5 x 2,500) = 13,100-13,400 tokens
Daily sessions: ~5
Annual token consumption: 23,907,500-24,455,000 tokens
Annual cost = ~$72-$73

Wait — this number looks higher than the summary approach? Let us correct the calculation.

In practice, 5 retrievals/session is a high estimate. Most sessions require only 0–2 retrievals — memory queries are needed only when the conversation involves historical decisions. A more realistic estimate is an average of 1 retrieval per session:

Per-session loading: 600-900 + (1 x 2,500) = 3,100-3,400 tokens
Daily sessions: ~5
Annual token consumption: 5,657,500-6,205,000 tokens
Annual cost = ~$17-$19

The lower numbers quoted in the README still correspond to the future AAAK wake-up path. The key point is not the exact figure but the order of magnitude: under the current default implementation, MemPalace’s annual usage cost is roughly $3-$20 in common use (0-1 retrieval per session) and $70+ in heavier use, while the summary approach sits in the $200-$500 range.

Cost Comparison Summary Table

ApproachPer-load tokensAnnual costAccuracy
Full paste19,500,000 (exceeds context window)Not possibleN/A
LLM summary~7,500~$507/year~85%
MemPalace wake-up~600-900~$3-$5/yearN/A (identity layer only)
MemPalace + on-demand retrieval~3,100–13,400~$17-$73/year96.6%

The last column is key: MemPalace is not only 50x cheaper but also 12 percentage points more accurate. This is not a “cheaper but slightly worse” approach — it is superior on both dimensions simultaneously.


Why Retrieval Is Hard

At this point, you might think: if storage is cheap and usage cost is low, is the problem solved?

No. The calculations above have an implicit assumption: retrieval must be accurate. If retrieval returns content other than what you are looking for, no amount of low cost matters — you have spent tokens loading irrelevant information.

The difficulty of retrieval comes from three levels.

Semantic Gap

There is a semantic gap between the user’s query and the memory’s content.

The user asks: “Why did we choose Clerk?” The original conversation’s phrasing might be: “OAuth provider evaluation conclusion — Auth0’s enterprise pricing triples after 10K MAU, Clerk’s pricing is more linear, and the Next.js SDK works out of the box.”

“Clerk” appears on both sides, but the semantic correspondence between “chose” and “evaluation conclusion,” and the causal correspondence between “why” and the pricing/SDK comparison, both require semantic understanding to establish.

Simple keyword matching would miss this correspondence. Vector retrieval (semantic similarity) can capture part of it, but in large memory stores, semantically similar but irrelevant results (false positives) increase significantly.

Scale Dilemma

A memory store of 19.5 million tokens, segmented into conversation fragments, yields tens of thousands to hundreds of thousands of document chunks. At this scale, the probability of irrelevant content appearing in the top-5 or top-10 vector retrieval results is high.

This is a classic information retrieval problem: as the corpus grows, maintaining both high precision and high recall simultaneously becomes difficult. Increasing recall (not missing relevant results) decreases precision (admitting more irrelevant results), and vice versa.

Missing Context

Pure vector retrieval lacks structural context. It knows “this document chunk is semantically similar to the query” but does not know “this document chunk belongs to which project, involves which people, or was produced at what project phase.”

Without this structural context, the retrieval system cannot answer queries like:

  • “Kai’s advice about databases” — needs to know who Kai is and which conversations involved Kai.
  • “Driftwood project decisions last month” — needs to know which project Driftwood is and requires time filtering.
  • “Pitfalls we hit during the auth migration” — needs to know that the auth migration is a topic spanning multiple conversations.

What Makes Retrieval Feasible

The three retrieval difficulties — semantic gap, scale dilemma, missing context — are not unsolvable. Each has known solutions; these solutions simply need to be combined.

Solution One: Shrink the Search Space

The most effective counter to the scale dilemma is not a better search algorithm but a smaller search space.

If you can know approximately which region the answer is in before searching, you can shrink the search scope from tens of thousands of document chunks to a few hundred. Both precision and recall are easy to maintain at high levels on a small corpus.

But “knowing which region the answer is in” requires the memory store to have structure. Not a flat vector database, but an organizational system with hierarchy, classification, and associations.

This observation itself is not novel — library science has studied it for centuries. The core insight of the Dewey Decimal Classification is: classify first, then search. Classification reduces an O(N) search problem to an O(N/K) search problem, where K is the number of categories.

For AI memory, the classification dimensions can be:

  • Who — Whose memories?
  • What project — Which project does it belong to?
  • What type — Is it a decision, an event, a preference, or a recommendation?
  • What topic — Specifically about auth, databases, deployment, or frontend?

If a query can be routed to the correct classification combination (e.g., “Driftwood project database decisions”), the search space can shrink from tens of thousands to tens — dramatically improving both precision and recall.

The data quantifies this effect. On a test set of 22,000+ memories, simply narrowing the search scope step by step by “who,” “type,” and “topic,” R@10 improved from 60.9% to 94.8% — structure alone produced a 34-percentage-point retrieval improvement, requiring no better vector model, no LLM reranking, no additional computational cost. Purely a change in how data is organized (see Chapter 7 for the layer-by-layer benchmark data).

This is a profound result. It demonstrates that the greatest lever for retrieval efficiency is not at the algorithm level (better embeddings, more precise similarity calculations) but at the data organization level — how information is placed into the right drawers.

Solution Two: Layered Loading

Not all memories need to be loaded in every session. Human memory is layered too — your name, your job, your team are “always online” memories; what you had for lunch last Tuesday is an “on-demand retrieval” memory.

AI memory can adopt the same layered strategy:

LayerContentSizeLoading Timing
L0Identity — who this AI is~100 tokensAlways loaded
L1Key story / key facts — high-weight and recent memories~500-800 tokensAlways loaded
L2Topic memory — room-scoped recallOn-demandOn explicit recall / lightweight filtering
L3Deep retrieval — full semantic searchOn-demandWhen explicitly needed

In the current implementation, L0 + L1 totals roughly 600-900 tokens. The README’s more aggressive 170-token figure belongs to a later stage in which L1 is fully AAAK-ified. Even at the current size, though, this is still a relatively cheap persistent context layer, and more importantly it tells the AI who you are, what your team structure is, and what project you are working on so later retrieval can start from the right frame.

Solution Three: Compression, Not Summarization

A critical distinction must be made here: compression and summarization are not the same thing.

Summarization is lossy — it discards “unimportant” information (but who defines “unimportant”?). Compression is ideally lossless (or near-lossless) — it aims to preserve the same factual assertions in a more compact representation.

Lossless text compression is generally believed to have limits — natural language has finite redundancy. But what if the compression target is not for humans to read, but for AI to read?

AI and humans process text differently. Humans need complete grammar, punctuation, and connectives to understand meaning. AI can recover full semantics from highly compressed structured text.

An AI-oriented compression dialect can, in principle, achieve very high compression while preserving the factual structure of a short, structured example. For example:

Original (~1000 tokens):

Priya manages the Driftwood team. Team members include: Kai (backend dev, 3 years experience),
Soren (frontend dev), Maya (infrastructure), Leo (junior dev, joined last month).
They are developing a SaaS analytics platform. The current sprint's task is migrating
the authentication system to Clerk. Kai recommended Clerk over Auth0 based on pricing
and developer experience.

Compressed (~120 tokens):

TEAM: PRI(lead) | KAI(backend,3yr) SOR(frontend) MAY(infra) LEO(junior,new)
PROJ: DRIFTWOOD(saas.analytics) | SPRINT: auth.migration→clerk
DECISION: KAI.rec:clerk>auth0(pricing+dx)

In this example, the factual assertions are preserved and the representation is about 8x shorter. But the current open-source dialect.compress() path should not be confused with a universal strict-lossless guarantee: its real behavior includes key-sentence extraction, topic selection, and truncation of entities / emotions / flags. In the current repository, AAAK functions more like a high-compression index layered on top of raw Drawer storage than like the only copy of the memory.

The ideal difference between AAAK-style compression and LLM summarization is: the former tries to preserve factual structure while changing representation, whereas the latter decides what to keep and what to discard. In the current implementation, the raw text remains preserved in Drawers, and the AAAK-like layer is best read as a compact navigational representation rather than a perfect substitute for the original.


The Inverted Equation, Corrected

Let us return to the equation from the beginning of this chapter.

Traditional thinking: Storage is expensive, so compression (via summarization) is needed to reduce storage and usage costs.

Reality: Storage is free, usage cost depends on retrieval efficiency, and retrieval efficiency depends on how data is organized.

Correcting this equation means:

  1. Do not optimize at the storage end. The cost of storing all raw data is near zero. Any “optimization” done at the storage end (such as summarization extraction) merely adds risk (information loss) without reducing cost.

  2. Optimize at the retrieval end. The real cost savings come from loading fewer but more accurate tokens into the LLM context. Today’s 600-900-token wake-up versus 7,500 tokens of summaries already demonstrates this advantage; if the README’s AAAK wake-up path is fully connected later, that gap grows even further.

  3. Invest at the organization end. The 34% retrieval improvement comes from how data is organized — this is the highest-ROI component of the entire approach. Good data organization can make simple retrieval algorithms match the performance of complex algorithms on unorganized data.


Transition: From “Why” to “How”

Three chapters in, we have completed the full description of the problem space:

  • Chapter 1 answered “what is happening” — technical decisions have migrated into AI conversations, producing large quantities of irreplaceable knowledge assets daily that evaporate after sessions end.
  • Chapter 2 answered “why existing approaches don’t work” — the assumption of having LLMs extract key information is fundamentally wrong; the false memories it produces are more dangerous than no memory at all.
  • Chapter 3 answered “what is the right direction” — store everything (at near-zero cost), then make retrieval efficient through data organization (34% improvement from structure).

The reader should now have a clear understanding of the following points:

  • The problem is real and urgent.
  • The mainstream existing approach (LLM extraction) has a fundamental flaw in its core assumption.
  • The right direction is “complete storage + structured retrieval,” not “intelligent extraction + flat storage.”
  • Storage is not the bottleneck; organizational method is the key.

But we have not yet answered the specific “how”:

  • What data organization structure produces that 34% retrieval improvement?
  • What exactly does the current ~600-900-token wake-up contain, and what would the README’s ~170-token AAAK path change?
  • How does the on-demand semantic search work?
  • How is 30x lossless compression achieved?

These questions will be addressed one by one in Part 2 of this book — the solution space. There, we will see how an ancient memory technique — the memory palace — was reinvented as a knowledge architecture for the AI era.