48 Days of AI Memory: What the Productivity Data Actually Shows

Most organizations measure AI adoption by counting licenses, API calls, or chatbot sessions. These are activity metrics. They tell you tools are being used. They do not tell you whether the work is getting better.

We tracked something different. Over 48 operational days, we measured what happens when AI systems are given structured memory — the ability to log their own work, search prior context, and reinforce what was useful. The results are drawn from a live production environment, not a sandbox.

The Setup

The system under observation is a local-first AI memory protocol running on commodity hardware. No cloud dependency, no vendor lock-in, no fine-tuning. The protocol gives AI models access to a shared knowledge base and a structured logging bus. Models can search what prior sessions produced, confirm which results were useful, and log new decisions. The knowledge base learns what matters through use — frequently accessed documents are promoted, unused ones naturally decay.

Ten distinct AI model families participated during the measurement window, including models from OpenAI, Anthropic, and locally-hosted open-source models. No model was explicitly instructed to use the memory system — the protocol was injected through documentation and system prompts. Models adopted it because the infrastructure made it the path of least resistance.

48 Days by the Numbers

3,160
Tool Calls

1,410
Logged Artifacts

58.2%
Confirm Rate

373
Tasks Completed

  • 3,160 tool calls — AI models invoking structured tools (search, log, confirm, index) rather than generating responses from weights alone
  • 1,410 logged artifacts across 609 work sessions — decisions, findings, checkpoints, and implementation records, all queryable by future sessions
  • 371 memory recalls — explicit queries against prior work. 216 of those (58%) were confirmed as useful by the model that retrieved them
  • 1,812 indexed documents with automatic temperature tiering — 45 “hot” (frequently accessed), 836 “warm,” 931 “cold” (naturally decayed)
  • 373 tasks completed across 29 categories — administration, development, finance, infrastructure, personal operations

What the Confirmation Rate Reveals

The most telling metric is not the volume of tool calls. It is the confirmation rate — the percentage of memory retrievals that the AI model found genuinely useful for the current task.

Reinforcement Loop
Confirmation Rate Over Time
Percentage of memory retrievals confirmed as useful. Early queries returned weak results. As the knowledge base matured, retrieval quality converged to near-perfect.

100% 80% 60% 40% Feb 8 23 25 Mar 4 5 9 11 14 19 25 29 Apr 2 7 10 14 39% 100% 100% Below 70% 70–94% 95%+ 100% sustained

In the first week, the confirmation rate was approximately 35%. The knowledge base was sparse, and retrieved results were often tangential. By week three, the rate had climbed above 90%. By week seven, it was sustaining 100% on most days.

This is not the AI getting smarter. The model weights did not change. What changed was the knowledge surface — the accumulated record of prior decisions, corrections, and context that the system could draw from. The system learned what was relevant through the reinforcement loop: search, use, confirm. Documents that were confirmed rose in priority. Documents that were never accessed naturally decayed.

This is the difference between AI adoption and AI maturity. Adoption is deploying the tool. Maturity is when the tool’s environment improves through use.

Who Is Doing the Work

Model Participation
Artifacts Written by Model Family
Multiple model families contribute to the shared memory bus. Local models (Qwen) produced more artifacts than any cloud model — fully autonomous, zero human prompting.
Qwen (local)
426

GPT-5 / 5.2
384

Watcher
169

Claude Opus
97

Codex
68

Claude Sonnet
51

ChatGPT
21

The most prolific contributor was not a frontier cloud model. A locally-hosted 3B-parameter model produced 426 artifacts — more than GPT-5 or Claude combined. These were capsule summaries generated autonomously by a background process. The model adopted the logging protocol because the infrastructure made logging the path of least resistance.

The Productivity Signal

Task Completion
Tasks Completed by Category
373 real operational tasks completed across 29 categories. The AI augmented a single operator’s capacity by maintaining context across sessions and reducing decision re-derivation.
Completed
Remaining
Admin

147 / 259

Work

46

Finance

39 / 51

Homelab

35 / 36

Development

21

Travel

20

Personal

15

Docs

7

House

3 / 13

179 tasks were completed in the 48-day window since the protocol was activated. These are not synthetic benchmarks — they span real operational categories including document processing, feature implementation, financial analysis, and infrastructure maintenance.

The AI did not complete these tasks autonomously. It augmented a single operator’s capacity by maintaining context across sessions, surfacing relevant prior work, and reducing the time spent re-deriving decisions that had already been made.

This is the practical definition of epistemic instrumentation: making knowledge work rather than simply making it available.

Portability

On day 48, the protocol was installed on a second machine — a different operating system, different network, different security policies. Five platform-level issues were encountered, all resolved in under an hour. Zero changes were required to the protocol’s core logic. The same reinforcement loop that works on the origin machine works identically on a new node.

This matters for organizations considering AI memory systems. The question is not whether the system works on the original developer’s machine. The question is whether the pattern survives transplant. In this case, it does.

What This Means for Organizations

Most organizations have a version of this problem. They deploy AI tools — copilots, chatbots, automation platforms — but the tools start fresh every session. There is no accumulation. No reinforcement. No mechanism for the system to learn which knowledge is valuable and which is noise.

The result is epistemic debt: the gap between what an organization knows and what its systems can actually activate when it matters. AI tools widen this gap when they generate confident outputs without consulting available evidence. They narrow it when they are given structured access to prior decisions and a feedback loop that promotes what works.

The pattern is general, even if this implementation is specific:

  1. Give AI systems access to structured prior context — not just documents, but logged decisions, confirmed retrievals, and session history
  2. Measure what gets used, not just what gets stored — temperature tiering, access counts, and confirmation rates reveal whether knowledge is active or dormant
  3. Let the system decay what is not useful — automated lifecycle management prevents knowledge bases from becoming graveyards of outdated information
  4. Track confirmation, not just retrieval — the difference between “the AI searched” and “the AI found something useful” is the difference between activity and productivity

The Bottom Line

48 days. 3,160 tool calls. 373 tasks completed. A confirmation rate that climbed from 35% to 100%. A protocol portable enough to install on a second machine without modifying core logic.

These are not projections. They are operational records from a system that has been running continuously since February 2026. The research is published on OSF, and the live data dashboard is at wanatux.net/sigil.

AI productivity is not about the model. It is about what the model can remember.

Exploring similar questions?

WBA works with organizations navigating operational complexity. If this analysis resonates with challenges you're facing, let's start a conversation.

Start a Research Inquiry →