Back to previous work

Case Study

Ownoir

A reasoning graph system used to preserve decisions, constraints, and implementation context from AI-assisted design conversations.

AI Workflows · Reasoning Graphs · Developer Tools · Project Knowledge

Challenge

Throughout our work as AI engineers we have observed that AI models tend to forget information as a conversation goes on, and it becomes difficult to transfer context from one chat to another. The challenge was preserving the reasoning behind AI-assisted work so future AI sessions could build with continuity instead of rediscovering context every time.

System

Ownoir captured AI-assisted work sessions, converted the conversation into a structured reasoning graph, and made that graph available through a web client and agent tools.

Outcome

Ownoir demonstrated an improvement in how closely AI agents followed the original design. In a Hakai Lab benchmark averaged across 10 runs, the Ownoir graph condition fully implemented about 70 of 72 requirements, produced no failed requirements, and used about 5.8 million fewer input tokens than the full-transcript condition, a 64.9% reduction.

Background

AI chat conversations are increasingly used to design, plan, and discuss systems. A software architect might use an AI assistant to identify requirements, compare implementation paths, define constraints, and decide how a system should be built.

These conversations usually span many turns. Early prompts may describe the goal, while later turns may introduce constraints, reject approaches, refine requirements, and settle on implementation details. By the end of the conversation, the final plan is shaped by reasoning that is distributed across the exchange.

That creates a handoff problem. The architect may write a document for the development team, or ask the AI assistant to generate one from inside the conversation. The problem with doing this is that the document can describe the final architecture, but it may lose the reasoning that made the architecture specific.

For agentic software development, this matters because the agent may need to know that a requirement came from an operational constraint, that a security detail was chosen for rotation support, or that a deferred feature should not be built yet. Those details can be present in the conversation and still fail to survive the handoff.

Problem

The main problem is how to pass context to agents so that they stay within the constraints of the design. The options are passing a summary, which can lose important context, passing an entire transcript, which may fill the working context of a model and produce inconsistent results, or writing manual notes, which takes time and depends on the judgment of the person doing the handoff.

Hakai Lab observed that a transcript preserved the full exchange, but it forced the next agent to search through a long chronological record. That created inconsistent results because an agent may not always follow the same search path through the conversation. A generated spec was easier to read, but it could turn reasoning into final instructions and drop the path that led there. A loose brief was fastest to use, but it gave the agent the least protection against missing a hidden requirement.

Hakai Lab designed a benchmark to test this failure mode directly. The task was a signup webhook service. The agent had to build a standalone local Python system that fired an HTTP POST when a user signed up.

The benchmark task included a specific technical shape. The service used FastAPI, PostgreSQL, SQLAlchemy with asyncpg, Alembic migrations, Pydantic settings, httpx for async outbound delivery, tenacity for retry behavior, ULIDs for event identifiers, and Google-style Python conventions.

The webhook payload also had specific requirements. It needed an event_id, an event_type such as user.signup, an ISO 8601 UTC timestamp, and a nested data object containing user_id, email, and subscription_tier. The outbound request needed HMAC-SHA256 signing over the timestamp and raw body, with signature and timestamp headers sent to the receiver.

The receiver had its own requirements. It needed to verify signatures, reject stale timestamps, log received requests, detect duplicate event IDs, simulate failure modes, delay responses, and fail the first N requests before recovering. The test plan also required implemented happy-path and security tests, plus skipped tests with rationale for deferred failure scenarios.

This benchmark made the continuity problem concrete. The question Hakai Lab was answering was whether the agent could preserve the specific design choices that had been established during the original conversation.

How it was solved

Ownoir was built to preserve AI-assisted design work as structured project knowledge.

The system captured the work session through local hooks, including prompts, assistant responses, tool use, session events, and completed turns. It then processed the conversation into a reasoning graph.

In that graph, important design information became nodes. A node could represent a requirement, decision, constraint, explanation, problem, solution, task, reference, or pattern. Relationships between nodes showed how the design fit together, such as one decision depending on a constraint or one implementation detail refining a requirement.

This gave future agents a more usable context source. Instead of asking the agent to read the full transcript or rely on a generated spec, Ownoir exposed the graph through tools the agent could query during implementation.

For the signup webhook benchmark, the graph preserved the decisions and constraints needed to build the service according to the original design. The agent could retrieve details about the stack, payload shape, signing method, retry behavior, local receiver behavior, testing expectations, and deferred production features.

Benchmark design

Hakai Lab designed the benchmark to compare how different forms of context affected implementation fidelity. The benchmark was run 10 times, and the results below represent the average performance across those runs.

The same source design conversation was used to create four implementation conditions. One agent received access to Ownoir’s extracted reasoning graph through agent tools. One received the raw transcript. One received a generated specification. One received only a loose feature brief.

Each agent was asked to build the signup webhook service from its assigned context. The outputs were then audited against 72 requirements. The audit checked whether the implementation preserved the original design, such as stack choices, payload structure, signing behavior, replay protection, database schema, retry configuration, receiver behavior, and test coverage.

Across the 10-run average, the graph condition fully implemented about 70 requirements, partially implemented about 2, and had no failed requirements. The transcript condition fully implemented about 65 requirements, partially implemented about 6, and had about 1 failed requirement. The generated spec condition fully implemented about 69 requirements, partially implemented about 2, and had about 1 failed requirement. The loose brief condition fully implemented about 67 requirements, partially implemented about 3, and had about 2 failed requirements.

The token comparison also showed why the graph was useful as a working memory format. Across the 10-run average, the graph condition used about 3.1 million input tokens, while the full-transcript condition used about 8.9 million input tokens. That means Ownoir used about 5.8 million fewer input tokens than the full-transcript condition, a 64.9% reduction, while producing the strongest implementation result.

One example was dual-secret rotation for webhook signing. The graph condition preserved support for both a primary and secondary secret. The transcript, generated spec, and loose brief conditions all missed it.

Impact

Ownoir showed that preserving reasoning as a graph could improve how closely an AI agent followed the original design while reducing the amount of context required to continue the work.

Context given to agentFully implementedPartially implementedFailed requirementsInput tokens
Ownoir graph~70 / 72~2~0~3.1M
Full transcript~65 / 72~6~1~8.9M
Generated spec~69 / 72~2~1~4.6M
Loose brief~67 / 72~3~2~3.6M
Average results across 10 benchmark runs.
ComparisonFully implemented requirementsFailed requirementsInput token difference
Ownoir graph vs. full transcript~+5~-1~5.8M fewer input tokens, 64.9% fewer
Ownoir graph vs. generated spec~+1~-1~1.5M fewer input tokens, 32.0% fewer
Ownoir graph vs. loose brief~+3~-2~510K fewer input tokens, 14.1% fewer
Comparisons are based on the 10-run average.

The strongest comparison was against the full-transcript condition. The Ownoir graph condition implemented more requirements, produced no failed requirements, and used substantially fewer input tokens than passing the entire prior conversation into the next AI session.

The improvement was most visible on requirements that depended on design memory. Common stack choices, such as Python, FastAPI, PostgreSQL, and HMAC signing, were often captured across conditions. More specific requirements, such as dual-secret rotation and configurable retry behavior, were easier to lose when the conversation was passed forward as a transcript, generated spec, or loose brief.

Beyond benchmark numbers, clients that use Ownoir report an easier time finding context behind design decisions, and smoother handoffs between team members.