Problem
The main problem is how to pass context to agents so that they stay within the constraints of the design. The options are passing a summary, which can lose important context, passing an entire transcript, which may fill the working context of a model and produce inconsistent results, or writing manual notes, which takes time and depends on the judgment of the person doing the handoff.
Hakai Lab observed that a transcript preserved the full exchange, but it forced the next agent to search through a long chronological record. That created inconsistent results because an agent may not always follow the same search path through the conversation. A generated spec was easier to read, but it could turn reasoning into final instructions and drop the path that led there. A loose brief was fastest to use, but it gave the agent the least protection against missing a hidden requirement.
Hakai Lab designed a benchmark to test this failure mode directly. The task was a signup webhook service. The agent had to build a standalone local Python system that fired an HTTP POST when a user signed up.
The benchmark task included a specific technical shape. The service used FastAPI, PostgreSQL, SQLAlchemy with asyncpg, Alembic migrations, Pydantic settings, httpx for async outbound delivery, tenacity for retry behavior, ULIDs for event identifiers, and Google-style Python conventions.
The webhook payload also had specific requirements. It needed an event_id, an event_type such as user.signup, an ISO 8601 UTC timestamp, and a nested data object containing user_id, email, and subscription_tier. The outbound request needed HMAC-SHA256 signing over the timestamp and raw body, with signature and timestamp headers sent to the receiver.
The receiver had its own requirements. It needed to verify signatures, reject stale timestamps, log received requests, detect duplicate event IDs, simulate failure modes, delay responses, and fail the first N requests before recovering. The test plan also required implemented happy-path and security tests, plus skipped tests with rationale for deferred failure scenarios.
This benchmark made the continuity problem concrete. The question Hakai Lab was answering was whether the agent could preserve the specific design choices that had been established during the original conversation.