
Why We Believe the Future of Legal AI Is Stateful
Serious legal AI cannot reread the matter from scratch every time.
For the last year, our team has been working on one of the hardest problems in legal AI: memory.
Most AI systems do not actually remember your work. They take your uploads, break them into tokens, and reread those tokens from scratch every time you ask a question. Nothing is truly retained. Nothing is understood and set aside for next time.
That may work for a few documents. It breaks down when you are dealing with a 12,000-document record, years of matter history, or a live litigation file. Accuracy suffers. Detail gets lost. Nuance becomes harder to hold. And the cost rises exactly when the work becomes more complex and more important.
It is like having a brilliant associate who forgets the entire case file after every question and has to reread the whole thing before answering the next one. That is not how serious legal work should function.
Most of the industry has responded by building bigger: bigger models, bigger context windows, more compute. We think that is the wrong instinct. It is like trying to fix a forgetful associate by making them read faster.
So we took a different path.
Stateful Swarms
We call the approach Stateful Swarms. The idea is simple to say and hard to build:
Read once. Remember. Query cheaply, forever.
Instead of one large model rereading a matter every time, a coordinated group of specialized agents reads the documents once, writes down what it learns as structured and verifiable findings, and stores that understanding in persistent memory. Once the work of reading is done, the system can query, update, and reason over that memory cheaply going forward.
You pay to understand the document set once. After that, the system works from memory.
The economics are the whole story
It is tempting to read “39x cheaper” as a discount. It is not. It is a different economic model.
In most legal AI systems, the cost of a matter climbs as the matter grows. More documents mean more rereading. More rereading means more tokens. More tokens mean a larger bill. That punishes the exact matters where AI should matter most: large, complex, high-stakes work.
One of our clients, who uses Irys heavily, recently tried running a similar workflow directly through Claude. In a single week, they burned through roughly $1,700 in usage. They run that same workflow through Irys for nothing beyond their license.
That gap exists because we are not subsidizing an expensive process. We rebuilt the process so it is cheap by design. The system reads and structures a document set once, then operates over that structured memory. The cost per unit of work falls as usage scales, instead of rising every time the system has to remember again.
There is a broader point here that is easy to miss. A great deal of legal AI today is being sold below its real cost, with the token bill quietly subsidized to win deals. Unless you are one of the frontier labs, you cannot front that cost forever. The credits run out and the discounts end. That math does not change for us, because the underlying process does not have the problem in the first place. It is what makes it viable to run AI across an entire firm’s book of business without the bill exploding.
The results
To validate the approach, we ran Stateful Swarms against the Harvey Legal Agent Benchmark. Credit to Harvey for publishing it. A serious public benchmark is a real contribution to the field.
The benchmark includes 1,251 tasks across 24 practice areas. Our benchmark-fair system achieved:
- 39x lower cost
- $1.30 per task versus a $50.90 baseline
- 83.74% of required answer criteria passed
- 17.75% complete-task success versus a 10.4% published baseline, roughly 1.7x higher
A word on what these numbers mean. The benchmark does not simply ask whether an answer sounded good. It checks whether the answer included the specific pieces each task required. Measured across all of those individual requirements, our system passed 83.74% of the criteria. On the harder measure, whether the system completed an entire task with every required piece present, we reached 17.75%, compared with the published baseline of 10.4%. The point is not only that it was cheaper. It was cheaper and stronger on the benchmark’s hardest measure.
A few notes in the interest of transparency. We ran the full public benchmark set and scored it with Harvey’s own scorer and rubrics under the all-pass standard. Harvey’s published numbers were produced on a private holdout set that, by their own account, mirrors the practice-area and task distribution of the public set, so the comparison is across the same benchmark family and distribution rather than the identical task instances. Because of rate limits during our run, we used Gemini 3.1 FL as the scoring judge rather than the recommended Sonnet 4.6, and we confirmed over 90% agreement between the two before relying on it. We would gladly re-run against any benchmark or holdout Harvey makes available.
The full methodology, the scorer, the code, and every reasoning trace are documented in Devansh’s technical deep dive.
The breakthrough was architecture, not horsepower
We did not get there by using a larger model. We used frozen, off-the-shelf models. Used the normal way, those same models score zero on these tasks. The performance came entirely from architecture: how the system reads, stores, checks, and reasons over what it has already learned.
The lesson is simple. The future of legal AI will not be won by renting the biggest models. It will be won by building systems that remember.
Why auditability matters
There is another reason this matters for legal work. Every finding the system produces is traceable to its source: the document it came from, the step that created it, and the evidence it rests on. When the system is right, you can see why. When it is wrong, you can see where it went wrong.
For lawyers, who are professionally responsible for every line of their work product, that difference is the whole game. A legal AI system that is right most of the time is only useful if the rest can be identified, checked, and corrected. That is what structured memory makes possible.
What this means for Irys
Stateful Swarms is not the full Irys platform. It is a benchmark-fair, open-source subset of the infrastructure underneath it.
Inside Irys, this same direction powers a larger vision: a matter operating system where firms draft, research, review, collaborate, and preserve institutional memory across live matters. The product is what lawyers use. The memory layer is what makes the product compound.
Claude answers questions. Irys remembers the matter.
What comes next
We have open-sourced the code, the benchmark outputs, and the reasoning traces under an MIT license. This is v1. We released it because we believe the move toward stateful systems is bigger than any one company, and the underlying truth is hard to ignore:
Recompute is waste. Session loss is waste. A prompt is not memory.
The future is stateful.
The full technical deep dive, repository, and complete results are available in Devansh’s write-up.
Frequently asked questions
Did you use the same benchmark tasks?
We ran the public Legal Agent Benchmark set in full, all 1,251 tasks. Harvey’s published numbers were produced on a private holdout set that, by their own account, mirrors the practice-area and task distribution of the public set. That means the comparison is across the same benchmark family and the same distribution of work, rather than the exact same task instances. We would gladly run any holdout set Harvey chooses to share.
Did you use the same scoring methodology?
Yes. We scored with Harvey’s own benchmark scorer and rubrics, under the same all-pass standard they use, and we have uploaded the outputs so anyone can re-run the scorer and reproduce the numbers independently. One note for completeness: because of rate limits during our run, we used Gemini 3.1 FL as the scoring judge rather than the recommended Sonnet 4.6. We compared the two judges across outputs and found over 90% agreement, so the substitution does not change the conclusions.
What does “strict all-pass” mean?
A task counts as a pass only if every rubric criterion for that task passes. It is 100% or nothing, with no partial credit. On that measure we passed 222 of 1,251 tasks, which is 17.75%. It is the headline metric Harvey reports, and the published baseline on it is 10.4%.
What does “pooled criteria pass” mean?
It is the share of individual rubric criteria passed across all tasks: 62,800 of 74,990, or 83.74%. We lead with this number deliberately. The all-pass metric collapses “missed 1 criterion out of 60” and “missed all 60” into the same failure bucket, which discards exactly the signal an architecture release should expose: where in the pipeline criteria pass and where they fail. For evaluating a multi-agent system, criterion-level resolution is the correct granularity, and it is the standard micro-averaged metric across information retrieval and machine learning. We report the all-pass number too, it is right there in the table, but we do not accept that it is the only number that counts.
Did you tune to the benchmark?
No. The architecture is general to task types, with no per-task logic and no hardcoding to specific tasks. Prompts were written against task verbs and document structure, not against per-task rubrics, and our isolation tooling enforces that separation during runs. Whether we eventually run the private holdout to demonstrate generalization end to end is a decision for down the road.
What is benchmark-fair versus full Irys?
Benchmark-fair means the open-source system runs vanilla API calls with an empty blackboard on every task, carrying nothing between tasks: no persistence, no knowledge graphs, no fact stores. That isolation is required for clean evaluation, because persistence would amount to learning from the benchmark itself. Full Irys, the production system, is the opposite by design: it learns from your matters and from how you work, compounding context over time. The benchmark measures the architecture in isolation. The product is what happens when you let that architecture remember.


