Case Study

How a Legal AI Platform Improves Its Agents with Judgment

Oct 1, 2025

Background Context

A legal AI platform specializing in immigration (O-1A, H-1B, etc.) helps businesses and individuals secure work visas, green cards, and other immigration services. To meet heavy demand, the company builds agentic workflows that generate rough drafts of documents (e.g. immigration recommendation letters) for handoff to lawyers. Lawyers then quickly edit these drafts to ensure every document meets quality standards.

The company faces a simple challenge in operations: time spent by lawyers reviewing AI-generated documents becomes the bottleneck to scale and margins. One solution is to improve their AI workflows to decrease the need for manual editing. This would reduce costs and improve throughput. However, each time they updated their prompts, models, or search tools, they had no reliable way of checking for quality regressions along different quality dimensions. Agent system updates were causing major and minor regressions that induced hours of lost lawyer time that was sometimes difficult to detect.

“Having any method to unblock the iteration cycles of legal services and LLM development teams was a priority. Even if we improved the efficiency of our lawyers by 5-10 percent, we could see massive gains compound over time.” — Engineering Manager

Inefficient Status Quo

Prior to working with Judgment Labs, their approach was essentially manual regression testing. Each time the AI was updated, the legal services team had to compare outputs from the old system and the new system to check if quality had improved or declined. This required pulling in lawyers and at times, AI engineers, to manually inspect drafts side by side.

The process was inefficient and expensive, and because feedback depended on who was reviewing at the moment, it often produced inconsistent or unreliable results. Furthermore, after changes were deployed to production, emergent agent misbehaviors caused even more problems that the legal team surfaced to the engineering teams. Basic out-of-the-box LLM judge approaches led to inconsistent and misaligned results.

To address these inefficiencies, they asked Judgment Labs to build custom evaluators directly into their AI workflows that were trained on the company's internal data, to mimic a lawyer’s review and consistently judge the quality of document drafts.

Judgment Labs Approach

Judgment eliminated most of the need for lawyers to compare AI agent outputs after system prompt/model/tool updates, freeing up nearly 5 hours per lawyer every week.

Judgment engineers analyzed the company's internal document generation data and post-trained an LLM judge to mimic lawyer review. The data consisted of immigration case data (e.g. candidate profiles, regional precedent for certain Visa types) and pairs of anonymized LLM generated drafts (”rough drafts”) and lawyer edited drafts (”final drafts”).

Our insight was that the final drafts contained changes that represented the common mistakes of the writing agents at scale. From the pairs, Judgment used LLMs to bootstrap reasoning that explained the specific improvements lawyers made when revising each draft. We then ran ablations over the synthetic data to filter for high quality reasoning traces and finally distilled the reasoning into a judge model via post-training. This judge was evaluated on hold-out sets of document generation tasks to confirm its ability to judge pairwise quality and provide strong explanations for why certain letters were better than others.

In addition to the post-training approach, Judgment also deployed an LLM-as-jury system to improve performance of the evaluators via eval-time scaling. Instead of having one LLM judge make a decision, we poll N evaluators which have different base models and ablations of post-training data before sampling a distribution of votes from the jury. A weighted majority vote is then applied as the final decision from the judge system.

Outcomes

During the back-testing against their hold-out set, Judgment’s model system correctly identified the better document 97% of the time (vs. 52% with the company's baseline model) and produced feedback that closely matched how a lawyer would make the decision. Since this accuracy matched and sometimes exceeded lawyer performance, they deployed the evaluator into their automatic regression testing pipeline using Judgment Labs’ judgeval package.

This engineering eliminated the constant need for lawyers to review outputs from old and new AI systems, giving the team freedom to experiment with new prompts, models, tool calls, and agent memory layers without fearing a slower development cycle and wasting lawyer time.

Judgment enabled them to ship 2 new agent releases 3 months ahead of schedule while cutting lawyer review time by more than 85%, saving 100+ hours each month across their caseload.

The company now runs hundreds of cases per month through the post-trained judge system, supporting 20% more caseload with the same team size. With a reduced need for human evaluation in the loop, they have been able to deploy agent updates to production 3x faster.

Because the judge model was trained to compare and identify document quality, Judgment modified the model to generate and use its own rubric to spot specific issues with documents. Thus, the judge system has also been adapted and deployed as a monitoring solution, acting as a early warning system to triage poorly generated documents before they cause bottlenecks to lawyer review cycles.

To date, Judgment’s Agent Behavior Monitoring (ABM) has been live for three weeks and surfaced over 40 cases with factual contradictions, misquoted citations, and misconstrued evidence.

“Working with Judgment Labs is like having a world-class, on-call applied AI research team. They work quickly to understand your use cases and draft novel solutions with custom evals at the heart of quality control.” — CTO