Research

At Judgment Labs, our team of researchers from Stanford AI Lab, Berkeley AI Research, and Together AI brings academic rigor to agentic evaluations. We’re drilling deeper than generalized LLM benchmarks to a specialized suite designed to measure agent planning, reasoning, and tool-use in downstream tasks. Our mission is to redefine the standards for assessing agentic systems through deep execution tracing and agent-specific metrics.

Evaluation Research

Agentic Evaluations

In comparison to RAG workflows, agent systems execute workflows in non-deterministic paths requiring deep planning, memory, and the ability to reason over new context. The novel nature of agents motivates a new evaluation structure to reflect their inherent structural complexities. Judgment provides evaluation metrics to measure performance across agent tool-use, memory, planning, and more.

Osiris Evaluation Suite

Traditional automated evaluations are performed via deterministic algorithms (e.g. BLEU, ROUGE) or LLM-as-a-judge, where an LLM is tasked with the challenge of scoring an input or output according to a criteria. Since these are usually constructed via generic prompts to a single LLM judge call, the scores can often be unreliable, inconsistent, and low-signal. To make automated evaluations more robust, our Osiris Evaluation Suite decomposes complex evaluations into subtasks executed by a multi-agent system. We optimize this system end-to-end using research techniques such as test-time scaling, reinforcement learning, and fine tuning on synthetic data.

This multi-agent architecture produces evaluations which are noticeably higher performing than single LLM call systems, evidenced by our performance across internal benchmarks and publicly available datasets.

Osiris Case Study: Hallucination Detection

Osiris Case Study: Hallucination Detection

To clarify, we define a hallucinated response as an LLM output that contradicts the grounded retrieval context and/or diverges from the task instructions for the LLM call.

As a demonstration of how the Osiris Evaluation Suite outperforms existing methods, we measure our hallucination detection eval on the RagTruth benchmark, a comprehensive dataset of 18,000 intensely challenging multi-hop reasoning questions, with a mix of hallucinated and truthful answers, available here. This dataset stress-tests the ability for a model to detect subtle hallucinations that are more representative of the edge cases that builders see daily. The Judgment Osiris system scored significantly higher across recall, precision, and F1 compared to LangSmith and Arize AI's faithfulness metric.

Furthermore, we outperformed LangSmith and Arize AI on LegalBench, a collection of adversarial hallucination examples in the legal domain, comprising mostly queries regarding document parsing and casework.

To understand this sharp improvement, we can break down the methodology of the Osiris hallucination detection model as it uses three subtask agents to handle the complex task of detecting contradiction. The first subtask agent specifically extracts the claims from the original LLM output that we want to verify. A separate agent reviews those extracted claims against the retrieval context, while in parallel a third agent reviews the claims against the task instructions. Finally, a main synthesizer agent reviews the work of the subtask agents to make a final decision over which parts of the LLM output are hallucinated.

Insights Research

As part of our effort to optimize agentic systems and build the infrastructural layer for AGI, Judgment's research team explores methods for teams to quickly diagnose and address errors in their agents.

Model Context Protocol (MCP)

By integrating with the Model Context Protocol (MCP)—an open standard for delivering context and tools to LLMs—Judgment Labs assists with debugging complex workflows. By fueling coding agents like Cursor and Windsurf with the rich context of runtime data via traces and evaluation, we’re cutting down development times and pushing the boundaries of AI code development.

Clustering

Our custom clustering algorithm groups experimental runs by shared failure signatures, transforming raw LLM trace logs into intuitive visual clusters. This enables builders to analyze agent behavior and surface errors at scale. Our state-of-the-art clustering algorithm using HDBSCAN provides powerful observability over massive datasets. By filtering data by error cases, builders can identify inputs that lead to hallucinations, poor retrieval results, and other failures. No more sifting through raw traces and logs for hours--instantly locate failures and apply fixes at scale.

Synthetic Data Generation

Collecting high quality real-world data can be challenging. Judgment’s synthetic data generation enables developers to test their agents in low-data situations to ensure proper validation. With just a few sample data points, Judgment’s augmentation workflows can produce thousands of unique cases to stress-test your systems.

Publications and Contributions

Our work has been presented at ICLR, EMNLP, and other top research venues. We actively publish open-access papers, share our code, and maintain collaborative forums to push forward the science of reliable agentic AI.

Built and backed by AI leaders from the world's top institutions
Built and backed by AI leaders from the world's top institutions