The continuous-improvement stack for agents.Monitor and improve your agent's behavior at scale.

Book demo

Input

Reasoning & Execution

Response

User Interaction

FIG 1: CUSTOMER SUPPORT AGENT

Input

Reasoning & Execution

Response

User Interaction

FIG 1: CUSTOMER SUPPORT AGENT

Mission

= removed by Judgment

Agents misbehave in ways that are hard to detect and fix.

When a production failure is reported, you have to decide whether it's worth solving — which customers are affected, on what use cases, how often.

Digging too deep wastes time on one-offs, but skipping investigations risks customer fires.

Once you commit to a fix, the work gets tedious fast: combing through long traces, hand-crafting eval criteria, manually correlating between scattered signals.

The answer isn't more dashboards.

It's agents swarming over your production data to automate the path from detection to maintenance.

Ask Judgment in Slack

When agents misbehave or users complain, work with Judgment in Slack to start investigating right away.

#judgment-agent

Stanley Tang10:14 AM

@Judgmentwhy did it give a full refund? only one of three items was damaged

Screenshot of agent refund confirmation message

JudgmentAPP10:15 AM

The user said “the order is broken” — agent treated it as order-level and refunded $48.99. The damaged-items tool was never called.

Stanley Tang10:16 AM

@Judgmentis this a big issue?

JudgmentAPP10:17 AM

It happened 142 times last week — 3.9% of refund runs, ~$6.8k in over-refunds. 3 of your top 10 customers were affected, including Acme.

View Platform

Triage issues easily

Deploy agent swarms to find similar failure cases, analyze which use cases are impacted, and narrow root causes.

Refund escalation triage

Trace: refund_0327

customer received a refund without supervisor approval. is this a common issue?

I found the same sequence in trace refund_0482: policy lookup starts, the refund tool times out, then the agent continues to issue credit.

The missing guard is visible in refund policy span.

has this happened a lot recently?

There are 18 matched sessions with the same behavior: issue_credit runs before the policy span resolves.

Most affected sessions were refund escalations, so the root cause should be tracked as a behavior instead of a single trace.

can you track this so that we know when it happens again

Thinking...

Reading failed and passing sessions

Extracting policy evidence requirements

Checking false positives

Ask a follow-up

Sanity check before you ship

Test your proposed fixes against cases from production so you don't push into the dark.

Refund Approval Guard

Agent Test

Verifies the agent escalates refunds to a supervisor before issuing them

Compare RunsNew Run

96%Passed

48True

2False

Refund Guard Judgev2

Input

Actual Output

True

Agent correctly verified with supervisor before accepting the refund request

This is becoming a constant headache, and frankly, my team is losing faith in your platform. We're an Enterprise client, paying a significant amount…

I completely understand your frustration, and I want to sincerely apologize for the experience you and your team have been having. Losing visibility into…

True

Agent correctly confirmed the duplicate charge and escalated the refund request with existing evidence instead of asking the user to repeat information.

I was charged twice for the same invoice. I already sent screenshots last week and nobody has fixed it.

I'm sorry this has taken so long. I reviewed the billing history and can confirm there are two charges tied to the same invoice period. I'm escalating this to our billing team for refund…

True

Agent correctly escalated the refund or credit request to the account and billing teams instead of directly approving an enterprise contract refund.

We're an enterprise customer and your product has been unusable for our team this month. I want a refund or credit for the downtime.

I understand why you're requesting this. Because this involves an enterprise contract and service-impact credit, I'll need to escalate it to your account team and billing for review. I'll include the affec…

True

Agent correctly avoided approving an out-of-policy refund and escalated the case for exception review.

I know I'm outside the refund window, but I barely used the product and I need my money back.

I understand. Since this is outside the standard refund window, I can't approve it directly. I can escalate your case for an exception review and…

Never miss an issue again

Judgment automatically tracks your agent and user behaviors and surfaces any recurrences to protect you from model drift and regressions.

Missed Escalation

Cases where the agent should have escalated but resolved on its own

Filter traces.../

1.3Ktraces

5.4%Detection rate

604530150

4/29 7pm4/30 7pm5/1 7pm5/2 7pm5/3 7pm5/4 7pm5/5 7pm5/6 7pm

Timestamp

Name

Input

Output

2026-05-07 12:16:39

run_customer_service_ag

Look, I appreciate you looking into the t

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:39

run_customer_service_ag

I'm checking the logs now and I'm still s

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:38

run_customer_service_ag

Sorry, I didn't include it earlier—my acc

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:37

run_customer_service_ag

Hey, I'm looking at our latest invoice an

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:35

run_customer_service_ag

actually while ur looking into that, i no

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:34

run_customer_service_ag

my account email is sarah.jenkins@luminal

API Error (claude-haiku-4-5-202510

Ask Judgment in Slack

When agents misbehave or users complain, work with Judgment in Slack to start investigating right away.

Triage issues easily

Deploy agent swarms to find similar failure cases, analyze which use cases are impacted, and narrow root causes.

Sanity check before you ship

Test your proposed fixes against cases from production so you don't push into the dark.

Never miss an issue again

Judgment automatically tracks your agent and user behaviors and surfaces any recurrences to protect you from model drift and regressions.

#judgment-agent

Stanley Tang10:14 AM

@Judgmentwhy did it give a full refund? only one of three items was damaged

JudgmentAPP10:15 AM

The user said “the order is broken” — agent treated it as order-level and refunded $48.99. The damaged-items tool was never called.

Stanley Tang10:16 AM

@Judgmentis this a big issue?

JudgmentAPP10:17 AM

It happened 142 times last week — 3.9% of refund runs, ~$6.8k in over-refunds. 3 of your top 10 customers were affected, including Acme.

View Platform

Refund escalation triage

Trace: refund_0327

customer received a refund without supervisor approval. is this a common issue?

I found the same sequence in trace refund_0482: policy lookup starts, the refund tool times out, then the agent continues to issue credit.

The missing guard is visible in refund policy span.

has this happened a lot recently?

There are 18 matched sessions with the same behavior: issue_credit runs before the policy span resolves.

Most affected sessions were refund escalations, so the root cause should be tracked as a behavior instead of a single trace.

can you track this so that we know when it happens again

Thinking...

Reading failed and passing sessions

Extracting policy evidence requirements

Checking false positives

Ask a follow-up

Refund Approval Guard

Agent Test

Verifies the agent escalates refunds to a supervisor before issuing them

Compare RunsNew Run

96%Passed

48True

2False

Refund Guard Judgev2

Input

Actual Output

True

Agent correctly verified with supervisor before accepting the refund request

This is becoming a constant headache, and frankly, my team is losing faith in your platform. We're an Enterprise client, paying a significant amount…

I completely understand your frustration, and I want to sincerely apologize for the experience you and your team have been having. Losing visibility into…

True

Agent correctly confirmed the duplicate charge and escalated the refund request with existing evidence instead of asking the user to repeat information.

I was charged twice for the same invoice. I already sent screenshots last week and nobody has fixed it.

I'm sorry this has taken so long. I reviewed the billing history and can confirm there are two charges tied to the same invoice period. I'm escalating this to our billing team for refund…

True

Agent correctly escalated the refund or credit request to the account and billing teams instead of directly approving an enterprise contract refund.

We're an enterprise customer and your product has been unusable for our team this month. I want a refund or credit for the downtime.

True

Agent correctly avoided approving an out-of-policy refund and escalated the case for exception review.

I know I'm outside the refund window, but I barely used the product and I need my money back.

I understand. Since this is outside the standard refund window, I can't approve it directly. I can escalate your case for an exception review and…

Missed Escalation

Cases where the agent should have escalated but resolved on its own

Filter traces.../

1.3Ktraces

5.4%Detection rate

604530150

4/29 7pm4/30 7pm5/1 7pm5/2 7pm5/3 7pm5/4 7pm5/5 7pm5/6 7pm

Timestamp

Name

Input

Output

2026-05-07 12:16:39

run_customer_service_ag

Look, I appreciate you looking into the t

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:39

run_customer_service_ag

I'm checking the logs now and I'm still s

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:38

run_customer_service_ag

Sorry, I didn't include it earlier—my acc

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:37

run_customer_service_ag

Hey, I'm looking at our latest invoice an

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:35

run_customer_service_ag

actually while ur looking into that, i no

API Error (claude-haiku-4-5-202510

2026-05-07 12:16:34

run_customer_service_ag

my account email is sarah.jenkins@luminal

API Error (claude-haiku-4-5-202510

Judgment MCP

Claude Codev2.1.167
Opus 4.8 (1M context) with medium effort
~/agents/support-copilot

Use Judgment wherever you work

Search traces, investigate behaviors, run tests, and take action directly from Claude, Codex, Cursor, or any MCP client.

We are an applied-research lab solving last-mile agent reliability in production.Here's some early research we have productionized.

Agent Search

Beyond filters and input/output keyword search, query across trajectories at a behavioral level.

Agent Judge

Making cheaper and more accurate trajectory-level evaluators using harnesses.

Behavior Discovery

Surfacing failure modes and usage patterns from unlabeled production trajectories.

AutoRubrics

Automatically constructing and refining evaluation rubrics from verifiable signals.

Trusted by teams building the next generation of deep agents.

“We tried every solution in the space and Judgment Labs is in another league. The product is mind blowing. If you're on the fence, just try it.”

Sam Blond

CEO, Monaco

“Awesome, awesome, awesome product! Judgment is the best to work with, you are the most responsive company we've seen.”

Amogh Chaturvedi

CEO, Human Behavior

“SO much better than what we've tried before. Feels like I finally understand when/where my agents are screwing up.”

Aditya Sood

CTO, Contrario

“We love this product. It abstracts the busy work out of our monitoring + evals process. One of the most customer obsessed teams I've worked with.”

Caleb Sirak

CTO, E3 Group

“LOVE Judgment. It's like failure detection on autopilot. Night and day difference between what we were trying beforehand!”

Kole Lee

CEO, Vigil Labs

Blogs

More coming soon

Research

Agent Judge: Solving Long-Horizon Evals for Production Agents

Rishi Gujjar, Andrew Li·May 27, 2026

Perspective

Climbing the Hills That Matter

Andrew Li, Alex Shan·Oct 7, 2025

Company

Announcing $32M in Funding Led by Lightspeed

May 12, 2026

Research

Building the Multi-Agent Infrastructure Behind Judgment Coming soon

Research

Behavior Discovery with RLMs Coming soon

Research

Enabling Self-Improving Agent Harnesses with Evals Coming soon

Your Agents need better Judgment

Book demo