Case Study

How a Fortune 1000 Enterprise Improved Internal Agents with Agent Behavior Monitoring

Sep 28, 2025

Background Context

A Fortune 1000 enterprise software company was pushing to become AI-native. One of their flagship projects was an internal search agent designed to help employees quickly retrieve information across databases, applications, and internal documents. The agent could decompose complex queries into steps and dynamically choose which tools to call in order to return answers.

However, its sophistication led the system to often overcomplicate simple queries with long, winding trajectories and redundant tool calls. This slowed responses, drove up compute costs, and led to infinite/incomplete action loops that eroded user trust.

The company's engineering leadership understood that improving trajectory efficiency was key to scaling their agents across the company. But they lacked a consistent, systematic way to measure the efficiency of agent pathing and how to optimize along the efficiency metrics.

Suboptimal Status Quo

Before partnering with Judgment Labs, the engineering team relied on manual trajectory review. Engineers would compare old and new agent runs side by side, manually deciding which path was shorter, more logical, or more efficient. This approach consumed engineering hours, produced inconsistent results depending on reviewer, and slowed iteration cycle speed and size. Out of the box LLM judges did not solve their issues since performance was finicky and inconsistent with human judgment.

Without clear metrics, engineers hesitated to ship sweeping changes to production. Updates sometimes improved one scenario while making others worse, and regressions often slipped through unnoticed until they created negative experiences in production that were reported by end users.

“Before working with Judgment Labs, we honestly thought that human review was the only way to ensure quality. LLM judge approaches were historically infeasible due to quality issues. However, in the right places, we have seen the efficacy of targeted approaches with LLM judges with Judgment.” — Head of AI

Judgment Labs Approach

Judgment built automatic evaluators over agent trajectories for every metric the company was measuring by hand. Today, Judgment’s evaluators process hundreds of thousands of trajectories per week and enabled reinforcement learning (RL) optimizations over agent behavior.

Judgment Labs designed custom evaluators that measured trajectory efficiency across multiple dimensions. These evaluators were post-trained LLM judges that were equipped with rubrics that covered dimensions such as

Was the shortest reasonable path to the answer taken?
Were tool calls made with the right parameters and without redundancy?
Are there any loops within the agent’s trajectory that could be avoided with a new tool call?

The rubrics were inducted from production feedback data - Judgment’s telemetry collected trajectories at scale and tagged them with user feedback/preferences such as “took super long” or “called too many tools”. Conversely, synthetic labels were generated to apply towards common trajectory groups to provide reasoning for why successful trajectories effectively used tools or took the right actions. Then, LLMs were used to cluster and filter down common failure modes and success qualities into rubrics that represented quality in production. Finally, Judgment engineers used post-training (Direct Preference Optimization (DPO) + Supervised Fine Tuning (SFT)) to align the judge models to match user preferences on these dimensions.

Using these evaluators, they could automatically compare trajectories and assign quantitative reward signals based on length, efficiency, and correctness. The reward signals fed into a reinforcement learning loop that trained the agent to favor shorter, more efficient paths.

To strengthen robustness, Judgment also experimented with evaluator ensembles (LLM-as-jury) to provide more consistent judgments, minimizing variance across query types.

Outcomes

The reinforced agent demonstrated immediate and measurable gains:

40% average reduction in trajectory length amongst previously problematic query types, as measured by number of average tool calls/spans.
Seconds shaved off query response times on over half of their compound queries (question ⇒ action, e.g. “find information on X and send it to Y”).
15% lower infrastructure costs (LLM tokens, tool API calls) by eliminating redundant tool calls and LLM actions.

With evaluators integrated into their monitoring and self-improvement pipeline, their engineers were free to focus on new features rather than endless trajectory reviews.

“Judgment gave us confidence to ship faster. We no longer waste cycles debating whether the agent is more efficient—it’s measured, reinforced, and improving on its own.” — Engineering Lead

Today, the company's employees rely on the improved search agent daily. It delivers faster and more efficient retrieval across the company’s internal tools, supporting the broader transformation to an AI-native workplace.