Blog

Climbing the Hills That Matter

Sep 23, 2025

We cannot improve on what we cannot measure. Most teams aren’t measuring what matters.

Evals have become the steering wheel that determines whether agent systems improve in the ways we want. In many cases, optimization is now less challenging than finding and measuring the signals that surface the behaviors of useful and high-quality agents. If AI engineers have reliable metrics to track, they will find ways to make those numbers move up and to the right.

However, despite the potential of evals, it doesn't feel like AI teams are deriving proper value from them yet. What explains the gap between expectation and reality for these important but nebulous metrics?

What are evals?

Evals are notoriously difficult to discuss because the umbrella of use cases and techniques it describes is increasingly broad. For the purposes of this blog, we subscribe to Hamel Husain’s definition of evals as the “systematic measurement of application quality”.

Evals infra has a diverse ecosystem of tooling for measurement (autograders, verifiers), data collection (logging, benchmarks), and more. At Judgment Labs, we focus on Agent Behavior Monitoring (ABM) tooling, enabling teams to judge and alert on agent behavior in production. We primarily deal with evals in relation to how we reliably measure agent behavior from production data.

What is wrong with Evals?

Despite their potential, we observe two issues with how evals are implemented in practice:

Poor generalization of measurement methods.

Many teams and eval providers rely on out-of-the-box LLM judges with one prompt (e.g. correctness or tone), and apply it everywhere. While the plug-and-play nature of this approach is attractive, it fails because of the non-uniform nature of agent quality across use cases. A criterion like “correct” or “clear and concise” means something different in a legal brief than it does in code review.

For instance, an LLM judge checking for correctness between generated and reference code solutions might be instructed to examine whether the code samples produce the same outputs, experience the same errors on edge cases, and have the same time/space complexity. However, a judge measuring correctness across legal briefings may look for accurate citation of precedent, faithful representation of case details, and alignment with jurisdiction-specific standards.

This specificity extends deep—even within a domain such as legal, there are different boundaries for factuality or taste depending on the task and whether it deals with a certain type of law (European v. American) or type of case (putative class action v. product liability action).

The context, constraints, and acceptable trade-offs in evaluation methods change by data distribution (i.e. domain), so a one-size-fits-all judge often measures poorly and its scores quietly point you at the wrong hill. Customization over exactly what agent behaviors you check for and how you measure them is key, even if you want to measure the same metric as another company.

Put nicely by Hamel, “All you get from using these prefab evals is you don’t know what they actually do and in the best case they waste your time and in the worst case they create an illusion of confidence that is unjustified.”

Evaluation methods and datasets are being grounded in vibes instead of production environments.

Even when teams accept that evals must be domain-specific, they often over-index on handcrafted rubrics fed into LLM judges. Because these rubrics are human-written, they can be attractive and quick to create when experts are available, yet they frequently encode biased notions of quality that do not reflect what users actually value in production. Furthermore, rubrics are usually written once and left alone, so they quickly go stale as production data shifts, becoming disconnected from what users actually care about.

The best teams continuously examine their own data and discover what truly matters in their agent’s setting. Scaled analysis of real-world examples of an agent interacting with its environment and users help shape use-case-specific rubrics that reflect the quality indicators that matter most.

Beyond methods for scoring outputs, many teams do not have a quick and clean method for selecting and upkeeping their eval sets and metrics to reflect usage drift across product releases. The process typically revolves around a single-time selection of examples and metrics that may remain unchanged for months. The result of this being that even if measuring helpful metrics, there will still exist many blind spots in deployment because the distributional shift of user interactions causes coverage of behaviors to inevitably drift and regress. As Gian Segato from Anthropic explains in his latest essay:

“Engineers need to keep the eval dataset up to date with the distribution of actual user behavior…. Having a good system in place to constantly sample the right test cases from real-world past usage is one of the most critical new pieces of work required today to build great AI products.”

Among other failure modes, the impact of these two mistakes is that teams end up finding and climbing the wrong hills. This can lead to slower iteration cycles, false confidence in product quality, and ultimately, user churn. A key lesson is that success with users in production lives in the messy bits of human interaction and domain/context specificity that are continuously updated - and our evals should reflect this as such. We believe that agent behavior monitoring (ABM) is one of the ways to address these problems.

How do we fix this?

Defining success metrics, LLM-judge systems, and rubrics in advance assumes you know what to optimize for in the first place. It implies knowing and quantifying the definition of success. In AI products, you basically can’t do this by arm-chair pontification. You need to track and measure what’s happening with your agents and end users in production.

The source of truth lies in what we capture in production. High-signal behaviors emerge in the wild: how people react, what they correct, what they prefer. Every interaction is feedback. Every correction is a vote. Weaponizing these signals to improve agent behavior is the clearest path to building robust AI agent products.

User preferences in relation to agent behavior are king in evals. As argued in Building AI Products in the Probabilistic Era, evaluation and production performance do not live in separate lanes. They collapse into a single system where agent behavior shapes the entire funnel. In that world, we update what we measure from observed user behavior in production rather than inventing evals in advance. Interaction data is the map that shows where users go, whether they succeed, and what they value. This was notably executed by Cursor in their latest play with capturing user preferences in their tab completion model work, improving code acceptance rates by 28%.

A paradigm shift like this doesn’t happen on its own. It requires robust infrastructure: a systematic way to capture dynamic, messy end user preferences — the only signal that matters — and transform them into scalable evals by surfacing the criteria users implicitly apply and converting those into stable, interpretable metrics of agent behavior.

Towards an Agent Behavior Monitoring (ABM) Layer

At heart, an ABM system tracks and mines user preferences at scale, then converts them into custom evaluation systems that give development teams signals to alert on and improve from.

If the interaction data embeds the evaluation, the practical question arises: how do we harness the data? Raw production logs/trajectories are noisy and overwhelming. To turn them into a compass for improvement, we need a systematic way of distilling user behavior into interpretable signals. Our pursuit of an agent behavior monitoring (ABM) layer consists of four infrastructure blocks

1) Trajectory Capture

All ABM work begins with permissioned logs of how users actually interact with agents. Across reasoning, actions, and environment response, these trajectories—paired with light context and outcomes—form the raw record of how agents and people behave, where they succeed, and when they experience friction.

2) Bucketing and analysis

Trajectories only become useful once they’re organized and processed. By grouping interactions into buckets that map to real work, we expose patterns that matter. Simple metadata—task type, agent behavior issues, tool use stats, user satisfaction, and end-state outcome—provides the scaffolding. Cohorts of similar trajectories can surface where models consistently deliver and where they break down.

3) Preference mining and rubric discovery

From these groups of trajectories, user preferences and feedback can be mined to induct behavior criteria. Approvals, edits, retries, and pairwise comparisons reveal implicit judgments. By mining these contrasts with LLM-driven analysis at scale, we can surface candidate dimensions of behavioral quality, then distill them into a small, stable rubrics with operational definitions that teams can validate and align on.

4) Scores and reward

Once dimensions are clear, judge models (possibly requiring alignment) can score agent trajectories, with rationales for transparency. Well orchestrated scorers with embedded logic can run online and surface regressions in real time. And when a judge proves reliable, it can also be promoted into a reward model, slotting directly into post-training optimization workflows such as reinforcement learning or supervised fine tuning. In practical terms, this looks like combining rewards and trajectories before piping them to distributed training frameworks like Fireworks, OpenAI, or Thinking Machine’s new Tinker library. This enables agents to improve directly on the dimensions users prefer in production.

Each of these steps requires customization to the nuances of different agents and end user categories. A proper agent behavior monitoring layer is tailored to the agent action and user preference data that runs through it.

What happens when we get this right?

Proper ABM grounds evaluations in the nitty-gritty nuances of production data. This grounding unlocks continuous improvements in product quality and higher ROI. Instead of relying on generic benchmarks or handcrafted, potentially biased criteria, agent companies can discover specialized evals rooted in the unique contexts and preferences of their own customers. Products begin to learn directly from users, which lends particularly well towards post-PMF agent companies: customized evals are a tool to convert distribution advantages into product advantages, turning usage into a compounding source of product superiority. Evals actually become your moat.

This shift powers a new kind of flywheel: the post-building flywheel. Just as post-training makes a model useful, post-building makes an agent reliable. Post-training uses data to refine a model’s skills; post-building uses data to evaluate, monitor, and optimize assembled agents so those skills are applied consistently, safely, and effectively in practice.

The signals are already in your data. Find the right hill.

Written by Andrew Li and Alex Shan

Thank you to James Alcorn, Dakota McKenzie, and the Judgment Labs team for reviewing and debating ideas with us.

To read more, subscribe to our newsletter. Stay tuned for our next post: evals are the greatest moat of AI product advantage.