Case Study

Real-Time Monitoring for a Fintech AI Agent

Sep 25, 2025

Background Context

A leading fintech platform launched an AI financial agent that helped customers navigate spreadsheets, analyze SEC filings and earnings reports, and interpret market data. The product had clear value but carried equally high risk: in finance, trust is fragile. A couple hallucinated numbers or missing citations can cause customers to lose confidence and later churn.

To scale successfully, they needed a system that could ensure accuracy in real time, detect regressions before customers noticed, and continuously improve from how customers interacted with the agent.

Inefficient Status Quo

Before working with Judgment Labs, the platform had no systematic way to monitor the reliability of agent behavior. The only signals of failure came from frustrated customers filing complaints through Slack or email support. By the time engineers investigated, dissatisfaction had already spread. Additionally, engineers had little insight into behavior patterns drifting over time, such as the type of queries shifting from knowledge retrieval on SEC filings to workflow automation on spreadsheets over time.

This reactive loop meant that:

Issues with agent behavior were discovered late (days or weeks after deployment).
Silent regressions slipped by; e.g., prompt rewrites or provider-driven model updates that quietly degraded accuracy in ways testing didn’t cover.
Customer confidence eroded because failures appeared preventable.
Engineering team was operating in the dark over which workflows to optimize for.

“It was really hard to get ahead of production regressions or unexpected behaviors when we had no Sentry-style system over the non-deterministic nature of LLMs.” — Engineering Manager

Judgment Labs Approach

Judgment started by grounding evaluation in their real customer signals. This made every interaction a measurable data point to ground evaluation metrics and methods from.

Mining End User Signals → Building Rubrics

The company had rich production data: thumbs up/down ratings, open-ended feedback, inline edits, and spreadsheet corrections. Judgment engineers bootstrapped synthetic examples and reasoning traces at scale from this data.

By clustering and analyzing these signals with LLMs, we discovered which failure patterns correlated most with end user dissatisfaction and which workflows drove adoption (e.g., quarterly earnings summaries, EPS breakdowns). These insights were distilled into custom evaluation rubrics that reflected production usage patterns instead of abstract financial QA benchmarks.

This rubric-first process ensured that subsequent evaluators and monitoring systems were tuned to the exact failure modes that mattered most to their end users.

“Grounding evals in our own deployment data really helped us gain confidence that these metrics were telling the real story about agent quality.” — Senior AI engineer

Real-Time Agent Behavior Monitoring (ABM)

With rubrics in place, Judgment deployed a post-trained judge model built on their financial data and customer history. It was tuned with paired examples of agent outputs and corrected analyst drafts, focusing on the most business-critical failure modes:

Hallucinated values in spreadsheets (e.g., fabricated ratios).
Citation mismatches (claims not properly backed by linked filings, or links pointing to non-existent sources).
Coverage failures (skipping key sections of 10-Ks or earnings transcripts).

Every response was evaluated by Judgment in real time, producing a risk verdict and rationale tied directly to the company's accuracy rubric.

Regression Detection

The monitoring system logged and flagged regressions across production deployments, including:

A prompt update that led the agent to mis-handle EPS calculations.
A silent model swap by an inference provider that introduced subtle citation gaps.
A tool-use regression where the financial summarizer call was failing/skipped, yielding incomplete analysis.

Instead of hearing about these regressions from customer complaints, they now received instant customized alerts and triage reports via Slack/email/Judgment’s platform.

Outcomes

Customer-visible errors leading to tickets/Slack complaints reduced by 55% through real-time detection and intervention.
Regressions caught instantly that instead would have lingered unnoticed for weeks or months.
Customer approval scores improved by 20% after 10 weeks, measured via in-product surveys and retention metrics.

“With Judgment, we went from reactive firefighting to proactive monitoring. We now know about failures before our customers do.” — Head of AI Product

Conclusion

By starting with customer signal mining to build rubrics, and layering on real-time monitoring and regression detection, the platform kept its financial agent reliable in a high-risk domain. The company reduced churn, increased adoption, and strengthened customer trust while also saving dozens of analyst-hours spent triaging errors. Today, the platform runs hundreds of thousands of customer interactions per week through Judgment’s monitoring pipeline, ensuring their agents stay accurate, efficient, and trustworthy over time.