Beginner's Guide to AI Evals (Walkthrough)
How we discovered our AI still couldn't do basic math and built a system to catch failure modes
Quick note: I launched my first course on AI Product Management this summer where we went deeper into building production AI systems. The next cohort kicks off again this week on Sept 2nd! We'll cover coding agents, evals, agent architectures, and the practical skills you need to ship AI products that actually work.
At this point, you’ve almost definitely heard about evals and why they matter for PMs building AI products. Peter Yang and I wanted to show what actually happens when you build evals from scratch. That’s why we built a customer support agent for On running shoes to demonstrate the complete process. We discovered a lot of LLM failure modes in the process - and how to catch them.
For example: A customer asked to change their delivery address 45 minutes after ordering. The bot responded: "Unfortunately, you've passed the 60 minute window." Our AI (the best LLM models available today) couldn't figure out that 45 minutes is less than 60 minutes.
Exactly why you need evals!
Watch the full session with Peter Yang
The Four Types of AI Evaluations
When helping companies deploy evals from development through production, these four types cover everything:
1. Code-Based Evals
Binary pass/fail checks using string matching or presence detection.
Example: If you're United Airlines and someone asks "How do I book a flight with Delta?" you verify your bot doesn't provide Delta booking instructions. Basic brand protection through simple string checks.
2. Human Evals
PMs or subject matter experts manually label interactions as good/bad/average.
The PM's job is to have judgment on what that end product experience should be. I've never found it useful to completely outsource human evals to contractors or labelers. The PM has to be in the spreadsheet doing this themselves.
At Cruise, we labeled self-driving car decisions almost every day: did the car do what it should have? The same principle applies here. Look at your data with your team and determine quality standards together.
3. LLM as Judge
Using an LLM to evaluate other LLM outputs at scale, acting as if a human is labeling the data.
What we discovered: LLM judges default to rating everything as "good."
Product Knowledge: 100% match with human labels
Tone: 20% match (only 1 out of 5 matched)
You need to provide explicit rules like "Don't rate as good if response is over 200 words" or the judge becomes useless.
4. User Evals / Business Metrics
Real user feedback (thumbs up/down) is sometimes the closest thing you might have to a proxy business metric, but it is faults as well.
The paradox: sometimes negative feedback often means the system worked correctly. User requests refund after 30 days, bot correctly denies per policy, user gives thumbs down. The bot followed your rules, but the user dislikes your policy. What does that thumbs down actually mean?
The "Not Sexy" Spreadsheet Truth
The most sophisticated AI systems are evaluated in Google Sheets by PMs manually labeling data.
I built a customer support agent for On running shoes with Peter Yang to demonstrate the complete process. We started with Anthropic's Console, but the real work happened in a spreadsheet where we manually labeled responses across three criteria:
Product Knowledge: Does it know about the products?
Policy Compliance: Is it following the rules?
Tone: What's the tone of the response?
Spreadsheets are the ultimate product for evaluating LLMs. This stuff is not sexy. You're in a Google Sheet. But this is probably one of the most important things you'll get right with your team.
How to Actually Build Evaluations (The 9-Step Process)
Step 1: Generate Your Initial Eval Prompt
Rather than writing prompts from scratch, use this system.
Three required inputs:
User question (what the customer asks)
Product information (from company catalog)
Policy information (return guidelines, support policies)
Tip: You can ask an LLM like ChatGPT "get me policy info for On" to pull initial context. This is called using synthetic data
Step 2: Build Your Golden Dataset
Definition of Golden Dataset: Data you want to use for evaluation. As a starting point, you can start with three columns: Question, Answer and Context. Grade row each as good/bad, based on the criteria (i.e. did the agent answer the question? That would be correctness). This becomes the schema for your golden dataset.
Start with just 10 rows of data to get aligned on your evaluation criteria, then iterate.
For distribution: I recommend a random sample (i.e. what do you expect to see in production) to get started with this - this can be a proxy for what you see in production with real world data.
Then, down the line, you can build a dataset with importance sampling (i.e. specific failure modes), and also use this as a dataset for testing and benchmarking performance improvements. More on this later.
This stuff is not sexy. You're literally just working in a Google Sheet. But this is probably one of the most important things you'll get right with your team.
Step 3: Label, Debate, and Discover Failures
Generate responses and start labeling. Document why things fail.
This is where we discovered the math failure. Customer asked about changing delivery address 45 minutes after ordering. Bot said they'd "passed the 60 minute window." LLMs are really bad at math.
Another edge case sparked debate: A customer lost their shoe box and the bot said "I'd recommend contacting our customer support team."
I pushed back: "Is losing the box covered in the policy or not? If not, why punt to another team? That just creates more work."
Peter thought it seemed reasonable. These debates matter. Document them.
Step 4: Scale From 10 to 100 Examples
You're optimizing for two orthogonal dimensions:
Speed of iteration: How quickly you can test changes
Confidence in results: How certain you are about performance
From our experience:
POC/internal testing: Start with ~10 examples minimum
Production confidence: Need around 100 examples or more
The tradeoff: More data = more confidence but slower iteration
Step 5: Implement LLM as Judge (And Watch It Fail)
This is the moment where theory meets reality. You upload your golden dataset to Arize (or your platform of choice), copy your evaluation criteria into the LLM judge prompt, and hit run. Then you watch your AI evaluator rate everything as "good."
That's exactly what happened to us. 100% positive ratings. Every response—whether helpful or horrible—got a thumbs up. The LLM judge was essentially useless.
The fixes that actually worked:
First, we added explicit criteria. Instead of "evaluate the tone," we specified "penalize responses over 200 words" and "flag any response that doesn't acknowledge user emotion."
Second, we required explanations. The judge couldn't just say "good" or "bad"—it had to explain why. This revealed the judge's reasoning gaps. We discovered it was marking everything good because it found something positive in every response, no matter how small.
Third, and this is critical: we sampled 10% of all AI judgments for human review. Not just initially—continuously. Because here's what nobody tells you: match rates drift. Week one might show 80% alignment with humans. By week four, without updates, you're down to 65%.
The sobering reality from our session:
Even after all our improvements, our tone evaluations didn’t always match human judgement. Product knowledge was a 100% match. But tone—the thing that actually matters for customer experience—was nearly random.
LLM judges require constant calibration, explicit instructions, and human oversight. The judge needs as much work as the system it's evaluating.
Step 6: Launch With Controlled Testing
The temptation to go straight to production is real. You've built your agent, created your evals, everything looks good in testing. But this is where patience pays dividends.
The rollout strategy that actually protects you:
Start with internal dogfooding. Have your team use the system for their actual work. When we built the On shoes agent, we had our team use it to answer their own customer service questions first. The failures you find internally are failures you prevent externally.
What we discovered during dogfooding: the bot worked great for simple queries but completely fell apart with multi-part questions. "I want to return my shoes and order a different size" sent it into a loop of half-answers. We never would have caught this without internal testing.
Next comes the 1% test. Not 10%, not 5%—start with 1% of your traffic. This gives you enough data to spot problems without creating a customer service disaster. Monitor everything: eval scores, response times, escalation rates, and most importantly, user satisfaction.
The surprising divergence:
Here's what surprised us: system performance and user satisfaction often move in opposite directions. Our eval scores improved when we made responses more accurate and policy-compliant. But user satisfaction dropped. Why? The responses became robotic and cold.
Conversely, when we made the bot more conversational and friendly, eval scores for policy compliance dropped but users were happier. The bot would say things like "I totally understand your frustration" even when the customer wasn't frustrated.
This divergence is actually valuable data. It shows you where your evals might be missing something important. In our case, we needed to add a new eval criterion: "appropriateness to user emotion." The system was optimizing for the metrics we gave it, but we hadn't given it the right metrics.
The gradual rollout based on learning, not timeline:
Resist the pressure to scale quickly. Each percentage increase in rollout should be based on what you've learned and fixed, not on a predetermined schedule. We stayed at 1% for two weeks because we kept finding edge cases. Moving to 5% only happened after we'd addressed the major issues.
The truth is, controlled testing feels slow but saves you from catastrophic failures. Every issue you catch at 1% is an issue that doesn't affect the other 99% of your users.
Step 7: Benchmark Your System
Once you have your evals running, you need to track performance over time. Otherwise you're flying blind. Most teams discover their "improvements" made things worse!
When building with Peter, we ran our On shoes agent through multiple iterations. Each change seemed like an improvement in isolation. But without benchmarking against our original dataset, we wouldn't have caught regressions. One prompt change to make responses more concise accidentally removed all the empathy from customer service responses.
The metrics that actually matter:
We tracked three core dimensions:
Response accuracy against our golden dataset (how often was it factually correct?)
Policy compliance (did it follow On's actual return policy?)
Response characteristics (length, tone, format consistency)
What surprised us: improving one metric often degraded another. Making responses more accurate made them longer. Making them more empathetic made them less compliant with policy. This is why you need comprehensive benchmarking.
Model switching revealed hidden assumptions:
When we compared GPT-4 to Claude on the same dataset, the results challenged our assumptions. GPT-4 was verbose but generally accurate. Claude was more concise but occasionally invented policies that didn't exist. The lesson? Always test model changes against your existing eval suite before deploying.
Version control extends beyond code:
We learned to tag everything: prompt versions, model changes, eval updates. Why? Three weeks into production, when customer satisfaction drops, you need to trace exactly what changed. Was it the prompt? The model? Or did your evaluator become more lenient? Without versioning, you're debugging in the dark.
The business connection everyone misses: eval scores need to correlate with actual business outcomes. If your accuracy improves but escalation rates increase, you're measuring the wrong thing. Track both in parallel.
Step 8: Identify New Failure Modes
Your AI will fail in unexpected ways. The 45 < 60 math error we discovered wasn't just amusing—it revealed a fundamental weakness in how LLMs process numerical comparisons.
Here's what actually happened: A customer asked to change their delivery address 45 minutes after ordering. The bot responded: "Unfortunately, you've passed the 60-minute window." This wasn't a typo or a one-off glitch. When we dug deeper, we found the model consistently failed at time-based comparisons. "Is 30 days within our 14-day return window?" Yes, according to the bot.
The shoe box debate that changed our approach:
Another revealing moment came when a customer said they lost their shoe box. The bot responded: "I'd recommend contacting our customer support team."
I pushed back during our review: "Is losing the box covered in the policy or not? If not, why punt to another team? That just creates more work."
Peter thought the response seemed reasonable—after all, it was being helpful. But this exposed a pattern: the bot was deferring to human support whenever it encountered ambiguity. It wasn't making decisions; it was avoiding them. We found this behavior in dozens of interactions—anytime the bot wasn't certain, it punted.
Systematic discovery beats random testing:
We developed a structured approach to finding failures:
First, we kept a failure log. Every weird interaction got documented, no matter how minor. The log revealed patterns we'd never have noticed from individual cases.
Second, we actively tested edge cases. What happens when someone asks about a product that doesn't exist? What if they mention a competitor? What if they're angry and use profanity? Each edge case revealed new failure modes.
Third, we grouped failures by type. When 40% of our failures involved time calculations, we knew we needed dedicated handling for temporal logic. When emotional responses consistently failed, we added emotion detection.
The patterns tell you everything: repeated failures in one area mean you need specialized handling. Random failures across different areas suggest your prompt needs refinement. Consistent misunderstandings about specific topics might indicate training data gaps.
Step 9: Use Data to Iterate on Your Evaluator
Your evaluator needs evaluation too. This meta-evaluation is where most teams get stuck, but it's where we had our biggest breakthrough.
When we first ran our LLM judge on the On shoes responses, something seemed off. Every single response was rated as "good." 100% positive ratings. Either we'd built the perfect customer service bot (we hadn't), or our evaluator was broken.
The evolution of our tone evaluator tells the whole story:
Our first attempt was embarrassingly simple: "Determine if the tone is appropriate." The LLM judge loved everything. It was like asking someone who only says yes to be a critic.
So we overcorrected. Version two included strict rules: "Mark as bad if over 100 words or missing an exclamation point." Suddenly 95% of responses failed. We'd gone from a cheerleader to a harsh critic who hated everything.
The breakthrough came when we added specific examples. Instead of describing what good tone meant, we showed it. Five examples of good responses, five examples of bad ones. Our match rate with human labels jumped from 20% to 60%.
The data that exposed our blind spots:
When we compared our LLM judge to human labels across different criteria:
Product Knowledge: 100% match with human labels
Policy Compliance: 90% match
Tone: Only 20% match initially
This mismatch was revealing. The LLM could easily verify facts (does this shoe exist?) and rules (is this within policy?). But subjective qualities like tone? That required much more work.
The iteration cycle that actually works:
We developed a weekly rhythm. Every Friday, we'd pull 100 random evaluations and have humans label them. Then we'd compare. Week 1 showed 60% alignment. Week 2, after adding more examples, hit 72%. Week 3, with edge cases added, reached 78%. Week 4, with refined criteria, hit 85%.
But here's the catch: without continuous updates, that 85% degraded by about 5% per month. The model's behavior drifted, new types of queries emerged, and our evaluator slowly became less reliable.
The key learning that changed everything:
LLM judges need explicit negative examples. Without them, they default to positive ratings. You can't just describe what "bad" looks like—you need to show it. When we added examples of passive-aggressive responses, sarcastic replies, and overly formal language, our tone detection finally worked.
This is the uncomfortable truth: you'll spend more time tuning your evaluator than your actual product. But without this investment, you're flying blind. Every improvement to your evaluator makes every future product iteration more reliable.The Messy Reality (And Why It's Normal)
People are always looking for a silver bullet but the reality is constant iteration.
You may need to run 10-20+ prompt iterations just to get started. Some teams run 50-100+ iterations before production. After launch, iteration continues forever.
Each iteration teaches you something. The 45 < 60 math error led us to add "please verify any numerical comparisons" to prompts. The "everything is good" judge problem led to explicit negative criteria.
This process is super messy. You go back and forth constantly trying to figure things out. This is what evals looks like in practice.
What Successful Teams Do That Others Don't
The successful patterns:
PMs personally label data in spreadsheets
Teams debate edge cases vigorously ("I would push back and say...")
You can see more about that in this post : How AI PMs and AI Engineers collaborate on evals
Accept evaluation as permanent discipline that you will need to keep revisiting
Document every weird failure and consider adding to your golden dataset
Start with human labels before any automation
Track match rates between human and AI judges
The failure patterns:
Outsourcing all labeling
Skipping human evaluation
Trusting LLM judges without verification
Expecting evaluation to end after launch
Ignoring the user feedback paradox
Looking for silver bullets
The Ultimate Implementation Test
Before launching any AI feature, can you answer:
"Would our CPO warn customers about this?" (Acknowledging limitations)
"Have we done the unsexy spreadsheet work?" (Ground truth labeling)
"Did we discover our math failures?" (Edge case testing)
"Has the team had healthy debates?" (Quality through disagreement)
"Is our LLM judge calibrated?" (Match rate validation)
"Can we interpret angry users correctly?" (Signal verification)c
If yes to all: You're ready for the permanent discipline of AI evaluation.
Start Tomorrow Morning
Your first 30 minutes:
Open Google Sheets
Write three column headers: Product Knowledge | Policy Compliance | Tone
Generate one test example in Anthropic Console
Label it good/average/bad
Ask someone to label it independently
Debate the difference
Everything else builds from this foundation.
The difference between AI products that work and ones that fail isn't the model or the prompt. It's the spreadsheet and the willingness to argue about what "good" means.
Resources:




++ Good Post. Also, start here : $500K Salary Career Wins, 500+ LLM, RAG, ML System Design Case Studies, 300+ Implemented Projects
https://open.substack.com/pub/naina0405/p/500k-salary-career-wins-500-case?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
Aman, I watched the video you did with Peter. It was very helpful in confirming what I had thought about the work involved with evaluating the quality of an LLM's outputs. Given that it does take time to set up evals for teams that aren't organised/data driven minded, like defining the rubric/evaluation criteria (+ debates), analyse & score outputs (+ debates), how much time would you say it may take for such a team to get their first version of the AI agent ready to use for an internal/closed user group release? Am asking as my observation is few organisations are rigorous where the subject matter expert is able to define quality criteria to the granular enough level to say "This is good/bad/average.". I sense that it's going to take some time to do the setup work.