Chapter 23 — Week 5: Evaluation, golden sets, and build-measure-learn

Welcome to Week 5. You have a deployed prototype with a working vertical slice. Without evaluation discipline, you cannot tell whether the prototype is good enough to launch in Week 6, whether prompt iteration is making things better or worse, or whether your alpha cohort’s reactions reflect signal or noise. This week you build the evaluation infrastructure that makes all subsequent improvement tractable: a golden set of 100+ representative inputs with expected behaviour, an evaluation pipeline that runs the golden set through your system on every prompt change, a metrics dashboard, and at least one round of measured prompt iteration that visibly moves the metrics. By Friday your team should be able to answer the question is the alpha launchable? with evidence. The single most-common Week-5 mistake is to skip the golden-set discipline and rely on subjective impressions. Without measurement, prompt iteration is theatre.

Chapter overview

This chapter follows the same six-part structure. §23.1 (Concept) sets out why evaluation is the binding discipline for AI products, the build-measure-learn loop applied to AI specifically, the golden-set methodology, the five evaluation modalities (reference-based, reference-free, LLM-as-judge, human, A/B), task-specific metrics, statistical thinking with the Vapnik bound from Chapter 2 made operational, evaluation-as-code architecture, and the failure-mode taxonomy. §23.2 (Method) is the day-by-day Week 5 sprint: build the golden set, implement reference-based eval, implement LLM-as-judge for subjective dimensions, run a human-evaluation pilot, build the metrics dashboard, iterate prompts to hit the quality bar from Chapter 21. §23.3 (Lessons from the cases) pulls eight specific evaluation lessons from Parts I–III. §23.4 (Tools and templates) gives you the golden-set design template, the evaluation pipeline scaffold in Python and TypeScript, the LLM-as-judge prompt patterns, the dashboard schema, the human-rater rubric, CI/CD integration patterns, and the pre-alpha checklist. §23.5 (Worked example) continues Team Aroma through their Week 5: building a 100-question SPM Add Maths golden set, discovering they are below the quality bar on first eval (72% BM clarity vs 80% target), four rounds of prompt iteration to reach 81%, and the regression catch that saves the alpha. §23.6 (Course exercises and deliverables) specifies the Week 5 submission with grading rubric.

How to read this chapter. Read §23.1 in full at Monday’s standup; this is the conceptual chapter that will anchor your team’s thinking through the alpha and beta weeks. Read §23.2 with the team and assign a single member as the evaluation lead for the week — this role coordinates golden-set construction, eval pipeline build, and metrics dashboard work. Treat §23.3 as Wednesday-evening reading; the lessons land harder when you have your own first evaluation results. Use §23.4 throughout. Read §23.5 before Friday’s iteration sprint. Submit against §23.6 by Friday 23:59.

23.1 Concept

23.1.1 Why evaluation is the binding discipline for AI products

Classical software has a binary correctness model: the code does what it is supposed to or it does not. AI products have a continuous correctness model: outputs vary in quality across a range, and the same input can produce different outputs across different runs. The implication for engineering practice is that AI products require a measurement infrastructure that classical software does not — without it, the team cannot tell whether changes are improving or degrading the product.

Three distinctive properties of AI products make evaluation binding:

The capability-quality continuum. As established in Chapter 21, a tutor producing correct explanations 60% of the time, 80% of the time, and 95% of the time are different products. The same code base, the same model, the same prompt structure can yield a 60% system or an 80% system depending on prompt-engineering, RAG configuration, and downstream filtering. Without evaluation, the team cannot tell which version of the system they actually have.

The wow-factor trap, again. Generative AI demos elicit enthusiasm that does not predict sustained use (Chapter 20). The same trap applies to founders: a single impressive output produces unjustified confidence in the system, while a single poor output produces unjustified pessimism. Both reactions are noise. Evaluation against a golden set of 100+ inputs averages out the noise and produces a stable measurement.

The data-flywheel dependency. AI products’ value compounds as the team learns from production usage. The learning is captured by evaluation data: every input-output pair, every user correction, every escalation. A team that does not log and evaluate systematically cannot turn the flywheel — they can ship the product, but they cannot improve it predictably.

The consequence is that evaluation infrastructure is not optional engineering work for AI products; it is the engineering work. A team that ships an MVP without measurable quality has shipped a demo, not a product. The Watson Health case (Chapters 2, 7) is the canonical reminder: IBM had a demonstrably impressive system on the Jeopardy! surface, but no clinical-evaluation infrastructure that could distinguish “Watson recommended a treatment that the oncologist would have chosen” from “Watson recommended a treatment that turned out to be unsafe.” The latter category surfaced only after years of deployment, by which time the brand and the partnerships were sunk.

The discipline is not new. The Vapnik–Chervonenkis bound from Chapter 2 — that generalisation error is bounded by training error plus a complexity term that scales as $\sqrt{d \log n / n}$ — formalises why evaluation matters: a complex hypothesis class (a high-VC-dimension model) on a small dataset cannot be expected to generalise; the only way to measure whether it does is empirical evaluation on held-out data. Foundation models have effectively unbounded VC dimension; their generalisation is empirically excellent for reasons the classical theory does not fully predict (the recent literature on “benign overfitting” and “double descent” is the current frontier). For practical purposes the implication is that evaluation is the only reliable signal of model performance on your specific task; theoretical guarantees do not transfer from generic benchmarks to your application.

23.1.2 The build-measure-learn loop, applied

Eric Ries’s Lean Startup (Ries, 2011) introduced the build-measure-learn loop as the iteration unit of startup work. For AI products in 2026, the loop has specific operationalisations at each phase.

Build. A change to the system: a new prompt version, a different foundation model, a RAG retrieval-strategy change, a new feature, a UI modification. Builds happen as PRs through the Week 4 workflow.

Measure. Run the build through the evaluation pipeline. Two measurement modes:

Offline evaluation: run the new build against the golden set; compare metrics to the prior baseline. Catches most regressions before they ship.
Online evaluation: deploy the build to a fraction of production traffic (canary or A/B); compare metrics against the control. Catches regressions that the golden set missed because the production distribution differs from the golden set’s distribution.

Learn. From the measurement, decide: ship to production, iterate further, or roll back. The loop’s discipline is that every change passes through measurement before it reaches users; intuitive judgment is augmented by evidence, not replaced by it.

The loop turns at different speeds for different changes. Prompt iteration: 5–30 minute cycles (build a new prompt, run against golden set, learn). Feature changes: 2–4 day cycles. Model swaps: 1 day cycles. The pace of learning is governed by the evaluation infrastructure’s speed; if evaluation takes 2 hours, iteration cycles are slow; if evaluation takes 5 minutes, iteration cycles are fast and the team learns faster than competitors.

23.1.3 The golden set

A golden set is a curated collection of representative inputs with expected outputs or quality criteria. It is the team’s standardised measurement instrument, used to compare system versions and to detect regressions.

Four properties characterise a useful golden set:

Representativeness. The inputs should reflect the distribution of real production traffic. Not the average traffic — the distribution, including edge cases and tail behaviour. A golden set of 100 SPM Add Maths questions should include calculus questions, geometry questions, statistics questions, application questions, and cross-topic questions, in roughly the proportions the system will actually see.

Coverage of failure modes. The golden set must include inputs that probe the failure-mode taxonomy in §23.1.8 — adversarial inputs that test prompt injection, ambiguous inputs that test refusal calibration, inputs in the customer’s secondary language, inputs that probe known weaknesses. Without failure-mode coverage, the golden-set evaluation is biased upward; the production deployment will surprise the team.

Stable expected behaviour. Each input has a documented expected output (for reference-based evaluation) or quality criteria (for reference-free evaluation). Stability means the expected behaviour is agreed by the team before the system is built, so changes to the system can be evaluated without re-deciding what good looks like.

Versioned and curated. The golden set is a project artefact under source control. Additions, deletions, and modifications go through review like any other code change. Golden-set changes are themselves a quality signal: a team adding 20 new edge cases monthly is investing in evaluation infrastructure; a team that has not changed its golden set since Week 5 has likely stopped learning.

Sizing the golden set. A common student-team mistake is to size the golden set by intuition rather than by sample-size analysis. The right size depends on what you are trying to measure. For binary metrics (output is correct vs incorrect) at moderate effect sizes, $n = 100$ supports detecting differences of ~10 percentage points with reasonable power; $n = 400$ supports detecting differences of ~5pp. For multi-level rubric metrics (clarity scored 1–5), the same principles apply with the variance term replacing the binomial. For Week 5’s pre-alpha measurement, $n \approx 100$ is the working minimum; for production-grade evaluation in Weeks 7+, $n = 400$ to $1{,}000$ is the standard.

The connection to the Vapnik bound (Chapter 2) is direct: if your prompt-engineering effective hypothesis class is rich (you can plausibly express many different prompt strategies), the sample size needed to distinguish them grows. Practically, this means that prompt iteration in a small team typically tries 5–20 distinct prompts per week; a golden set of 100 inputs distinguishes 5–20 prompts at moderate effect sizes; the eval infrastructure works. A team that wants to compare 100 prompts on a 30-input golden set is asking the data more than it can answer; the apparently best prompt is mostly a function of which one happened to fit the small sample.

23.1.4 Five evaluation modalities

Five distinct evaluation modalities are used in practice; each has characteristic strengths and limitations.

1. Reference-based evaluation. The system’s output is compared to a known-correct expected output. Metrics: exact match, semantic similarity (using sentence embeddings), token-level overlap (BLEU, ROUGE), structured-field match. Strengths: deterministic, fast, scales to thousands of inputs cheaply. Limitations: requires expected outputs that may be expensive to produce, and many tasks (open-ended generation, conversation, code) have multiple acceptable answers that exact-match metrics treat as wrong.

Use reference-based evaluation for: - Classification tasks (label correctness) - Structured extraction (field-level accuracy) - RAG retrieval (was the right document retrieved?) - Format compliance (does output match expected schema?) - Anything with a clearly correct answer

2. Reference-free evaluation. The system’s output is scored against quality criteria without comparing to a specific expected output. Metrics: rubric scores, factuality, fluency, helpfulness, safety. Strengths: no expected-output construction cost; works for open-ended tasks. Limitations: requires a scorer (human or model) and is more variable than reference-based.

Use reference-free evaluation for: - Open-ended generation (essays, explanations, summaries) - Conversational interactions - Creative tasks - Cases where the team cannot specify what “correct” looks like ex ante

3. LLM-as-judge. A capable model evaluates the output of the system being tested. The judge model receives the input, the system’s output, and a rubric; it produces scores or pass/fail decisions. Strengths: scalable, consistent, much cheaper than human evaluation. Limitations: the judge inherits its own biases (verbosity bias, position bias, self-preference bias); the judge’s evaluations correlate with but do not equal human judgment.

LLM-as-judge has become the standard 2024–2026 method for scaling evaluation. Best practices:

Use a more capable model as judge than the model being evaluated (e.g., Claude Opus to judge Claude Haiku output, GPT-5 to judge GPT-5-mini).
Provide explicit rubrics, not vague quality requests. “Score 1–5 on factual accuracy, where 1 = contains false claims, 3 = mostly accurate with minor errors, 5 = factually correct” beats “rate quality 1–10.”
Calibrate the judge against human annotations periodically (run 50 inputs through both human raters and the LLM judge; check correlation).
Watch for length bias: judges often favour longer outputs even when they are less accurate. Counteract with explicit length-aware rubrics or length-controlled comparisons.
Watch for position bias: when comparing two outputs A and B, the judge often favours whichever is presented first. Counteract by randomising presentation order or using both orderings.

4. Human evaluation. Trained human raters score outputs against rubrics. Strengths: gold standard for subjective quality, captures cultural and domain nuance that models miss. Limitations: expensive (USD 5–50 per output rated), slow (days rather than seconds), and subject to inter-rater variability that requires training and calibration.

Use human evaluation for: - Calibration of LLM-as-judge (~50 outputs per major release) - High-stakes evaluation (regulated domains, safety-critical outputs) - Cultural and linguistic dimensions models cannot reliably evaluate (BM colloquial register, Mandarin honorifics, Australian English idiom, religious sensitivity) - Final pre-launch validation

For Week 5, the realistic human-evaluation pattern is: each team member rates 20 outputs on a defined rubric, producing 100 human-rated outputs total. This is sufficient for calibrating LLM-as-judge against your team’s judgment and for identifying systematic LLM-judge biases.

5. A/B testing in production. The system is deployed with two or more variants; production traffic is randomised across variants; outcome metrics are compared. Strengths: tests on real distribution, captures signals that controlled evaluation misses. Limitations: requires production traffic at scale (typically 1,000+ events per arm), takes time (typically 1–4 weeks), and only works after the product is live.

For Weeks 5–6, A/B testing is not yet relevant; you do not have production traffic. From Week 7 onward (beta), A/B testing becomes the primary evaluation modality. Chapter 25 develops it.

23.1.5 Metrics by task type

Different AI tasks call for different metrics. The taxonomy:

Classification tasks (the model assigns a label to an input):

Accuracy: fraction correct. Easy to interpret but misleading for imbalanced classes.
Precision: fraction of positive predictions that are correct.
Recall: fraction of true positives that are caught.
F1: harmonic mean of precision and recall; the standard for imbalanced binary classification.
AUC-ROC: area under the receiver operating characteristic curve; measures ranking quality.
Confusion matrix: full breakdown by predicted vs actual label. Always inspect.

Generation tasks (the model produces freeform text):

Reference-based: BLEU (translation), ROUGE (summarisation), BERTScore (semantic similarity).
Reference-free: factuality, fluency, helpfulness, harmfulness — all typically scored by LLM-as-judge or human raters on rubrics.
Hallucination rate: fraction of outputs containing factual errors. Critical for any product where the user is likely to act on the output.

Retrieval / RAG tasks (the system finds relevant documents to ground generation):

Hit rate at k: did the relevant document appear in the top-k retrieved? Standard k values are 1, 5, 10.
Mean Reciprocal Rank (MRR): average of the inverse rank of the first relevant result.
NDCG: normalised discounted cumulative gain; weights earlier-ranked relevant results more heavily.
Faithfulness: fraction of generated claims that are supported by retrieved documents. Catches the hallucination failure where the model ignores retrieval.
Answer relevance: fraction of the generated answer that addresses the question.

Conversational tasks (multi-turn interaction):

Turn-level satisfaction: per-turn user satisfaction (typically thumbs up/down).
Task completion rate: fraction of conversations where the user’s stated task was completed.
Escalation rate: fraction of conversations escalated to human (lower is better, except where safety requires escalation).
Conversation length: average turns to completion. Lower is better for transactional tasks; higher may be better for educational or exploratory tasks.

Workflow / agentic tasks (the system takes multi-step action):

Task success rate: fraction of tasks completed without human intervention.
Time-to-completion: end-to-end latency.
Cost-per-task: total inference and tooling cost.
Error / recovery rate: how often the agent recovers from intermediate failures.

For your specific MVP, identify which of these task types your wedge falls into. Most B2B AI products are mixed: a primary task type plus 1–2 secondary types. Build evaluation for the primary first; secondaries follow as the product matures.

23.1.6 Statistical thinking, briefly but seriously

Three statistical considerations recur in evaluation work and bite student teams who skip them.

Sample size and confidence. Two prompts that score 78% and 82% on a 100-input golden set may not be statistically distinguishable. The standard error for a binomial estimate is $\sqrt{p(1-p)/n}$; at $p = 0.8$ and $n = 100$ the SE is ~4 percentage points, so a 4pp difference is roughly one standard error and consistent with no real difference. To declare a 4pp improvement with 95% confidence requires $n \approx 400$. The implication: for early prompt iteration aim for large effect sizes (>10pp); for production deployment decisions, build the golden set up to 400+ inputs.

The two-proportion z-test is the right inferential tool: \[ z = \frac{\hat{p}_A - \hat{p}_B}{\sqrt{\hat{p}(1 - \hat{p}) \cdot (1/n_A + 1/n_B)}} \] where $\hat{p}$ is the pooled proportion. A test of $|z| > 1.96$ corresponds to the standard 5% significance level. Pre-commit to your significance threshold before running the eval; post-hoc threshold-shopping is the most common evaluation-malpractice pattern in startup work.

The multiple-testing problem. If you compare 20 prompts against the same golden set at $\alpha = 0.05$, the probability that some comparison achieves apparent significance by chance alone is $1 - 0.95^{20} \approx 0.64$. Most “best prompts” found by exhaustive comparison are noise. Mitigation: pre-commit to a small number of focused comparisons; apply Bonferroni correction ($\alpha / k$ for $k$ comparisons) when you cannot pre-commit; use a held-out evaluation set for the final decision after exploratory iteration on a development set.

Stratification by segment. A 78% aggregate score may mask a 95% score on the primary segment and 50% on a secondary segment. The Khairy-Levin (2024) example: a recommendation system performing well “on average” but failing on minority-language users. Always stratify metrics by the major segments of your customer profile (Chapter 20). For Team Aroma’s Pulse: stratify by topic area (Algebra, Functions, Geometry, Statistics, Differentiation, Integration), by language (BM, EN), by difficulty level. Stratified metrics surface failure patterns the aggregate hides.

23.1.7 Evaluation as code

The 2026 best practice is evaluation as code: the evaluation pipeline lives in the same repository as the application, runs in CI/CD on every PR, and produces structured output that the team can query. Three elements:

The eval script. A program that takes a system version (a prompt template, a model, a configuration) and a golden set, runs the system against the inputs, scores outputs against the metrics, and produces a structured results object (typically JSON). The script is deterministic-modulo-model-randomness; running it twice produces consistent rankings even when individual outputs vary.

The CI/CD integration. On every PR that touches prompts, models, or core logic, the eval script runs automatically. Results are posted as a PR comment showing the metric deltas vs the main branch. Regressions on key metrics block merge; teams can override with explicit justification.

The continuous production evaluation. A scheduled job (daily or weekly) samples production inferences, re-scores them with the same metrics, and tracks trends. This catches drift — performance degradation that occurs not because of code changes but because the input distribution has shifted (Chapter 25’s data flywheel territory).

The combination produces a quantitatively rigorous improvement loop. Every change is measured; every measurement is comparable; every regression is caught early. The team’s velocity is set by the evaluation infrastructure’s speed: a 5-minute eval supports many iterations per day; a 2-hour eval limits the team to a few iterations per day.

23.1.8 Failure mode taxonomy for AI products

Eight recurring failure modes appear in evaluation. Your golden set should contain inputs probing each.

1. Hallucination. The model generates plausible-sounding content that is factually false. Test with: questions whose answers are not in the model’s training data; questions with subtle factual constraints (dates, quantities, citations).

2. Refusal / over-cautiousness. The model declines to answer reasonable questions, citing safety or policy. Test with: questions in regulated domains (medical, legal, financial) that have legitimate informational answers; questions touching sensitive but appropriate topics.

3. Inconsistency. The same input produces materially different outputs across runs. Test with: a subset of golden-set inputs run 5+ times each, with inter-run agreement measured.

4. Bias. The model produces systematically different outputs across protected categories (gender, race, religion, age, geography). Test with: paired inputs differing only in demographic markers; the @buolamwini2018 framework adapted to text. Critical for any product subject to fair-lending, fair-employment, or fair-housing regulation (Chapter 14).

5. Prompt injection susceptibility. User input contains instructions that the model follows instead of treating as data. Test with: inputs containing “ignore previous instructions”-style payloads; inputs where the malicious instruction is encoded in indirect ways (in retrieved documents, in tool outputs).

6. Off-topic drift. The model’s output drifts away from the requested task into adjacent or unrelated content. Test with: precise tasks where small drift is detectable; long conversations where drift accumulates.

7. Verbosity. The model produces longer outputs than necessary, padding with hedges, caveats, and restatements. Costs increase, user experience degrades, and downstream evaluation (especially LLM-as-judge with verbosity bias) is corrupted. Test with: latency and token-count metrics on every output.

8. Format failure. The model produces output that does not match the requested schema (JSON broken, missing fields, hallucinated fields). Test with: format-compliance checks; structured-output libraries (Pydantic, Zod schemas) that enforce parse-correctness.

The taxonomy is not exhaustive, but it covers the failure modes that recur across student-team AI MVPs. Your golden set should contain at least 5–10 inputs probing each mode (50–80 inputs total dedicated to failure-mode coverage), out of the ~100-input total.

23.2 Method — the Week 5 sprint

23.2.1 Day 1 (Monday): build the golden set

By Monday end-of-day, the team should have a 100-input golden set in version control, with documented expected behaviour for each input.

Method:

Allocate the 100 inputs across stratification dimensions. For Team Aroma’s Pulse, the allocation: 40 Algebra/Functions, 25 Geometry, 15 Statistics/Probability, 10 Differentiation, 10 Integration. Within each topic, 6 of 10 in BM, 4 of 10 in English. 70 standard difficulty, 20 hard, 10 edge-case (ambiguous wording, multi-part, common misconceptions).
Source inputs from real materials. SPM past-year papers (publicly available through Lembaga Peperiksaan Malaysia archives), centre-owned materials shared by the design-partner centre, edge cases identified from Week-2 customer-discovery interviews. Where you cannot source 100 real inputs, supplement with team-authored ones — but flag the source per input.
Document expected behaviour per input. For SPM Add Maths: the correct numerical answer (rule-based check), the SPM-format compliance criteria (rule-based check), the BM/English clarity bar (LLM-as-judge), the absence of hallucinated formulae (LLM-as-judge + factuality check). The expected-behaviour annotation is structured so different metrics can be applied to different aspects.
Include failure-mode coverage. 20 of the 100 inputs are designed to probe failure modes: 5 ambiguous inputs (testing refusal calibration), 5 prompt-injection attempts (testing instruction-following hierarchy), 5 cross-language inputs mixing BM and English (testing language handling), 5 inputs where the SPM-correct answer differs from the textbook-default answer (testing rubric alignment).
Version control. The golden set is committed to the repo as eval/golden_set/v1.json (or .csv if the team prefers tabular), with a README explaining its structure, sources, and known limitations.

The golden set is the single most-important Week-5 artefact. Underinvestment here compounds into evaluation noise for the rest of the build.

23.2.2 Day 2: implement reference-based evaluation

By Tuesday end-of-day the team has a reference-based eval pipeline running. It scores the outputs of the current system against the golden set on the metrics that admit reference-based scoring: numerical correctness (for SPM Add Maths), SPM-format compliance (rule-based regex/parser checks), structural validity (output is well-formed JSON if the API expects it).

Method:

Write the eval script. Either Python (recommended for cross-team teams with R/Python comfort) or TypeScript (if the team is JavaScript-only). The script reads the golden set, runs each input through the system being evaluated, scores outputs, and produces a JSON results file.
Wire CI/CD integration. GitHub Actions runs the eval on every PR that touches prompts/, lib/ai/, or eval/. Results are posted as a PR comment by a bot.
Run the baseline. Score the current production prompt against the golden set. This is your Week-5 baseline; subsequent iterations compare to this.
Inspect failures. For every input where the system fails, the team reads the failure and categorises by failure mode. The categorisation produces the input to Wednesday’s iteration work.

A typical Week-5 baseline result for an MVP: 55–80% on the primary metric, with concentrated failure patterns (e.g., “the system fails 90% of the time on BM-language Geometry questions”). The pattern is more useful than the aggregate; concentrated failures point to specific iteration targets.

23.2.3 Day 2–3: implement LLM-as-judge for subjective dimensions

By Wednesday end-of-day, the team has an LLM-as-judge pipeline running for the metrics that cannot be evaluated reference-based: BM clarity, English clarity, explanation pedagogical quality, and other subjective dimensions.

Method:

Write the judge prompt. Explicit rubric (5-point scale with definitions per level), input-and-output presented to the judge, output format specified (JSON with score, justification, and confidence).
Choose the judge model. Use a more capable model than the system being evaluated. For Team Aroma’s Pulse, the system uses Claude Sonnet 4.6; the judge is Claude Opus 4.7. Cost per judgment ~USD 0.15 for typical inputs; total evaluation cost on a 100-input golden set ~USD 15.
Calibrate the judge. Run 30 inputs through both the LLM judge and 2 human raters (team members). Compute correlation; investigate large disagreements. Adjust the rubric until human-judge correlation is >0.7.
Run the LLM-as-judge eval. Score all 100 outputs. Watch for systematic biases: are short outputs scored consistently lower? Are outputs in a particular language scored differently? If yes, the judge has bias the team must correct for.
Persist results. Each judgment goes into the database, alongside the input, output, and timestamp. The structured store enables later analysis.

A specific LLM-as-judge prompt template for clarity scoring:

You are evaluating the clarity of an explanation given to a Form-5
Malaysian student preparing for the SPM Add Maths exam.

QUESTION: [the SPM question]
EXPLANATION: [the system's output]
LANGUAGE: [BM | English]

Score the explanation 1-5 on clarity:
  1 = Confusing; the student would not understand the reasoning
  2 = Unclear in places; major gaps in logic
  3 = Mostly clear; minor confusing passages
  4 = Clear; a typical Form-5 student would follow the reasoning
  5 = Exceptionally clear; better than a typical centre teacher's
      explanation

Consider:
  - Whether each step is justified
  - Whether the language register is appropriate for Form-5
  - Whether examples or analogies aid understanding
  - Whether the explanation aligns with the SPM marking scheme

Respond ONLY in JSON:
{
  "score": <integer 1-5>,
  "justification": "<2-3 sentences>",
  "specific_issues": ["<issue 1>", "<issue 2>", ...] or [],
  "confidence": "<low | medium | high>"
}

The structured output makes downstream aggregation tractable; the justification field aids the team’s iteration.

23.2.4 Day 3 evening: human-evaluation pilot

By Thursday morning, each team member has rated 20 outputs (100 total) using the same rubric the LLM judge used. The exercise serves three purposes:

Calibrates the LLM judge. Computing the correlation between human ratings and LLM ratings tells the team how trustworthy the LLM judge is on this task.
Identifies systematic LLM biases. If humans rate outputs lower than the LLM judge does on average, the judge is over-generous; if higher, under-generous. Either pattern can be corrected by rubric tuning.
Provides baseline data for production human evaluation. The 100 human ratings collected this week are the seed data for Week 6–7’s larger evaluation work.

Each team member’s 20 ratings should take 60–90 minutes, depending on output length. Standardise the rubric, the rating UI (a Notion page or simple Streamlit app), and the output presentation order.

23.2.5 Day 4: build the metrics dashboard

By Thursday end-of-day, the team has a dashboard showing current evaluation status against the quality bars from Chapter 21. The dashboard is the primary artefact the team checks every morning during alpha and beta.

Recommended structure:

METRICS DASHBOARD — [PROJECT]

QUALITY BAR STATUS (vs Chapter 21 thresholds)
  BM clarity (LLM-judge):    [current %] / target 80%   [PASS / FAIL]
  English clarity (LLM):     [current %] / target 80%   [PASS / FAIL]
  SPM-format accuracy:       [current %] / target 90%   [PASS / FAIL]
  Numerical correctness:     [current %] / target 95%   [PASS / FAIL]
  Hallucination rate:        [current %] / target <2%   [PASS / FAIL]

STRATIFIED METRICS
  By topic:
    Algebra/Functions:       [...]
    Geometry:                [...]
    Statistics:              [...]
    Differentiation:         [...]
    Integration:             [...]
  By language:
    BM:                      [...]
    English:                 [...]
  By difficulty:
    Standard:                [...]
    Hard:                    [...]
    Edge case:               [...]

FAILURE MODE BREAKDOWN
  Hallucinations:            [n / N]
  Format failures:           [n / N]
  Refusals:                  [n / N]
  Off-topic:                 [n / N]
  Inconsistency (multi-run): [n / N]
  Bias (paired probes):      [n / N]

COST AND LATENCY
  Avg cost per query (USD):  [current]
  Avg latency (ms):          [current]
  P95 latency (ms):           [current]
  Queries per day budget:    [budget] / actual: [current]

TREND (vs prior week's baseline)
  [up/down] BM clarity:      [+x pp]
  [up/down] Hallucination:   [-x pp]
  [up/down] Cost:            [+x %]

The dashboard can be built in any of: a Notion page (manual updates), a Streamlit or Gradio app (Python; recommended if team has Python comfort), a custom Next.js page (if the team is fully TypeScript), a LangSmith / Langfuse dashboard (if using those platforms — they include this functionality out of the box).

For Week 5, manual updates after each daily eval run are sufficient; for Week 6+, automate so the dashboard reflects the most recent CI/CD eval and live production traffic.

23.2.6 Day 5: iteration and the pre-alpha checklist

By Friday end-of-day, the team has completed at least three rounds of prompt iteration based on Week-5 eval evidence, with each iteration’s metric delta documented. The Friday afternoon work is also the pre-alpha checklist: a structured review of whether the system is ready for Week-6 alpha launch.

The pre-alpha checklist:

PRE-ALPHA CHECKLIST — [PROJECT]
Date: Friday, Week 5

QUALITY BARS
  [ ] BM clarity ≥80%
  [ ] English clarity ≥80%
  [ ] SPM-format ≥90%
  [ ] Numerical correctness ≥95%
  [ ] Hallucination rate <2%

CRITICAL FAILURE MODES
  [ ] No prompt-injection vulnerability identified
  [ ] No bias differentials >10pp across language/topic
  [ ] No hallucinated SPM-curriculum facts on golden set

OPERATIONAL READINESS
  [ ] Production deployment URL is stable for >48 hours
  [ ] Auth / sign-up flow works on three browsers + mobile
  [ ] Error handling: graceful degradation on API rate limits
  [ ] Cost monitoring: budget alerts configured at 5x baseline
  [ ] Logging: every inference persists to ai_inferences table
  [ ] Rollback plan: previous prompt version can be reverted in <5 min

ALPHA-COHORT READINESS
  [ ] 5+ alpha users confirmed for Week 6
  [ ] Alpha-onboarding doc written
  [ ] Feedback channel established (form / Slack / email)
  [ ] Daily-check-in cadence scheduled

LEGAL AND ETHICAL
  [ ] PDPA / GDPR / privacy policy in place (if any)
  [ ] Terms of service published
  [ ] Data retention policy documented
  [ ] If serving minors (e.g., students under 18): parental
      consent flow tested

TEAM
  [ ] Each team member can demo one feature end-to-end
  [ ] On-call coverage agreed for alpha week
  [ ] Each team member has read the metrics dashboard

Items marked PASS go through; items marked FAIL must be addressed before Monday or the alpha is delayed by a week. In practice many student teams will have 1–3 FAIL items by Friday and a clear weekend remediation list.

The discipline of the checklist is what distinguishes a launch from a deployment. A team that ticks off the checklist is shipping; a team that skips the checklist is rolling the dice.

23.3 Lessons from the cases

Eight specific evaluation lessons from Parts I–III shape Week 5 practice.

23.3.1 Watson Health — no measurable evaluation criteria (Chapters 2, 7)

Watson Health’s failure was partly an evaluation failure. The system’s stated goal — “recommend appropriate cancer treatment” — is essentially unmeasurable on a small pilot, because treatment appropriateness depends on multi-year disease course, comorbidities, and patient outcomes. IBM had no closed-loop evaluation infrastructure that could distinguish good from bad recommendations within months. By the time multi-year evidence accumulated, the product had drifted from the customer.

Operational implication. Closed-loop evaluation in Week 5 is non-negotiable. If your system’s quality cannot be measured on a 100-input golden set in 5 minutes, your scope is wrong. Narrow until the evaluation is tractable.

23.3.2 DBS GANDALF — quality bar from inception (Chapters 4, 6)

DBS’s banking-AI deployments specified the quality bar before the build started: credit-card origination 21 days → 4 days; not “faster,” but a specific number. The pre-commitment forced the evaluation infrastructure to be built alongside the product, not retrofitted afterward.

Operational implication. Your Chapter-21 quality bar is the contract. Week 5’s eval pipeline is what tells you whether you have met it. If the eval is not yet in place by mid-Week 5, your alpha launch must be delayed; running a launch against unmeasured quality is the Klarna pattern (§23.3.4).

23.3.3 Cursor — evaluation through founder use (Chapter 5)

Anysphere’s primary evaluation infrastructure for the first ~12 months was the founders’ own daily use of Cursor. Every friction the founders encountered in their own programming work was a data point. The team did not need a 100-input golden set because they had the most demanding daily users — themselves.

Operational implication. Founder use is a free, high-quality evaluation channel for student teams. Each team member should use the product for at least 30 minutes per week from Week 5 onward, with friction-points logged and treated like customer feedback. This does not replace the golden set, but it complements it cheaply.

23.3.4 Klarna — production traffic as the only evaluation (Chapter 8, forthcoming)

Klarna’s deployment in February 2024 effectively used production traffic — millions of customer interactions — as the primary evaluation. The validation that the AI agent maintained quality at scale came back negative, but only after months of customer experience had been delivered. The reversal was costly because the evaluation came after the launch, not before.

Operational implication. Pre-launch evaluation against a golden set catches at least 80% of issues that production launch would surface, at far lower cost. The 20% that the golden set misses (the genuine surprises) are best handled by staged production rollout (alpha → beta → general availability) so production traffic is itself a controlled experiment, not a full-scale gamble.

23.3.5 Stitch Fix — data flywheel as continuous evaluation (Chapter 8, forthcoming)

Stitch Fix’s recommendation system continuously evaluated itself through the keep/return decisions on every shipment. Every customer’s choice was a data point on whether the recommendation algorithm was improving. The team did not run separate evaluation studies; the product’s normal operation generated the evaluation signal.

Operational implication. Design your Week-5 logging (the ai_inferences table from Chapter 22) so that user actions on outputs (accept / edit / reject; thumbs up / thumbs down; complete / abandon) are captured as evaluation signal. By Week 7, the production interaction data is your primary evaluation source; the golden set is the regression-test layer.

23.3.6 AlphaGo — self-play as evaluation infrastructure (Chapter 2)

DeepMind’s AlphaGo Zero used self-play to generate evaluation signal: the system played millions of games against itself, with each game producing a clear win/loss signal. The evaluation infrastructure was the same as the training infrastructure; the scale was unprecedented.

Operational implication. Where your task admits self-play or model-vs-model evaluation (model-graded tournaments, debate-style evaluation, adversarial probing), the technique scales evaluation cheaply. For most student-team MVPs the technique is overkill; for some specific products (chatbots, assistants where two variants can be compared head-to-head) it is the right approach.

23.3.7 Anthropic Constitutional AI — eval as alignment infrastructure (Chapter 2)

Anthropic’s Constitutional AI methodology (Bai et al., 2022) uses LLM-as-judge as both training signal and evaluation signal. The evaluation infrastructure is not separate from the safety infrastructure; the same judges that score outputs during training also score outputs during evaluation, ensuring that the safety properties measured during training transfer to deployment.

Operational implication. Your LLM-as-judge prompts, once stable, become reusable: the same judges evaluate Week-5 baseline, Week-6 alpha, Week-7 beta, and Week-8+ production. The investment in good judge prompts (with rubrics, with bias-correction calibration) pays back across the rest of the build.

23.3.8 JPMorgan COiN — narrow scope makes evaluation tractable (Chapter 6)

COiN’s evaluation was straightforward because its scope was narrow: did the system extract clauses from commercial credit agreements with sufficient accuracy that lawyer review-and-correction time fell measurably? The narrow framing made the evaluation question well-formed: “lawyer’s review time” is measurable; “general legal AI quality” is not.

Operational implication. Your evaluation tractability is set in Chapter 21 (MVP scoping) by your wedge specification. A team that scoped narrowly in Chapter 21 has a tractable Week 5; a team that scoped broadly is now discovering that their evaluation question has no good answer. If this is your team, the right move in Week 5 is to narrow further, not to invent a more elaborate evaluation.

23.4 Tools and templates

23.4.1 Golden set design template

GOLDEN SET — [PROJECT]
Version: v1
Created: [date]
Total inputs: [n]

STRATIFICATION
  Dimension 1: [e.g., topic area]
    Stratum 1a: [n inputs]
    Stratum 1b: [n inputs]
    ...
  Dimension 2: [e.g., language]
    ...
  Dimension 3: [e.g., difficulty]
    ...

INPUT SCHEMA (one entry per input)
{
  "id": "GS-001",
  "input_text": "[the input as it would appear from a user]",
  "stratum": {
    "topic": "Algebra",
    "language": "BM",
    "difficulty": "standard"
  },
  "expected_behaviour": {
    "numerical_answer": "...",
    "format_requirements": ["..."],
    "must_include": ["..."],
    "must_not_include": ["..."],
    "clarity_threshold": 4
  },
  "failure_mode_probe": null,  // or "prompt_injection", "ambiguity", etc.
  "source": "SPM 2024 Trial Paper, Q5",
  "notes": "Tests whether system handles substitution method
            correctly when both equations are non-linear."
}

23.4.2 Evaluation script scaffold (Python)

# eval/run_eval.py
"""Run the evaluation pipeline against the current system."""

import json
from pathlib import Path
from typing import Any
from anthropic import Anthropic
from datetime import datetime

GOLDEN_SET = Path("eval/golden_set/v1.json")
RESULTS_DIR = Path("eval/results")

def run_system(input_text: str, language: str) -> dict:
    """Call the production system with the given input."""
    # In practice, call the same API endpoint as the live app
    # so the eval mirrors production exactly.
    response = anthropic_client.messages.create(
        model="claude-sonnet-4-6",
        system=load_prompt("tutor-spm-v3"),
        messages=[{
            "role": "user",
            "content": format_user_prompt(input_text, language),
        }],
        max_tokens=1500,
    )
    return {
        "output": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }

def evaluate_one(entry: dict) -> dict:
    """Run one golden-set entry through the system and score."""
    result = run_system(entry["input_text"], entry["stratum"]["language"])
    output = result["output"]

    metrics = {}
    metrics["numerical_correct"] = check_numerical(output, entry)
    metrics["format_pass"] = check_spm_format(output, entry)
    metrics["clarity_score"] = llm_judge_clarity(output, entry)
    metrics["hallucinated_facts"] = factuality_check(output, entry)
    metrics["latency_ms"] = result.get("latency_ms")
    metrics["cost_usd"] = compute_cost(result)

    return {
        "id": entry["id"],
        "stratum": entry["stratum"],
        "metrics": metrics,
        "output_excerpt": output[:200],
    }

def aggregate(all_results: list[dict]) -> dict:
    """Compute aggregate and stratified metrics."""
    n = len(all_results)
    # Aggregate
    aggregate_metrics = {
        "numerical_correct_pct": sum(r["metrics"]["numerical_correct"] for r in all_results) / n * 100,
        "format_pass_pct": sum(r["metrics"]["format_pass"] for r in all_results) / n * 100,
        "clarity_mean": sum(r["metrics"]["clarity_score"] for r in all_results) / n,
        "clarity_pct_4plus": sum(1 for r in all_results if r["metrics"]["clarity_score"] >= 4) / n * 100,
        "hallucination_rate_pct": sum(r["metrics"]["hallucinated_facts"] for r in all_results) / n * 100,
    }
    # Stratified
    stratified = stratify_metrics(all_results)
    return {
        "n": n,
        "aggregate": aggregate_metrics,
        "stratified": stratified,
        "timestamp": datetime.utcnow().isoformat(),
    }

if __name__ == "__main__":
    golden = json.loads(GOLDEN_SET.read_text())
    results = [evaluate_one(entry) for entry in golden]
    summary = aggregate(results)

    out = RESULTS_DIR / f"eval-{summary['timestamp']}.json"
    out.write_text(json.dumps({"summary": summary, "details": results}, indent=2))

    print(f"Eval complete: {summary['n']} inputs")
    print(f"  Numerical correct: {summary['aggregate']['numerical_correct_pct']:.1f}%")
    print(f"  Format pass: {summary['aggregate']['format_pass_pct']:.1f}%")
    print(f"  Clarity mean: {summary['aggregate']['clarity_mean']:.2f}")
    print(f"  Clarity ≥4: {summary['aggregate']['clarity_pct_4plus']:.1f}%")
    print(f"  Hallucination rate: {summary['aggregate']['hallucination_rate_pct']:.1f}%")

The scaffold is sketchy by design — your team’s specific implementation will plug into your specific data sources and APIs. The structure (run, score, aggregate, persist) is universal.

23.4.3 Evaluation script scaffold (TypeScript, for fully-JS teams)

// eval/run-eval.ts
import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import goldenSet from './golden_set/v1.json' with { type: 'json' };
import { writeFileSync } from 'fs';

async function evaluateOne(entry: GoldenEntry): Promise<EvalResult> {
  const result = await generateText({
    model: anthropic('claude-sonnet-4-6'),
    system: loadPrompt('tutor-spm-v3'),
    prompt: formatUserPrompt(entry.input_text, entry.stratum.language),
    maxTokens: 1500,
  });

  const metrics = {
    numerical_correct: checkNumerical(result.text, entry),
    format_pass: checkSpmFormat(result.text, entry),
    clarity_score: await llmJudgeClarity(result.text, entry),
    hallucinated_facts: await factualityCheck(result.text, entry),
    latency_ms: result.latencyMs,
    cost_usd: computeCost(result.usage),
  };

  return { id: entry.id, stratum: entry.stratum, metrics };
}

async function main() {
  const results = await Promise.all(goldenSet.map(evaluateOne));
  const summary = aggregate(results);
  const timestamp = new Date().toISOString();
  writeFileSync(
    `eval/results/eval-${timestamp}.json`,
    JSON.stringify({ summary, details: results }, null, 2)
  );
  console.log('Eval complete:', summary);
}

main();

23.4.4 LLM-as-judge prompt template

(See §23.2.3 above for the full BM clarity rubric template; the same structure applies to any subjective dimension. Key elements: explicit rubric with anchor descriptions per level, structured JSON output, position randomisation when comparing two outputs.)

23.4.5 Human-rater interface and rubric

For the Week-5 human-rater pilot, a Streamlit / Notion / Google Forms interface presents the input + output and the rubric:

RATER INSTRUCTIONS

You will rate explanations given to Form-5 students preparing for SPM
Add Maths. Each item shows the question, the system's explanation, and
the language (BM or English).

Rate on the 5-point clarity scale (see definitions below).
Take 2-3 minutes per item.

CLARITY SCALE
  5 = Exceptionally clear
  4 = Clear
  3 = Mostly clear, minor confusing passages
  2 = Unclear in places
  1 = Confusing

ALSO INDICATE
  - Any factual errors (free-text)
  - Any SPM-format issues (free-text)
  - Whether the explanation would be safe to send to a student
    without teacher review (yes/no)

ITEM 1 of 20
[question]
[explanation]
[language]

[Rating: 1 2 3 4 5]
[Factual errors]
[SPM-format issues]
[Safe without review: Y/N]

Aggregate ratings across raters; compute inter-rater agreement (Cohen’s kappa for binary, Krippendorff’s alpha for ordinal); use disagreements as targets for rubric refinement.

23.4.6 CI/CD eval integration template (GitHub Actions)

# .github/workflows/eval.yml
name: Run evaluation on PR

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'lib/ai/**'
      - 'eval/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r eval/requirements.txt
      - name: Run eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python eval/run_eval.py
      - name: Compare to baseline
        run: python eval/compare_to_baseline.py > eval/comment.md
      - name: Post results to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('eval/comment.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

The workflow runs the eval on every PR touching prompts or AI logic, posts a comment with metric deltas, and (in mature teams) blocks merge if key metrics regress beyond a threshold.

23.4.7 Cost-per-task tracker

A simple table tracking per-feature unit economics:

Feature	Avg input tokens	Avg output tokens	Avg cost (USD)	Avg latency (s)	Cost / 1000 invocations
AI explanation	800	1200	0.04	3.2	$40
Topic check	200	50	0.005	0.6	$5
Format validation	100	30	0.002	0.4	$2
Practice generation	600	1500	0.05	4.0	$50

Updated weekly. Feeds Chapter 26’s unit-economics analysis.

23.4.8 The pre-alpha checklist (full version)

The full pre-alpha checklist is in §23.2.6 above. It is the single most-important Friday artefact; the team’s go/no-go decision for Week 6 alpha rests on it.

23.5 Worked example — Team Aroma’s Week 5

Team Aroma starts the week with a working vertical slice from Week 4 and ambitious quality bars from Chapter 21: BM clarity ≥80% rated 4 or 5, SPM-format ≥90%, numerical correctness ≥95%, hallucination <2%.

Day 1 (Monday): the golden set

Daniel and Priya lead the golden-set construction since they own SPM curriculum content. They source 60 questions from publicly archived SPM Add Maths trial papers (2022–2024), 25 from the design-partner centre’s internal materials (with permission, anonymised), and 15 team-authored edge cases. Aliyah cross-checks the BM/English allocation against Week-2 customer-discovery findings to ensure realistic distribution.

By Monday 5pm KL, the golden set is committed:

100 inputs total
40 Algebra/Functions, 25 Geometry, 15 Statistics/Probability, 10 Differentiation, 10 Integration
60 BM, 40 English
70 standard, 20 hard, 10 edge case
20 inputs probe failure modes (5 ambiguous, 5 prompt-injection, 5 cross-language, 5 SPM-rubric-vs-textbook divergence)

Each input has documented expected behaviour: the correct numerical answer, format-compliance criteria, and clarity threshold (target 4 of 5). The expected-behaviour annotation took roughly 6 hours of Daniel’s and Aliyah’s time combined — the dominant cost of golden-set construction.

Day 2: reference-based eval and the baseline

Wei Hao writes the Python eval script (eval/run_eval.py, ~200 lines). The script reads the golden set, runs each input through the production endpoint (so the eval mirrors what alpha users will see), and scores numerical correctness (parse the output, compare to expected) and SPM-format compliance (regex/parser check for required structural elements like “Penyelesaian:”, proper unit notation, working shown).

Tuesday afternoon: the team runs the baseline. The first results are sobering:

Numerical correctness: 89.0% (target 95%)
SPM-format pass: 78.0% (target 90%)
(Clarity not yet measurable — LLM-judge pipeline arrives Wednesday)

The Tuesday evening team-sync (4pm KL / 7pm Melbourne) examines the failures. The numerical errors cluster in two patterns: (a) compound-fraction rendering where the model produces decimal approximations rather than exact fractions; (b) integration-by-substitution problems where the model’s substitution choice is non-canonical, producing technically-correct but format-noncompliant answers.

The SPM-format failures cluster in three patterns: (a) missing the Malay-language section header “Penyelesaian:” in BM-language outputs; (b) units in problems that involve real-world scenarios; (c) inconsistent decimal-place precision. Wei Hao opens four tickets in Linear, each with a specific failure pattern.

Day 3: LLM-as-judge for clarity

Wei Hao implements the LLM-as-judge pipeline. The judge model is Claude Opus 4.7 (one tier up from the Sonnet 4.6 the production system uses). The clarity rubric is the one drafted in §23.2.3, with explicit 1–5 anchors.

Daniel and Priya pilot the rubric: they each rate 10 outputs from the Tuesday baseline using the rubric. Their inter-rater agreement is high (Cohen’s kappa 0.71). The LLM-as-judge runs on the same 10 outputs; correlation with the human ratings is 0.78. The judge slightly under-rates BM-language explanations (mean LLM rating 3.2 vs mean human rating 3.5). Daniel adjusts the rubric to add explicit BM-language considerations; the recalibrated judge correlates 0.83 with humans.

The Wednesday evening LLM-judge run on all 100 outputs:

BM clarity (% rated 4 or 5): 64% (target 80%)
English clarity (% rated 4 or 5): 76% (target 80%)

Both fall short of the target. The team is alarmed but not surprised — the Week-3 prompt was a v1 that had not been iterated against measured quality.

Day 4: prompt iteration

Wednesday night and Thursday morning: four rounds of prompt iteration, each scored against the same 100-input golden set.

Iteration 1. Daniel rewrites the system prompt to add explicit SPM-format instructions, BM-language register guidance, and step-by-step structure requirements.

Metric	Baseline	After v2	Δ
Numerical correct	89%	89%	–
SPM-format	78%	86%	+8 pp
BM clarity (≥4)	64%	71%	+7 pp
EN clarity (≥4)	76%	78%	+2 pp

Improvement on SPM-format and BM clarity; English clarity barely moved (which is expected — the English-language register was already adequate; the changes targeted BM).

Iteration 2. Priya adds three worked examples of ideal outputs to the prompt (one Algebra, one Geometry, one Statistics). The intuition: few-shot examples teach the model the format better than instructions alone.

Metric	After v2	After v3	Δ
Numerical correct	89%	92%	+3 pp
SPM-format	86%	91%	+5 pp
BM clarity (≥4)	71%	78%	+7 pp
EN clarity (≥4)	78%	81%	+3 pp

Clarity is approaching the target on both languages. Format passes 90%. Numerical correctness rises (likely because the worked examples show correct procedural steps).

Iteration 3. Wei Hao adds a self-check pass: after generating the explanation, the model is prompted to verify its own answer against the question. This is a chain-of-thought-style intervention.

Metric	After v3	After v4	Δ
Numerical correct	92%	96%	+4 pp
SPM-format	91%	92%	+1 pp
BM clarity (≥4)	78%	80%	+2 pp
EN clarity (≥4)	81%	82%	+1 pp
Cost per query	USD 0.04	USD 0.07	+75%
Latency	3.2s	5.8s	+81%

Numerical correctness now exceeds the target. BM clarity at the target. Cost and latency have risen substantially due to the self-check pass. The team debates whether the cost increase is worth the quality improvement.

Iteration 4. Aliyah proposes making the self-check pass conditional: only run it on outputs the model itself flagged as low-confidence in the initial generation, plus a 10% random sample.

Metric	After v4	After v5	Δ
Numerical correct	96%	95%	–1 pp
SPM-format	92%	92%	–
BM clarity (≥4)	80%	81%	+1 pp
EN clarity (≥4)	82%	83%	+1 pp
Cost per query	USD 0.07	USD 0.045	–35%
Latency	5.8s	4.1s	–29%

The conditional self-check costs 1pp on numerical correctness (still above target) but recovers most of the cost and latency. The team commits to v5 as the alpha-launch prompt.

Day 4 evening: the regression catch

Friday morning Wei Hao runs a final eval against an extended golden set (the original 100 plus 30 newly-added probes from the day’s failure-mode work). One of the new probes catches a regression: in v3 (when the worked examples were added), the model began ignoring the user’s stated preferred language for ~5% of inputs and replied in the language of the worked example. The behaviour is a known LLM idiosyncrasy with few-shot prompting.

The fix is a single-line addition to the prompt: “Always respond in the language requested by the user, regardless of the language of the worked examples below.” Re-run: language-following at 100%. The fix takes 15 minutes; identifying the issue without the regression suite would have taken weeks of customer-reported bug reports during alpha.

Day 5: the metrics dashboard and pre-alpha checklist

By Friday afternoon Wei Hao has built a Streamlit-based metrics dashboard (deployed to Streamlit Cloud free tier) that pulls data from the eval results JSONs and displays current vs target. Sara has built a similar view in Notion that the non-engineers on the team can read.

The Friday team meeting reviews the pre-alpha checklist. PASS items: all five quality bars at or above target; logging is comprehensive; deployment URL stable for 72 hours; rollback plan tested. FAIL items: alpha cohort has only 4 users confirmed (target 5+); parental-consent flow for student users not yet tested; PostHog dashboard showing zero events on three of the eight tracked actions.

The team commits to fixing the FAIL items over the weekend: Aliyah will personally re-confirm two alpha users from the Week-2 corpus and onboard them; Sara will test the parental-consent flow with a friend’s child; Wei Hao will debug the PostHog event tracking. By Sunday evening all FAIL items will be PASS.

The Friday submission goes in at 10:30pm KL: the golden set, the eval pipeline code, the metrics dashboard URL, the four iteration reports with metric deltas, the failure-mode catalogue (with the language-regression case study), the cost-per-task tracker, and the pre-alpha checklist with three weekend follow-ups documented.

What Team Aroma got right and what they almost got wrong

Three things they did well: (1) the calibration of the LLM judge against human raters caught the BM under-rating bias before it skewed iteration decisions; (2) the conditional self-check in iteration 4 recovered most of the cost without losing quality, demonstrating that prompt engineering is a multi-objective optimisation; (3) the regression suite caught the language-following regression that none of the team would have spotted in casual testing.

Three things they almost got wrong: the team almost shipped iteration 4 because all five quality bars passed (the cost increase would have broken the unit economics in Week 8); they almost rated only 5 outputs each in the human-eval pilot (which would have produced unreliable inter-rater agreement); and they almost extrapolated the 100-input golden-set results too confidently to the production population, before the team caught the BM-Geometry stratum failure rate (12% even at the aggregate 81% — the stratum is the alpha-launch risk to monitor).

The pattern is general. Week 5 is high-leverage because the discipline of measurement forces the team to confront which version of the system they actually have, and to iterate against evidence rather than intuition. Without measurement, the alpha launches against the team’s hopes; with measurement, the alpha launches against the team’s documented quality position.

23.6 Course exercises and Week 5 deliverable

Submit the Week 5 deliverable bundle by Friday 23:59. Required artefacts:

23.6.1 Required artefacts

Golden set v1 in version control (~100 inputs with documented expected behaviour and stratification dimensions).
Evaluation pipeline code in the repo, with CI/CD integration on PRs touching prompts or AI logic.
Baseline evaluation report showing initial metrics against quality bars, with stratified breakdowns and failure-mode analysis.
Iteration reports for at least three rounds of prompt iteration, each with before/after metric deltas.
Metrics dashboard (URL or screenshot) displaying current quality status against quality bars from Chapter 21.
Failure-mode catalogue with examples of each major failure mode encountered, and the team’s response.
Cost-per-task tracker showing actual unit economics per major feature.
LLM-as-judge calibration report showing human-vs-judge correlation on at least 50 outputs, with rubric adjustments documented.
Pre-alpha checklist with PASS/FAIL on every item, with weekend follow-up plans for any FAIL.

23.6.2 Grading rubric (50 points)

Component	Points	Distinction-level criteria
Golden-set quality	10	≥100 inputs; stratified by ≥3 dimensions; ≥20% failure-mode coverage; documented expected behaviour
Evaluation pipeline rigour	10	Reference-based + LLM-judge implemented; CI/CD integration; output structured for downstream analysis
Iteration evidence	10	≥3 iterations with documented metric deltas; trade-offs (quality vs cost vs latency) explicit
Quality-bar attainment	5	All Chapter-21 quality bars at or above target by Friday; FAIL items have remediation plans
Statistical thinking	5	Sample-size awareness; stratified metrics; multiple-testing acknowledged where relevant
LLM-judge calibration	5	Human-judge correlation ≥0.7; biases identified and corrected
Pre-alpha checklist	5	All items addressed; FAIL items have credible remediation by Monday

Pass: 30. Credit: 36. Distinction: 42. High Distinction: 47.

The team-comprehension penalty from §19.6.2 applies; additionally, every team member must be able to read the metrics dashboard and explain what each metric measures.

23.6.3 Things to do before Monday of Week 6

By Sunday evening of Week 5, in addition to the deliverable submission:

Confirm the alpha cohort: 5+ users committed, with onboarding scheduled for Monday or Tuesday of Week 6.
Write the alpha-onboarding doc: 1 page covering what the alpha is, what users should and shouldn’t expect, how to give feedback, and how to escalate problems.
Set up the alpha-feedback channel (a dedicated Slack channel, a Notion form, or both). Establish team-side response times.
Read Chapter 5 (Strategy, collisions, and the new meta) and §24.1–§24.3 of Chapter 24 (Alpha launch and early users) before Monday of Week 6.

References for this chapter

Lean methodology and build-measure-learn

Ries, E. (2011). The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Business.
Maurya, A. (2012). Running Lean: Iterate from Plan A to a Plan That Works. O’Reilly.

Statistical learning theory and evaluation

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. (2nd ed.) Springer.
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the bias-variance trade-off. PNAS 116(32): 15849–15854. (Benign overfitting / double descent.)

LLM evaluation methodology

Liang, P., Bommasani, R., Lee, T., et al. (2023). Holistic evaluation of language models (HELM). Transactions on Machine Learning Research.
Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). Judging LLM-as-a-judge with MT-bench and Chatbot Arena. NeurIPS Datasets and Benchmarks Track.
Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). Chatbot Arena: An open platform for evaluating LLMs by human preference. ICML.
Anthropic (2024). Evaluation cookbook. docs.anthropic.com.
OpenAI (2024–2026). OpenAI Evals. github.com/openai/evals.

RAG and retrieval evaluation

Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. (2024). RAGAS: Automated evaluation of retrieval-augmented generation. EACL.
LangChain Inc. (2024). LangSmith documentation.
Langfuse (2024). Langfuse documentation. langfuse.com.

Cases referenced in §23.3

Iansiti, M. and Lakhani, K. R. (2020). Competing in the Age of AI. Harvard Business Review Press.
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.
Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). Mastering the game of Go without human knowledge. Nature 550: 354–359.
Klarna AB (2024, 2025). Press releases on AI customer service deployment and reversal.

Statistical methods for AI evaluation

Kohavi, R., Tang, D., and Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. (4th ed.) Sage.

Failure modes and red-teaming

Perez, E., Huang, S., Song, F., et al. (2022). Red teaming language models with language models. EMNLP.
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. AISec.
Anthropic (2024). Building safe LLM applications: A guide to red-teaming. anthropic.com.

Chapter overview

23.1 Concept

23.1.1 Why evaluation is the binding discipline for AI products

23.1.2 The build-measure-learn loop, applied

23.1.3 The golden set

23.1.4 Five evaluation modalities

23.1.5 Metrics by task type

23.1.6 Statistical thinking, briefly but seriously

23.1.7 Evaluation as code

23.1.8 Failure mode taxonomy for AI products

23.2 Method — the Week 5 sprint

23.2.1 Day 1 (Monday): build the golden set

23.2.2 Day 2: implement reference-based evaluation

23.2.3 Day 2–3: implement LLM-as-judge for subjective dimensions

23.2.4 Day 3 evening: human-evaluation pilot

23.2.5 Day 4: build the metrics dashboard

23.2.6 Day 5: iteration and the pre-alpha checklist

23.3 Lessons from the cases

23.3.1 Watson Health — no measurable evaluation criteria (Chapters 2, 7)

23.3.2 DBS GANDALF — quality bar from inception (Chapters 4, 6)

23.3.3 Cursor — evaluation through founder use (Chapter 5)

23.3.4 Klarna — production traffic as the only evaluation (Chapter 8, forthcoming)

23.3.5 Stitch Fix — data flywheel as continuous evaluation (Chapter 8, forthcoming)

23.3.6 AlphaGo — self-play as evaluation infrastructure (Chapter 2)

23.3.7 Anthropic Constitutional AI — eval as alignment infrastructure (Chapter 2)

23.3.8 JPMorgan COiN — narrow scope makes evaluation tractable (Chapter 6)

23.4 Tools and templates

23.4.1 Golden set design template

23.4.2 Evaluation script scaffold (Python)

23.4.3 Evaluation script scaffold (TypeScript, for fully-JS teams)

23.4.4 LLM-as-judge prompt template

23.4.5 Human-rater interface and rubric

23.4.6 CI/CD eval integration template (GitHub Actions)

23.4.7 Cost-per-task tracker

23.4.8 The pre-alpha checklist (full version)

23.5 Worked example — Team Aroma’s Week 5

Day 1 (Monday): the golden set

Day 2: reference-based eval and the baseline

Day 3: LLM-as-judge for clarity

Day 4: prompt iteration

Day 4 evening: the regression catch

Day 5: the metrics dashboard and pre-alpha checklist

What Team Aroma got right and what they almost got wrong

23.6 Course exercises and Week 5 deliverable

23.6.1 Required artefacts

23.6.2 Grading rubric (50 points)

23.6.3 Things to do before Monday of Week 6

References for this chapter

Further reading