Chapter 21 — Week 3: MVP design for AI products

Welcome to Week 3. You have a problem statement backed by 20+ interviews, a named primary segment, and a customer profile (Value Prop Canvas right side). This week you design the smallest product that will produce validated learning — and only that. The single most-common Week-3 mistake is to design v1 of the final product rather than the MVP. By Friday you will have an MVP scoping document, a value proposition canvas with both sides populated, a tech-stack decision, and a quality-bar specification. Build starts Monday of Week 4.

Chapter overview

This chapter follows the same six-part structure as Chapters 19–20. §21.1 (Concept) sets out what an MVP is — and isn’t — for AI products specifically; introduces the five MVP archetypes (Wizard of Oz, Concierge, single-feature, narrow-vertical, proof-of-concept); develops the riskiest-assumption-test framework; and completes the Value Proposition Canvas (left side) introduced in Chapter 20. §21.2 (Method) is the day-by-day Week 3 sprint: identify riskiest assumptions, choose archetype, scope features (MoSCoW), make build/buy/borrow decisions, design the value proposition, write the scoping document. §21.3 (Lessons from the cases) pulls eight specific lessons from Parts I–III on MVP design, including the JPMorgan COiN narrow-framing principle, the Watson Health broad-framing failure, and Anthropic’s progressive-disclosure release pattern. §21.4 (Tools and templates) gives you the riskiest-assumption taxonomy, MoSCoW worksheet, build/buy/borrow matrix, foundation-model selection guide for the 2026 stack, and the eight-section MVP scoping document template. §21.5 (Worked example) continues Team Aroma’s Pulse-for-tutoring-centres project from Week 2 through MVP scoping and a critical build-vs-buy decision. §21.6 (Course exercises and deliverables) specifies the Week 3 submission with grading rubric.

How to read this chapter. Read §21.1 in full at the team’s Monday meeting. Read §21.2 with the team and decide who owns each step. Treat §21.3 as Tuesday-evening reading once the riskiest-assumption work is done. Use §21.4 throughout the week, particularly the foundation-model selection guide in §21.4.5 (which is the most operationally consequential section). Read §21.5 before drafting your own scoping document on Thursday. Submit against §21.6 by Friday 23:59.

21.1 Concept

21.1.1 What a minimum viable product is — and isn’t

The term “minimum viable product” is among the most-misused vocabulary in startup work. Eric Ries’s Lean Startup (2011) is unambiguous: an MVP is the smallest product that produces validated learning. It is not v0.5 of the final product. It is not a proof-of-concept. It is not a polished prototype. It is the smallest object you can put in front of a customer that lets you definitively answer the question should we keep building this, change direction, or stop?

Three properties follow from the validated-learning criterion:

  • The MVP is defined by what you intend to learn, not by which features it includes. If your riskiest assumption is “centre owners will pay RM 30/student/month for an SPM-aligned tutor tool,” the MVP is whatever lets you test that assumption fastest. If the riskiest assumption is “AI explanations in BM are good enough that students prefer them to existing alternatives,” the MVP is something different — perhaps a no-business-model prototype that lets students compare two answers.
  • The MVP exists to make a decision tractable. After the MVP runs, you should be able to say with evidence whether to persevere, pivot, or abandon. An MVP that produces ambiguous results — “well, some users liked it” — has not done its job.
  • The MVP is throwaway by default. The code, the UX, even the architecture should be assumed to be discarded. A team that designs an MVP they intend to scale into production has misunderstood the exercise. The MVP exists to answer questions cheaply; the production system is built later, against the validated answers, often with completely different technology choices.

This is not an academic distinction. Most failed AI startups, in 2024–2026 case literature, fail because they spent 12–18 months building a v1 product before validating the problem-customer-segment hypothesis. The Watson Health case (Chapter 7) is the most-expensive instance: IBM built years of product before validating that oncologists wanted what was being built. By the time the validation evidence arrived, the product was too far along to redirect.

21.1.2 Why AI products need a different MVP discipline

Five distinctive properties of AI products affect MVP design. A team that designs an MVP as if for a generic SaaS product will systematically misjudge the appropriate scope.

The capability-quality continuum. Classical software either works or it does not. AI software works along a continuum: a tutor that produces correct explanations 60% of the time, 80% of the time, and 95% of the time are different products. The MVP must specify where on the quality continuum you are operating, and whether that quality level is sufficient to answer your validated-learning question. A 60%-quality tutor probably cannot answer “would centres pay for this?” because the centre owner cannot distinguish a poor product from a poor problem statement.

The trust threshold. AI products in regulated and B2B contexts must clear a trust threshold before users will adopt them at all. Below the threshold, even a technically working product produces no learning, because no customer will use it long enough to inform you. This means the MVP must be sufficient to the trust threshold for your specific customer segment, even when the underlying capability is rough.

The wow-versus-reliability gap. Generative AI demos elicit enthusiasm that does not survive contact with daily use. An MVP that wins on first impression but fails on the third, fourth, and fifth use produces a misleading positive signal. The MVP needs to be designed for sustained engagement, not for the demo moment.

The data-flywheel dependency. Many AI products’ value depends on the data flywheel turning — every interaction makes the product better. An MVP with too few users (under ~50 active per week, typically) cannot turn the flywheel and therefore cannot produce the learning that the eventual production product would. The implication: scope the MVP narrowly enough that you can concentrate enough users in a single segment to turn the flywheel even at small scale.

The narrow-specification advantage. AI products benefit disproportionately from narrow specification because the underlying foundation models are general-purpose; the value comes from how you specialise them. A narrow MVP (specific customer, specific workflow, specific quality bar) is much easier to build and much easier to evaluate than a general-purpose one. The JPMorgan COiN pattern (Chapter 6) generalises: read commercial credit agreements, not “any contract.”

21.1.3 Five MVP archetypes for AI products

There are five durable MVP archetypes for AI products. Most successful AI MVPs are one of these five, and most failures are AI products built without a clear archetype.

Archetype 1: Wizard of Oz. Humans simulate the AI capability behind the scenes; the customer experience is identical to the eventual AI product, but the actual processing is manual. The technique is named after the 1939 film: the user sees a powerful wizard; behind the curtain is an ordinary person operating levers. Used famously by IBM in the 1980s for speech-recognition research, and by countless 2020s AI startups.

The Wizard of Oz MVP works particularly well when (a) the riskiest assumption is desirability — would users actually use this if it worked? — and (b) the production AI is hard to build but the manual simulation is fast. The downside: it does not test feasibility (can we actually build the AI?), and it requires manual labour that does not scale. Use Wizard of Oz when you want to validate whether users will adopt a workflow, not whether the AI capability exists.

Archetype 2: Concierge. Fully manual delivery of the eventual product’s value, with no attempt to simulate the production technology. The user knows they are getting a manual service; the startup learns whether the service is valuable enough that customers will pay for it, and what specifications they care about. Stitch Fix’s first months were a concierge MVP: founder Katrina Lake hand-picked clothes for friends and shipped them in boxes, learning which selection criteria mattered before any algorithmic recommendation was built.

The Concierge MVP works when (a) the riskiest assumption is will customers pay for this kind of service at all?, and (b) the founders have time to deliver the service manually for the validation period. It does not test technical feasibility; it tests the business hypothesis. Use Concierge when the business case itself is uncertain, not when the technology is uncertain.

Archetype 3: Single-feature MVP. Build one capability, deeply, and ignore everything else. The product does one thing well rather than many things poorly. Cursor’s first version was a single-feature MVP: the IDE wrapper around an LLM, with one core interaction (code completion in context). It did not have the agent mode, the multi-file edit, or the AI-pair-programming features it has today; those came later. The single-feature MVP works when you are confident the single feature is the wedge — the thing customers will hire your product for — and you can de-risk the surrounding capabilities by iteration.

Archetype 4: Narrow-vertical MVP. Pick one customer segment and build the full workflow for them, ignoring all other segments. Glean’s first version was a narrow-vertical MVP: enterprise search for software-engineering teams at tech companies, ignoring sales teams, marketing teams, and non-tech industries entirely. Once the engineer-segment workflow worked, expansion to adjacent segments was tractable. Use a narrow-vertical MVP when (a) workflow integration matters more than capability breadth, and (b) the segment you choose can produce useful learning even at small scale.

Archetype 5: Proof-of-concept MVP. Prove the hardest technical risk before anything else. Anthropic’s Claude initial release was a proof-of-concept MVP for aligned-by-default foundation models: the team prioritised proving that constitutional AI (Bai et al., 2022) could produce a usable conversational model, before investing in product packaging, distribution, or surface area. Use a proof-of-concept MVP when (a) the technical risk is the dominant uncertainty, and (b) failure to prove technical feasibility makes the rest of the business plan moot.

The five archetypes are not mutually exclusive. A team can run a Concierge MVP first (to test the business hypothesis), then a Wizard of Oz (to test workflow desirability), then a single-feature MVP (to test the capability that matters most). Most successful AI startups in 2024–2026 are on archetype 3 or 4 by the time they are externally visible; archetypes 1 and 2 are typically invisible because they precede public launch.

21.1.4 Riskiest assumption tests (RATs)

Marty Cagan’s Inspired (2017) develops the framework that has become the industry standard: every product faces four classes of risk, and the MVP should test the riskiest first.

  • Desirability risk — will users actually use this? Tested through customer behaviour, not customer surveys.
  • Viability risk — can this be a business? Tested through pricing experiments, willingness-to-pay analysis, unit economics modelling.
  • Feasibility risk — can we actually build this with the technology available? Tested through technical proof-of-concept work.
  • Ethical and regulatory risk — is this allowed and acceptable? Tested through legal review, regulatory consultation, and stakeholder engagement.

For AI products specifically, the four classes have characteristic patterns:

  • Desirability risk is often higher than founders think because the demo-versus-usage gap is wide. Survey-stated enthusiasm overstates actual adoption.
  • Viability risk is often lower than founders think because foundation-model API costs have fallen so substantially (Stanford AI Index, 2025) that previously uneconomic use cases are now viable. The harder viability question is usually distribution cost (CAC), not delivery cost.
  • Feasibility risk varies enormously by product. For wrappers around foundation-model APIs, feasibility is essentially solved. For products requiring fine-tuned models, RAG with proprietary data, or agentic workflows, feasibility risk can be the dominant uncertainty.
  • Ethical/regulatory risk is often much higher than founders think in regulated domains (healthcare, finance, education-of-minors, employment screening, criminal justice), where the EU AI Act, GDPR, PDPA, ECOA, FCRA, and various sector-specific frameworks impose substantial constraints.

The Week-3 method requires you to score each of your assumptions on probability (how likely is the assumption to be wrong?) and impact (if wrong, how badly does the business plan break?). The product of probability × impact identifies the riskiest assumptions. The MVP exists to test the top 1–3 of these.

21.1.5 The Value Proposition Canvas, completed

Chapter 20 produced the customer profile (right side of Osterwalder’s canvas): jobs, pains, gains. This week produces the value proposition (left side): products and services, pain relievers, gain creators. Fit is achieved when each pain has a pain reliever and each gain has a gain creator.

The Week-3 method:

  1. Take each pain from the Week-2 customer profile and ask: what specific feature, behaviour, or capability of our product makes this pain go away or smaller? The answer is a pain reliever. Pain relievers should be specific (“explanations rendered in BM with localised idioms”) not generic (“better quality”).
  2. Take each gain from the customer profile and ask: what specific feature creates this gain? That is a gain creator.
  3. Inventory the products and services. The list of distinct features, capabilities, or service elements your MVP will offer.
  4. Cross-check fit. Every high-priority pain should have a corresponding pain reliever in the MVP; every required gain should have a corresponding gain creator. Pain relievers without corresponding pains, or gain creators without corresponding gains, are over-engineering — features that do not address validated customer needs and should be cut.

The Value Proposition Canvas is the artefact that forces you to defend each MVP feature against a customer-side need. A team that completes the canvas honestly will typically find that 30–40% of their initial feature list does not map to a validated pain or gain — those features should be cut or deferred.

21.1.6 The MoSCoW prioritisation method

The MoSCoW method (Clegg and Barker, 1994; widely adopted in agile project management) is the standard tool for MVP feature prioritisation. Each candidate feature is sorted into one of four buckets:

  • Must have: without this feature, the MVP cannot produce validated learning. This is typically 5–8 features.
  • Should have: features that improve the MVP substantially but whose absence does not break the validation experiment. Typically 3–5 features.
  • Could have: features that would be nice but are clearly not in scope. Typically 5–10 features.
  • Won’t have (this version): features that have come up in design conversations but will explicitly not be in the MVP. Naming them prevents scope creep.

The discipline that matters most is the must vs should boundary. A feature in “must” is justified by the validated-learning criterion; without it, you cannot answer the question your MVP is trying to answer. A feature in “should” might improve the MVP but is not strictly necessary. Most novice teams put features in “must” that should be in “should”; the result is over-scoped MVPs that cannot ship in time.

A useful diagnostic question for each “must have”: if we removed this feature, what learning would we lose? If the answer is “nothing important to our validated-learning question,” the feature is not “must” — it is “should” or lower.

21.1.7 Build, buy, or borrow

For each capability your MVP requires, you face a build/buy/borrow choice:

  • Build — write your own code. Appropriate for the capabilities that constitute your competitive advantage.
  • Buy — purchase a hosted service or commercial API. Appropriate for capabilities that are not differentiating but are needed.
  • Borrow — use open-source libraries, free APIs, or public datasets. Appropriate for capabilities that are commoditised and where the open-source option is mature.

For 2026 AI MVPs with low-code preference, the typical decomposition is:

Capability Default choice Reasoning
Foundation model inference Buy (API) Frontier capability for $0.01–0.10 per query; building yourself is uneconomic at MVP scale
Frontend / app shell Borrow (low-code platform) Lovable, v0, Replit Agent, or Bolt produce production-grade UIs in hours
Hosting and database Buy (PaaS) Vercel, Railway, Supabase, or Render; minimal devops at low cost
Auth and user management Buy Clerk or Auth.js; reinventing auth is a known anti-pattern
Vector store / RAG Buy Pinecone, Weaviate Cloud, or pgvector on hosted Postgres
Analytics / event tracking Buy PostHog, Plausible, or Mixpanel free tier
Payments Buy Stripe (Pty Ltd contexts) or local equivalents (KL: iPay88, Billplz)
Domain-specific AI capability Build This is your wedge; you must own it
Integrations with customer systems Build Where workflow integration is the moat

The 80/20 of an AI MVP build is therefore: ~80% of the technical surface is buy/borrow, ~20% is build (and that 20% is your wedge). A team that finds itself building authentication, hosting, or basic LLM API access is over-investing in commoditised infrastructure at the expense of the build that matters.

21.2 Method — the Week 3 sprint

21.2.1 Day 1: identify the riskiest assumptions

By Tuesday morning, the team should have a written list of 8–15 business assumptions, each scored on probability × impact, with the top 1–3 identified as the targets for MVP testing.

The method:

  1. Brainstorm assumptions. Each team member writes down 5–7 assumptions the business plan is making. An assumption is a statement that, if false, would damage the business — “customers will pay X”, “the AI capability is good enough”, “distribution channel Y exists”, “the regulatory pathway is clear”. Aim for 25–35 raw assumptions across the team.
  2. Cluster and dedupe. Group similar assumptions; consolidate to 8–15 distinct claims.
  3. Score each on probability and impact. Probability: how likely is this assumption to be wrong? (1–5, where 1 = nearly certain to be true, 5 = highly uncertain). Impact: if wrong, how badly does the plan break? (1–5, where 1 = minor adjustment, 5 = whole plan collapses). Multiply for the risk score.
  4. Identify the top 1–3 highest-risk assumptions. These are what the MVP must test. Lower-risk assumptions can be tested later or accepted.

A common pattern: novice teams under-score the desirability assumptions (because their customer-discovery interviews produced enthusiasm) and over-score the feasibility assumptions (because the technical work seems hard). The Week-2 lessons (the demo-versus-usage gap; the wow-factor trap) should counterweight this. Desirability risk in AI products is typically higher than founders intuit, not lower.

21.2.2 Day 2: choose the MVP archetype

By Tuesday evening, the team has selected one of the five archetypes from §21.1.3 as the primary frame for the MVP, with explicit rationale tied to the riskiest-assumption analysis.

The selection rubric:

If the riskiest assumption is… Choose…
Customers will adopt this kind of workflow Wizard of Oz
Customers will pay for this kind of service Concierge
The single core capability is valuable enough Single-feature
We can serve a specific segment end-to-end Narrow-vertical
The hardest technical capability is achievable Proof-of-concept

The archetype is not a taxonomy of products; it is a frame for the MVP’s purpose. Most actual MVPs are hybrids — Team Aroma’s MVP in §21.5 is a Concierge + Wizard-of-Oz hybrid with a narrow-vertical scope. The selection’s purpose is to clarify the primary validation question, not to lock in a single architecture.

21.2.3 Day 2–3: scope the features (MoSCoW)

By Wednesday end-of-day, you should have a complete MoSCoW prioritisation. Method:

  1. Brainstorm features. Each team member writes down 10+ candidate features. Aim for a raw list of 40–60 features.
  2. Group similar features. Consolidate to a structured list of 20–30 distinct candidates.
  3. Sort each into Must / Should / Could / Won’t. Apply the diagnostic question from §21.1.6: if we removed this feature, what learning would we lose? If the answer is “nothing tied to our riskiest assumption,” the feature is at most a “should.”
  4. Sanity-check the Must list. A typical 4-week MVP has 5–8 must-have features. If your Must list has 15 features, you are over-scoped; recur down the list and demote.
  5. Document the Won’t list explicitly. Naming what is out of scope prevents scope-creep arguments in Weeks 5–6 when team members try to add “just this one feature.”

The MoSCoW output becomes the input to the build/buy/borrow analysis (§21.2.4) and the MVP scoping document (§21.2.6).

21.2.4 Day 3: build/buy/borrow decisions

For each Must-have feature, complete a build/buy/borrow analysis. The decision is captured in a matrix:

Feature Build / Buy / Borrow Choice rationale Cost (~) Risk
User authentication Buy Clerk; standard for student startups; AUD ~30–50/month free tier sufficient Free Low
Foundation-model inference Buy Anthropic API (Claude Sonnet); strong for BM/multilingual; per-query economics work ~USD 0.02–0.05/query Medium (capability)
Database / hosting Buy Supabase + Vercel; team has zero devops experience Free tier Low
Question-answer evaluation Build Domain logic specific to SPM rubric; this is our wedge 5–8 dev-days High
Teacher-review workflow Build Specific to centre-owner JTBD; integration moat 8–12 dev-days High
Analytics Buy PostHog free tier Free Low
Payments Defer (post-MVP) Pilot is RM 0; payment integration not Must-have

The total Build effort should be 15–25 dev-days for a 4-week MVP. If your matrix exceeds 30 dev-days of Build, recur back to MoSCoW; you have classified too many features as “must.”

21.2.5 Day 3–4: foundation-model selection

The single most-consequential technical decision in the Week-3 sprint, for AI MVPs, is the foundation-model choice. Five considerations:

Capability. The 2026 frontier models (Claude Sonnet 4.6, GPT-4.5/5 series, Gemini 2.5/3 series, DeepSeek V3+) differ on specific dimensions: code, reasoning, multilingual coverage, latency, function-calling reliability, vision-language. Test each on your specific task before committing. A 30-minute evaluation with 20 representative inputs from your customer-discovery corpus is usually sufficient; commit to whichever model produces the best output on your task.

Cost. Per-query costs vary by 10–100× across models. As of mid-2026, frontier reasoning models cost ~USD 3–15 per million input tokens and ~USD 15–75 per million output tokens; smaller / faster models cost 5–20× less. For high-volume B2C products, choose the smallest model that clears your quality bar; for low-volume B2B products, capability typically matters more than cost.

Latency. End-user latency budget for B2C interactive products is typically 1–3 seconds; for B2B internal tools, 5–15 seconds is acceptable. Reasoning models have substantially higher latency (5–60+ seconds) and are appropriate only for tasks where the deeper reasoning justifies the wait.

Privacy and data residency. For regulated workloads (finance, healthcare, education-of-minors), data may not be allowed to cross jurisdictional boundaries. Anthropic, OpenAI, and Google all offer regional inference (US-only, EU-only) on enterprise tiers. For sovereign-data requirements, self-hosted open-weight models (Llama, Qwen, DeepSeek) are the only path.

Reliability and rate limits. Free tiers and student credits are subject to rate-limiting that breaks demos and pilots. Plan for a paid tier with reasonable rate limits before alpha launch (Week 6).

A practical 2026 default for KL/Melbourne student teams without specific constraints: Anthropic Claude Sonnet (good multilingual capability including BM and Mandarin; predictable pricing; strong tool-use; reliable enterprise tier) for the primary model, with DeepSeek V3.x as a cost-controlled fallback for high-volume non-sensitive workloads. Keep prompt design portable across both so switching is a matter of changing endpoints rather than rewriting application logic.

21.2.6 Day 4–5: design the value proposition (left side of canvas)

Take the customer profile from Week 2 (right side of canvas) and design the matching value proposition. Method per §21.1.5:

  1. For each major pain in the customer profile, name a specific pain reliever in the MVP. If you cannot, the pain is going unaddressed — either elevate the pain reliever to a Must in MoSCoW or accept that this pain is out of MVP scope.
  2. For each gain, name a specific gain creator. Same discipline.
  3. Inventory the products and services. The list of distinct features.
  4. Cross-check for over-engineering. Pain relievers and gain creators that do not map to validated pains or gains are candidates to cut.

The output is the Strategyzer Value Proposition Canvas, both sides populated, screenshot-ready for the deliverable.

21.2.7 Day 5: the MVP scoping document

The MVP scoping document is the central Week-3 deliverable. Eight sections:

MVP SCOPING DOCUMENT — [PROJECT]
Date: Friday, Week 3
Build window: Weeks 4–6 (3-week build, alpha launch Week 6)

1. HYPOTHESIS UNDER TEST
   The single sentence stating what the MVP exists to validate.
   Tied to the riskiest assumption from §21.2.1.

2. TARGET CUSTOMER (recap from Week 2)
   Primary segment, with demographic / firmographic specificity.
   Named early-adopter contacts (5+ from Week-2 corpus).

3. VALUE PROPOSITION
   Reference to Value Prop Canvas (attached). Three bullets:
   - Functional value: what the customer gets done.
   - Emotional value: how they feel using it.
   - Social value: how they appear to others.

4. MVP ARCHETYPE
   The chosen archetype from §21.1.3 with rationale.
   Hybrids labelled (e.g., "Concierge + Wizard-of-Oz, narrow-vertical").

5. FEATURE SPECIFICATION (MoSCoW)
   Must / Should / Could / Won't lists, with brief description per
   feature.

6. TECHNICAL STACK
   Build / buy / borrow matrix.
   Foundation model choice with rationale.
   Hosting, auth, database, analytics choices.

7. QUALITY BAR
   The minimum quality threshold for each Must-have feature,
   stated as a measurable criterion.
   Example: "BM explanations rated 4/5 or higher by 80%
            of pilot students on a 5-point clarity scale."

8. BUILD PLAN AND MILESTONES
   Week 4: foundations (auth, DB, basic UI shell, FM integration)
   Week 5: core feature build (the wedge)
   Week 6 Mon-Wed: integration testing, alpha onboarding prep
   Week 6 Thu: alpha launch with 10 friendly users.
   Named owner per task; estimated dev-days.

9. RISK REGISTER
   Top 5 risks with mitigation plans.
   Include technical (FM capability), market (segment WTP),
   operational (build velocity), legal (data, IP, regulatory).

The document is the artefact that the team and the unit instructor read on Friday night. It is also the document the team returns to in Week 6 when the alpha launches and the validated-learning question gets answered.

21.2.8 Friday submission

Submit the Week 3 deliverable bundle by 23:59 Friday. The team-comprehension penalty from §19.6.2 applies.

21.3 Lessons from the cases

Eight specific lessons from Parts I–III shape Week 3 MVP-design decisions.

21.3.1 JPMorgan COiN — narrow framing wins, again (Chapter 6)

We saw this lesson in Chapter 19 (idea selection). It applies again at MVP design. COiN’s MVP read commercial credit agreements only — not “any contract,” not “any document.” The narrowness made evaluation tractable: was the system’s clause extraction accurate enough that a lawyer’s review-and-correction time fell? The bank could answer that question with a small pilot and a structured comparison against the manual workflow.

Operational implication. The MVP’s evaluation criteria should be measurable on a small pilot. If your “quality bar” requires user feedback at scale to evaluate, your MVP scope is wrong; narrow it until evaluation is tractable on the alpha cohort.

21.3.2 Watson Health — broad-framed MVPs cannot be evaluated (Chapters 2, 7)

The Watson Health failure was partly an MVP-design failure: the system was scoped to recommend cancer treatment, which is not measurable on a 10-patient pilot. Whether a recommendation is “correct” depends on subsequent disease course, comorbidities, and patient choice over months or years. The evaluation problem made the validation cycle multi-year, by which time the product had drifted from the customer.

Operational implication. Choose MVPs whose validated-learning answer arrives in the build window of your project (10 weeks for student teams). Open-ended evaluation cycles (“did we cure cancer?”) cannot produce learning in time; closed-ended evaluation cycles (“did the lawyer’s review time fall by ≥40%?”) can.

21.3.3 Cursor — single-feature MVP done deeply (Chapter 5)

Cursor’s first version did one thing: in-context code completion in an IDE. It did not have agent mode, multi-file edit, or AI-pair-programming. The depth on the one feature was what won. The team could iterate that single feature against their own use (founder autoethnography from Chapter 19) before adding the surrounding capabilities.

Operational implication. A team in Week 3 should be able to identify the single feature that is the wedge. If three features are “all critical,” the team has not yet decided. Force the decision; the wedge is the one that, if it works alone, justifies the rest of the build.

21.3.4 Glean — narrow-vertical MVP (Chapter 5)

Glean’s first version was enterprise search for software-engineering teams at tech companies — not “enterprise search” in general. The narrow-vertical scope let the team build the workflow integration (Slack, Jira, GitHub, Confluence, Google Drive) deeply for one user type before generalising. By the time Glean opened to non-engineering teams, the workflow patterns were proven.

Operational implication. When choosing between archetype 3 (single-feature) and archetype 4 (narrow-vertical), ask: does the value depend more on capability depth (single-feature) or on workflow integration depth (narrow-vertical)? Most B2B AI products are narrow-vertical; most B2C AI products are single-feature.

21.3.5 Stitch Fix — Concierge MVP at scale (Chapter 8, forthcoming)

Stitch Fix’s first ~12 months were a Concierge MVP: founder Katrina Lake hand-picked clothes for friends and shipped them. The point was not to build a recommendation engine; it was to learn whether customers would pay for hand-curated style boxes at all. Once that hypothesis was validated, the technology layer (the recommendation engine, the inventory model, the style profiles) was built against a known business case.

Operational implication. When your riskiest assumption is will customers pay for this kind of service?, do not build technology yet. Run the service manually. The technology comes later, against the validated demand.

21.3.6 Anthropic Claude — proof-of-concept MVP with progressive disclosure (Chapter 13)

Anthropic’s initial Claude release was a proof-of-concept MVP for constitutional-AI methodology (Bai et al., 2022): could a foundation model be aligned by self-reference rather than purely by RLHF (Christiano et al., 2017; Ouyang et al., 2022)? The team prioritised proving the technical thesis before broadening the product surface. Public access progressed from research preview → limited beta (specific trusted partners) → general availability over roughly 18 months.

Operational implication. A proof-of-concept MVP for technical risk should run behind a wall until the technical thesis is validated. Public release of a half-validated technical thesis produces brand damage and incorrect feedback (users reacting to capability limitations rather than to the thesis being tested). For student teams, “behind a wall” means alpha within the team and a small trusted-friends cohort, not public launch.

21.3.7 Klarna — over-scoped MVP that skipped validation (Chapter 8, forthcoming)

The Klarna AI customer-service deployment in February 2024 was effectively a full launch presented as a deployment milestone. The validation step (a 5–10% rollout to a defined cohort, with measurement against the existing-agent baseline) was skipped. The reversal came when full-launch data revealed customer-experience deterioration that an alpha would have caught.

Operational implication. Even a well-resourced firm benefits from staged rollout. For student teams, the rule is absolute: you do not “launch” the MVP. You alpha it (Week 6) to a small friendly cohort, learn, then beta (Week 7) to a wider but still-bounded cohort. Public launch comes only after the validation question is answered.

21.3.8 The DBS GANDALF transformation — quality bar from inception (Chapters 4, 6)

DBS’s banking-AI deployments specified the quality bar before the build started. The credit-card origination workflow target was 21 → 4 days; not “faster,” but a specific number. The personalised nudges programme target was an EBIT impact in the SGD-tens-of-millions range; not “more revenue,” but a specific magnitude. The discipline made evaluation tractable.

Operational implication. Section 7 of the MVP scoping document — the quality bar — must be measurable. “Better than existing alternatives” is not measurable. “BM explanations rated 4/5 or higher by 80% of students on a 5-point clarity scale” is. Force the specificity; without it, the alpha launch in Week 6 produces ambiguous results.

21.4 Tools and templates

21.4.1 Riskiest assumption taxonomy

A worksheet template for the §21.2.1 exercise:

RISKIEST ASSUMPTION INVENTORY

For each assumption, score 1–5 on:
  Probability: how likely is this to be wrong? (1=very unlikely, 5=highly uncertain)
  Impact: if wrong, how badly does the plan break? (1=minor adjustment, 5=plan collapses)

Risk score = Probability × Impact.

| # | Assumption (one sentence) | Class | Prob | Impact | Risk |
|---|---|---|---|---|---|
| 1 | [statement] | Desirability / Viability / Feasibility / Ethical |  |  |  |
| ... |   |   |   |   |   |

Top 3 highest-risk assumptions (the targets of the MVP):
1. [...]
2. [...]
3. [...]

21.4.2 MVP archetype selection rubric

ARCHETYPE SELECTION

Riskiest assumption (from §21.4.1): [...]

Match to archetype:
- Adoption / workflow desirability → Wizard of Oz
- Pricing / business-model viability → Concierge
- Single-capability sufficiency → Single-feature
- Segment-end-to-end value → Narrow-vertical
- Hardest technical risk → Proof-of-concept

Chosen archetype: [...]
Rationale (2–3 sentences): [...]

Hybrid? (Y/N): [...]
If yes, primary frame and secondary frame: [...]

21.4.3 MoSCoW worksheet

FEATURE PRIORITISATION (MoSCoW)

For each candidate feature:
  - Must: required to test the riskiest assumption
  - Should: improves MVP but not strictly necessary
  - Could: nice to have, clearly out of MVP scope
  - Won't (this version): explicitly out of scope

Diagnostic question for "must": if we removed this feature, what
  learning would we lose? If the answer is "nothing tied to our
  riskiest-assumption test," the feature is not a must.

| Feature | Bucket | Justification |
|---|---|---|
| [feature] | Must / Should / Could / Won't | [tied to riskiest assumption?] |

Total Must features: ___ (target: 5–8)
Total Should features: ___ (target: 3–5)

21.4.4 Build / buy / borrow decision matrix

BUILD / BUY / BORROW DECISION MATRIX

| Feature | B/B/B | Tool / API | Cost (~) | Build effort (dev-days) | Risk |
|---|---|---|---|---|---|
| Authentication | Buy | Clerk | Free tier | 0 | Low |
| Hosting | Buy | Vercel + Supabase | Free tier | 0 | Low |
| Foundation model | Buy | [Claude / GPT / DeepSeek] | $X/query | 0 | Medium |
| Domain logic (the wedge) | Build |  |  | [days] | High |
| ... |  |  |  |  |  |

Total Build effort: ___ dev-days
Build effort budget: 4 weeks × team's effective dev-days/week ≈ 60–80 dev-days
                    minus ~30% for testing, integration, documentation
                    ≈ 45–55 dev-days available.

21.4.5 Foundation model selection guide (2026 stack)

The current default options for student-team AI MVPs in 2026, with characteristic strengths:

Model family Strengths Use cases Cost (per 1M output tokens, approx.)
Claude Sonnet 4.x / 5.x (Anthropic) Multilingual including BM/Mandarin, function-calling, long context, agent reliability Default choice for B2B and education; Malaysian/SE Asian content USD 15
Claude Haiku 4.x Fast, cheap, capable on standard tasks High-volume light tasks, B2C chat USD 4
Claude Opus 4.x Frontier reasoning, complex multi-step Hardest reasoning tasks; coding agents USD 75
GPT-4.x / GPT-5 (OpenAI) Strong general capability, broad ecosystem Standard default; broad community support USD 10–60
GPT-4o-mini Cheap, fast, multimodal High-volume multimodal tasks USD 0.6
Gemini 2.5 Pro / 3 Pro (Google) Long context (1–2M tokens), multimodal Document analysis, video tasks USD 5–15
DeepSeek V3.x / R1.x (open-weight) Strong reasoning at low cost; MIT-licensed Cost-sensitive workloads, sovereign deployments USD 1–3 (hosted) or self-host
Llama 4.x (open-weight, Meta) Strong general capability; widely supported Self-hosted, fine-tuned, customisable Self-host
Qwen 2.5+ / 3 (open-weight, Alibaba) Strong on Chinese/multilingual; Apache-licensed Multilingual; APAC contexts Self-host
Mistral Large 2 / 3 (open-weight) Solid European-trained alternative EU-data-sovereignty contexts Self-host or hosted

For low-code teams, the practical recommendation: default to Claude Sonnet via Anthropic API for B2B and education products; switch to Haiku for high-volume customer-facing chat where cost matters; consider DeepSeek V3.x as a cost-controlled secondary for non-sensitive workloads. Keep prompts portable so model swaps are a configuration change, not a rewrite.

For student teams with R or Python comfort, the open-weight options (Llama, Qwen, DeepSeek) become viable through inference-as-a-service providers (Together AI, Fireworks, Groq, Modal) at lower cost than the closed APIs. Self-hosting on owned hardware is generally not worth the operational overhead for a 10-week build.

21.4.6 Quality bar specification template

The quality bar is the most-skipped section of the scoping document. Force specificity:

QUALITY BAR — [FEATURE]

Measurement method:
  [How will quality be measured? User survey? Internal eval set?
   Comparison with existing baseline?]

Target threshold:
  [The specific level the MVP must reach to pass validation.]

Sample size:
  [How many users / queries / observations are required for the
   measurement to be statistically meaningful?]

Pass / fail decision:
  [What happens if the threshold is met? What happens if it's
   missed? Pivot? Extend? Abandon?]

Example (Team Aroma):
  Feature: AI explanation quality for SPM Add Maths questions.
  Measurement: 5-point teacher-rated clarity score.
  Target: ≥80% of explanations rated 4 or 5 by reviewing teachers.
  Sample size: 100 explanations across 5 SPM topic areas, rated by 3 teachers.
  Pass: continue to beta (Week 7) at chosen quality threshold.
  Miss (60–79%): extend build by one week with prompt-engineering iteration.
  Miss (<60%): pivot to teacher-augmented mode (AI-drafted, teacher-finalised).

21.4.7 Value proposition canvas (left side) template

VALUE PROPOSITION — [SEGMENT]

Customer profile (from Week 2):
  Pains: [list, priority order]
  Gains: [list, priority order]
  Jobs: [list]

Value proposition:

Products and services:
  1. [feature / service element]
  2. [...]

Pain relievers (one per major pain):
  Pain → Pain reliever
  - [Pain 1] → [specific feature that addresses this]
  - [Pain 2] → [...]

Gain creators (one per major gain):
  Gain → Gain creator
  - [Gain 1] → [specific feature that creates this]
  - [Gain 2] → [...]

Fit check:
  Pains without pain relievers: [list any]
  Gains without gain creators: [list any]
  Pain relievers / gain creators without corresponding pain or gain: [list any]

21.4.8 The MVP scoping document master template

The complete eight-section template from §21.2.7 is the master deliverable. Reproduce it as a separate Notion/Google Doc with team-editable fields, populated through the week as the analyses (§21.2.1 through §21.2.6) complete.

21.5 Worked example — Team Aroma’s Week 3

Team Aroma continues with the centre-as-customer pivot from Week 2. The primary customer is Klang-Valley tutoring centres (10–200 students); the user is the centre’s teachers; the student is the end consumer of the product. Three actors, with the centre owner being the buyer.

Day 1: riskiest assumptions

The team brainstorms 28 raw assumptions and consolidates to 11. The risk-scored list:

# Assumption Class Prob Impact Risk
1 Centre owners will pay RM 30/student/month for a teacher productivity tool Viability 3 5 15
2 An AI tutor can produce SPM-aligned BM/English explanations at clarity acceptable to centre teachers Feasibility 4 5 20
3 Teachers will adopt the tool day-to-day rather than treat it as imposed by the owner Desirability 4 4 16
4 Students will engage with the tool when assigned by the teacher Desirability 3 3 9
5 The teacher-review workflow takes <5 min per student per week Feasibility 3 4 12
6 Centre owners can be reached through Aliyah’s network + cold outreach Distribution 2 3 6
7 Anthropic API quota / rate limits will not break the pilot Operational 1 3 3
8 Data-protection requirements (PDPA) can be met within the 8-week build Regulatory 2 3 6
9 Centre teachers will not feel threatened by AI replacement Cultural 3 3 9
10 The pilot’s 3-month free period will convert to paid at ≥30% rate Viability 4 4 16
11 Cross-campus team can ship a working alpha by Week 6 Operational 3 4 12

The top 3 highest-risk assumptions are: #2 (FM capability for SPM-aligned BM/English explanations), #3 (teacher day-to-day adoption), and #10 (pilot-to-paid conversion). The MVP must test these.

The team notes that #1 was scored lower than #2 and #3 because the Week-2 interviews provided strong directional evidence on viability (T2 explicitly offered to pilot at RM 30/student/month). Desirability (whether teachers actually use it) and feasibility (whether the AI is good enough) carry the residual risk.

Day 2: archetype choice

The riskiest-assumption pattern points to a hybrid:

  • For #2 (capability): a proof-of-concept is needed — can Claude Sonnet produce SPM-quality explanations?
  • For #3 (teacher adoption): a Wizard-of-Oz / Concierge hybrid is appropriate — show the teachers the workflow even if the AI part is partly human-supervised.

The team chooses a Concierge + Wizard-of-Oz hybrid with narrow-vertical scope: human-supervised AI explanations for one centre’s Form 5 maths students for 4 weeks. The AI proposes answers; the teacher reviews and corrects before sending to students. The teacher’s workflow is real (this is what they would do in the eventual product); the back-end is partly Wizard-of-Oz (the team is in the loop on flagged cases).

This archetype tests:

  • The AI capability hypothesis (#2): we can measure the proportion of explanations the teacher accepts unchanged, the proportion they correct, and the proportion they reject outright.
  • The teacher-adoption hypothesis (#3): we can measure whether teachers continue to use the tool over the 4 weeks, and what their qualitative feedback is.
  • A first-cut on conversion (#10): we can ask the centre owner whether they would pay RM 30/student/month for the validated workflow.

Day 3: MoSCoW

The team’s feature MoSCoW:

Must (8 features):

  1. Teacher dashboard — login, student list, today’s assignments
  2. Student-facing question delivery — SPM Add Maths Form 5 questions, one at a time
  3. AI-generated explanations in BM and English (toggle)
  4. Teacher review/correction interface — see AI answer, accept / edit / reject
  5. SPM-rubric-aligned answer format checking
  6. Per-student progress tracking — accuracy by topic
  7. Basic auth (Clerk) — teacher login, student login
  8. Activity log — for analytics on workflow time

Should (4 features):

  1. Mandarin language support (Wei Hao argues for this; team agrees post-MVP)
  2. Integrated chat — student asks follow-up questions
  3. Parent visibility — weekly progress report
  4. Topic-level personalised practice generation

Could (5 features):

  1. Mobile app (vs web responsive)
  2. Offline mode
  3. Gamification / streaks
  4. Tutor-tutor messaging
  5. Centre owner billing dashboard

Won’t (this version, 6 features):

  1. Other subjects (Bahasa, English Lit, Sejarah, etc.)
  2. SPM exam simulation
  3. Past-year-paper integration
  4. Multiple-choice question generation
  5. Audio explanations
  6. Direct-to-parent purchasing

The Must list has 8 features against an estimated 50 dev-days build budget — at the upper edge of feasibility but achievable.

Day 3: build / buy / borrow

The build/buy/borrow matrix:

Feature B/B/B Tool Cost Build effort Risk
Auth Buy Clerk Free 0.5 days (setup) Low
Hosting (frontend) Buy Vercel Free 0.5 days Low
Hosting (backend / DB) Buy Supabase Free 1 day Low
App shell / UI Borrow Lovable + Tailwind Free 3 days (initial UI) Low
FM inference Buy Anthropic Claude Sonnet API ~USD 100/month at pilot scale 0 (API integration ~1 day) Medium
AI explanation engine Build Custom prompt-engineering + RAG 8 days High
Teacher review interface Build Custom React/Next.js 6 days Medium
SPM-rubric format check Build Hybrid LLM + rule-based 4 days High
Student-facing question UX Build Custom React/Next.js 5 days Medium
Progress tracking Build Custom DB schema + UI 3 days Low
Activity log / analytics Buy PostHog Free tier 0.5 days (setup) Low
Total Build effort ~30 dev-days
Total cost (4 weeks pilot) ~USD 100

30 dev-days fits within the 4-week budget for a 5-person team (with effective ~3 dev-days/person/week, totalling ~60 days minus ~50% for testing, integration, and the cross-campus coordination overhead).

Day 3: foundation-model selection

The team evaluates three candidate models on 20 representative SPM Add Maths questions (sourced from public past-year papers):

Model BM clarity English clarity SPM-format accuracy Cost / explanation Latency
Claude Sonnet 4.6 4.5/5 4.7/5 4.0/5 ~USD 0.04 3.2s
GPT-4o 3.8/5 4.6/5 3.5/5 ~USD 0.05 2.8s
DeepSeek V3.2 3.0/5 4.2/5 2.8/5 ~USD 0.005 4.1s

Claude Sonnet wins on the SPM-format and BM-clarity dimensions, which are the binding constraints from the Week-2 customer-discovery findings. The team commits to Claude Sonnet as primary, with the prompt portable enough that DeepSeek V3.2 can be evaluated as a cost-controlled fallback in Week 5–6 if needed.

Day 4: value proposition canvas

The team produces the full canvas. Key entries on the value-proposition side, mapping back to the Week-2 customer profile:

Pain relievers (centre-owner segment):

  • “Teachers spend 60–70% of time on routine grading and re-explanation” → AI generates first-pass explanations and routine practice, teachers review/correct in seconds rather than draft from scratch
  • “Existing tools are not aligned with the SPM rubric” → SPM-format-checking layer enforces marking-scheme conventions (units, working shown, answer-form rules)
  • “Tools don’t handle BM/English code-switching” → Bilingual explanations with toggle; matches Malaysian classroom practice
  • “Teacher-time scaling is the binding constraint on centre growth” → Teacher-time-per-student halves; centre can serve ~50% more students without hiring

Gain creators:

  • “Increase teacher retention by reducing burnout” → Teachers handle 30% more students with same hours; redirect saved time to harder teaching
  • “Improve student outcomes through 24/7 practice availability” → Students can practice anytime; teacher reviews queue at their convenience
  • “Differentiate from competing centres” → Centre marketing position: “AI-augmented teaching” as a feature parents value

Pain relievers without corresponding pains (over-engineering check):

The team identifies that the “gamification” feature in Could has no corresponding validated pain or gain — it appeared in initial design discussions but was not in any interview. They confirm it stays in Could (i.e., out of MVP scope) and document the absence of evidence as the rationale.

Day 5: quality-bar specification

The team specifies measurable thresholds for each Must-have feature:

Feature Measurement Target Sample size
AI explanation BM clarity Teacher rating, 5-point scale ≥80% rated 4 or 5 100 explanations across 5 topics, 3 teachers
AI explanation English clarity Teacher rating, 5-point scale ≥80% rated 4 or 5 100 explanations
SPM-format accuracy Format-check pass rate ≥90% on first AI generation 200 questions
Teacher review time Average per-student-per-week <10 min/student/week All pilot students for 4 weeks
Teacher acceptance rate % of AI explanations sent unchanged ≥60% All AI generations during pilot
Student weekly engagement % of pilot students active ≥3 days/week ≥70% All 50–80 pilot students
Centre-owner Week-4 NPS NPS score ≥30 (passable for 4-week pilot) 1 centre owner (T2 confirmed)

The thresholds are stretching but evidence-grounded: the BM-clarity bar comes from Aliyah’s own assessment of her tutoring quality; the teacher-review-time bar comes from T2’s stated acceptance threshold (“if it doesn’t save time it’s not worth using”); the SPM-format-accuracy bar comes from Aliyah’s experience marking SPM trial papers.

Day 5 evening: scoping document

The team writes the eight-section MVP scoping document, populated from the prior days’ work. Section 1 (hypothesis under test) reads:

“A bilingual AI tutor with SPM-rubric-aligned answer formatting, deployed via teacher-supervised review at one Klang-Valley tutoring centre, will (a) produce explanations clear enough that ≥80% pass teacher review with no edits, (b) reduce teacher per-student-per-week time by ≥40%, and (c) generate sufficient centre-owner satisfaction (NPS ≥30) to justify a paid pilot conversion.”

The remaining sections follow the §21.2.7 template; the document totals 7 pages including the value proposition canvas, MoSCoW, and risk register. Sara presents it at the Friday evening meeting; the team makes two amendments (tightening the BM-clarity threshold from 75% to 80%, and explicitly excluding Mandarin from the MVP) before submitting.

What Team Aroma got right and what they almost got wrong

Three things they did well: (1) the riskiest-assumption analysis correctly identified capability and adoption risks rather than viability risk (which the Week-2 evidence had already partly addressed); (2) the archetype choice (Concierge + Wizard-of-Oz hybrid) matched the actual riskiest-assumption pattern rather than picking a single archetype out of habit; (3) the foundation-model evaluation was empirical (20 representative test questions on three models) rather than reputation-based.

Three things they almost got wrong: Wei Hao initially wanted to push Mandarin into Must (the “Should” placement reflects a team negotiation); the team almost set the SPM-format-accuracy threshold at 75% (which Aliyah pointed out was below the threshold a Klang-Valley tutoring centre would tolerate); the team almost specified “teachers prefer AI to manual” as a quality-bar criterion (which is unmeasurable in a 4-week pilot — they replaced it with the more-measurable acceptance-rate criterion).

The pattern is general. Week 3 is high-leverage because the discipline of writing measurable quality bars forces the team to confront whether their MVP can actually answer the validated-learning question; without that discipline, the alpha launch in Week 6 produces evidence the team cannot interpret.

21.6 Course exercises and Week 3 deliverable

Submit the Week 3 deliverable bundle as a shared folder by Friday 23:59. Required artefacts:

21.6.1 Required artefacts

  1. Riskiest assumption inventory (§21.4.1). 8–15 assumptions scored on probability × impact, with top 3 identified.
  2. Archetype selection memo (§21.4.2). One-paragraph rationale for the chosen archetype, with hybrid framing if applicable.
  3. MoSCoW feature specification (§21.4.3). Must / Should / Could / Won’t lists with brief justifications.
  4. Build / buy / borrow matrix (§21.4.4). Decision per Must-have feature, with cost and effort estimates.
  5. Foundation-model selection memo (§21.4.5). Empirical evaluation on representative inputs from Week-2 customer-discovery corpus, with chosen model and rationale.
  6. Value proposition canvas (§21.4.7). Both sides populated; pain reliever per pain, gain creator per gain.
  7. Quality bar specification (§21.4.6). Measurable threshold per Must-have feature with sample size and pass/fail decision rules.
  8. MVP scoping document (§21.4.8). The eight-section master document, populated.

21.6.2 Grading rubric (50 points)

Component Points Distinction-level criteria
Riskiest assumption clarity 5 Top 3 assumptions are evidently the riskiest given Week-2 evidence; rationale explicit
Archetype selection rigour 5 Archetype matches riskiest-assumption pattern; hybrid framing used if appropriate
MoSCoW discipline 10 Must list ≤8 features; each Must justified by tied-to-riskiest-assumption logic; Won’t list explicit
Build / buy / borrow rationale 5 Build effort ≤30 dev-days; buy/borrow choices defended; the wedge clearly identified
Foundation-model selection 5 Empirical evaluation on representative inputs; rationale ties to capability and cost
Value proposition fit 5 Each major pain has a pain reliever; over-engineering identified and removed
Quality bar specificity 5 Each Must has a measurable threshold, sample size, and pass/fail decision
MVP scoping document quality 10 All 8 sections populated with specific content; risk register includes mitigations

Pass: 30. Credit: 36. Distinction: 42. High Distinction: 47.

The team-comprehension penalty from §19.6.2 applies.

21.6.3 Things to do before Monday of Week 4

By Sunday evening of Week 3, in addition to the deliverable submission:

  • Set up the development environment for Monday’s build start: Vercel account, Supabase project, Clerk app, Anthropic API key (with paid-tier quota), GitHub repo, shared development conventions.
  • Confirm the alpha-cohort booking from Week 2 (5–8 interviewees pre-committed to alpha testing in Week 6).
  • Read Chapter 3 (the AI factory) and §22.1–§22.3 of Chapter 22 (the 2026 low-code stack) before Monday of Week 4. The Chapter 3 reading establishes the architectural pattern; the Chapter 22 reading goes into the specific tooling choices for the build.

References for this chapter

MVP and lean methodology

  • Ries, E. (2011). The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Business.
  • Maurya, A. (2012). Running Lean: Iterate from Plan A to a Plan That Works. O’Reilly.
  • Cagan, M. (2017). Inspired: How to Create Tech Products Customers Love. (2nd ed.) Wiley.
  • Cagan, M. and Jones, C. (2020). Empowered: Ordinary People, Extraordinary Products. Wiley.

Wizard of Oz, Concierge, and MVP archetypes

  • Kelley, J. F. (1984). An iterative design methodology for user-friendly natural language office information applications. ACM Transactions on Information Systems 2(1): 26–41. (The canonical reference on Wizard of Oz prototyping.)
  • Maurya, A. (2010). The Concierge MVP. LeanStack Blog.

Value proposition and product design

  • Osterwalder, A., Pigneur, Y., Bernarda, G., and Smith, A. (2014). Value Proposition Design: How to Create Products and Services Customers Want. Wiley.
  • Strategyzer AG (2024). The Value Proposition Canvas — official template and guide. strategyzer.com.

Prioritisation methods

  • Clegg, D. and Barker, R. (1994). Case Method Fast-Track: A RAD Approach. Addison-Wesley. (Origin of the MoSCoW method.)
  • Patton, J. (2014). User Story Mapping: Discover the Whole Story, Build the Right Product. O’Reilly.

Foundation models and AI infrastructure

  • Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.
  • Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. NeurIPS.
  • Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
  • Stanford HAI (2025). AI Index Report 2025.
  • DeepSeek-AI (2024). DeepSeek-V3 technical report. arXiv:2412.19437.
  • DeepSeek-AI (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948.
  • Yao, S. et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR.

Cases referenced in §21.3

  • Iansiti, M. and Lakhani, K. R. (2020). Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World. Harvard Business Review Press. (Watson Health, Stitch Fix, JPMorgan COiN.)
  • Lamarre, E., Smaje, K., and Zemmel, R. (2023). Rewired: The McKinsey Guide to Outcompeting in the Age of Digital and AI. Wiley.
  • Klarna AB (2024, 2025). Press releases and CEO interviews on the AI customer service deployment and reversal.

Build/buy/borrow and 2026 low-code stack

  • Lovable AI (2024–2026). Lovable platform documentation.
  • Anthropic (2024–2026). Claude API documentation. docs.anthropic.com.
  • Supabase (2024–2026). Supabase platform documentation.
  • Vercel (2024–2026). Vercel platform documentation.

Further reading

For MVP design philosophy generally, Ries’s Lean Startup remains the canonical reference; Maurya’s Running Lean is the more practitioner-oriented complement. For product management at AI startups specifically, Cagan’s Inspired and Empowered are the field’s standard texts; the Marty Cagan / Silicon Valley Product Group blog is updated regularly with case material.

For the technical-stack literature, the most-current public sources are the foundation-model providers’ own documentation (Anthropic, OpenAI, Google), the open-source community channels (Hugging Face, the LangChain blog, the LlamaIndex blog), and the practitioner-blog ecosystem (Vercel’s blog, Supabase’s blog, the Render blog, the Latent Space podcast and newsletter).

For the 2026 evolution of low-code AI development specifically, the AI Engineer Summit (annual San Francisco event, with proceedings online), the Anthropic developer documentation, and the OpenAI cookbook are the primary references. Latent Space (Shawn Wang’s newsletter and podcast) provides regular practitioner-focused commentary.

For evaluation methodology — covered in detail in Chapter 23 — Hugging Face’s Evaluate library documentation, the OpenAI Evals project, and the Anthropic evaluation framework provide the technical substrate for build-measure-learn.

Read Chapter 3 (the AI factory) and §22.1–§22.3 of Chapter 22 (the 2026 low-code stack) before Monday of Week 4.