Chapter 7 — Healthcare and pharmaceuticals
The application of artificial intelligence to medicine is the field’s longest-running ambition and its most cautionary record. Every era’s most-public AI programme — Stanford’s MYCIN in the 1970s, INTERNIST-1 and CADUCEUS in the 1980s, IBM’s Watson Health from 2011, the deep-learning imaging wave from 2015, the foundation-model wave from 2022 — has been launched with claims that the technology was at the threshold of transforming clinical practice. Each era’s deployment realities have substantially fallen short of those claims. The pattern is not an accident. Medicine combines high stakes (lives and litigation), heterogeneous data (the gap between trial conditions and real-world deployment), regulatory rigour, professional gatekeeping, and reimbursement structures that together create a deployment friction unique to healthcare. The lesson — repeatedly relearned — is that clinically-useful AI is harder than benchmark-impressive AI by an order of magnitude.
Yet the period since 2020 has produced changes that are different in kind, not just degree, from prior eras. AlphaFold 2 (2020) and AlphaFold 3 (2024) collapsed the protein-structure-prediction problem that had occupied structural biology for fifty years. Foundation-model performance on medical reasoning benchmarks (USMLE, MedQA, NEJM cases) crossed expert-physician levels by 2023 (Singhal et al., 2023; Nori et al., 2023). Ambient clinical scribes — once a research curiosity — became standard-of-care at major US health systems by 2025. AI-discovered molecules entered Phase III trials. The question is no longer whether AI changes medicine but how, where, and on what timeline.
This chapter develops the question across thirteen sections. Section 7.1 traces the historical arc, with attention to why each era over-promised. Section 7.2 covers medical imaging — the most-mature deployment domain. Section 7.3 develops the Watson Health post-mortem in detail; this is the cautionary case the playbook chapters reference repeatedly. Section 7.4 covers drug discovery and the AlphaFold revolution. Section 7.5 covers ambient clinical scribes, the 2024–2026 deployment frontier. Section 7.6 covers LLM-based clinical applications. Section 7.7 maps the regulatory landscape. Section 7.8 covers hospital-scale deployment, with attention to integration realities. Section 7.9 addresses mental-health and behavioural AI, where the clinical evidence is most-contested. Section 7.10 develops failures and ethical issues including the Obermeyer et al. (2019) bias finding. Section 7.11 covers Southeast Asian and Australian regional context. Section 7.12 sketches the 2026 frontier, particularly clinical agents. Section 7.13 connects to the broader statistical methodology that distinguishes credible from spurious clinical AI claims.
7.1 The historical arc — from MYCIN to Watson to AlphaFold
The earliest serious medical AI programme was MYCIN, developed at Stanford by Edward Shortliffe and colleagues from 1972 (Buchanan and Shortliffe, 1984). MYCIN was a rule-based expert system for diagnosing bacterial infections and recommending antibiotic therapy. By 1979 evaluations against expert clinicians, MYCIN’s recommendations were judged equivalent to or better than infectious-disease specialists’ on a defined test set (Yu et al., 1979). It was never deployed clinically. The reasons were instructive: integration into clinical workflow was unaddressed; the legal and liability structure for an autonomous diagnostic system did not exist; the data-entry burden of feeding MYCIN’s input variables was high; and physicians did not trust a black-box recommendation, even an accurate one. Every reason that prevented MYCIN’s deployment in 1979 has reappeared in subsequent eras.
The 1980s expert-systems era produced INTERNIST-1 (Miller et al., 1982) at Pittsburgh, an internal-medicine diagnostic system, and CADUCEUS, its commercialisation. By 1990 the broader expert-systems market in healthcare had collapsed. The technology was not the problem; the deployment context was. Building rule bases for the long tail of medical knowledge proved impossibly labour-intensive; rules generated for 1980 were stale by 1985; and the systems could not handle the clinical reasoning that physicians actually do, which is closer to pattern recognition under uncertainty than to deductive rule-application.
The 1990s and 2000s brought clinical decision support (CDS) systems integrated with electronic health records (EHRs). EPIC’s BestPractice Advisories, Cerner’s similar functionality, and Wolters Kluwer’s UpToDate became standard. CDS was useful — drug-drug interaction alerts, allergy warnings, dose-range checks — but it was not AI in the sense the field had originally meant. CDS is structured queries against structured data; the underlying methods are database lookup and rule evaluation, not learning systems. The era produced workflow integration but not the diagnostic intelligence that MYCIN and INTERNIST had aspired to.
The 2010s deep-learning wave is when medical AI in the modern sense begins. Esteva et al. (2017) demonstrated that a CNN trained on 130,000 dermatology images matched dermatologists on skin-cancer classification. Gulshan et al. (2016) showed that diabetic retinopathy could be detected from retinal photographs by a CNN at expert ophthalmologist level. Rajpurkar et al. (2017) applied the same approach to pneumonia detection from chest X-rays. The papers were a watershed: AI was no longer aspirational; specific diagnostic tasks had been mastered. By 2018 the FDA had cleared the first autonomous diagnostic AI (IDx-DR for diabetic retinopathy); by 2024 the cumulative count of FDA-cleared AI/ML-enabled medical devices exceeded 800 (FDA, 2024 update of the AI/ML-Enabled Medical Devices list).
In parallel, IBM’s Watson Health (2011–2022) attempted the broader vision — a general-purpose medical AI that would advise clinicians across diagnostics, treatment selection, and care management. Section 7.3 develops the Watson Health story in detail. The short version: Watson Health is the clearest cautionary case in modern medical AI, and the lessons from its failure are operational rather than technological.
The 2020 AlphaFold breakthroughs and the 2022–2026 foundation-model wave constitute the current era. The character of the current era differs from prior ones in three ways. First, the underlying capability is substantially stronger; foundation models have demonstrated competence on a much wider range of medical tasks than prior generations did. Second, the deployment infrastructure is now industrial; cloud computing, EHR APIs, and data-pipeline tooling at hospitals has matured to the point that AI integration is technically tractable in ways it was not for MYCIN or INTERNIST. Third, the regulatory pathway has been clarified, with the FDA’s AI/ML SaMD action plan (FDA, 2021, updated 2023) establishing a path for continuously-learning systems that the agency had previously not addressed. The combination produces an environment where deployment is more plausible than at any prior moment — but the historical record cautions that “more plausible” is not “easy.”
7.2 Medical imaging — the canonical deployment success
Medical imaging is the one domain where AI has achieved durable clinical deployment at scale. The reasons are structural: imaging produces digital data of standardised format (DICOM); the diagnostic task is often discrete (binary or small-multiclass classification, segmentation, or measurement); the radiologist workflow is amenable to AI augmentation rather than replacement; and the regulatory pathway (510(k) clearance via predicate-device claims) is well-trodden. By 2024, imaging accounts for the majority of FDA-cleared AI/ML-enabled medical devices — radiology alone represents over 75% of the cumulative count.
Diabetic retinopathy is the field’s deepest-deployed application. Diabetic retinopathy (DR) is the leading cause of preventable blindness globally; routine screening of diabetic patients is recommended but under-delivered, particularly in low- and middle-income contexts. Gulshan et al. (2016) demonstrated CNN-based DR screening at sensitivity 90.3% and specificity 98.1% against a board-certified-ophthalmologist consensus; subsequent work (Sayres et al., 2019) extended the system to grade DR severity. The IDx-DR system received FDA De Novo clearance in April 2018 — the first autonomous AI diagnostic cleared in the United States, allowing primary-care use without ophthalmologist review. Google Health partnered with Aravind Eye Care in India (2018 onward) to deploy similar technology at scale; by 2023 the Google-Aravind system had screened over 100,000 patients in southern India. The deployment evidence is substantial: a 2019 prospective study at Aravind’s Madurai facility found that 21.7% of screened patients had referable DR that would otherwise have been missed (Beede et al., 2020). The Aravind deployment also surfaced operational lessons that pure-research evaluation had not captured — image quality variability across capture conditions degraded sensitivity in ways that the lab evaluation had under-predicted.
Mammography is the field’s most-publicised research application. McKinney et al. (2020), a Google Health collaboration with Cancer Research UK and Northwestern, reported that an AI system reduced false positives by 5.7% and false negatives by 9.4% on US data, and similar improvements on UK data. The paper drew prominent attention but also methodological criticism (Haibe-Kains et al., 2020) for insufficient detail on training data and code; subsequent independent replication has been mixed. As of 2026, mammography AI is deployed at several US health systems as a “second reader” augmenting human radiologists, but it has not displaced the radiologist as the primary reader.
Pathology AI matured later than radiology but is now substantial. PathAI (founded 2016) and Paige (founded 2017) have produced FDA-cleared algorithms for prostate cancer, breast cancer biomarkers, and gastrointestinal pathology. The field’s particular challenge is the scale of digital pathology images (gigapixel-class) and the variation across staining protocols, scanners, and tissue preparation; these produce a domain shift problem that imaging-classification methods designed for radiology must explicitly address.
Radiology more broadly. The radiology AI ecosystem in 2024–2026 is dense with single-task algorithms: stroke detection in CT (Aidoc, RapidAI); pulmonary nodule detection (CT lung cancer screening); fracture detection (BoneView, Gleamer); intracranial hemorrhage detection. The deployment economics in the United States are partly driven by the radiologist shortage: the American College of Radiology has documented sustained imaging-volume growth outpacing radiologist supply, which has produced workflow pressure that AI augmentation addresses directly. Time savings of 10–30% on specific reading tasks have been documented in real-world deployments (van Leeuwen et al., 2021).
Limits of the imaging deployment success. Two limitations deserve note. First, the deployment is augmentation, not autonomy. With the exception of IDx-DR for DR screening, no FDA-cleared imaging AI system can render a final diagnostic interpretation without radiologist review. The AI surfaces candidates and prioritises worklists; the human reads. Second, the deployment economics depend on payer reimbursement, which has lagged the technical maturation. Until 2022 there was no specific CPT (Current Procedural Terminology) code for AI-assisted imaging; AI use was reimbursed implicitly as part of the underlying study. CPT codes for AI-augmented imaging began appearing in 2022 and have expanded since. The reimbursement structure substantially shapes where AI deploys and where it does not; technically-mature AI for non-reimbursed tasks deploys slower than technically-equivalent AI for reimbursed tasks.
7.3 Watson Health — the canonical post-mortem
IBM’s Watson Health programme (2011–2022) is the modern era’s clearest example of how AI projects fail when ambition outruns operational discipline. The programme began with the Jeopardy! moment in February 2011, when IBM’s Watson defeated Brad Rutter and Ken Jennings on the television quiz show. The marketing impact was substantial; IBM’s strategy team rapidly identified medicine as the natural commercialisation target for the underlying technology, on the reasoning that the question-answering capability that won Jeopardy! would generalise to clinical knowledge retrieval and diagnostic support.
The 2013 partnerships announced the strategy. IBM partnered with Memorial Sloan Kettering Cancer Center to develop Watson for Oncology, a system intended to recommend treatment plans for cancer patients based on the medical literature and Memorial Sloan Kettering’s clinical expertise. MD Anderson Cancer Center signed a similar partnership in 2013. Cleveland Clinic, Mayo Clinic, and others followed. The marketing was extensive: Watson would democratise expert-level cancer treatment recommendations; community oncologists would have access to leading-cancer-centre-quality decision support; the technology would scale to global use.
By 2017–2018 the reckoning was visible. MD Anderson cancelled its Watson partnership in 2017 after spending USD 62 million without producing a clinical product; an internal University of Texas audit later documented the project’s mismanagement. STAT News (Ross and Swetlitz, 2017) published an investigation revealing that Watson for Oncology’s training data was largely synthetic — Memorial Sloan Kettering oncologists had constructed treatment scenarios for the system to learn from, rather than the system learning from real patient outcomes. IBM internal documents leaked in 2018 revealed that the system had occasionally produced unsafe treatment recommendations. By 2022, IBM sold Watson Health to Francisco Partners for an estimated USD 1 billion — substantially below the cumulative investment.
The post-mortem yields five structural lessons.
Lesson 1 — broad scope without operational definition. “Recommend cancer treatment” is not an operationally-defined task. Cancer treatment selection involves cancer type, stage, comorbidities, prior treatments, patient preferences, trial eligibility, social context, and institution-specific protocols. A treatment “recommendation” can be a one-line drug suggestion or a multi-page treatment plan. Without operational definition, evaluation is impossible; without evaluation, the system cannot be improved. The contrast with successful imaging deployments is sharp: “detect referable diabetic retinopathy in a fundus photograph” is operationally defined; “recommend appropriate cancer treatment” is not.
Lesson 2 — brand momentum substituted for evaluation infrastructure. The Jeopardy! moment generated a marketing trajectory that ran for years before clinical-evidence trajectory caught up. IBM committed to high-profile partnerships, conference keynotes, and press releases without the closed-loop evaluation that would have surfaced problems early. By the time evaluation evidence became unavoidable, the brand commitment was sunk.
Lesson 3 — the platform framing without unit-of-value clarity. Watson Health was positioned as a platform across oncology, genomics, imaging, drug discovery, and care management. The platform framing produced discovery friction at every customer engagement: each hospital had to be convinced that Watson would solve their specific problem, with the burden of demonstration falling on the deployment team. A narrower scope — say, “we extract relevant clinical-trial eligibility from oncology notes” — would have produced unit-of-value clarity that the platform framing obscured.
Lesson 4 — synthetic training data and the validation gap. The Memorial Sloan Kettering training-data construction substituted curated clinician judgment for real-world outcomes. The substitution produced a system that approximated MSK oncologists’ stated preferences, not real-world treatment effectiveness. When the system generalised to community oncology contexts (different patient populations, different resources, different protocols), the MSK-aligned recommendations were sometimes inappropriate. The lesson generalises: training data that reflects expert judgement under controlled conditions does not transfer to deployment conditions without rigorous external validation, which Watson did not perform.
Lesson 5 — the regulatory and reimbursement pathway not understood. Watson for Oncology was positioned as decision-support, which meant it could avoid the FDA pathway that diagnostic devices require. The ambiguity benefited IBM in early commercialisation but damaged it in mature deployment: hospitals could not bill for Watson use; insurers had no reimbursement framework; the clinical-decision-support carve-out under the 21st Century Cures Act of 2016 explicitly limited what unregulated CDS could do. The mature commercial path required either FDA clearance (which Watson had not pursued seriously) or a workflow positioning that made the unregulated use sustainable, neither of which had been resolved.
The Watson Health post-mortem is now part of the medical-AI canon. The Wang and Topol (2024) review in Nature Medicine describes Watson as “the formative cautionary tale of medical AI deployment”; Strickland’s (2019) IEEE Spectrum reporting remains the most-detailed public-record account; the case appears in Iansiti and Lakhani (2020) as a structural-failure example. The lessons cited in this textbook’s playbook chapters (Ch 19 on idea selection, Ch 21 on MVP scoping, Ch 23 on evaluation, Ch 24 on alpha launch, Ch 28 on commercialisation) all draw on the Watson pattern. The pattern is durable because the underlying failure modes — broad scope, brand momentum, platform framing, weak validation, unclear regulatory path — recur in subsequent medical AI projects whenever discipline lapses.
7.4 Drug discovery and the AlphaFold revolution
Drug discovery before 2020 was structurally constrained by two problems. First, the target validation problem: identifying which proteins, in which disease contexts, were druggable was a slow, biology-intensive process. Second, the molecular design problem: once a target was identified, finding small molecules or biologics that bound the target with adequate selectivity, potency, and pharmacokinetic properties required iterative chemistry and biology cycles that took years and consumed hundreds of millions of dollars per program. The pharmaceutical industry’s productivity decline, documented as “Eroom’s law” (the inverse of Moore’s law; see Scannell et al., 2012), reflected the compounding cost: each new drug approval cost roughly twice as much in real terms as the prior decade’s, and the trend had run for fifty years.
AlphaFold, developed by DeepMind under John Jumper’s leadership, addressed the protein-structure-prediction problem that underlies target validation and molecular design. Predicting a protein’s three-dimensional structure from its amino-acid sequence had been an open problem since Anfinsen’s experimental work in the 1960s established that the sequence determines the structure. The Critical Assessment of Structure Prediction (CASP) competition, run biennially since 1994, served as the field’s benchmark.
AlphaFold 1 (Senior et al., 2020) won CASP13 in 2018 with substantial improvements over prior methods, but the predicted structures were not yet reliable enough for routine biological use. AlphaFold 2 (Jumper et al., 2021), unveiled at CASP14 in 2020, was a different magnitude of advance: median GDT_TS scores of 92.4 (out of 100) on free-modelling targets, well above the ~85 threshold conventionally taken to indicate experimental-quality predictions. The 2021 Nature paper was accompanied by the public release of the model and a database covering 200 million predicted structures (essentially every known protein in the major sequence databases). The community impact was extraordinary; within 24 months, AlphaFold 2 was cited in over 10,000 papers across structural biology, biochemistry, and drug discovery.
AlphaFold 3 (Abramson et al., 2024), released in May 2024, extended the methodology beyond proteins alone to model protein-protein, protein-ligand, protein-DNA, and protein-RNA complexes. This extension addressed the limitation that had bounded AlphaFold 2’s drug-discovery utility: predicting a drug-target binding required modelling the protein-ligand complex, which AF2 could not do directly. AF3 produces predictions of these complexes at competitive accuracy with experimental methods (X-ray crystallography, cryo-EM), at compute cost orders of magnitude lower. Isomorphic Labs, the Alphabet-owned drug-discovery spinout founded in 2021, built its drug-discovery platform around AlphaFold and related models; by 2024 the company had partnerships with Eli Lilly and Novartis worth a combined USD 3 billion.
The AI-native drug-discovery firms constitute a parallel ecosystem. Recursion Pharmaceuticals (founded 2013, Salt Lake City) built a phenotypic-screening platform combining high-content imaging with machine learning, with five clinical-stage assets by 2024. Insitro (founded 2018) combines machine learning with functional genomics for target discovery; partnerships with Bristol-Myers Squibb, Gilead, and Genentech. Atomwise (founded 2012) focuses on virtual screening of small molecules against protein targets. Schrödinger (founded 1990; IPO 2020) is an older physics-based platform that has incorporated ML; the company’s pipeline of internal drug candidates plus partnership royalties produced approximately USD 200 million in 2023 revenue. Insilico Medicine (founded 2014, Hong Kong) generated attention in 2023 when its AI-discovered fibrosis candidate INS018_055 entered Phase II trials — the first AI-designed drug to reach Phase II. BenevolentAI (founded 2013) had a more difficult trajectory, with multiple program terminations and a significant 2024 restructuring.
The pharmaceutical industry response has been substantial. Eli Lilly, Pfizer, Roche, Merck, AstraZeneca, GSK, Sanofi, Bristol-Myers Squibb — all major pharmaceutical firms — have either built internal AI platforms, acquired AI-platform companies, or signed major partnerships with AI-native firms. The Boston Consulting Group’s 2024 pharmaceutical AI survey found that 80% of major pharmaceutical firms report active AI use in target identification and 65% in molecular design.
The deployment realism check. AI-discovered drugs have not yet reached approval. As of 2026, the most-advanced AI-discovered candidates are in Phase II/III trials. The drug-development cycle is structurally long: even with AI compression of preclinical work, the clinical phases remain bounded by patient-recruitment timelines, regulatory review, and biological reality. AI may reduce the time-to-clinical-candidate from 4–6 years to 1–2 years, but the time-to-approval is bounded by clinical trial duration which AI does not change. Industry analysts (e.g., BCG, McKinsey) project that the first AI-discovered approved drugs will appear in the 2027–2029 window, with substantial volume by 2030. The structural promise of AI in drug discovery is genuine; the timeline to industry transformation is decade-scale rather than year-scale.
7.5 Ambient clinical scribes — the 2024-2026 deployment frontier
The administrative burden on clinicians is documented at substantial scale. Sinsky et al. (2016) found that primary-care physicians spent approximately 49% of their working time on EHR and desk work, against 27% on direct patient care. The “two-for-one” pattern — every hour of patient care produced two hours of documentation — is the canonical statistic. The administrative burden is the leading driver of clinician burnout (Shanafelt et al., 2019); burnout is the leading driver of clinician attrition; attrition is the leading driver of healthcare workforce shortage. The chain is causal at each link.
Ambient clinical scribes — AI systems that listen to clinician-patient encounters and produce clinical documentation automatically — address the chain at its source. The technology stack has three layers: speech recognition (transforming the audio recording into text), clinical-content understanding (identifying the diagnostically and administratively relevant content within the conversation), and structured-document generation (producing SOAP notes, ICD-10 codes, billing codes, and medical-record entries in EHR-acceptable formats). All three layers existed in research form before 2022; the foundation-model wave produced the integrated capability.
Major commercial offerings in 2024–2026 include:
Nuance DAX (Microsoft). Nuance, acquired by Microsoft for USD 19.7 billion in 2022, is the longest-incumbent player in clinical voice recognition (Dragon Medical, in clinical use since the 2000s). The DAX (Dragon Ambient eXperience) product, launched in 2020 and substantially upgraded in 2023 with GPT-4 integration, is the market leader by deployment scale. Microsoft reports DAX use at over 350 health systems by 2024. Pricing is enterprise: typically USD 200–400 per clinician per month at scale. The Microsoft-Epic partnership is a deployment accelerant; DAX integrates natively with Epic’s clinical workflow, removing the integration friction that plagued earlier scribe products.
Abridge. Founded 2018; Pittsburgh. The early academic-led entrant, with University of Pittsburgh Medical Center as anchor customer. Series C funding of USD 150 million in 2024 valued the company at USD 850 million. The product positions on physician-experience optimisation — minimising clinician interaction with the system — rather than enterprise-IT integration depth.
Suki AI. Founded 2017; Redwood City. Originally positioned as a voice-driven clinical assistant; pivoted to ambient scribing as the foundation-model wave matured. Partnerships with major EHR vendors.
DeepScribe. Founded 2017; San Francisco. Mid-market positioning; serves smaller practices and ambulatory clinics.
Ambience Healthcare. Founded 2020; San Francisco. Focused on the highest-end customer segments with the strongest evaluation infrastructure; published clinical evidence in 2023–2024 demonstrating documentation-time reduction of 70%+ and clinician-burnout-score improvements.
Hippocratic AI. Founded 2023; San Francisco. Distinctive positioning: rather than focusing on the physician scribe market, Hippocratic targets the nurse and care manager market with a clinical agent that conducts patient outreach, follow-up calls, and care navigation. The Polaris model (2024) is the company’s foundation model, trained extensively on clinical content with emphasis on safety-tuning for patient-facing use. The General Catalyst-led 2024 funding round valued the company at USD 1.6 billion. The deployment model differs from Nuance and Abridge: Hippocratic is paid per-conversation, with healthcare systems contracting for outcome-defined services.
Adoption rates and clinical evidence. By end-2024, ambient scribe adoption at US health systems exceeded 50% of large academic medical centres; The Permanente Federation reported in 2024 that 30,000 clinicians across Kaiser Permanente were using ambient scribe technology daily. Published clinical evidence (Gellert et al., 2024; Tierney et al., 2024) documents documentation-time reductions of 30–70%, clinician-satisfaction improvements of 0.5–1.0 points on 5-point scales, and modest or null effects on note quality (the studies generally find that AI-generated notes are not worse than human-dictated notes). The economic case for hospital-scale deployment is increasingly clear: the cost savings from clinician retention, reduced after-hours documentation work (“pajama time”), and increased throughput exceed the per-clinician monthly fee at most reasonable assumptions about clinician value.
The structural lessons from the scribe deployment are the inverse of Watson Health’s. The scribe task is operationally defined (transcribe and structure a conversation; produce a SOAP note); the unit of value is clear (one encounter; one note); the regulatory pathway is established (the scribe is a documentation tool, not a diagnostic device, and falls outside FDA’s oversight scope); the integration is into existing EHR workflows that hospitals already operate; the evaluation methodology is straightforward (note quality; clinician satisfaction; time saved). Where Watson Health failed every test, scribes pass every test. The deployment success is not a coincidence; it reflects the structural fit of the task to the deployment context.
7.6 LLM-based clinical applications
The capability of foundation models on medical reasoning tasks emerged faster than the medical community had projected. In 2022 Singhal et al. (Google Research) introduced Med-PaLM, a medical-domain specialisation of the PaLM language model. Med-PaLM scored 67.6% on the USMLE-style MedQA benchmark, around the passing threshold. Twelve months later Med-PaLM 2 (Singhal et al., 2023) scored 86.5% on the same benchmark — comparable to expert physicians. By 2024 OpenAI’s GPT-4 scored similarly (Nori et al., 2023) without medical-domain fine-tuning, suggesting that frontier foundation models had essentially absorbed medical reasoning capability through broad training. Subsequent models (GPT-5, Claude Opus, Gemini 3) extended the capability further; the standard medical benchmarks are now considered partly saturated, with expert-physician-equivalent performance routine.
Three deployment patterns have emerged for LLM-based clinical applications, each with distinct economics and regulatory implications.
Pattern 1 — clinical search and synthesis. OpenEvidence (founded 2021, formerly OpenLab Inc.) provides a clinical-question-answering interface backed by retrieval over peer-reviewed medical literature. The product is pitched directly to clinicians as a faster alternative to UpToDate or PubMed search. By 2024 OpenEvidence reported over 250,000 active clinician users, with substantial usage at academic medical centres. The economics work because clinicians use the tool many times per shift; the per-query cost (foundation-model inference plus retrieval) is small relative to the time saved. The regulatory framing positions OpenEvidence as decision-support, not diagnostic, which keeps it outside FDA jurisdiction.
Pattern 2 — clinical reasoning assistants. Glass Health (founded 2021), Pearl Health, and several adjacent firms position their products as “clinical reasoning copilots” that help clinicians work through differential diagnoses and treatment options. The positioning is more ambitious than search-and-synthesis but less so than autonomous diagnosis. The current evidence base is thinner; deployment is concentrated at adventurous early-adopter physicians rather than at health-system scale. The challenge is the trust threshold: clinicians use search-and-synthesis tools to confirm what they already know, but treat reasoning-assistance tools sceptically, especially when the assistance disagrees with the clinician’s own judgment.
Pattern 3 — patient-facing clinical agents. Hippocratic AI (Section 7.5) is the most-prominent example; other entrants include the K Health platform and various startup-stage offerings. The positioning is to handle low-acuity patient interactions — care navigation, appointment scheduling, medication reminders, post-discharge follow-up — without requiring physician time. The economic logic is compelling (the labour cost of human nurses or care managers is high; AI scaling is straightforward), but the safety threshold is high (errors can have direct patient consequences) and the regulatory framing is unsettled (does a patient-facing clinical agent require FDA oversight as a medical device?). As of 2026 the regulatory question has not been definitively resolved; the FDA’s 2024 enforcement-discretion guidance suggests the agency intends to apply existing medical-device principles to clinical agents but has not yet issued explicit rules.
The hallucination problem. The single largest risk for clinical LLM applications is hallucination — confident production of incorrect medical content. Hallucination is more dangerous in clinical contexts than in general consumer applications because (a) clinicians may act on the hallucinated content; (b) the consequences of clinical errors are severe; (c) accountability questions are unsettled. Mitigation approaches include retrieval-augmented generation (RAG) that grounds outputs in cited source material; structured evaluation against medical-fact databases; human-in-the-loop review for any patient-affecting recommendation; and explicit uncertainty quantification. The 2024 NEJM AI editorial (Topol, 2024) makes the case that ongoing hallucination risk requires that clinical LLM applications operate as “reading lenses” that surface verified information, rather than as “reasoning oracles” that generate novel conclusions.
The clinical-decision-support carve-out. Under the 21st Century Cures Act of 2016, certain clinical-decision-support software is exempt from FDA regulation if it (a) does not act on patient-specific information, (b) provides recommendations that the clinician can independently review, and (c) does not directly drive clinical decisions. Most current clinical LLM applications are positioned within this carve-out — the LLM provides information; the clinician decides. The carve-out is being tested as foundation-model capabilities expand toward fully-autonomous decision-support, with the 2024 FDA draft guidance on “Software Functions Excluded from the Definition of Device” (FDA, 2024) re-articulating the boundary. The boundary’s location materially affects the deployment economics; products on one side of the line are unregulated, products on the other side require 510(k) or De Novo clearance.
7.7 The regulatory landscape
Medical AI regulation is structurally complex because the underlying regulatory frameworks were designed for traditional medical devices — physical instruments with stable mechanisms — rather than learning systems that update over time. The frameworks have been adapting; the adaptation is ongoing.
United States — FDA. The FDA regulates medical devices under three risk-based pathways:
- 510(k) clearance for moderate-risk (Class II) devices that can demonstrate substantial equivalence to an already-cleared predicate device. The pathway is the most-used for medical AI: most imaging algorithms enter through 510(k), claiming substantial equivalence to either an earlier algorithm or a non-AI imaging product.
- De Novo classification for novel devices without a predicate. IDx-DR’s 2018 clearance was via De Novo (the first autonomous AI diagnostic, with no predicate). De Novo establishes a new device category that subsequent similar devices can use as their predicate.
- PMA (Premarket Approval) for high-risk (Class III) devices. Few medical AI systems require PMA; most route through 510(k) or De Novo.
The FDA’s AI/ML SaMD Action Plan (2021) and subsequent updates established the agency’s framework for continuously learning AI systems — devices that update their models in deployment. The traditional 510(k) framework assumed device behaviour was fixed at clearance; a model that retrains on new data violates that assumption. The Action Plan’s solution is the Predetermined Change Control Plan (PCCP), in which the manufacturer specifies in advance what kinds of model updates will occur and how they will be validated. The PCCP framework was finalised in late 2023 and has begun appearing in 2024–2025 clearances.
European Union — MDR / IVDR. The EU Medical Device Regulation (MDR, in force May 2021) and In Vitro Diagnostic Regulation (IVDR, in force May 2022) replaced earlier directives. The MDR/IVDR explicitly cover software as a medical device (SaMD) and apply risk-based classification. The EU AI Act (Chapter 14) layers additional requirements on high-risk AI systems, with clinical AI explicitly identified as high-risk. The combination produces a more demanding compliance burden than US regulation; medical AI products targeting the EU market typically face 12–24 months of additional regulatory work beyond US clearance.
Australia — TGA. The Therapeutic Goods Administration regulates medical devices under the Therapeutic Goods Act 1989 and associated regulations. Software as a medical device is recognised; the TGA aligns broadly with US FDA frameworks via the Medical Device Single Audit Program (MDSAP). The TGA published specific guidance on AI-enabled medical devices in 2021, updated 2024. For Australian-developed medical AI, the typical pathway is TGA clearance followed by international expansion (US FDA, EU MDR); the order matters because TGA approval can support fast-tracked international regulatory consideration.
Malaysia — MDA. The Medical Device Authority (MDA) regulates medical devices under the Medical Device Act 2012 (Act 737). MDA’s framework is risk-based and aligns broadly with the ASEAN Medical Device Directive. AI-specific guidance is less mature than in the US, EU, or Australia; AI-enabled medical devices are typically registered under existing software-as-medical-device categories. The Ministry of Health Malaysia has issued guidance on hospital deployment of AI tools (2023, updated 2024) that overlays the device-registration framework with hospital-side governance requirements.
The continuously-learning challenge. The hardest regulatory question, common across jurisdictions, is how to oversee AI systems that update their models in deployment. Traditional regulatory frameworks rest on the assumption that a regulatory clearance applies to a specific, frozen device; if the device changes, re-clearance is required. AI systems that learn from production data violate this assumption: the regulatory framework either freezes the model (preventing the data-flywheel-driven improvement that justifies the AI) or accepts model updates without re-clearance (requiring a different oversight mechanism). The PCCP approach (above) is the FDA’s attempt to thread this needle; the EU’s approach via the AI Act is more conservative, requiring that high-risk systems either remain frozen or undergo periodic re-evaluation.
The black-box problem in regulated medicine. Beyond the continuously-learning challenge, the interpretability requirement that regulators traditionally apply to medical devices is challenged by foundation models, where the relationship between inputs and outputs may not be explicable in the way that traditional rule-based or simple-statistical-model devices’ behaviour is. The FDA’s position has been pragmatic — what matters is the device’s clinical performance, not the mechanism — but the position is contested. Patients and clinicians who use a device may legitimately want to understand the basis for its outputs; “the model produced this output” is not always a satisfactory explanation. The interpretability requirement is more enforced in the EU than in the US or Australia, partly reflecting the EU’s broader privacy-and-data-rights philosophy; the divergence creates a structural complication for global product strategy.
7.8 Hospital-scale deployment — operations and adoption
Deploying AI at hospital scale is operationally distinct from developing AI capability. The capability question is “can the model perform the task?”; the deployment question is “can the model perform the task within the hospital’s existing workflow, with the hospital’s existing IT infrastructure, governed by the hospital’s existing clinical-and-administrative structures, paid for by the hospital’s existing reimbursement mechanisms?” The deployment friction has been the binding constraint on medical AI for the past decade; the institutions that have solved the deployment problem have done so through dedicated structural investment, not through technical capability.
Mount Sinai Health System (New York). Mount Sinai built one of the first dedicated AI deployment programmes in US healthcare. The Mount Sinai Department of AI and Human Health (founded 2017, evolved from the Hammer Institute) operates AI deployment as a clinical-research-and-operations function. By 2024 Mount Sinai had deployed 30+ AI models in active clinical use, with documented evidence on each. The institutional model centres on dedicated AI clinical fellows (clinicians trained in AI methodology), dedicated implementation engineers (working with the IT department), and a governance committee that approves each deployment. The model is replicable but expensive; only large academic medical centres have the resources to support it directly.
Mayo Clinic. Mayo’s approach is more centralised. The Mayo Clinic Platform, founded 2020, is an enterprise-scale data-and-AI platform that supports clinical, research, and external-partner AI work. The Platform offers specific commercial products — including the 2023 launch of an AI-supported atrial fibrillation detection product — alongside internal use. Mayo’s particular advantage is the longitudinal patient data of one of the world’s largest integrated health systems, which produces training and evaluation corpora that smaller institutions cannot match.
HCA Healthcare. HCA, the largest US for-profit hospital operator, built the NATE platform (Natural Language for Tagging Encounters) for clinical-documentation augmentation, with operational scale across HCA’s 180+ hospitals. The HCA approach is industrial: the platform serves multiple workflows across the hospital network, with central AI/ML governance and standardised deployment patterns. The HCA model is the closest US-healthcare analog to the AI factory pattern of Chapter 3.
Australian context — Royal Children’s Melbourne and Garvan Institute. The Royal Children’s Hospital Melbourne has been a regional leader in clinical AI, with deployments in pediatric radiology and clinical decision support. The Garvan Institute of Medical Research, in Sydney, focuses on genomics-and-AI integration; the Institute’s leadership in clinical genomics is paired with AI methodology that supports translation of genomic findings into clinical decisions. The Walter and Eliza Hall Institute (Melbourne) and the Murdoch Children’s Research Institute (Melbourne) operate similar AI-research-meets-clinical-operations programmes, with smaller scale than the major US examples but with closer integration to Australian primary-care and public-health systems.
Malaysian context — IHH Healthcare and the private-hospital networks. IHH Healthcare, the parent of Pantai Hospital and Mount Elizabeth Hospital networks, operates the largest private-hospital footprint in Southeast Asia. IHH’s AI investments have focused on operational efficiency (patient flow, appointment scheduling, claims processing) rather than diagnostic AI, reflecting the regional reimbursement structure that does not yet directly compensate for clinical AI use. Sunway Medical Centre and KPJ Healthcare have similar operational-AI orientations. The University of Malaya Medical Centre and Hospital Kuala Lumpur, on the public-system side, are the centres of academic medical AI research in Malaysia, often in partnership with international research groups.
The IT integration challenge. Hospital IT environments are structurally complex: typically an Epic, Cerner, or local-vendor EHR as the primary clinical system; multiple departmental systems (radiology PACS, pathology LIS, pharmacy, billing); legacy interfaces using HL7 v2 messaging; modern interfaces using FHIR; bespoke integration layers. AI deployment requires reading from and writing to these systems in ways that respect the institution’s IT governance. The integration cost is typically larger than the AI development cost; vendors that solve the integration problem (Epic’s App Orchard for AI integrations; Cerner’s Code Galaxy) acquire substantial deployment leverage.
The clinical-champion model. The single most-reliable predictor of successful AI deployment within a hospital is the existence of a clinical champion — a senior clinician who advocates for the deployment, troubleshoots adoption issues, and provides peer credibility. Without a clinical champion, even well-designed deployments fail; with a strong champion, even rough deployments succeed. The pattern is durable across institutions and AI categories. The implication for AI vendors: customer-acquisition strategy must identify and recruit potential champions, not just sell to administrators.
The data-quality gap. Hospital-generated data is messier than clean training data. Missing fields, inconsistent coding, free-text where structure was intended, errors of various kinds. The data-quality gap is a leading cause of deployment underperformance: a model trained on curated research data may perform substantially worse on real-world hospital data of the same nominal type. Successful deployments invest in data-quality work — sometimes via ETL pipelines, sometimes via data-quality monitoring, sometimes via clinician-side data-entry retraining. The investment is not glamorous and is often under-resourced; the resulting under-performance is misattributed to the AI rather than to the data-quality gap.
7.9 Mental health and behavioural AI
Mental health is a domain where AI promises substantial value (large unmet need, scalability, low marginal cost) and where the clinical evidence is most contested (efficacy questions, safety concerns, adversarial events). The tension between the structural opportunity and the deployment caution defines the field.
The mental-health crisis context. The WHO estimates that 970 million people globally live with a mental disorder. The treatment gap is large: an estimated 75–95% of people with mental disorders in low- and middle-income countries receive no treatment, and even in high-income countries the gap exceeds 50% (Patel et al., 2018, Lancet). The clinician supply is structurally constrained — psychiatrist training takes years and the workforce shortage is acute in most countries. AI’s promise is to extend access; the question is whether the AI-mediated intervention is clinically equivalent to human-mediated care.
The CBT-chatbot wave. Cognitive behavioural therapy (CBT) is among the most-evidenced psychotherapy modalities, with structured protocols that are amenable to text-based delivery. Woebot (founded 2017) was the highest-profile CBT-chatbot entrant, with academic-affiliated evidence (Fitzpatrick et al., 2017) showing depression-symptom reduction in a 2-week RCT among college students. Wysa (founded 2015), based in Bangalore and Boston, took similar positioning with stronger international expansion. Youper (founded 2015) used a different format. By 2024 the CBT-chatbot category had attracted USD 800+ million in cumulative venture funding.
The clinical evidence. The early efficacy evidence (Fitzpatrick et al., 2017; Inkster et al., 2018) has been challenged by larger and longer-duration trials. A 2023 systematic review (Hunkin et al.) found that effect sizes from CBT chatbots were modest (Cohen’s d 0.2–0.4 for depression and anxiety symptoms) and did not generally persist beyond 2–4 months. The trial structure also under-tested some failure modes; a 2-week RCT with college-student volunteers does not capture the population for whom the unmet-need argument is strongest (severe-symptom adults, in low-resource settings, over months-to-years of use). The current evidence supports CBT chatbots as low-intensity adjuncts to human care, not as standalone replacements; the deployment claim that the chatbots constitute a scalable solution to the global mental-health treatment gap is not yet supported by the evidence base.
Crisis intervention vs maintenance. The riskiest mental-health AI use case is crisis intervention — a user in active suicidal crisis, deteriorating psychosis, or severe self-harm. Crisis cases require expert clinical judgment, careful safety planning, and often physical intervention (hospitalisation, in-person evaluation). Most AI mental-health products explicitly disclaim crisis use and route users to crisis lines; the routing’s reliability is itself a research question. Maintenance use — supporting users with stable but ongoing mental-health concerns — is less risky but also produces smaller clinical effect sizes.
The 2024 Character.AI case. In October 2024 Character.AI was sued (Garcia v. Character Technologies, M.D. Fla.) following the suicide of a 14-year-old user who had developed an emotional attachment to a chatbot persona on the platform. The case is the most-prominent legal challenge to AI-mediated mental-health-adjacent interactions. Character.AI’s product is positioned as entertainment, not as mental-health care, but the user’s interactions with the platform were characterised by the suit as constituting de-facto emotional support that contributed to the suicide. The case is in litigation as of 2026; the eventual ruling will have substantial implications for AI products that produce parasocial emotional engagement, even when not positioned as mental-health products.
Reliable companionship agents. A subset of the field positions AI as non-clinical emotional support — companionship for older adults, social practice for individuals with social anxiety, journaling-and-reflection support. Replika (founded 2017) was the early leader; Pi (founded 2022, by Inflection) was a more-mature foundation-model entry. The category is not regulated as medical; the user-experience claims vary. The Anthropic constitutional-AI methodology produces companionship behaviour that explicitly resists the strongest forms of parasocial dependency; other vendors are less clear on their design priorities. The category is a useful test case for the broader question of whether AI products have responsibilities toward user wellbeing that go beyond the products’ explicit positioning.
The Malaysian and Australian context. Mental-health AI deployment in both countries lags US deployment for two reasons: smaller venture-funded ecosystem, and conservative clinical-and-regulatory cultures that scrutinise mental-health AI more carefully than the US permits. Australian-developed mental-health products (e.g., the Black Dog Institute’s research applications) emphasise integration with the existing professional-care pathway rather than standalone scalable deployment. Malaysian deployment is concentrated at academic-medical contexts (Universiti Malaya, Universiti Kebangsaan Malaysia) rather than via consumer-facing products.
7.10 Failures, biases, and ethical issues
The medical-AI failure record is substantial. Beyond Watson Health (Section 7.3) and the mental-health-AI nuances (Section 7.9), specific cases illustrate recurring failure modes.
Babylon Health. Babylon Health, founded 2013 in London, was for several years one of the most-funded health technology firms globally, with over USD 1 billion in venture capital and a 2021 SPAC listing at USD 4.2 billion valuation. Babylon’s product was an AI-powered telehealth platform; the marketing claim was that the AI could triage medical concerns at expert-physician quality, scaling primary care globally. The clinical evidence was contested from the start; a 2018 BMJ paper (Razzaki et al.) reported equivalent performance to GPs on a self-administered MRCGP examination, but follow-up analyses (Fraser et al., 2018, in The Lancet; subsequent independent assessments) raised concerns about the evaluation methodology and the system’s actual triage performance. Babylon’s commercial model — including its high-profile NHS partnerships — suffered from documented patient-safety incidents and regulatory pressure. The company’s share price collapsed 99% from 2021 to 2023, and Babylon entered administration in August 2023, with assets sold to eMed (US). The Babylon case is a contemporary parallel to Watson Health: ambitious capability claims, marketing momentum that ran ahead of evidence, deployment pressure that surfaced safety problems, and an eventual structural collapse.
The Optum algorithm bias case. In October 2019 Obermeyer et al. published in Science a study demonstrating that a commercial care-management algorithm — used at scale across US health systems by Optum (a UnitedHealth subsidiary) — exhibited substantial racial bias. The algorithm identified patients who would benefit from care-management programmes by predicting future healthcare costs; Black patients with similar health status to white patients had lower predicted costs (because of disparate access to and use of healthcare), so the algorithm assigned them lower priority for care management. The bias was structural: the algorithm was technically working as designed (predicting costs), but the design assumption (that costs proxy for need) embedded the disparate-access disparity into care-management allocation. The paper estimated that correcting the bias would more than double the proportion of Black patients identified for additional support. The case became a textbook example for several reasons: the bias was substantial, the mechanism was well-documented, the affected population was large (the algorithm was used on ~200 million Americans annually), and the corrective path was technically straightforward but commercially complex. Subsequent regulatory attention to healthcare algorithm bias (including 2024 HHS proposed rules on Section 1557 of the ACA) traces to this paper.
Bias in dermatology AI. The Esteva et al. (2017) dermatology paper trained on largely Caucasian-skin images. Subsequent independent evaluation (Kamulegeya et al., 2019; Daneshjou et al., 2022) found that dermatology AI systems performed substantially worse on darker skin tones. The pattern is general across many medical-imaging applications: training data drawn from majority-Caucasian populations does not generalise to global skin colour, ethnicity-correlated anatomical variation, and disease-presentation differences across populations. The mitigation requires training-data diversification (which is expensive and slow) and rigorous external validation across populations (which much of the field has not historically done).
Privacy and data-protection concerns. Healthcare data is among the most-protected categories under privacy law (HIPAA in the US, GDPR in the EU, PDPA in Malaysia and Singapore, the Privacy Act 1988 in Australia). AI training data must be either de-identified (with re-identification risk addressed) or used under specific consent. Notable failure cases include the 2017 Google DeepMind Royal Free London partnership, in which an Information Commissioner’s Office (ICO) investigation found that the partnership had transferred ~1.6 million NHS patient records to DeepMind without adequate legal basis (ICO ruling, July 2017). The case shaped subsequent UK and EU framing of healthcare AI partnerships. As foundation models are increasingly trained on web-scraped data of unclear provenance, the risk of inadvertent inclusion of medical records in training data becomes a parallel concern; regulators are beginning to address this in 2024–2026 guidance.
The “fairness” literature. A substantial academic literature on machine-learning fairness has emerged since the mid-2010s (Hardt et al., 2016; Mitchell et al., 2019; Selbst et al., 2019). The literature has identified multiple definitions of fairness (demographic parity, equalised odds, calibration) that are mutually incompatible in most realistic settings (Chouldechova, 2017); has documented the failure modes of well-intentioned fairness interventions; and has produced both technical mitigations (reweighting, adversarial debiasing) and procedural ones (fairness audits, demographic-stratified evaluation). For medical AI specifically, the practitioner consensus has converged on demographic-stratified evaluation as the minimum baseline: every clinical AI system should report performance across major demographic strata (race, ethnicity, sex, age) and the gaps between them should be explicit. Many published clinical-AI papers still do not meet this standard; this is changing slowly.
7.11 Southeast Asian and Australian regional context
The non-US, non-EU context for medical AI is structurally different. Healthcare-system structure (single-payer vs multi-payer; public vs private; hospital-led vs primary-care-led), regulatory maturity, language and cultural specificity, and venture-funding ecosystems all vary; AI deployment patterns vary correspondingly.
Malaysia. The Malaysian healthcare system combines a public sector (Ministry of Health-operated hospitals and clinics, which serve the majority of the population) with a substantial private sector (IHH, KPJ, Sunway, and others). AI deployment is concentrated in the private sector for clinical applications and in research-medical-centre contexts (UMMC, Hospital Kuala Lumpur, Hospital Universiti Kebangsaan Malaysia, IMU) for research applications. Specific deployments: AIA Malaysia (a major private health insurer) uses AI for claims-processing and fraud detection; Sunway Medical Centre has invested in AI-supported imaging and pathology; Pantai Hospital networks are rolling out ambient-scribe technology in 2024–2025. The Ministry of Health’s AI deployment is more cautious, reflecting public-system risk-aversion and the broader ecosystem-development strategy. The MDA (Section 7.7) regulates AI medical devices, with framework alignment to ASEAN-wide standards.
Singapore. Singapore has the most-developed regional AI healthcare deployment, partly because of dedicated government funding (under the Smart Nation initiative and AI Singapore programmes) and because of Singapore’s structural advantages (small geography, high-income population, integrated healthcare-IT infrastructure). The Synapxe agency (formerly IHiS) operates the national health-IT platform and has pursued AI-related infrastructure aggressively. SingHealth and the National University Health System operate institution-scale AI programmes. Singapore-developed AI products often launch domestically and then expand regionally; the small domestic market is itself the validation environment for the larger ASEAN target.
Indonesia. Indonesia’s healthcare AI is dominated by the consumer-telehealth wave: Halodoc (founded 2016) and Alodokter (founded 2014) both incorporated AI-powered triage and consultation features as the foundation-model wave matured. The deployment is at-scale (Halodoc reports tens of millions of users) but is consumer-facing rather than clinical-AI in the diagnostic sense. The Indonesian regulatory framework for AI medical devices is still developing; deployment proceeds via the consumer-services framework that does not require medical-device approval.
Australia. The Australian system combines Medicare (public), the Pharmaceutical Benefits Scheme, private health insurance, and a mix of public and private hospitals. AI deployment is concentrated at major academic medical centres (described in Section 7.8) and at primary-care contexts via Medicare-funded telehealth platforms. The TGA framework is mature; Australian-developed medical AI typically pursues TGA clearance first. R&D Tax Incentive (Section 19.4.7) provides material funding support for early-stage Australian AI medical research. The University of Melbourne, Monash University, the University of Sydney, the University of Queensland, and the Garvan Institute are the major academic AI-medicine centres. Notable Australian-developed medical AI products include Annalise.ai (radiology, founded 2020 in Sydney; FDA-cleared 2024), Maxwell Plus (urology AI, Brisbane), and several smaller imaging-AI firms. The cross-pollination with US and UK academic networks is substantial; many Australian-trained medical AI researchers spend time in US or UK programmes before returning.
The AIA Malaysia case study. AIA Malaysia, the Malaysian arm of the AIA Group regional insurer, has developed one of the more-sophisticated AI deployments in the regional insurance-and-health context. The deployment focuses on three areas: claims fraud detection (using ML on historical claim patterns to flag anomalous submissions for human review), medical-record OCR-and-extraction (converting paper-based claim documentation into structured data), and customer-service automation (chatbots and email triage). The deployment has produced measurable benefits — AIA’s 2023 annual report cited claims-processing-time reductions of 35% and fraud-detection-rate improvements — and is a useful regional contrast to the US-focused descriptions that dominate the AI-in-healthcare literature. The IHH-Sunway-AIA-Halodoc constellation, taken together, sketches the Southeast Asian medical AI deployment pattern: heavy on operational efficiency, modest on clinical-decision-support, very modest on autonomous diagnostic AI, with regulatory pathways developing alongside rather than ahead of deployment.
7.12 The 2026 frontier — clinical agents and the path ahead
The current period is characterised by foundation-model capability that has substantially exceeded clinical-deployment capacity. Med-PaLM 2, GPT-5, Claude Opus, and Gemini 3 perform at expert-physician level on standard benchmarks; the question is what to do with that capability. Three distinct frontiers are visible.
Frontier 1 — clinical agents at scale. Hippocratic AI’s Polaris and similar agent systems are positioned to handle low-acuity patient interactions autonomously. The bet is that the cost of human nurse-and-care-manager labour is high enough, and the supply is constrained enough, that AI handling of routine patient interactions will scale rapidly once the trust threshold is crossed. The clinical-evidence build-out in 2024–2026 has been substantial; large health-system pilots are underway. The trust threshold is not yet crossed for most use cases; the next 24 months will resolve whether the agent positioning translates into broad deployment or whether the limitations identified in early deployments (hallucination, escalation-pattern errors, scope drift) cap the deployment scale.
Frontier 2 — drug discovery production. AlphaFold 3 and the AI-native drug-discovery firms have produced molecular candidates at a rate that the pharmaceutical-development pipeline cannot yet absorb. The bottleneck is not at the discovery stage but at the development stage — animal-model studies, IND-enabling toxicology, Phase I dose-finding, Phase II efficacy, Phase III pivotal — which AI compresses minimally. The 2025–2030 question is whether the increased candidate flow translates into more approved drugs. Pharmaceutical analysts project that 5–15% of approved drugs in the 2030s will be AI-discovered; sceptics argue that the development bottleneck will limit AI’s net contribution to single-digit-percent improvements in productivity. The pharmaceutical industry’s response in 2024–2026 has been to invest substantially in both the AI-discovery side and in development-process AI (clinical-trial design, recruitment, data-analysis) that addresses the development bottleneck directly.
Frontier 3 — health-system integration of foundation models. The third frontier is less visible but potentially the most-consequential: foundation models embedded as infrastructure across the health system, supporting diverse applications from documentation to coding to research to administration. Microsoft, Google, and Amazon all have positioned themselves to be the infrastructure layer; Epic and Cerner have positioned themselves to be the integration layer. The pattern is becoming visible at major US health systems by 2024–2026: a single foundation-model platform (often Microsoft Azure OpenAI or AWS Bedrock) underlies dozens of specific-application uses. The infrastructure positioning is more durable than any single application, because the infrastructure compounds across applications. The economics are of cloud-platform-scale rather than medical-device-scale.
The trust-threshold dynamics. A common feature of the three frontiers is the role of trust threshold in determining deployment scale. Low-trust deployments (information retrieval, second-opinion suggestions, documentation) scale rapidly. High-trust deployments (autonomous diagnosis, autonomous treatment, autonomous patient communication) scale slowly. The trust threshold is partly technical (the system must perform reliably enough to deserve trust) and partly social (the institutional and individual willingness to trust the system, given the stakes). The technical side is improving faster than the social side, which means deployment-scale evolution lags capability evolution. The lag is not failure; it is the deployment time-constant of safety-critical infrastructure.
The data-flywheel-without-deployment problem. A specific concern for medical AI in 2026 is that the systems with the strongest production capability (the foundation models from OpenAI, Anthropic, Google) accumulate data through general consumer use, not through medical-deployment use. The medical-deployment data flywheel requires that AI is actually used in clinical settings, with outcomes captured. Until medical deployment is widespread, the medical data flywheel does not turn. The result is that clinical-AI-specific capabilities improve more slowly than general AI capabilities, even though the underlying technology is the same. The 2030s pattern may resolve this — as deployment scales, the medical-domain-specific data accumulates — but the 2024–2028 period is one where the gap between general-AI capability and medical-AI deployment is large and may widen.
7.13 Connection to econometric and statistical methodology
The clinical-AI literature is methodologically a subset of clinical-research methodology more broadly, with specific requirements that connect to the broader statistical and econometric frameworks this textbook draws on.
Randomised controlled trials (RCTs). The gold-standard for clinical evaluation remains the RCT. For AI-augmented care, the RCT design requires randomising patients to standard-of-care or to AI-augmented care, with prospective endpoints. Few clinical-AI studies have used pure RCT designs; most use stepped-wedge designs (where the AI is rolled out across hospital units sequentially, allowing within-unit before-after comparison) or cluster-randomised designs (where hospitals or units are randomised). The 2024 NEJM AI journal launch reflects the growing consensus that AI-medicine evaluation needs the same rigour as drug-and-device evaluation; the methodological norms are still being established.
Real-world evidence (RWE). The FDA’s Real-World Evidence framework (FDA, 2018, expanded since) provides a pathway for using post-deployment data to support regulatory and labelling decisions. RWE is methodologically softer than RCT — confounding, selection bias, and measurement error are larger — but is the only feasible source of evidence for many post-approval questions. AI-system RWE is particularly important for the continuously-learning systems (Section 7.7) that update during deployment; the evaluation infrastructure must monitor whether the system continues to perform as approved.
Propensity-score and other causal-inference approaches. Where RCT is not feasible, observational study design with explicit causal-inference methodology is appropriate. Propensity-score matching, instrumental-variable analysis, regression-discontinuity design, and difference-in-difference are the standard tools. The methodology connects directly to the econometric work that this textbook’s author has worked in (per the Privilege of the Draw RDD work; the dual-banking TVP-VAR/NARDL work). The clinical-AI evaluation literature is increasingly drawing on these methods, both for evaluating deployment effects and for adjusting for the systematic biases (selection effects in who gets AI-augmented care, for instance) that observational designs introduce.
The statistical-rigour gap. A persistent concern in the clinical-AI literature is the gap between methodological best-practice (rigorous RCT or carefully-controlled observational evaluation) and what most published papers actually do. Surveys (Liu et al., 2019, BMJ; Nagendran et al., 2020, BMJ) have found that the majority of clinical-AI papers do not meet basic CONSORT-AI or SPIRIT-AI methodological reporting standards. The gap reflects partly the field’s youth and partly the publication-incentive structure (impressive-sounding papers with weak methodology are easier to publish than methodologically-rigorous papers with modest findings). The methodological infrastructure is improving; the NEJM AI launch, the CONSORT-AI and SPIRIT-AI reporting standards, and journal policy changes at major medical journals have begun to enforce higher standards. The improvement is welcome and overdue.
The connection to the broader econometric framework is direct. The clinical-AI literature is a domain application of the same methodology that financial-services econometrics uses (Chapter 6), that policy-evaluation econometrics uses, that the labour-economics literature uses. The specific clinical-AI questions — how does AI augmentation affect patient outcomes, controlling for confounding? what is the causal effect of AI deployment on resource utilisation? — are causal-inference questions that the broader econometric methodology directly addresses. Graduate-level work in clinical AI requires fluency in this methodology, not just in the AI-specific machine-learning techniques. The integration is one of the things that distinguishes clinical-AI work done at a research medical centre from clinical-AI work done at a consumer-AI startup.
References for this chapter
Foundational clinical AI history
- Buchanan, B. G. and Shortliffe, E. H. (1984). Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley.
- Miller, R. A., Pople Jr., H. E., and Myers, J. D. (1982). INTERNIST-1, an experimental computer-based diagnostic consultant for general internal medicine. NEJM 307: 468–476.
- Yu, V. L., Buchanan, B. G., Shortliffe, E. H., et al. (1979). Evaluating the performance of a computer-based consultant. Computer Programs in Biomedicine 9(1): 95–102.
Medical imaging deep learning
- Esteva, A., Kuprel, B., Novoa, R. A., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature 542: 115–118.
- Gulshan, V., Peng, L., Coram, M., et al. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316(22): 2402–2410.
- Rajpurkar, P., Irvin, J., Zhu, K., et al. (2017). CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv:1711.05225.
- McKinney, S. M., Sieniek, M., Godbole, V., et al. (2020). International evaluation of an AI system for breast cancer screening. Nature 577: 89–94.
- Haibe-Kains, B., Adam, G. A., Hosny, A., et al. (2020). Transparency and reproducibility in artificial intelligence. Nature 586: E14–E16.
- Beede, E., Baylor, E., Hersch, F., et al. (2020). A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. CHI 2020.
- Sayres, R., Taly, A., Rahimy, E., et al. (2019). Using a deep learning algorithm and integrated gradients explanation to assist grading for diabetic retinopathy. Ophthalmology 126(4): 552–564.
- van Leeuwen, K. G., Schalekamp, S., Rutten, M. J. C. M., et al. (2021). Artificial intelligence in radiology: 100 commercially available products. European Radiology 31: 3797–3804.
Watson Health
- Strickland, E. (2019). IBM Watson, heal thyself: How IBM overpromised and underdelivered on AI health care. IEEE Spectrum 56(4): 24–31.
- Ross, C. and Swetlitz, I. (2017). IBM pitched its Watson supercomputer as a revolution in cancer care. It’s nowhere close. STAT News (5 September 2017).
- Iansiti, M. and Lakhani, K. R. (2020). Competing in the Age of AI. Harvard Business Review Press.
- Wang, F. and Topol, E. J. (2024). The future of medical AI: from healthcare to health care. Nature Medicine 30: 1247–1257.
AlphaFold and drug discovery
- Jumper, J., Evans, R., Pritzel, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596: 583–589.
- Senior, A. W., Evans, R., Jumper, J., et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature 577: 706–710.
- Abramson, J., Adler, J., Dunger, J., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630: 493–500.
- Scannell, J. W., Blanckley, A., Boldon, H., and Warrington, B. (2012). Diagnosing the decline in pharmaceutical R&D efficiency. Nature Reviews Drug Discovery 11(3): 191–200.
- Stokes, J. M., Yang, K., Swanson, K., et al. (2020). A deep learning approach to antibiotic discovery. Cell 180(4): 688–702.
LLMs in medicine
- Singhal, K., Azizi, S., Tu, T., et al. (2023). Large language models encode clinical knowledge. Nature 620: 172–180.
- Singhal, K., Tu, T., Gottweis, J., et al. (2023). Towards expert-level medical question answering with large language models. arXiv:2305.09617.
- Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. (2023). Capabilities of GPT-4 on medical challenge problems. arXiv:2303.13375.
- Topol, E. J. (2024). When AI gets it wrong: Perspectives on hallucination in clinical AI. NEJM AI 1(2).
- Lee, P., Goldberg, C., and Kohane, I. (2023). The AI Revolution in Medicine: GPT-4 and Beyond. Pearson.
Ambient clinical scribes
- Sinsky, C., Colligan, L., Li, L., et al. (2016). Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties. Annals of Internal Medicine 165: 753–760.
- Shanafelt, T. D., West, C. P., Sinsky, C., et al. (2019). Changes in burnout and satisfaction with work-life integration in physicians and the general U.S. working population between 2011 and 2017. Mayo Clinic Proceedings 94: 1681–1694.
- Tierney, A. A., Gayre, G., Hoberman, B., et al. (2024). Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catalyst Innovations in Care Delivery.
- Gellert, G. A., Gellert, G. L., Sutton, J., et al. (2024). The role of generative AI in addressing physician burnout. Health Affairs 43(8).
Bias and fairness
- Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science 366(6464): 447–453.
- Kamulegeya, L. H., Bwire, M., Berman, S., et al. (2019). Using artificial intelligence on dermatology conditions in Uganda: A case for diversity in training data sets for machine learning. bioRxiv.
- Daneshjou, R., Vodrahalli, K., Novoa, R. A., et al. (2022). Disparities in dermatology AI performance on a diverse, curated clinical image set. Science Advances 8(31).
- Hardt, M., Price, E., and Srebro, N. (2016). Equality of opportunity in supervised learning. NeurIPS.
- Mitchell, M., Wu, S., Zaldivar, A., et al. (2019). Model cards for model reporting. FAccT.
- Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5(2).
- Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., and Vertesi, J. (2019). Fairness and abstraction in sociotechnical systems. FAccT.
Mental health AI
- Fitzpatrick, K. K., Darcy, A., and Vierhile, M. (2017). Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Mental Health 4(2).
- Inkster, B., Sarda, S., and Subramanian, V. (2018). An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: Real-world data evaluation mixed-methods study. JMIR mHealth and uHealth 6(11).
- Hunkin, H., King, D. L., and Zajac, I. T. (2023). Wearable devices, smartphone apps, and chatbots in the treatment of mental health disorders: Systematic review and meta-analysis. Journal of Medical Internet Research.
- Patel, V., Saxena, S., Lund, C., et al. (2018). The Lancet Commission on global mental health and sustainable development. The Lancet 392: 1553–1598.
Methodology and evaluation
- Liu, X., Faes, L., Kale, A. U., et al. (2019). A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis. The Lancet Digital Health 1(6).
- Nagendran, M., Chen, Y., Lovejoy, C. A., et al. (2020). Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies. BMJ 368.
- Liu, X., Cruz Rivera, S., Moher, D., Calvert, M. J., and Denniston, A. K. (2020). Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nature Medicine 26: 1364–1374.
- Cruz Rivera, S., Liu, X., Chan, A. W., Denniston, A. K., and Calvert, M. J. (2020). Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI extension. Nature Medicine 26: 1351–1363.
- Topol, E. J. (2019). Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. Basic Books.
- Rajkomar, A., Dean, J., and Kohane, I. (2019). Machine learning in medicine. NEJM 380: 1347–1358.
- Davenport, T. and Kalakota, R. (2019). The potential for artificial intelligence in healthcare. Future Healthcare Journal 6(2).
Regulatory frameworks
- US Food and Drug Administration (2021). Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. fda.gov.
- US Food and Drug Administration (2023, 2024). Predetermined Change Control Plans for Machine Learning-Enabled Device Software Functions: Guidance for Industry. fda.gov.
- European Commission (2017, 2022). Medical Device Regulation (EU) 2017/745.
- Therapeutic Goods Administration (2024). Regulatory changes for software-based medical devices. tga.gov.au.
- Medical Device Authority Malaysia (2024). Guidance for AI medical devices. mda.gov.my.
- Information Commissioner’s Office UK (2017). Royal Free — Google DeepMind trial failed to comply with data protection law. ICO ruling, 3 July 2017.
Cases and regional context
- Razzaki, S., Baker, A., Perov, Y., et al. (2018). A comparative study of artificial intelligence and human doctors for the purpose of triage and diagnosis. arXiv:1806.10698.
- Fraser, H., Coiera, E., and Wong, D. (2018). Safety of patient-facing digital symptom checkers. The Lancet 392(10161).
- AIA Group (2023). Annual report.
- Synapxe (formerly IHiS) (2024). Healthcare AI deployment outlook for Singapore.
- Garcia v. Character Technologies, Inc. (2024). M.D. Fla., Case No. 6:24-cv-1903.