Chapter 2 — Five eras of business AI

A short history of business AI from Feigenbaum’s expert systems through the agentic enterprise, organised into five eras whose boundaries are drawn by the technology that defined the dominant deployments of the period.

Chapter overview

This chapter develops the five-era taxonomy with enough technical depth that a graduate student can identify why each era’s defining technology unlocked the commercial deployments that defined it. The chapter is organised around the inflections — the specific moments at which the dominant technology changed — and the residue each era leaves behind. The thesis is that eras do not replace each other: each era’s deployments persist as infrastructure, and the cumulative stack of business AI as of 2026 is a stratigraphy of all five.

Reading this chapter

The technical detail in this chapter (architecture descriptions, methodological notes, scaling laws) is not optional for graduate readers. Without it, the rest of the book reads as a chronicle rather than an analysis. Where mathematical detail is given, it is given at the level needed to understand what was new about the era’s signature deployment, not at the level needed to reproduce it.

A timeline at a glance

gantt
    title Five eras of business AI
    dateFormat  YYYY
    axisFormat  %Y
    section I. Pre-ML
    DENDRAL                :1965, 1972
    MYCIN                  :1972, 1979
    XCON / R1              :1980, 1990
    section II. Statistical ML
    FICO Falcon            :1992, 2005
    Deep Blue (IBM)        :1997, 1998
    Amazon CF              :1998, 2010
    Netflix Prize          :2006, 2009
    section III. Deep learning
    AlexNet (ImageNet)     :2012, 2015
    BERT in Search         :2019, 2022
    AlphaGo                :2016, 2017
    AlphaGo Zero           :2017, 2018
    section IV. Transformer / LLM
    Transformer paper      :2017, 2018
    GPT-3                  :2020, 2022
    ChatGPT (Nov 2022)     :2022, 2024
    GPT-4o                 :2024, 2025
    section V. Agentic
    AutoGPT                :2023, 2024
    Computer Use           :2024, 2025
    DeepSeek-R1            :2025, 2026
    Operator               :2025, 2026

Figure 3.1: Five eras of business AI, with selected milestones.

Era I — The pre-ML era (1950s–1990s)

The pre-ML era established the architectural separation of knowledge from inference that still organises regulated AI today. It is the era of symbolic AI — systems that manipulated explicit symbols according to explicit rules, in contrast to the statistical and connectionist systems that followed.

DENDRAL and the birth of knowledge engineering

Feigenbaum (1977) founded what came to be called knowledge engineering with the DENDRAL project at Stanford (1965–1980; Lindsay et al. (1980)). DENDRAL’s task was mass-spectrometry interpretation: given the mass spectrum of an organic molecule of known atomic composition, infer its structure. The technical innovation was the plan-generate-test architecture: a planner used heuristic rules (encoding chemical constraints) to narrow the candidate-structure space; a generator enumerated remaining candidates; a tester scored each candidate’s predicted mass spectrum against the observed one.

Three architectural ideas from DENDRAL persist in modern systems:

Knowledge as data, not code. The chemical heuristics were stored as declarative rules, separate from the inference engine that applied them. This made the system extensible by domain experts without programmer intervention — though, as XCON would later show, it also made the system extensible by domain experts with programmer intervention required to audit the result.
Constraint-based search. The combinatorial explosion of candidate molecular structures was managed by aggressive pruning using domain heuristics. Modern constraint-programming and SAT-solving systems share the same intellectual lineage.
The expert as the bottleneck. Building DENDRAL required years of iterative collaboration between Joshua Lederberg (Nobel laureate in genetics; chemist) and the AI team. Feigenbaum named this the knowledge acquisition bottleneck — and identified it as the operational ceiling on expert systems generally.

MYCIN: the medical-AI reference architecture

Shortliffe and Buchanan (1975) and Buchanan and Shortliffe (1984) developed MYCIN at Stanford (1972–1979) for bacterial-infection diagnosis and antibiotic recommendation. MYCIN’s architecture became the template for subsequent rule-based expert systems:

flowchart LR
    KB["Knowledge base<br/>~600 rules<br/>antimicrobial therapy<br/>encoded as IF-THEN"]
    WM["Working memory<br/>patient facts<br/>current case data"]
    IE["Inference engine<br/>backward chaining"]
    UI["User interface<br/>question/answer<br/>natural language"]
    EXPL["Explanation module<br/>'WHY?' / 'HOW?'<br/>trace rule chains"]

    UI --> WM
    KB --> IE
    WM --> IE
    IE --> UI
    IE --> EXPL
    EXPL --> UI

    style KB fill:#e8f3fa,stroke:#006DAE
    style IE fill:#fdf3e7,stroke:#d97706
    style EXPL fill:#e9f5ec,stroke:#059669

Figure 3.2: MYCIN’s rule-based expert system architecture.

Three technical contributions are worth memorising:

Backward chaining. Given a goal (e.g., “is the organism Streptococcus?”), MYCIN searched for rules whose conclusions matched the goal, then recursively backward-chained on each rule’s premises until reaching directly-observable facts (the patient’s symptoms, lab results) or unanswerable questions (which it asked the user). This contrasts with forward-chaining systems like XCON.
Certainty factors. Each rule had an associated certainty factor (CF) in $[-1, +1]$, intended to capture a human expert’s degree of belief. Combined CFs were calculated using the Shortliffe and Buchanan (1975) algebra: \[ CF(A,B) = \begin{cases} CF(A) + CF(B) - CF(A) \cdot CF(B) & \text{both} \geq 0 \\ \frac{CF(A) + CF(B)}{1 - \min(|CF(A)|, |CF(B)|)} & \text{otherwise} \end{cases} \] This was a deliberate departure from full Bayesian inference, motivated by the difficulty of eliciting joint priors from experts. The Bayesian network literature would later supersede certainty factors for principled reasons; MYCIN’s CFs are remembered as a pragmatic compromise that worked.
Explanation. MYCIN could answer “WHY?” (why are you asking this question?) by displaying the rule it was trying to satisfy, and “HOW?” (how did you reach that conclusion?) by tracing the rule chain backward. This explainability has rarely been matched by modern deep-learning systems and is part of why rule-based architectures persist in regulated domains.

MYCIN performed at expert-physician level on bacterial diagnosis in formal evaluations, but was never deployed clinically owing to liability concerns and the practical impossibility of integrating it with the workflow of a busy infectious-disease ward. This pattern — technical adequacy blocked by institutional risk and workflow misfit — recurs throughout the book; it is arguably the canonical lesson of the entire pre-ML era.

XCON: the canonical commercial expert system

The era’s commercial flagship was John McDermott’s XCON / R1 at Digital Equipment Corporation. Operational from 1980, XCON configured DEC VAX minicomputer systems, ensuring that the components ordered by a sales team were technically compatible and complete. The system was built in OPS5 (a forward-chaining production-rule language) and grew to roughly 2,500 rules processing 80,000 orders annually. By 1986, DEC’s internal estimate of XCON’s annual savings was approximately $25 million.

The case is instructive for two reasons:

The economic case worked. A $25M annual saving against a relatively small build-and-maintenance cost was, in net-present-value terms, an attractive ROI by any standard then or now.
The maintenance cost was substantial. By the late 1980s, XCON’s rule base had grown to a size where maintenance required a team of roughly eight knowledge engineers permanently. Every new VAX model required new rules; every promotion or product-line reorganisation required rule edits; cross-rule conflicts emerged unpredictably and required expert resolution. The rule base behaved like a large, badly-modularised software system rather than a clean knowledge representation.

XCON’s experience taught two generalisable lessons that recur in modern AI deployment: knowledge bases require ongoing maintenance investment at a level often underestimated when the project is scoped, and knowledge representations work best when there is a stable underlying domain (DEC’s product line was relatively stable) — a condition that fails in many commercial contexts.

Other systems worth knowing

INTERNIST-1 / CADUCEUS (Pittsburgh, 1974–): internal-medicine diagnosis. Larger than MYCIN (~500 diseases, ~3,500 manifestations) but less widely deployed.
PROSPECTOR (SRI, 1979–): mineral exploration. Famously credited with identifying a porphyry molybdenum deposit at Mount Tolman in Washington — though the historical accuracy of this attribution is disputed.
Symbolics LISP machines (1980–1990s): specialised hardware for running symbolic AI. The collapse of the Symbolics market in 1987–1988 is often cited as the trigger of the second AI winter.

The two AI winters

The era ended in two AI winters: roughly 1974–1980 (triggered by the Lighthill Report’s pessimistic UK assessment and DARPA’s funding pullback in the US) and 1987–1993 (triggered by the LISP machine market collapse and the failure of expert systems to scale). The first winter killed the early symbolic-AI optimism associated with the General Problem Solver and similar projects; the second killed the commercial expert-systems industry.

The lessons learnt — quietly — were that AI’s economic return was bottlenecked by the same complementary intangibles we identified in Chapter 1 (workflow integration, ongoing maintenance, expert-knowledge engineering bandwidth) rather than by raw capability. The lessons were soon forgotten, then re-learnt, then forgotten again.

Adjacent statistical work

Adjacent statistical work proceeded quietly through the pre-ML era. Peter Keen and Michael Scott Morton at MIT Sloan articulated decision-support systems in the 1970s. Fair, Isaac & Co. introduced the general-purpose FICO score in 1989, perhaps the single most successful pre-ML statistical risk model ever deployed; it is now used to underwrite a substantial fraction of consumer credit decisions globally. The credit-scoring literature was statistical (logistic regression, discriminant analysis) rather than AI in the symbolic sense, but it set the template for subsequent ML deployments in finance.

Era II — The statistical machine learning era (1990s–early 2010s)

The shift from symbolic to statistical AI produced general-purpose classifiers that could be applied across domains without the knowledge-engineering bottleneck. The era’s intellectual peak was Vladimir Vapnik’s statistical learning theory; its commercial peak was the deployment of classical ML in fraud detection, recommendation, and search.

Statistical learning theory

Vapnik (1995) systematised what came to be called statistical learning theory: a framework for thinking about generalisation as a function of model complexity, sample size, and noise. Two ideas from Vapnik’s framework are necessary for understanding what came after.

First, the Vapnik–Chervonenkis dimension (VC dimension) measures the complexity of a hypothesis class as the largest number of points it can shatter (correctly classify in all $2^n$ possible labellings). For a hypothesis class with VC dimension $d$ and $n$ training examples, the generalisation error is bounded with high probability by

\[ \text{err}_{\text{test}} \leq \text{err}_{\text{train}} + O\left(\sqrt{\frac{d \log n}{n}}\right). \]

This inequality formalised the intuition that high-capacity models risk overfitting — and motivated the bias-variance tradeoff that organises classical ML practice.

Second, the structural risk minimisation principle says: choose the hypothesis class that minimises the upper bound, not the one that minimises training error. This is the formal justification for regularisation (ridge, lasso) and for techniques like cross-validation.

Support vector machines and the kernel trick

Cortes and Vapnik (1995) introduced support vector machines (SVMs) as the practical realisation of structural risk minimisation. An SVM finds the hyperplane that maximally separates two classes in feature space; the kernel trick (replacing the inner product $x_i \cdot x_j$ with a kernel function $K(x_i, x_j)$) lets the same algorithm operate in high- or infinite-dimensional implicit feature spaces without ever computing the embedding directly.

For text classification, polynomial and Gaussian RBF kernels gave SVMs a decade of dominance through the 2000s. Spam filters, sentiment classifiers, document categorisers, and bioinformatics classifiers were SVM-based by default. Even today, sklearn’s SVC is a sensible first model for many small-data problems.

Random forests and ensembles

Breiman (2001) introduced random forests: ensembles of decision trees grown with two sources of randomness (bootstrapped training data per tree; random feature subset at each split). The intuition: averaging many high-variance, low-bias predictors reduces variance without increasing bias, provided the predictors’ errors are sufficiently uncorrelated.

Random forests dominated tabular-data classification through the 2000s and 2010s. Gradient-boosted trees (XGBoost, LightGBM, CatBoost) extended the ensemble idea to additive models in the 2010s and remain — even in 2026 — the workhorses of credit scoring, churn prediction, and similar tabular-data problems.

LSTM and the sequence-modelling revolution

Hochreiter and Schmidhuber (1997) introduced the long short-term memory (LSTM) cell to address the vanishing-gradient problem in recurrent neural networks. The architectural innovation was the gated cell: forget, input, and output gates that together let the cell maintain information across long sequences without gradient decay.

LSTMs powered Google’s machine translation, Amazon’s speech recognition, and a wide range of sequence-modelling tasks through the 2010s. They were superseded by transformers after 2017 — but the intuition that learned gating mechanisms manage information flow recurs in modern architectures.

Three commercial milestones

System	Year	Why it matters
FICO Falcon Fraud Manager	1992	Scored most of the world’s payment-card transactions on neural networks well before deep learning was respectable. The system processed every credit-card transaction at participating banks in <100ms. Still in production.
Amazon item-to-item collaborative filtering	1998 (paper Linden, Smith, and York (2003))	Linden, Smith and York demonstrated that item-item similarity (precomputed offline) scales to massive product catalogues better than user-user similarity. The template for modern personalisation; cited in virtually every subsequent recommender-system paper.
Netflix Prize	2006–2009	A $1M open competition to improve Netflix’s recommender by 10% RMSE. Won by BellKor’s Pragmatic Chaos on 17 Sep 2009 with a 10.06% improvement, using matrix factorisation (Koren, Bell, and Volinsky, 2009) combined with gradient-boosted residuals. Popularised matrix factorisation, ensembling, and open data competitions.

Two further milestones from this era still drive enterprise value:

Google PageRank (Brin and Page, 1998): the eigenvector-centrality computation on the web’s hyperlink graph that founded Google. The algorithm computes the dominant eigenvector of the (column-stochastic) hyperlink-transition matrix; equivalently, the stationary distribution of a random walk on the web.
Google Search ad ranking (Quality Score, 2002): an early commercial deployment of a learned ranking model. By the 2010s this was a $100B+ revenue stream, almost entirely AI-mediated; by 2026, it is a $200B+ stream.

Deep Blue and the symbolic counterpoint

The era’s most-publicised AI moment was IBM Deep Blue’s 3½–2½ defeat of Garry Kasparov on 11 May 1997. Deep Blue was a brute-force symbolic system — alpha-beta search over chess-position game trees, accelerated by custom hardware — rather than a statistical learner. Its victory was the public proof that “computers beat humans” at a benchmark task, but the methodological lineage was symbolic AI, not the statistical-ML tradition that came to dominate the era. The case is a useful reminder that public AI moments and dominant research paradigms can diverge.

Why deep learning was dormant

The era’s puzzle is why deep learning — proposed in essentials by McCulloch and Pitts in 1943, with backpropagation refined by Rumelhart, Hinton, and Williams in 1986 — remained dormant from the 1990s through the early 2010s. Three reasons recur in retrospective analyses:

Compute. Training a 7-layer neural network on a non-trivial task in 1995 required weeks on the best workstations available. The same training run took hours on a 2012 GPU.
Data. Effective deep learning requires labelled datasets at scales unavailable before web-scale data collection became practical.
Initialisation and optimisation. Vanishing gradients made training networks deeper than 3–4 layers reliably hard. The Hinton–Salakhutdinov pretraining paper (Hinton and Salakhutdinov, 2006) partially addressed this; the breakthrough came with Krizhevsky, Sutskever, and Hinton (2012)’s combination of ReLU activations, GPU training, dropout regularisation, and the much-larger ImageNet dataset.

Era III — The deep learning revolution (2012–2022)

The era has a precise inflection point: Krizhevsky, Sutskever, and Hinton’s AlexNet won ImageNet on 30 September 2012 (Krizhevsky, Sutskever, and Hinton, 2012), with top-5 error of 15.3% versus 26.2% for the runner-up. The 10.9-percentage-point gap was unprecedented in ImageNet’s history and immediately ended the dominance of hand-engineered visual features.

What was new about AlexNet

AlexNet’s architecture combined five convolutional layers, three fully-connected layers, and a 1000-way softmax output, with roughly 60 million parameters. Three innovations made it work:

ReLU activations instead of sigmoid/tanh. Rectified linear units, $f(x) = \max(0, x)$, do not saturate and so do not suffer from vanishing gradients in deep networks.
GPU training. AlexNet was trained on two NVIDIA GTX 580s for roughly 5–6 days. The same training run on a CPU would have taken weeks to months.
Dropout regularisation. Randomly setting 50% of activations to zero during training prevents co-adaptation among neurons and acts as an ensemble-averaging regulariser.

The DNNresearch team (Krizhevsky, Sutskever, Hinton) was acquired by Google in March 2013 for a reported $44M. The trio went on to receive the 2018 Turing Award; Hinton shared the 2024 Nobel Prize in Physics with John Hopfield.

The methodological substrate

The era produced the methodological substrate for everything that followed:

Innovation	Year	Why it matters
Word2Vec (Mikolov et al., 2013)	2013	Distributed word representations from local co-occurrence; the conceptual ancestor of contextual embeddings.
GANs (Goodfellow et al., 2014)	2014	Generative adversarial networks; the founding paper of modern generative imaging.
seq2seq learning (Sutskever, Vinyals, and Le, 2014)	2014	Encoder-decoder LSTMs for machine translation; the template that transformers later replaced.
ResNet (He et al., 2016)	2016	Skip connections that enable training of very deep networks (152 layers in the original). The single most-cited image-recognition paper.

Each of these is worth knowing in technical detail. ResNet’s residual block implements

\[ y = F(x; \{W_i\}) + x \]

where $F(\cdot)$ is a small subnet (typically two conv layers). The skip connection $+x$ ensures that gradients can flow back to early layers without vanishing, even when the network has 100+ layers.

AlphaGo and the reinforcement-learning peak

Silver et al. (2016) reported AlphaGo’s 4–1 victory over Lee Sedol in Seoul, 9–15 March 2016. AlphaGo combined three components:

A policy network trained on 30 million human Go moves to predict the next move probability distribution.
A value network trained on game outcomes to predict the probability of winning from a given position.
Monte Carlo Tree Search (MCTS) that used the policy and value networks to focus its search on plausible lines.

AlphaGo’s most-discussed moment was move 37 in game 2, a move so unconventional that human commentators initially called it a mistake; it was later identified as the decisive move of the game. The case is canonical because it illustrates that deep learning can produce strategies humans would not — not just faster execution of human-known strategies.

Silver et al. (2017) then reported AlphaGo Zero, which surpassed AlphaGo with self-play alone in 40 days, learning Go from scratch without human game data. The architectural simplification — a single network trained by self-play and reinforcement — became the template for AlphaZero (chess, shogi, Go) and subsequently for AlphaFold’s multi-modal training.

Commercial deployments

Commercial deep-learning deployments followed:

Google Smart Reply (May 2015): an LSTM-based suggested-reply system in Gmail. Within a year, ~10% of replies sent through Gmail were Smart Reply suggestions.
Tesla Autopilot (October 2015): a CNN-based driver-assist system. The architecture has since been redesigned multiple times (HydraNet, end-to-end, FSD v12); the original deployment is the public marker for production deep learning in safety-critical autonomous systems.
Spotify Discover Weekly (July 2015): collaborative filtering augmented with audio analysis (CNN-based content embedding).
BERT-powered Google Search (October 2019): bidirectional transformer pre-training for search query understanding. By late 2019, used in roughly 10% of US English Search queries; by mid-2020, “virtually all” queries (Google has never published the precise revenue impact, but it is reasonably estimated at tens of billions of dollars annually).

IBM Watson Health: the era’s most expensive cautionary tale

IBM Watson’s Jeopardy! victory (14–16 February 2011) brought NLP into public consciousness — Watson defeated Ken Jennings and Brad Rutter $77,147 to $24,000 and $21,600. IBM subsequently committed substantial investment to Watson Health, including a 2012 partnership with MD Anderson Cancer Center.

By 2017, internal MD Anderson reviews documented that Watson had “made unsafe and incorrect recommendations.” A series of similar partnerships (Mayo Clinic, Memorial Sloan Kettering, NHS) failed to scale. By 2022, IBM sold Watson Health to Francisco Partners for approximately $1B, having invested an estimated $4–5B in acquisitions and development. The case is the era’s canonical reminder that demonstration capability does not equal deployment capability; we develop the failure mode in detail in Chapter 7.

Era IV — Transformers and large language models (2017–2024)

The era began with one paper: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017 (Vaswani et al., 2017), now cited over 173,000 times. The architecture dispensed with recurrence and convolution in favour of multi-head self-attention, parallelising training and enabling the scale-by-compute regime that followed.

Self-attention

The core operation in a transformer is scaled dot-product attention. Given queries $Q \in \mathbb{R}^{n \times d_k}$, keys $K \in \mathbb{R}^{n \times d_k}$, and values $V \in \mathbb{R}^{n \times d_v}$ (each row a token’s representation), the attention output is

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V. \]

The softmax produces a row-stochastic matrix of attention weights; each output row is a convex combination of the value rows, weighted by query-key similarity. The $\sqrt{d_k}$ scaling stabilises gradients in high-dimensional spaces.

Multi-head attention runs $h$ such operations in parallel, with different linear projections of $Q, K, V$, then concatenates and re-projects. This lets the model attend to different aspects of the input simultaneously — syntactic, semantic, positional — without dedicating any one attention head to all of them.

The transformer’s parallelism is the architectural gift that made the scale-by-compute regime possible: where LSTMs must process tokens sequentially, transformers process all tokens of a sequence simultaneously, exploiting GPU/TPU parallelism almost fully.

The papers that defined the era

Paper	Year	Why it matters
Devlin et al. (2019) (BERT)	2018	Bidirectional pre-training using masked-language modelling + next-sentence prediction. Powered Google Search ranking from 2019.
Radford et al., GPT-2	2019	OpenAI’s first “too dangerous to release” model; established generative pre-training (next-token prediction) as a paradigm.
Brown et al. (2020) (GPT-3)	2020	175B parameters; established in-context few-shot learning — the model could perform new tasks given only examples in the prompt, without weight updates.
Kaplan et al. (2020) (scaling laws)	2020	First systematic empirical study of how loss scales with parameters, data, and compute. Identified power-law relationships that motivated training larger models.
Hoffmann et al. (2022) (Chinchilla)	2022	Reset compute-optimal training: for a fixed compute budget, smaller models trained on more data outperform larger models trained on less. Approximately 20 tokens per parameter.
Ouyang et al. (2022) (InstructGPT)	2022	Demonstrated that RLHF — reinforcement learning from human feedback, building on Christiano et al. (2017) — let a 1.3B aligned model outperform a raw 175B model on user-rated quality.
Bai et al. (2022) (Constitutional AI)	2022	Anthropic’s method for using AI feedback rather than human feedback for harmlessness training.
Wei et al. (2022) (chain of thought)	2022	Empirical demonstration that prompting LLMs to “think step by step” elicits reasoning that improves accuracy on multi-step tasks.

Scaling laws

Kaplan et al. (2020) found that test loss $L$ scales as a power law in three independent factors — model size $N$ (parameters), dataset size $D$ (tokens), and compute $C$:

\[ L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C} \]

with empirically estimated exponents $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$, $\alpha_C \approx 0.050$. These exponents are small — meaning halving the loss requires roughly tenfold more compute — but the relationship is remarkably consistent across orders of magnitude.

Hoffmann et al. (2022) corrected an over-emphasis on parameter count in the original Kaplan paper, showing that for a fixed compute budget, optimal training pairs $N$ and $D$ in proportion (approximately 20 tokens per parameter). Models trained under-data (such as GPT-3, which was trained on roughly 1 token per parameter) were “compute-suboptimal.” This finding is the reason why every frontier model from 2023 onward is trained with much more data per parameter than GPT-3.

The ChatGPT inflection

⚡ The ChatGPT moment

ChatGPT launched on 30 November 2022, reaching one million users in five days and an estimated 100 million monthly active users by January 2023 — the fastest-growing consumer application in history at the time. GPT-4 followed on 14 March 2023 (Achiam et al., 2023), scoring at the 90th percentile on the bar exam and passing the USMLE.

The economic significance of ChatGPT is not that it was the most capable LLM in November 2022 — it was not — but that it was the first chat-shaped LLM. The interface change (multi-turn dialogue, with system instructions and user/assistant turns) made the underlying capability accessible to a vastly larger user base than the prior text-completion API.

Enterprise infrastructure crystallised rapidly: Microsoft 365 Copilot reached general availability on 1 November 2023 at $30/user/month; ChatGPT Enterprise launched 28 August 2023; and JPMorgan’s COiN system (deployed 2017) was canonically credited with eliminating 360,000 lawyer-hours per year by reading 12,000 commercial credit agreements.

The 2024 multimodal turn

The 2024 frontier models were materially different from their 2022 predecessors in being natively multimodal. GPT-4o (May 2024) handled vision, audio, and text in a single model with sub-second voice latency. Gemini 1.5 Pro (February 2024) introduced the long-context regime — 1 million tokens, later 2 million — that made full-codebase and full-document reasoning practical. Claude 3.5 Sonnet (June 2024) introduced the “Artifacts” pattern of structured generative outputs, then in October 2024 the “Computer Use” capability that opened the agentic era.

The architectural lesson is that multimodality is not a stitched-together pipeline (vision API → OCR API → LLM API) but a single model with shared internal representations across modalities. The training implication is that pre-training corpora must include not just text but image-caption pairs, video-transcript pairs, and audio-transcript pairs at scale.

Era V — The agentic AI era (2023–2026)

The defining feature of the agentic era is the shift from per-step human-triggered copilots to goal-driven systems that plan, decompose tasks, call APIs, and act with reduced human supervision.

What separates agents from copilots

A copilot suggests; an agent acts. The architectural difference is that agents plan (decompose a goal into sub-goals), perceive (read state from a system), act (call tools or APIs that change state), and iterate (revise plans based on observed results). Copilots are step-level; agents operate over multi-step trajectories with reduced human supervision.

The methodological substrate

Three papers define the technical substrate:

ReAct (Yao et al., 2023): interleaved reasoning traces with tool-use actions. The agent emits a Thought (free-text reasoning), an Action (tool call), receives an Observation (tool output), and iterates. The architecture is the most-cited template for modern agents.
Toolformer (Schick et al., 2023): LLMs that can teach themselves to use tools, by self-generating training examples that include API calls and verifying which calls improved the prediction.
Anthropic Model Context Protocol (MCP), open-sourced November 2024: an open standard for connecting LLMs to data sources and tools. By mid-2025, MCP had become the de-facto integration standard for agentic systems.

The 2024–2026 launches

Date	System	What was new
17 Sep 2024	Salesforce Agentforce 1.0	First major CRM-native agent platform
22 Oct 2024	Anthropic Computer Use (Claude 3.5)	First frontier model to perceive screens, move cursors, type
Nov 2024	Anthropic Model Context Protocol (MCP)	Open standard for tool-and-data integration
Dec 2024	Google Gemini Agentspace	Enterprise agent infrastructure with Workspace integration
23 Jan 2025	OpenAI Operator	87% on WebVoyager — autonomous web browsing
Mar 2025	Cognition Labs Devin GA	Autonomous software engineering agent
Jun 2025	Salesforce Agentforce 3.0	MCP support; Command Center for observability

Reasoning models inflect the cost-quality curve

OpenAI o1 (September 2024) and DeepSeek-R1 (DeepSeek-AI, 2025) (January 2025) introduced inference-time reasoning — models that “think” before they answer, generating long internal reasoning traces before emitting a final answer. The technical innovation is that the model is rewarded during RL training for traces that lead to correct answers, regardless of length, encouraging genuine multi-step reasoning rather than short-cut pattern matching.

For multi-step tasks (mathematics, code, long-form analysis), reasoning models outperform classical LLMs by 10–40 percentage points on standard benchmarks (AIME, MATH, GPQA Diamond, SWE-bench). The cost is per-query inference compute: a reasoning-model query may consume 10–100× the compute of a classical-LLM query. The economics depend on the task.

The DeepSeek shock

The era’s defining geopolitical moment came at the start of 2025. DeepSeek-V3 (DeepSeek-AI, 2024) (26 December 2024) achieved frontier performance for a reported $5–6 million in training compute — roughly 1% of the rumoured cost of OpenAI’s GPT-4. DeepSeek-R1 (20 January 2025), MIT-licensed, matched OpenAI’s o1 on reasoning benchmarks.

📉 27 January 2025 — the trillion-dollar repricing

Nvidia lost approximately $600 billion in market capitalisation in a single day — the largest one-day loss for any US company in history — even as the Stargate Project, a $500 billion four-year US infrastructure commitment between OpenAI, SoftBank, Oracle, and MGX, was unveiled six days earlier. By 29 October 2025, Nvidia became the first $5 trillion company in history.

The DeepSeek shock did not signal the end of frontier model investment. It signalled, instead, that frontier capability can no longer be taken as a durable moat. The implication for enterprise architecture is that vendor lock-in to a single foundation-model provider is no longer required — a shift we develop in Chapter 5.

A subtler point worth memorising: the DeepSeek shock was at least partly mismeasured. The reported $5–6M training-compute cost did not include the (likely much larger) cost of base-model pretraining, data acquisition, and prior research investment that DeepSeek had been making since 2023. The efficient-frontier-cost claim is real; the order-of-magnitude reduction implied by the headline number is, on careful reading, smaller than the markets initially priced.

What the eras have in common

A useful exercise after surveying the five eras is to ask what is constant. Three things, at least:

Capability runs ahead of organisational complement

XCON’s brittleness, Watson Health’s deployment failure, the 78%-adoption-versus-5%-value pattern, and Klarna’s reversal are all instances of the same gap. Each era’s defining technology entered enterprises faster than the workflow redesign, governance, and training that determine value capture. The pattern is the central organising claim of this book.

The most economically valuable deployments are quiet

FICO Falcon, Amazon’s recommender, Google’s Quality Score, BERT in Search, and DBS’s GANDALF transformation produced more measurable value than any of the era-defining headline systems. This is not coincidental — value is captured by deployments that integrate with existing workflows and produce measurable financial impact, and such deployments are typically incremental, internal, and unphotogenic.

The corollary for graduate students is that public attention is a terrible proxy for economic value. The headlines focus on what is novel; the value follows what is integrated.

The architecture template is more durable than the technology

The four-component AI factory pattern (Chapter 3) describes Amazon’s pre-2010 deployments as well as Anthropic’s 2026 ones. The Iansiti–Lakhani framework was developed observing pre-LLM AI factories, but it applies without modification to LLM-native and agentic firms. The implication is that the conceptual scaffolding of this book — the four-component factory, the six Rewired capabilities, the three Agrawal-Gans-Goldfarb solution layers, the five Iansiti-Lakhani rules of the new meta — is more durable than any specific 2026 deployment.

Exercises 2.1

Era boundaries. The chapter argues that the five-era taxonomy is a periodisation convenience rather than a sharp boundary. Identify three deployments that could plausibly belong to two adjacent eras. For each, defend the era-attribution that you find most informative.
The MYCIN counterfactual. MYCIN performed at expert-physician level in formal evaluations but was never clinically deployed. Construct a counterfactual: if MYCIN had been deployed in 1980, what would have had to be true (technically, organisationally, legally) to make the deployment sustainable? Which of these conditions are present today, and which still are not?
The maintenance cost of XCON. XCON eventually required eight knowledge engineers permanently. Construct the equivalent ongoing-maintenance cost for a modern LLM-based RAG system in a regulated industry. What roles, what FTEs, what annual budget?
VC dimension and modern models. The Vapnik bound bounds generalisation error in terms of VC dimension. Modern transformers have effectively unbounded VC dimension. Why do they generalise at all? (See the recent literature on benign overfitting and double descent.)
Scaling laws in practice. Using the Kaplan et al. (2020) exponents, calculate (a) by how much loss falls when compute is increased 10×, (b) what compute increase is required to halve the loss, (c) whether the Hoffmann et al. (2022) correction changes either answer.
The transformer’s gift. The chapter argues that transformer parallelism enabled the scale-by-compute regime. Identify one specific commercial AI deployment that would not exist without this parallelism, and explain why.
Watson Health post-mortem. IBM invested an estimated $4–5B in Watson Health and sold it for ~$1B in 2022. Construct a managerial post-mortem at three levels: (a) technical (what did the system fail at?), (b) organisational (what management decisions enabled the technical failure?), (c) strategic (what should IBM have done instead?).
AlphaGo’s move 37. In game 2 of the Lee Sedol match, AlphaGo played a move that human commentators initially called a mistake. Read accounts of the game and write a 500-word analysis of what move 37 reveals about the difference between deep learning and expert-system AI.
The DeepSeek pricing. The chapter notes that the headline $5–6M training-compute cost for DeepSeek-V3 likely undercounts total investment by roughly 10×. Estimate the true total investment and identify three reasons why the headline figure was nonetheless economically informative.
Why now for agents? AutoGPT was technically possible in 2018 (the building blocks — function calling, structured output, planning — existed in some form). It became commercially viable only in 2024. Identify three changes between 2018 and 2024 that crossed thresholds for agentic AI.
The quiet-value claim. Identify a 2024–2026 deployment that is quiet but plausibly already producing $500M+ annual value. Defend the value estimate quantitatively. What systematic biases produce the gap between value and coverage?
The next era. The chapter ends with the agentic era. Forecast the sixth era (2027–2030). What technology will define it? What organisational complements will be required for value capture?

References for this chapter

Feigenbaum, E. A. (1977). The art of artificial intelligence: Themes and case studies of knowledge engineering. Proceedings of the Fifth International Joint Conference on Artificial Intelligence, 1014–1029.
Lindsay, R. K., Buchanan, B. G., Feigenbaum, E. A., and Lederberg, J. (1980). Applications of Artificial Intelligence for Organic Chemistry: The DENDRAL Project. McGraw-Hill.
Shortliffe, E. H. and Buchanan, B. G. (1975). A model of inexact reasoning in medicine. Mathematical Biosciences 23(3–4): 351–379.
Buchanan, B. G. and Shortliffe, E. H., eds. (1984). Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley.
[Reference for sec not in bibliography]
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20(3): 273–297.
Breiman, L. (2001). Random forests. Machine Learning 45(1): 5–32.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8): 1735–1780.
Linden, G., Smith, B., and York, J. (2003). Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7(1): 76–80.
Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for recommender systems. IEEE Computer 42(8): 30–37.
Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7): 107–117.
Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science 313(5786): 504–507.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
Goodfellow, I. et al. (2014). Generative adversarial nets. NeurIPS.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. CVPR.
Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529(7587): 484–489.
Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). Mastering the game of Go without human knowledge. Nature 550(7676): 354–359.
Vaswani, A. et al. (2017). Attention is all you need. NeurIPS.
Devlin, J. et al. (2019). BERT: Pre-training of deep bidirectional transformers. NAACL.
Brown, T. et al. (2020). Language models are few-shot learners. NeurIPS.
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. arXiv:2001.08361.
Hoffmann, J. et al. (2022). Training compute-optimal large language models (Chinchilla). arXiv:2203.15556.
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. NeurIPS.
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.
Achiam, J., Adler, S., Agarwal, S., et al. (2023). GPT-4 technical report. arXiv:2303.08774.
Yao, S. et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR.
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. NeurIPS.
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948.
DeepSeek-AI (2024). DeepSeek-V3 technical report. arXiv:2412.19437.
Brynjolfsson, E. and McAfee, A. (2014). The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies. W. W. Norton.
McAfee, A. and Brynjolfsson, E. (2017). Machine, Platform, Crowd: Harnessing Our Digital Future. W. W. Norton.

Chapter overview

A timeline at a glance

Era I — The pre-ML era (1950s–1990s)

DENDRAL and the birth of knowledge engineering

MYCIN: the medical-AI reference architecture

XCON: the canonical commercial expert system

Other systems worth knowing

The two AI winters

Adjacent statistical work

Era II — The statistical machine learning era (1990s–early 2010s)

Statistical learning theory

Support vector machines and the kernel trick

Random forests and ensembles

LSTM and the sequence-modelling revolution

Three commercial milestones

Deep Blue and the symbolic counterpoint

Why deep learning was dormant

Era III — The deep learning revolution (2012–2022)

What was new about AlexNet

The methodological substrate

AlphaGo and the reinforcement-learning peak

Commercial deployments

IBM Watson Health: the era’s most expensive cautionary tale

Era IV — Transformers and large language models (2017–2024)

Self-attention

The papers that defined the era

Scaling laws

The ChatGPT inflection

The 2024 multimodal turn

Era V — The agentic AI era (2023–2026)

What separates agents from copilots

The methodological substrate

The 2024–2026 launches

Reasoning models inflect the cost-quality curve

The DeepSeek shock

What the eras have in common

Capability runs ahead of organisational complement

The most economically valuable deployments are quiet

The architecture template is more durable than the technology

Exercises 2.1

Further reading

References for this chapter