The Ancestor's Error
Shumailov's Nature paper proved the mechanism in a closed loop. Ahrefs found 74% of new web pages contained AI text. The thought experiment is no longer hypothetical.
Epic's sepsis prediction model missed 67% of sepsis cases at Michigan Medicine. The audit method we built for AI cannot catch what the system never said.
In 2017 Epic Systems began rolling out a sepsis prediction model. It scored hospitalized patients on the likelihood of developing sepsis and surfaced alerts to clinicians. The model became one of the most widely deployed AI tools in American hospital medicine. Epic's documentation reported, on its internal validation, an area-under-curve in the range of 0.76 to 0.83. Those numbers sit at the edge of "good" in the medical machine-learning literature.
In June 2021 a research team from Michigan Medicine published an external validation in JAMA Internal Medicine. They tested the model on 27,697 of their own patients across 38,455 hospitalizations. The AUC came back at 0.63. At the threshold the model used to fire alerts, sensitivity was thirty-three percent. The model missed sixty-seven percent of patients who developed sepsis. It also fired alerts on eighteen percent of all hospitalizations, generating an alert load no clinical team can carry. The model's bedside fingerprint, in other words, was invisibility for two-thirds of the cases that mattered and noise everywhere else.
What stayed with me when I read that paper was the structure of the failure.
Every alert the Epic Sepsis Model produced was a true claim about its own judgment. The model said, "this patient meets my threshold for sepsis risk." The harm was in the alerts that never came, on the patients who fell beneath the threshold and developed sepsis anyway. The model's silence on those patients was not an error in the conventional sense. The model was working as designed. It was the silence that contained the harm, and the silence had no surface. There was no line in the chart that said "missed case." There was a patient who deteriorated and a model that did not flag it, and from the clinician's vantage no signal that the model had failed, because the absence of an alert is not a failure-shaped object.
This is the structure I have in mind when I argue that omission is the AI failure mode we have not learned to audit.
We have built a vocabulary for what these systems get wrong. Hallucination, confabulation, distributional shift, adversarial vulnerability. We have not built the equivalent vocabulary for what they fail to produce. The missing finding in a radiology summary. The unflagged clause in a contract review. The threat vector not surfaced in a security assessment. Each of these failures has the property the Epic case has. The system was working as designed. The thing that was missing left no trace.
The reason this matters more than ordinary error is that we audit AI systems the way we audit human ones. We check what they say. We compare their outputs against ground truth where ground truth is available. We flag and correct discrepancies. The audit catches commission, not omission. Catching omission requires knowing what should have been there, and if you know what should have been there you do not need the system.
There is a cognitive phenomenon in the judgment literature that Daniel Kahneman called "what you see is all there is." Presented with a coherent narrative, we rarely ask what information might be missing. The narrative feels complete because it hangs together, not because we have verified its completeness. AI summaries exploit this with relentless efficiency. A well-formed paragraph about contract risks reads as thorough. The reviewer's machinery registers it as a complete treatment of the subject. The fact that it says nothing about the change-of-control provision or the IP assignment clauses, which may be critical to the deal, does not trigger an alarm, because alarms are calibrated to detect presence and not absence.
The natural response is to propose coverage metrics. Measure how much of the source document a summary captures. Flag summaries that fall below a threshold. Several commercial tools now do this. The approach is intuitive and almost certainly insufficient. A coverage metric that checks whether a contract summary mentions each section heading will flag a missing "Termination" section. It will not catch the absence of the distinction between termination for cause and termination for convenience, because that distinction lives within a section, not in the document's skeleton. The metric measures the bones. The omissions happen in the muscle.
A high coverage score also creates a new layer of false confidence. The reviewer now has two sources of false reassurance. The fluent summary, and the quantitative claim that the summary is "ninety-five percent complete." The number replaces the judgment that was supposed to be the point of the exercise.
I do not have a clean answer to this. I have a working position. Until we build methods for making the shape of an omission visible, for rendering what a system chose not to surface as concretely as we render what it did, every AI summary in a high-stakes workflow needs to be treated as an oracle that has answered honestly and incompletely, and the gap is the user's to find. That is an uncomfortable working position. It is also the one Michigan Medicine's clinicians had to adopt, in the space of one published paper, with respect to a system already running across their hospital.
Epic released a redesigned sepsis model in 2022. I have not seen comparable external validation of the new version published yet. The lesson of the original is not specific to that vendor. It is a lesson about audits, and about what they catch by design, and about what they cannot catch by design. The shape of what is missing has no shape. It is invisible until someone, usually too late, notices.
Shumailov's Nature paper proved the mechanism in a closed loop. Ahrefs found 74% of new web pages contained AI text. The thought experiment is no longer hypothetical.
A 2025 CHI paper showed that human confidence aligns with AI confidence, and the alignment outlasts the tool. The error rate stays. The calibration moves.
Air France 447 and a 2025 Polish endoscopy trial point at the same trap. The more reliable the system, the more thoroughly its absence becomes catastrophic.
Tell us about the decision you're trying to improve. We'll schedule a briefing with our principals to understand your environment and explore a potential fit.
Schedule a Briefing