The Calibration Crisis

Calibration is the technical name for a property most professionals develop without ever calling it that. A calibrated practitioner is one whose confidence tracks their accuracy. When they say they're sure, they're right. When they say they're uncertain, the answer is genuinely close to the line. Calibration is different from accuracy. A calibrated person can be uncertain about most things and remain useful, because the uncertainty itself is information.

In April 2025 a team of researchers presented a paper at CHI titled "As Confidence Aligns." Across multiple experiments they measured what happens when a human decision-maker uses an AI system that expresses its own confidence. The finding is the kind of result that should have caused a pause across the field. It did not, and I am still working out why.

What the paper showed is that the human's self-confidence aligns with the AI's confidence over the course of using the system, and the alignment persists after the AI is removed. The user keeps the inflated confidence even when the prosthetic has been taken away. Calibration drift is a durable change in how the user relates to their own uncertainty, not a transient symptom of relying on the tool. The intervention the team identified as effective at reducing the drift was real-time correctness feedback, the kind of signal almost no operational deployment provides.

I have been thinking about this against the backdrop of a different finding from the same conference's track. Another paper, on metacognition in human-AI reasoning, found that participants using AI assistance on LSAT problems improved their performance by three points relative to a norm population and overestimated their performance by four points. Higher AI literacy was correlated with less accurate self-assessment. The participants who knew the most about the tools were the most confident and the least precise about their own results. The combination of higher task accuracy and inflated metacognition is the worst combination for downstream judgment. The user is right more often, by a thin margin, and now believes they are right by a thicker one.

This is where I land most often when general counsels and chief risk officers ask me what AI is doing to their organizations. The cleanest answer I have is that AI does not move the error rate of the analysts who use it. AI moves their calibration. The analysts are right roughly as often. They believe they are right far more often.

The reason this matters is that calibration is the substrate of risk management. Every escalation rule, every "stop and check" trigger, every human review threshold in a workflow is implicitly calibrated to a confidence distribution. When the analyst is appropriately uncertain, the workflow knows to slow down. When the analyst is appropriately certain, the workflow can move. If the analyst is now systematically more certain than the situation warrants, the slow-down triggers do not fire. The work moves faster. The errors that the slow-down was designed to catch slip through, and the system records itself as more efficient.

The Epic Sepsis Model story I have written about elsewhere has the same structure. An AI tool that produces an output, a human who treats the output as a stand-in for their own assessment, a downstream workflow that has lost the calibration signal it needed. In the Epic case the loss showed up as missed sepsis. In a corporate decision context the loss shows up as what people call, after the fact, "we should have caught that." The catching was a calibration function. The function got rewritten.

I do not think this is solved by training. The CHI paper's only effective intervention, real-time correctness feedback, is exactly what most operational AI deployments do not have. The contracts review tool is not told, in real time, when a clause review was wrong. The threat assessment system is not told, in real time, when an attack succeeded. The medical decision support is not told whether the patient improved. The feedback loop closes weeks or months later, after the calibration has already settled into a new equilibrium with the AI's confidence.

The recommendation I give clients is the most boring version of the obvious one. Build the feedback loop yourself. Synthesize ground-truth review processes that put the AI and the human in front of their own errors close enough in time that calibration can adjust. This is expensive, organizationally disruptive, and the first thing that gets cut when the budget tightens. It is also, as far as I can tell from the research, the only thing that works.

The harder thing to say is that this is not a problem inside an organization. It is a problem about what the organization has become. An organization whose analysts have been miscalibrated by their tools is an organization whose risk management has lost an unmeasured part of its function. Nobody will see this on a dashboard. They will see it the way Air Canada saw it in February 2024. A tribunal decision arrives, and the organization realizes that the calibration that would have caught the bad chatbot answer was already gone, three years before anyone went to the tribunal.

The Calibration Crisis

Further Reading

Related to Decision Intelligence

The Ancestor's Error

The Taxonomy of Silence

The Fidelity Trap

Initiate Contact

Ready to transform your
decision architecture?

The Calibration Crisis

Further Reading

Related to Decision Intelligence

The Ancestor's Error

The Taxonomy of Silence

The Fidelity Trap

Initiate Contact

Ready to transform your decision architecture?

Ready to transform your
decision architecture?