AI Passed the Exam. What Did It Understand?

In 2025, an AI model built by researchers at the University of Buffalo outperformed most human physicians on the United States Medical Licensing Examination. OpenAI's o1 model scored 96% on MedQA, a benchmark drawn from the same exam. DeepSeek hit 93% on the clinical knowledge portion. These aren't outliers — across law, engineering, accounting, and philosophy, AI systems now match or exceed human performance on professional examinations.

A reasonable person, looking at these numbers, might conclude that these systems understand medicine. Or law. Or whatever domain they've been tested in. The scores suggest it. The outputs are indistinguishable from those of someone who does understand. And in any practical context — hiring, credentialing, evaluation — that would be enough.

But a systematic review of 761 studies evaluating language models in medicine found something worth pausing on: only 5% tested performance on actual patient care. The rest used exam questions. The machines passed the tests. Whether they understood what they were doing remained, by the field's own admission, unresolved.

This isn't a small distinction. It might be the whole question.

What passing a test means, and what it doesn't

Tests are not empty exercises. A well-designed exam does test understanding — it asks you to apply knowledge in unfamiliar contexts, to reason through cases you haven't seen, to demonstrate that you grasp the principles beneath the facts. Medical licensing exams are specifically designed to do this. They're not trivial to pass, even for humans who've spent years studying.

So when an AI scores 96%, that means something. It means the system can produce the same outputs that a person who understands would produce, across a wide range of situations, including novel ones. That's impressive, and it would be dishonest to dismiss it.

The question is whether producing those outputs is the same thing as understanding.

Think about what happens when you actually understand something. Not when you recall a fact or recognise a pattern — when you genuinely get it. There's a qualitative shift. Something clicks. The concept stops being information you're holding in your head and becomes something you can move around in, see from different angles, apply in ways that weren't prescribed. It has meaning to you. Not meaning in the abstract sense — meaning in the sense that it matters, that it connects to other things you know, that it changes the way you see.

Understanding is experiential. It's not just what you can do with the knowledge. It's what the knowledge does to you.

An AI model that scores 96% on a medical exam can produce correct diagnoses. It can reason through differential diagnoses, weigh competing factors, and arrive at the right answer. But does the diagnosis mean anything to it? Does it experience the moment of recognition when the symptoms click into place? Does it understand the patient, or does it process the patient?

These might sound like philosophical indulgences. But they're not. They're practical questions with real consequences, because if understanding is purely a matter of producing correct outputs, then the machines already understand. And if it's something else — something experiential, something that involves meaning, not just accuracy — then we're looking at something far stranger: systems that perfectly imitate understanding without having it.

The room that made the problem visible

In 1980, the philosopher John Searle proposed a thought experiment that still sits at the centre of this debate. He called it the Chinese Room.

The setup is simple. Imagine a person — an English speaker who knows no Chinese — locked in a room. Through a slot in the door, people pass in questions written in Chinese characters. Inside the room, there's a detailed manual: for every possible combination of Chinese characters that comes in, the manual specifies which characters to write and pass back out.

The person follows the instructions perfectly. They look up the incoming characters, find the matching response in the manual, write the correct characters, and slide them back through the slot. From the outside, the room appears to understand Chinese. The responses are fluent, contextually appropriate, indistinguishable from those of a native speaker.

But the person inside the room doesn't understand a word of Chinese. They don't know what the characters mean. They don't know what questions are being asked or what answers they're giving. They're following rules — manipulating symbols according to a system — without any comprehension of what those symbols refer to.

Searle's point was that this is what computers do. They manipulate symbols according to rules. The outputs can be perfect. The process can be sophisticated. But if there's no one inside who grasps what the symbols mean — if there's no experience of meaning — then there's no understanding. There's just processing.

Searle died in September 2025 at the age of 93. His argument, forty-five years old, has never been more relevant — or more contested. In March 2025, Anthropic published research tracing the internal processes of one of its language models and found something that looked structurally like multi-step reasoning, not simple pattern recall. Whether that constitutes understanding or simply a more complex form of symbol manipulation is precisely what Searle was asking. It remains open.

The gap that won't close

The debate has generated hundreds of papers, conferences, and counterarguments. The most common objection to Searle is the "systems reply": maybe the person in the room doesn't understand Chinese, but the whole system — person plus manual plus room — does. Understanding is a property of the system, not any single part.

It's a reasonable objection. But it also reveals the problem: nobody agrees on what understanding is. The systems reply works only if understanding is defined as correct input-output behaviour. If you define it that way, then yes, the room understands Chinese. And yes, AI understands medicine. And the debate is over.

But most people don't actually believe that. Not when they think carefully about it. When you understand something, you don't just produce the right response. Something happens. The concept lands. You see it. You feel the weight of it. You can explain it in ten different ways because you're not recalling an answer — you're describing something you've experienced from the inside.

This is what separates understanding from computation. Computation is a process. Understanding is an experience. Computation gets the answer right. Understanding knows what the answer means.

A calculator computes the square root of 144. A student who understands square roots grasps why the answer is 12 — what it means for a number to be the product of itself, where that concept connects to geometry, to area, to the shape of a square in physical space. The calculator arrives at the same number. The student arrives at the same number and also arrives somewhere else — a place where the number has significance beyond itself.

What this means for us

The standard version of this debate asks: will machines ever truly understand? That's an important question, and it may take decades to resolve. But there's a closer, more uncomfortable version that deserves attention first.

How much of your own daily cognition is actually understanding — and how much is computation?

Consider what it's like to stand at the edge of the ocean. Not to think about the ocean. Not to recall facts about tidal patterns or salinity or depth. To stand there — wind on your skin, the sound filling your chest, the sheer scale of the thing pressing against you in a way that has nothing to do with data.

You understand something in that moment. Something real. But it isn't the kind of understanding you could write on an exam. It's bodily, spatial, felt. It's the experience of being small in front of something vast. A machine can compute the volume of the Pacific Ocean to the nearest cubic kilometre. It can tell you how far the horizon is from where you stand. But the feeling of standing at the water's edge and understanding, in your body, what that vastness means — that's a different kind of knowing. Not a better kind, necessarily. A different kind. And it's the kind that computation can't touch.

Or think about black holes. You can explain the physics — mass so dense that light can't escape, spacetime curved beyond recognition, a singularity at the centre where the equations break down. An AI can produce that explanation more accurately and more fluently than most people. But when you sit with the concept long enough — when you really try to hold in your mind what it would mean for space itself to stop working — something happens that isn't computational. Your imagination reaches for it and fails. And in that failure, you understand something about the limits of understanding itself. You feel the strangeness. The vertigo of a concept that exceeds your capacity to picture it. That feeling is a form of comprehension. Not because you've mastered the facts, but because you've met the edge of what facts can do.

This is the territory that the AI debate keeps skirting. Not whether machines can produce correct outputs — they can, impressively. But whether there is a dimension of understanding that is inherently experiential. That lives in the body, in the imagination, in the felt sense of what something means when it actually lands.

A machine can compute the distance between galaxies. A human can feel lonely looking at the stars. Both are responses to the same information. Only one of them is understanding.

The question underneath

None of this diminishes what AI can do. These systems are extraordinary. They will save lives, accelerate research, transform industries. Their inability to experience what they process — if it is an inability — doesn't make them less useful. A calculator doesn't understand arithmetic, and it's still indispensable.

But the existence of systems that perfectly replicate the outputs of understanding without the experience of it forces a question that the debate, for all its sophistication, keeps circling without quite landing on.

What is the experience of understanding actually for? Why does it exist? If correct outputs are all that matter — if understanding is just a means to the right answer — then the machines have already made it redundant. But if understanding is something more than that — if the felt sense of meaning, the moment of recognition, the experience of comprehension serves a purpose that goes beyond accuracy — then we're looking at something the machines can't replace. Not because they're not good enough. Because it's not the kind of thing that computation does.

The machines, by doing the functional part so well, are clarifying what the experiential part actually is. And it turns out to be something worth paying attention to. Not because machines are deficient, but because the human capacity to feel what we know — to experience understanding rather than just perform it — may be more significant than we assumed in a world that increasingly rewards outputs over insight.

The question isn't only whether machines can understand. It's what understanding is for — and whether we've been valuing it for the wrong reasons all along.

Sources and further reading:

Siam et al., "Benchmarking large language models on the USMLE" (2025) — on AI scoring 93–96% on medical licensing exams
"Knowledge-Practice Performance Gap in Clinical Large Language Models," systematic review (2026) — on the gap between exam scores and real clinical competence
John Searle, "Minds, Brains, and Programs" (1980) — the Chinese Room argument
Anthropic, "On the Biology of a Large Language Model" (2025) — interpretability research showing multi-step reasoning in LLMs