An AI Just Scored 100% on the USMLE. Here's What That Actually Means for Med Students.
In August 2025, OpenEvidence scored 325 out of 325 on a standardized USMLE benchmark — the first AI to achieve a perfect score across all three Steps. The headline generated predictable reactions: dismissal ("just pattern matching") and alarm ("why am I studying?"). Both miss the important part. This article covers what actually happened, the fine print that changes the interpretation, and — crucially — why AI's perfect score does not transfer to your study strategy in the ways you might assume.
What Actually Happened
In August 2025, a company called OpenEvidence announced that its AI had achieved a perfect 100% accuracy on the USMLE, becoming the first AI system in history to do so. The score covered all three Steps: Step 1, Step 2 CK, and Step 3.
OpenEvidence is a Miami-based AI company founded by Daniel Nadler, a Harvard PhD who previously built Kensho Technologies (an AI analytics firm acquired by S&P Global for roughly $550 million in 2018). The company's main product is a clinical decision support platform used by physicians — think of it as a medical search engine where doctors ask clinical questions and get answers grounded in peer-reviewed evidence from sources like the New England Journal of Medicine, JAMA, Cochrane, and NCCN guidelines.
The company has moved fast. As of late 2025, over 430,000 U.S. physicians — roughly 40% of all practicing doctors in the country — had registered on the platform, which was handling about 18 million clinical consultations per month. Nadler was named to TIME's 100 Most Influential People in Global Health list in 2025, and the company's valuation has climbed from $1 billion in early 2025 to $12 billion by January 2026, with investors including Sequoia Capital, Google Ventures, Nvidia, Kleiner Perkins, and the Mayo Clinic.
So this is not a garage project. It's a well-funded, widely adopted platform with serious institutional backing.
The Fine Print You Should Actually Read
Here is where most coverage of the story stops. Here is where it gets important.
The 100% score was achieved on the Kung et al. dataset — a standardized benchmark based on the official USMLE sample exam questions available on usmle.org. The dataset includes 94 Step 1 questions, 109 Step 2 CK questions, and 122 Step 3 questions, totaling 325 items.
Three things to know about the methodology:
-
Image-based questions were excluded. The AI was only tested on text-based questions. USMLE exams include histology slides, ECG tracings, imaging studies, and dermatology photos. Those weren't part of this benchmark.
-
Recording errors in the original dataset were corrected. The team identified and fixed mistakes in the Kung et al. dataset before running the evaluation. This is reasonable methodology, but it means the exact question set differs slightly from what earlier AIs were tested on.
-
This was not the actual USMLE. It was a performance evaluation on a publicly available set of sample questions. The real USMLE is a proctored, multi-hour exam administered by the NBME at Prometric testing centers. Nobody is claiming the AI sat for the actual test.
None of this makes the achievement less impressive. Getting 325 out of 325 on USMLE-level questions is genuinely remarkable. But "AI scores 100% on a curated subset of sample questions with images removed" lands differently than "AI aces the USMLE," even though both describe the same event.
How We Got Here: The AI Performance Timeline
The USMLE has quietly become the benchmark that AI researchers use to measure medical reasoning ability. Here's how the progression looked:
| Year | AI System | Approximate USMLE Accuracy |
|---|---|---|
| Late 2022 | ChatGPT (GPT-3.5) | ~60% (near passing threshold) |
| Late 2022 | Flan-PaLM (Google) | ~68% |
| Early 2023 | GPT-4 (OpenAI) | ~86% |
| Early 2023 | Med-PaLM 2 (Google) | ~86.5% |
| July 2023 | OpenEvidence | >90% (first AI above 90%) |
| 2024 | GPT-4o (OpenAI) | ~90% |
| April 2025 | SCAI (Univ. at Buffalo) | 95.2% on Step 3 |
| August 2025 | OpenEvidence | 100% |
The jump from "can barely pass" to "perfect score" happened in roughly three years. That's fast, even by AI standards.
When OpenEvidence first crossed the 90% mark in July 2023, the company reported making 77% fewer errors than ChatGPT, 24% fewer than GPT-4, and 31% fewer than Google's Med-PaLM 2. By August 2025, the errors were gone entirely.
How OpenEvidence's AI Actually Works
OpenEvidence has not published a detailed peer-reviewed paper on the architecture behind the 100% score, so we're working from public statements and press releases. Here's what's known:
The system uses what Nadler describes as "second and third derivative reasoning" — not just recalling a fact, but figuring out what a set of facts implies, and then reasoning through those implications. In his words: "not just taking the facts that come in before you, but taking the factors in before you, figuring out what those imply, and then reasoning through the implications."
The AI has access to a massive curated evidence base through multi-year content partnerships with NEJM Group (all published content from 1990 onward), JAMA Network, NCCN, Wiley, and Cochrane. So unlike a general-purpose chatbot trained on the open internet, OpenEvidence reasons over decades of peer-reviewed medical literature specifically.
Alongside the 100% score, the company released a free explanation model that teaches the reasoning behind each correct answer, creating clinical vignettes customizable by training level. The stated goal is to democratize access to quality medical education resources.
The Part That Should Temper the Hype
Here's the fact that didn't make it into most headlines.
In November 2025, a pilot study posted on medRxiv tested OpenEvidence on the MedXpertQA dataset — a set of complex medical subspecialty scenarios that go well beyond the scope and format of USMLE questions. These are the kinds of messy, ambiguous, multi-system problems that show up in real clinical practice.
OpenEvidence scored 34%.
That's not a typo. The same AI that got every single USMLE question right could only answer about one in three complex subspecialty questions correctly.
This result highlights something fundamental about what standardized exams actually test. USMLE questions, by design, have one clearly correct answer among five options. They test a specific body of knowledge in a structured format. Real clinical reasoning is open-ended, ambiguous, involves incomplete information, and frequently has no single right answer. Mastering the former does not automatically translate to the latter.
This is true for AI systems, and it's worth noting that it's also true for human examinees. A perfect Step 1 score has never guaranteed clinical excellence.
What This Means for Your USMLE Prep
If you are a medical student reading this between question blocks, here is what the OpenEvidence result actually changes — and does not change — about your preparation.
Your exam is not getting replaced
The USMLE exists to certify that human physicians have the foundational knowledge needed to practice medicine safely. The fact that an AI can now ace the test doesn't change that purpose. The NBME has given no indication that AI performance will change exam policies, scoring, or requirements. You still need to pass. You still need to know this material.
AI study tools are getting significantly better
The same category of technology that enabled OpenEvidence's benchmark result — domain-specific AI grounded in curated medical literature rather than generic internet text — is filtering into study tools. Platforms that apply this approach responsibly, anchoring AI features to expert-reviewed question content rather than generating answers from scratch, represent a meaningful step forward from the static QBanks of even two years ago.
The "what" hasn't changed; the "how" is evolving
You still need to learn the renin-angiotensin-aldosterone system. You still need to recognize the histology of granulomatous inflammation. The core knowledge hasn't changed. What's changing is the quality of the tools available to help you learn it — from better explanations and adaptive question routing to AI tutoring that can meet you where you are.
Do not use general AI chatbots as your primary study resource
This matters especially after a headline like "AI scores 100% on USMLE." OpenEvidence achieved that benchmark using curated peer-reviewed literature from NEJM, JAMA, Cochrane, and NCCN — decades of validated medical knowledge feeding its reasoning. ChatGPT draws from broad internet text without that clinical verification layer. A 2025 study in the Journal of the American College of Clinical Pharmacy documented 0% accuracy from ChatGPT-3.5 across categories like IV compatibility and monitoring parameters. The perfect-score AI and the chatbot on your phone are fundamentally different tools despite sharing the "AI" label.
The real competitive advantage is knowing how to learn
AI systems are getting better at answering questions. They are not getting better at being physicians. The clinical reasoning, pattern recognition, and judgment that residency programs are ultimately looking for cannot be outsourced to a chatbot. Your job is to build that foundation — and the best way to do that is still active recall, spaced repetition, and deliberate practice with high-quality question banks.
Why AI Performance Does Not Transfer to Your Study Strategy
OpenEvidence (and other medical AI systems) achieve high scores through a fundamentally different mechanism than human test-taking. Understanding this difference prevents drawing the wrong conclusions about what AI benchmarks mean for your preparation.
AI Pattern Matching vs. Human Clinical Reasoning
AI processes the entire question simultaneously, matching text patterns against its training data to identify the most probable correct answer. Humans read sequentially, build a clinical picture progressively, generate a differential diagnosis, and apply diagnostic frameworks to narrow it down. These are fundamentally different cognitive processes. AI's success on a question does not mean the question is "easy," and the AI's approach cannot teach you how to answer it — because you are not going to take the exam with access to a billion-parameter model and decades of indexed literature.
AI Does Not Experience Cognitive Fatigue, Time Pressure, or Test Anxiety
A human taking a 280-question, 8-hour exam with breaks faces cumulative fatigue, attention drift, and emotional fluctuation. Your accuracy in block 7 is measurably lower than in block 1 — not because you know less, but because your brain is tired. AI performance is constant across all questions. Your study strategy must account for human factors that AI is immune to: stamina building through full-length practice tests, break optimization to manage energy, and anxiety management techniques for test day. None of these appear in AI benchmark discussions.
What the AI Benchmark DOES Tell You
If an AI system scores perfectly on a USMLE-style question set, it confirms that the questions are answerable from the available information in the stem — there is no ambiguity that makes any question fundamentally unanswerable. This is useful quality control for question banks. It also demonstrates that the knowledge needed to answer USMLE questions is well-documented in peer-reviewed literature. These are validation insights, not study strategy insights.
The Future Implications for Board Exams
As AI systems become capable of scoring perfectly on current board exams, the exam format will likely evolve to test skills that AI cannot replicate: physical examination interpretation from video, patient communication assessment, ethical judgment in genuinely ambiguous situations, and procedural decision-making under uncertainty. Students who invest in these "human-only" skills alongside knowledge acquisition are positioning themselves for the future of medical assessment, not just the current exam format.
The Bigger Picture
Daniel Nadler has framed OpenEvidence's mission around equity: "There's an enormous amount of inequality in medical education in the United States and in preparation for medical school exams." The free explanation model released alongside the 100% score is meant to make high-quality reasoning accessible regardless of which medical school you attend or how much you can afford to spend on prep materials.
That vision resonates with a real problem. USMLE prep costs $2,000–$4,000 when you add up all the resources, and for IMGs in countries where that represents months of income, the financial barrier is significant.
Whether AI-powered tools will meaningfully close that gap remains to be seen. But the trajectory is clear: medical AI is improving faster than most people expected, and the tools available to students today are qualitatively different from what existed even two years ago.
The students who will benefit most are the ones who understand what these tools can and cannot do — and use them accordingly.
AI Benchmark Performance: What It Means
Did an AI really pass the USMLE?
In August 2025, OpenEvidence became the first AI to score a perfect 100% on all three USMLE Steps using the Kung et al. benchmark dataset of 325 sample questions from usmle.org. Image-based questions were excluded. The AI did not take the actual proctored exam.
What is OpenEvidence?
OpenEvidence is an AI-powered clinical decision support platform used by over 430,000 U.S. physicians. Founded by Daniel Nadler (Harvard PhD), it provides evidence-grounded answers to clinical questions using content from NEJM, JAMA, Cochrane, and other peer-reviewed sources. It is valued at $12 billion as of January 2026.
How does OpenEvidence compare to ChatGPT for medical questions?
OpenEvidence is purpose-built for medicine with access to curated peer-reviewed literature. ChatGPT is a general-purpose language model. In benchmarks, OpenEvidence made 77% fewer errors than ChatGPT. A 2025 study found ChatGPT-3.5 had 0% accuracy on several drug information categories where precision matters most.
Will AI replace the USMLE?
There is no indication from the NBME or USMLE program that AI performance will change exam requirements. The USMLE certifies human physician competency, and that purpose remains regardless of AI capabilities.
Should I use AI tools to study for the USMLE?
AI tools can be valuable supplements — especially for concept explanation, adaptive learning, and personalized study plans. The key is using purpose-built tools with expert-reviewed content rather than general chatbots, and always treating AI as a supplement to active recall and structured practice, not a replacement for it.
Studying for the USMLE? Harness AI for your own preparation — QuantaPrep brings adaptive intelligence to your study sessions through personalized question selection and performance pattern analysis, built on expert-reviewed content rather than AI-generated medical facts. Free to start.
Ready to start practicing?
QuantaPrep's question bank features detailed explanations, performance analytics, and study modes designed around active recall.