If you’re looking for a new reason to worry about AI, try this: Some of the smartest people in the world are racing to create tests that AI systems can’t pass.
For years, AI systems have been measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging SAT-caliber problems in areas such as math, science, and logic. Comparing the models’ scores over time served as a rough measure of AI progress.
But AI systems eventually got too good at those tests, so new, more difficult tests were created — often with the types of questions that graduate students might encounter on their exams.
These tests aren’t in good shape either. New models from companies such as OpenAI, Google, and Anthropic have scored highly in several PhD-level challenges, limiting the usefulness of these tests and leading to a chilling question: Are AI systems getting too smart to measure them?
This week, researchers at the Center for AI Security and Scale released a possible answer to that question: A new assessment, called “Humanity’s Last Exam,” that they claim is the toughest test ever given to systems. artificial intelligence.
Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known AI security researcher and director of the Center for AI Security. (The test’s original name, “Humanity’s Last Stand,” was rejected as too dramatic.)
Mr. Hendrycks worked with Scale AI, an artificial intelligence company to which he is a consultant, to write the test, which consists of about 3,000 multiple-choice and short-answer questions designed to test the abilities of AI systems in areas ranging from analytical philosophy to rocket engineering. .
Questions were posed by experts in these fields, including college professors and award-winning mathematicians, who were asked to pose extremely difficult questions to which they knew the answers.
Here, try your hand at a hummingbird anatomy question from the quiz:
Hummingbirds within the Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudal lateral part of the swollen, cruciate aponeurosis of the insertion of the m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
Or, if physics is more your thing, try this:
A block is placed on a horizontal rail, along which it can slide without friction. It is attached to the end of a massless rigid rod of length R. A mass is attached to the other end. Both objects have a weight W. The system is initially at rest, with the mass directly above the block. The mass is given an infinitesimal thrust, parallel to the rail. Assume that the system is designed so that the rod can rotate 360 degrees without stopping. When the rod is horizontal, it carries the tension T1. When the rod is vertical again, with the mass directly below the block, it carries the tension T2. (Both of these quantities could be negative, which would indicate that the rod is in compression.) What is the value of (T1−T2)/W?
(I’d print the answers here, but that would screw up the test for all the AI systems being trained in this column. Also, I’m too stupid to verify the answers myself.)
The questions in Humanity’s Last Exam went through a two-step filtering process. Initially, the questions asked were given to leading artificial intelligence models to solve.
If the models couldn’t answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were given to a set of human reviewers, who refined them and verified correct answers. Experts who wrote top-rated questions were paid between $500 and $5,000 per question, as well as credit for their contribution to the exam.
Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a handful of questions to the test. Three of his questions were selected, which he told me were “in the upper range of what one might see on a graduate exam.”
Mr. Hendrycks, who helped create a widely used AI test known as Massive Multitask Language Understanding, or MMLU, said he was inspired to create tougher AI tests by a conversation with Elon Musk. (Mr. Hendrycks is also a security adviser to Mr. Musk’s AI company, xAI.) Mr. Musk, he said, raised concerns about existing tests done on AI models, which he found too easy.
“Elon looked at the MMLU questions and said, ‘These are undergraduate level. I want things that a world-class expert could do,” said Mr. Hendrix.
There are other tests that try to measure advanced AI capabilities in certain fields, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by AI researcher François Chollet.
But Humanity’s Last Exam aims to determine how well AI systems are at answering complex questions on a wide variety of academic subjects, giving us what could be considered an overall intelligence score.
“We’re trying to assess the extent to which artificial intelligence can automate very, very difficult mental work,” said Mr. Hendrix.
Once the list of questions was compiled, the researchers gave Humanity’s Last Exam to six top AI models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3 percent.
(The New York Times sued OpenAI and its partner Microsoft, accusing them of copyright infringement of news content related to artificial intelligence systems. OpenAI and Microsoft have denied these claims.)
Mr. Hendricks said he expected those ratings to rise quickly and possibly exceed 50 percent by the end of the year. At that point, he said, AI systems could be considered “world-class utilities,” capable of answering questions on any topic more accurately than human experts. And we may need to look for other ways to measure AI’s impact, such as looking at economic data or judging whether it can make new discoveries in fields like math and science.
“You can imagine a better version of this where we can ask questions that we don’t know the answers to yet and we’re able to verify if the model can help us solve it,” said Summer Yue, Scale. AI research director and exam organizer.
Part of what’s confusing about AI progress these days is how jagged it is. We have AI models capable of diagnosing diseases more effectively than human doctors, winning silver medals at the International Mathematical Olympiad and beating top human programmers in competitive coding challenges.
But these same models sometimes struggle with basic tasks like arithmetic or writing metered poetry. This has given them a reputation for being amazingly brilliant at some things and completely useless at others, and has created very different impressions of how fast AI is improving depending on whether you’re looking at the best or worst performers.
This jaggedness also made measuring these models difficult. I wrote last year that we need better evaluations of AI systems. I still believe it. But I also think we need more creative methods of tracking AI progress that don’t rely on standardized tests, because most of what humans do—and what we fear AI will do better than us—can’t be captured in a written exam.
Mr. Zhou, the theoretical particle physics researcher who submitted questions on Humanity’s Last Exam, told me that while AI models were often impressive at answering complex questions, he didn’t see them as a threat to him and his colleagues because their work it involves a lot more than spitting out correct answers.
“There is a big gap between what it means to take exams and what it means to be a practicing physicist and a researcher,” he said. “Even an AI that can answer these questions may not be ready to help with research, which is inherently less structured.”