Chatbots Are Academically Dishonest

AI models might not be as “smart” as companies’ testing claims.

Chatbots Are Academically Dishonest

This is Atlantic Intelligence, a newsletter in which our writers help you wrap your mind around artificial intelligence and a new machine age. Sign up here.

Even by the AI industry’s frenetic standards, 2025 has been dizzying. OpenAI, Anthropic, Google, and xAI have all released major AI models and products, almost invariably touting them as the “best” and “smartest” in the world.

But determining exactly how “intelligent” programs such as GPT-4.5 or Claude 3.7, the latest models from OpenAI and Anthropic, are is tricky. That’s great for marketing—vague metrics of “intelligence” make for easy claims about it—but it’s also a problem for accurately measuring just how powerful or competent any AI model is compared with all the rest. Still, companies have coalesced around a set of industry-wide benchmark tests of AI-model abilities, and a new high score on these benchmarks is typically what tech companies mean when they say their AI models are the “smartest.”

The problem with these benchmarks, however, is that the chatbots seem to be cheating on them. Over the past two years, a number of studies have suggested that leading AI models from OpenAI, Google, Meta, and other companies “have been trained on the text of popular benchmark tests, tainting the legitimacy of their scores,” Alex Reisner wrote this week. “Think of it like a human student who steals and memorizes a math test, fooling his teacher into thinking he’s learned how to do long division.” This may not be tech companies’ intent—many of these benchmarks, or the questions on them, simply exist on the internet and thus get hoovered into AI models’ training data. (Of the labs Reisner mentioned, only Google DeepMind responded to a request for comment, telling him it takes the issue seriously.) Intentional or not, though, the unreliability of these benchmarks makes separating fact from marketing even more challenging.


Illustration of robotic hands holding Scantron tests
Illustration by The Atlantic. Source: Getty.

Chatbots Are Cheating on Their Benchmark Tests

By Alex Reisner

Generative-AI companies have been selling a narrative of unprecedented, endless progress. Just last week, OpenAI introduced GPT-4.5 as its “largest and best model for chat yet.” Earlier in February, Google called its latest version of Gemini “the world’s best AI model.” And in January, the Chinese company DeepSeek touted its R1 model as being just as powerful as OpenAI’s o1 model—which Sam Altman had called “the smartest model in the world” the previous month.

Yet there is growing evidence that progress is slowing down and that the LLM-powered chatbot may already be near its peak. This is troubling, given that the promise of advancement has become a political issue; massive amounts of land, power, and money have been earmarked to drive the technology forward. How much is it actually improving? How much better can it get? These are important questions, and they’re almost impossible to answer because the tests that measure AI progress are not working. (The Atlantic entered into a corporate partnership with OpenAI in 2024. The editorial division of The Atlantic operates independently from the business division.)

Read the full article.


What to Read Next

  • “It feels like it’s chaotic on purpose”: At around midnight last Friday, the Trump administration, in its attempt to optimize government services with technology, eliminated a team of technologists dedicated to doing just that, I reported over the weekend. The unit, known as 18F and formed during the Obama years, had helped build IRS Direct File; a website that assisted in sending out free COVID tests; and a new way to file civil-rights complaints online, among numerous other projects. “The team was tailor-made for government efficiency and technology,” I wrote, “something the newly formed Department of Government Efficiency and its allies might, in theory, uplift.”