Chatbots Are Academically Dishonest

AI models might not be as “smart” as companies’ testing claims.

Mar 7, 2025 - 08:23

0 2

This is Atlantic Intelligence, a newsletter in which our writers help you wrap your mind around artificial intelligence and a new machine age. Sign up here.

Even by the AI industry’s frenetic standards, 2025 has been dizzying. OpenAI, Anthropic, Google, and xAI have all released major AI models and products, almost invariably touting them as the “best” and “smartest” in the world.

But determining exactly how “intelligent” programs such as GPT-4.5 or Claude 3.7, the latest models from OpenAI and Anthropic, are is tricky. That’s great for marketing—vague metrics of “intelligence” make for easy claims about it—but it’s also a problem for accurately measuring just how powerful or competent any AI model is compared with all the rest. Still, companies have coalesced around a set of industry-wide benchmark tests of AI-model abilities, and a new high score on these benchmarks is typically what tech companies mean when they say their AI models are the “smartest.”

The problem with these benchmarks, however, is that the chatbots seem to be cheating on them. Over the past two years, a number of studies have suggested that leading AI models from OpenAI, Google, Meta, and other companies “have been trained on the text of popular benchmark tests, tainting the legitimacy of their scores,” Alex Reisner wrote this week. “Think of it like a human student who steals and memorizes a math test, fooling his teacher into thinking he’s learned how to do long division.” This may not be tech companies’ intent—many of these benchmarks, or the questions on them, simply exist on the internet and thus get hoovered into AI models’ training data. (Of the labs Reisner mentioned, only Google DeepMind responded to a request for comment, telling him it takes the issue seriously.) Intentional or not, though, the unreliability of these benchmarks makes separating fact from marketing even more challenging.

Illustration of robotic hands holding Scantron tests — Illustration by The Atlantic. Source: Getty.

Chatbots Are Cheating on Their Benchmark Tests

By Alex Reisner

Generative-AI companies have been selling a narrative of unprecedented, endless progress. Just last week, OpenAI introduced GPT-4.5 as its “largest and best model for chat yet.” Earlier in February, Google called its latest version of Gemini “the world’s best AI model.” And in January, the Chinese company DeepSeek touted its R1 model as being just as powerful as OpenAI’s o1 model—which Sam Altman had called “the smartest model in the world” the previous month.
Yet there is growing evidence that progress is slowing down and that the LLM-powered chatbot may already be near its peak. This is troubling, given that the promise of advancement has become a political issue; massive amounts of land, power, and money have been earmarked to drive the technology forward. How much is it actually improving? How much better can it get? These are important questions, and they’re almost impossible to answer because the tests that measure AI progress are not working. (The Atlantic entered into a corporate partnership with OpenAI in 2024. The editorial division of The Atlantic operates independently from the business division.)

Read the full article.

What to Read Next

“It feels like it’s chaotic on purpose”: At around midnight last Friday, the Trump administration, in its attempt to optimize government services with technology, eliminated a team of technologists dedicated to doing just that, I reported over the weekend. The unit, known as 18F and formed during the Obama years, had helped build IRS Direct File; a website that assisted in sending out free COVID tests; and a new way to file civil-rights complaints online, among numerous other projects. “The team was tailor-made for government efficiency and technology,” I wrote, “something the newly formed Department of Government Efficiency and its allies might, in theory, uplift.”

Kamala Harris reportedly considering run for California governor

What's Your Reaction?

Dislike

Love

Funny

Angry

Sad

Wow

admin

Welcome to Lakewood Newsbreak, a subsidiary of Lakewood Opinions, LLC. This website is designed o enhance your news delivery. All information belongs to the individual contributor and LNB take no responsibility for any content. We do not sell any information. LNB pulls from over 2,500 RSS news feeds from around the world to bring you the latest updates. Please enjoy.

When I travel with my teens, we seek out the nearest ar...

admin Mar 11, 2025 0 1

The new retail-investing folk heroes: 3 influencers hel...

admin Mar 2, 2025 0 1

I love it when my husband travels for work. The kids be...

admin Mar 4, 2025 0 0

I stayed at Rosewood Miramar Beach, a 5-star resort in ...

admin Feb 27, 2025 0 1

Georgia, Vanderbilt tangle with chance to prove tourney...

admin Mar 7, 2025 0 1

Trump Takes A Cruel New Step To Hurt People On Social S...

admin Mar 8, 2025 0 1

Comments

There are so many Social Media sites out there and they are hard to keep up with. That is why Lakewood Newsbreak has design a Social site design to discuss and post News and World related items of intrest. We are tring to promote feel good news posts to help the world in these harden times. Please be courteous with your comments. Thannk you and enjoy. Please read our Content Policy for any Questions

Enter Here

WHAT IS YOUR FAVORITE PODCAST

Talk Shows

Interviews

Videos

Business Reviews

City Issues

Please select an option!

You already voted this poll before.

WHAT IS YOUR FAVORITE PODCAST

Total Vote: 12

Talk Shows

25 %

Interviews

33.3 %

Videos

16.7 %

Business Reviews

16.7 %

City Issues

8.3 %

Best cordless vacuum

Trump administration's potential Park Service...

Colorado garbage truck driver wins national c...

Best BLACK+DECKER steam iron

Astronauts launching to space Wednesday will ...

After Scrutiny, Battlefield Preservation Agen...

Taco Bell is releasing twice as many new menu...

India's richest man is giving Elon Musk's Sta...

My teens use me as an excuse to get out of so...

Inside Larry Page's secret bet to 3D-print th...

Chatbots Are Academically Dishonest

AI models might not be as “smart” as companies’ testing claims.

Kamala Harris reportedly considering run for California governor

A Method to His Madness | Sunday on 60 Minutes

What's Your Reaction?

Tiny Desk Concert

Igmar Thomas' Revive Big Band: Tiny Desk Concert

Follow Us

Social Media

Advertisment ••

Recommended Posts

The 7 best cookware sets for home cooks, tested by our ...

World News

Wonders of VietNam | The Most Amazing Places

in VietNam | Travel Video 8K

Spoken Word

Advertisment ••••

Voting Poll

WHAT IS YOUR FAVORITE PODCAST

WHAT IS YOUR FAVORITE PODCAST

Most Viewed Posts

Where to watch 2025 Oscar nominees: Your complete strea...

Only 9 artists have topped the Billboard chart with 10 ...

CData Virtuality: Senior Backend Developer SaaS

Chatbots Are Academically Dishonest

AI models might not be as “smart” as companies’ testing claims.

What's Your Reaction?

Related Posts

Headline News

Tiny Desk Concert

Igmar Thomas' Revive Big Band: Tiny Desk Concert

Follow Us

Social Media

Advertisment ••

Recommended Posts

World News

Wonders of VietNam | The Most Amazing Places

in VietNam | Travel Video 8K

Spoken Word

Advertisment ••••

Voting Poll

WHAT IS YOUR FAVORITE PODCAST

WHAT IS YOUR FAVORITE PODCAST