GPT, Claude, Llama? How to tell which AI model is best

Share This Post

[ad_1]

When Meta, the parent company of Facebook, announced its latest open-source large language model (LLM) on July 23rd, it claimed that the most powerful version of Llama 3.1 had “state-of-the-art capabilities that rival the best closed-source models” such as GPT-4o and Claude 3.5 Sonnet. Meta’s announcement included a table, showing the scores achieved by these and other models on a series of popular benchmarks with names such as MMLU, GSM8K and GPQA.

On MMLU, for example, the most powerful version of Llama 3.1 scored 88.6%, against 88.7% for GPT-4o and 88.3% for Claude 3.5 Sonnet, rival models made by OpenAI and Anthropic, two AI startups, respectively. Claude 3.5 Sonnet had itself been unveiled on June 20th, again with a table of impressive benchmark scores. And on July 24th, the day after Llama 3.1’s debut, Mistral, a French AI startup, announced Mistral Large 2, its latest LLM, with—you’ve guessed it—yet another table of benchmarks. Where do such numbers come from, and can they be trusted?

Having accurate, reliable benchmarks for AI models matters, and not just for the bragging rights of the firms making them. Benchmarks “define and drive progress”, telling model-makers where they stand and incentivising them to improve, says Percy Liang of the Institute for Human-Centred Artificial Intelligence at Stanford University. Benchmarks chart the field’s overall progress and show how AI systems compare with humans at specific tasks. They can also help users decide which model to use for a particular job and identify promising new entrants in the space, says Clémentine Fourrier, a specialist in evaluating LLMs at Hugging Face, a startup that provides tools for AI developers.

But, says Dr Fourrier, benchmark scores “should be taken with a pinch of salt”. Model-makers are, in effect, marking their own homework—and then using the results to hype their products and talk up their company valuations. Yet all too often, she says, their grandiose claims fail to match real-world performance, because existing benchmarks, and the ways they are applied, are flawed in various ways.

One problem with benchmarks such as MMLU (massive multi-task language understanding) is that they are simply too easy for today’s models. MMLU was created in 2020 and consists of 15,908 multiple-choice questions, each with four possible answers, across 57 topics including maths, American history, science and law. At the time, most language models scored little better than 25% on MMLU, which is what you would get by picking answers at random; OpenAI’s GPT-3 did best, with a score of 43.9%. But since then, models have improved, with the best now scoring between 88% and 90%.

This means it is difficult to draw meaningful distinctions from their scores, a problem known as “saturation” (see chart). “It’s like grading high-school students on middle-school tests,” says Dr Fourrier. More difficult benchmarks have been devised—MMLU-Pro has tougher questions and ten possible answers rather than four. GPQA is like MMLU at PhD level, on selected science topics; today’s best models tend to score between 50% and 60% on it. Another benchmark, MuSR (multi-step soft reasoning), tests reasoning ability using, for example, murder-mystery scenarios. When a person reads such a story and works out who the killer is, they are combining an understanding of motivation with language comprehension and logical deduction. AI models are not so good at this kind of “soft reasoning” over multiple steps. So far, few models score better than random on MuSR.

MMLU also highlights two other problems. One is that the answers in such tests are sometimes wrong. A study carried out by Aryo Gema of the University of Edinburgh and colleagues, published in June, found that, of the questions they sampled, 57% of MMLU’s virology questions and 26% of its logical-fallacy ones contained errors. Some had no correct answer; others had more than one. (The researchers cleaned up the MMLU questions to create a new benchmark, MMLU-Redux.)

Then there is a deeper issue, known as “contamination”. LLMs are trained using data from the internet, which may include the exact questions and answers for MMLU and other benchmarks. Intentionally or not, the models may be cheating, in short, because they have seen the tests in advance. Indeed, some model-makers may deliberately train a model with benchmark data to boost its score. But the score then fails to reflect the model’s true ability. One way to get around this problem is to create “private” benchmarks for which the questions are kept secret, or released only in a tightly controlled manner, to ensure that they are not used for training (GPQA does this). But then only those with access can independently verify a model’s scores.

To complicate matters further, it turns out that small changes in the way questions are posed to models can significantly affect their scores. In a multiple-choice test,asking an AI model to state the answer directly, or to reply with the letter or number corresponding to the correct answer, can produce different results. That affects reproducibility and comparability.

Automated testing systems are now used to test models against benchmarks in a standardised manner. Dr Liang’s team at Stanford has built one such system, called HELM (holistic evaluation of language models), which generates leaderboards showing how a range of models perform on various benchmarks. Dr Fourrier’s team at Hugging Face uses another such system, EleutherAI Harness, to generate leaderboards for open-source models. These leaderboards are more trustworthy than the tables of results provided by model-makers, because the benchmark scores have been generated in a consistent way.

The greatest trick AI ever pulled

As models gain new skills, new benchmarks are being developed to assess them. GAIA, for example, tests AI models on real-world problem-solving. (Some of the answers are kept secret to avoid contamination.) NoCha (novel challenge), announced in June, is a “long context” benchmark consisting of 1,001 questions about 67 recently published English-language novels. The answers depend on having read and understood the whole book, which is supplied to the model as part of the test. Recent novels were chosen because they are unlikely to have been used as training data. Other benchmarks assess models’ ability to solve biology problems or their tendency to hallucinate.

But new benchmarks can be expensive to develop, because they often require human experts to create a detailed set of questions and answers. One answer is to use LLMs themselves to develop new benchmarks. Dr Liang is doing this with a project called AutoBencher, which extracts questions and answers from source documents and identifies the hardest ones.

Anthropic, the startup behind the Claude LLM, has started funding the creation of benchmarks directly, with a particular emphasis on AI safety. “We are super-undersupplied on benchmarks for safety,” says Logan Graham, a researcher at Anthropic. “We are in a dark forest of not knowing what the models are capable of.” On July 1st the company began inviting proposals for new benchmarks, and tools for generating them, which it will co-fund, with a view to making them available to all. This might involve developing ways to assess a model’s ability to develop cyber-attack tools, say, or its willingness to provide advice on making chemical or biological weapons. These benchmarks can then be used to assess the safety of a model before public release.

Historically, says Dr Graham, AI benchmarks have been devised by academics. But as AI is commercialised and deployed in a range of fields, there is a growing need for reliable and specific benchmarks. Startups that specialise in providing AI benchmarks are starting to appear, he notes. “Our goal is to pump-prime the market,” he says, to give researchers, regulators and academics the tools they need to assess the capabilities of AI models, good and bad. The days of AI labs marking their own homework could soon be over.

© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.

MoreLess

[ad_2]

Source link

Related Posts

High-Quality Online Gaming Sites Like Gaza88

The online gaming industry has matured into a highly...

Online Gaming Platform Shutdown Scams: A Warning Report

The world of online gaming is filled with exciting...

The Best Apps for Mobile Live Video Broadcasting

Why Mobile Live Broadcasting Keeps GrowingMobile live video broadcasting...

Dive Into New Challenges and Win Big

Embrace the Excitement of Overcoming Challenges and Achieving Great...

Portal Breakers Enter the Fractured Universe

The universe is far larger and stranger than most...

Adios, Windows: These alternatives make switching from Microsoft easy

If you can’t install Windows 11 on your...
- Advertisement -spot_img
Slot Gacor Slot777slot mahjongslot mahjongjudi bola onlinesabung ayam onlinejudi bola onlinelive casino onlineslot danaslot thailandsabung ayam onlinejudi bola onlinesitus live casino onlineslot mahjong waysbandar togel onlinejudi bolasabung ayam onlinejudi bolaSABUNG AYAM ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINEjudi bola onlineslot mahjong wayslive casino onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bola onlinemahjong wayssabung ayam onlinesbobet88slot mahjongsabung ayam onlinesbobet mix parlayslot777judi bola onlinesabung ayam onlinesabung ayam onlinejudi bola onlinelive casino onlineslot mahjong waysjuara303juara303juara303juara303juara303juara303juara303juara303SV388Mix ParlayBLACKJACKSLOT777Sabung Ayam OnlineBandar Judi BolaAgen Sicbo Online
agen sabung ayamslot mahjong gacorsabung ayam onlinejudi bola onlinelive casino onlineslot mahjongsabung ayam onlinejudi bola onlinelive casino onlineslot mahjongslot mahjongsabung ayam onlinescatter hitamlive casino onlinemix parlaysabung ayam onlinelive casinomahjong waysmix parlaysabung ayam onlinelive casinomahjong waysmix parlaySBOBETSBOBETCASINO ONLINESBOBETSBOBET88SABUNG AYAM ONLINESBOBETagen judi bolalive casino onlinesabung ayam onlinejudi bola sbobetsabung ayam onlineSabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2slot gacorjudi bolamix parlayjudi bolasv388SABUNG AYAM ONLINELIVE CASINO ONLINEJUDI BOLAMAHJONG WAYSSLOT MAHJONGJUDI BOLA ONLINELIVE CASINO ONLINESABUNG AYAM ONLINE
SABUNG AYAM ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINEjudi bola onlinesabung ayam onlinelive casino onlinesitus toto 4djudi bola onlinejudi bola onlinesabung ayam onlinelive casino onlinejudi bola onlinemix parlaysbobet88sv388sbobet mix parlayws168sbobet88sv388sv388sbobet88sabung ayam onlinejudi bola onlinesabung ayam onlinesbobet mix parlaysabung ayam onlinejudi bola onlineslot gacorsabung ayam onlinejudi bola onlinelive casino onlineslot mahjong waysjuara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303SV388Mix ParlayLive Casino OnlineSitus Slot GacorSV388SBOBET WAPBlackjackPragmatic PlaySV388Judi Bola OnlineBlackjackKakek ZeusSV388Mix ParlayAgen BlackjackSlot Gacor Onlinesabung ayam onlinejudi bola onlinesabung ayam onlinejudi bola onlinejudi bola onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bola onlineslot mahjong wayssabung ayam onlinejudi bolaslot mahjonglive casino onlinesabung ayam onlinejudi bola onlineslot mahjong gacorsitus toto togel 4Dsabung ayam onlinesitus toto togel 4Dsitus live casinojudi bola onlinesitus slot mahjongjudi bolasabung ayam onlinesabung ayam onlinemahjong wayssabung ayam onlinejudi bolasabung ayam onlinejudi bola
judi bola onlinejudi bola onlinejudi bola onlinejudi bola onlineJUDI BOLA ONLINESBOBET88JUDI BOLA ONLINEJUDI BOLA ONLINESV388Judi Bola OnlineBlackjackKakek ZeusSV388SBOBET WAPAgen BlackjackSlot Gacor Onlinejuara303juara303juara303juara303juara303juara303juara303juara303judi bola onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bolasabung ayam onlinesabung ayam onlinejudi bola onlinesitus live casino onlineslot mahjong wayssabung ayam onlinesitus live casinojudi bola onlinedexel
Slot Mahjong Waysslot danaslot danaslot danasabung ayam onlinesabung ayam onlineJUDI BOLA ONLINESV388Mix ParlayAgen Casino OnlineSLOT777Sabung Ayam OnlineAgen Judi BolaLive Casino Onlinesabung ayam onlinesabung ayam onlinejudi bola onlineslot mahjong wayssabung ayam onlinejudi bola onlinesitus live casino onlineagen togel onlineSabung Ayam OnlineJudi Bola OnlineSlot MahjongBandar togelSabung Ayam OnlineJudi Bola Onlinejudi bola onlinejudi bola onlinesabung ayam onlinelive casino onlineJUDI BOLA ONLINESBOBET88JUDI BOLA ONLINEmix parlaymix parlaylive casinosabung ayam onlinemix parlayslot danaslot mahjongslot mahjongjudi bolaMAHJONG WAYS 2SABUNG AYAM ONLINELIVE CASINO ONLINESABUNG AYAM ONLINESBOBETLIVE CASINO ONLINESLOT MAHJONG WAYSSABUNG AYAM ONLINEMIX PARLAYSABUNG AYAM ONLINESABUNG AYAM ONLINEWALA MERONWALA MERONSITUS SABUNG AYAMSITUS SABUNG AYAMjudi bola terpercayaSabung Ayam Onlinemix parlaySabung Ayam OnlineZeus Slot GacorSitus Judi BolaSabung Ayam Onlinesitus sabung ayamSlot MahjongSV388SBOBET88live casino onlineslot mahjong gacorSV388SBOBET88live casino onlineslot mahjong gacorSabung Ayam OnlineJudi Bola OnlineCasino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineLive Casino OnlineMahjong Ways 2judi bolacasino onlinesv388sabung ayam onlinejudi bola onlineagen live casino onlinemahjong waysLIVE CASINOJUDI BOLA ONLINESABUNG AYAM ONLINESITUS BOLASV388LIVE CASINO ONLINESLOT QRISSABUNG AYAM ONLINEMIX PARLAYMIX PARLAYJUDI BOLA ONLINESLOT MAHJONG
Mahjong Ways 2mahjong ways 2indojawa88daftar dan login wahanabetCapWorks Official ContactAynsley Official SitedexelHarifuku Clinic Official AccessNusa Islands Bali Official PackagesTrinidad and Tobago Pilots’ Association Official About PageNusa Islands Bali Official ContactCapworks Official SiteTech With Mike First Official SiteSahabat Tiopan Official SiteOcean E Soft Official SiteCang Vu Hai Phong Official SiteThe Flat Official SiteTop Dawg Tavern Official SiteDuhoc Interlink Official SiteRatiohead Official SiteMAN Surabaya E-Learning Official SiteShaker Group Official SiteTakaKawa Shoten Official SiteBrydan Solutions Official SiteConcursos Rodin Official SiteConmou Official SiteCareer Wings Official SiteMontero Espinosa Official SiteBDF Ventura Official SiteAkura Official SiteNamulanda Technical Institute Official Sitemenu home roasted coffeetosayama academy workshopjudi bola onlineContactez le Monaco Rugby Sevens - Club Professionnel à 7Virtual Eco Museum Official Event 2025DRT Seitai Official Contacta leading company in UWB technology development