DeepMind’s Michelangelo benchmark reveals limitations of long-context LLMs

Share This Post

[ad_1]

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Large language models (LLMs) with very long context windows have been making headlines lately. The ability to cram hundreds of thousands or even millions of tokens into a single prompt unlocks many possibilities for developers. 

But how well do these long-context LLMs really understand and utilize the vast amounts of information they receive?

Researchers at Google DeepMind have introduced Michelangelo, a new benchmark designed to evaluate the long-context reasoning capabilities of LLMs. Their findings, published in a new research paper, show that while current frontier models have progressed in retrieving information from large in-context data, they still struggle with tasks that require reasoning over the data structure.

The need for better long-context benchmarks

The emergence of LLMs with extremely long context windows, ranging from 128,000 to over 1 million tokens, has prompted researchers to develop new benchmarks to evaluate their capabilities. However, most of the focus has been on retrieval tasks, such as the popular “needle-in-a-haystack” evaluation, where the model is tasked with finding a specific piece of information within a large context.

“Over time, models have grown considerably more capable in long context performance,” Kiran Vodrahalli, research scientist at Google DeepMind, told VentureBeat. “For instance, the popular needle-in-a-haystack evaluation for retrieval has now been well saturated up to extremely long context lengths. Thus, it has become important to determine whether the harder tasks models are capable of solving in short context regimes are also solvable at long ranges.”

Retrieval tasks don’t necessarily reflect a model’s capacity for reasoning over the entire context. A model might be able to find a specific fact without understanding the relationships between different parts of the text. Meanwhile, existing benchmarks that evaluate a model’s ability to reason over long contexts have limitations.

“It is easy to develop long reasoning evaluations which are solvable with a combination of only using retrieval and information stored in model weights, thus ‘short-circuiting’ the test of the model’s ability to use the long-context,” Vodrahalli said.

Michelangelo

To address the limitations of current benchmarks, the researchers introduced Michelangelo, a “minimal, synthetic, and unleaked long-context reasoning evaluation for large language models.” 

Michelangelo is based on the analogy of a sculptor chiseling away irrelevant pieces of marble to reveal the underlying structure. The benchmark focuses on evaluating the model’s ability to understand the relationships and structure of the information within its context window, rather than simply retrieving isolated facts.

The benchmark consists of three core tasks:

Latent list: The model must process a long sequence of operations performed on a Python list, filter out irrelevant or redundant statements, and determine the final state of the list. “Latent List measures the ability of a model to track a latent data structure’s properties over the course of a stream of code instructions,” the researchers write.

Multi-round co-reference resolution (MRCR): The model must produce parts of a long conversation between a user and an LLM. This requires the model to understand the structure of the conversation and resolve references to previous turns, even when the conversation contains confusing or distracting elements. “MRCR measures the model’s ability to understanding ordering in natural text, to distinguish between similar drafts of writing, and to reproduce a specified piece of previous context subject to adversarially difficult queries,” the researchers write.

“I don’t know” (IDK): The model is given a long story and asked to answer multiple-choice questions about it. For some questions, the context does not contain the answer, and the model must be able to recognize the limits of its knowledge and respond with “I don’t know.” “IDK measures the model’s ability to understand whether it knows what it doesn’t know based on the presented context,” the researchers write.

Latent Structure Queries

The tasks in Michelangelo are based on a novel framework called Latent Structure Queries (LSQ). LSQ provides a general approach for designing long-context reasoning evaluations that can be extended to arbitrary lengths. It can also test the model’s understanding of implicit information as opposed to retrieving simple facts. LSQ relies on synthesizing test data to avoid the pitfalls of test data leaking into the training corpus.

“By requiring the model to extract information from structures rather than values from keys (sculptures from marble rather than needles from haystacks), we can more deeply test language model context understanding beyond retrieval,” the researchers write.

LSQ has three key differences from other approaches to evaluating long-context LLMs. First, it has been explicitly designed to avoid short-circuiting flaws in evaluations that go beyond retrieval tasks. Second, it specifies a methodology for increasing task complexity and context length independently. And finally, it is general enough to capture a large range of reasoning tasks. The three tests used in Michelangelo cover code interpretation and reasoning over loosely written text.

“The goal is that long-context beyond-reasoning evaluations implemented by following LSQ will lead to fewer scenarios where a proposed evaluation reduces to solving a retrieval task,” Vodrahalli said.

Evaluating frontier models on Michelangelo

The researchers evaluated ten frontier LLMs on Michelangelo, including different variants of Gemini, GPT-4 and 4o, and Claude. They tested the models on contexts up to 1 million tokens. Gemini models performed best on MRCR, GPT models excelled on Latent List, and Claude 3.5 Sonnet achieved the highest scores on IDK.

However, all models exhibited a significant drop in performance as the complexity of the reasoning tasks increased, suggesting that even with very long context windows, current LLMs still have room to improve in their ability to reason over large amounts of information.

long-context reasoning
Frontier LLMs struggle with reasoning on long-context windows (source: arxiv)

“Frontier models have room to improve on all of the beyond-retrieval reasoning primitives (Latent List, MRCR, IDK) that we investigate in Michelangelo,” Vodrahalli said. “Different frontier models have different strengths and weaknesses – each class performs well on different context ranges and on different tasks. What does seem to be universal across models is the initial drop in performance on long reasoning tasks.”

The Michelangelo evaluations capture basic primitives necessary for long-context reasoning and the findings can have important implications for enterprise applications. For example, in real-world applications where the model can’t rely on its pretraining knowledge and must perform multi-hop reasoning over many disparate locations in very long contexts, Vodrahalli expects performance to drop as the context length grows.

“This is particularly true if the documents have a lot of information that is irrelevant to the task at hand, making it hard for a model to easily immediately distinguish which information is relevant or not,” Vodrahalli said. “It is also likely that models will continue to perform well on tasks where all of the relevant information to answer a question is located in one general spot in the document.”

The researchers will continue to add more evaluations to Michelangelo and hope to make them directly available so that other researchers can test their models on them.


[ad_2]
Source link

Related Posts

- Advertisement -spot_img
JUDI BOLA ONLINEMAHJONG WAYS 2SABUNG AYAM ONLINELIVE CASINO ONLINEMAHJONG WAYSjudi bola onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bola onlinesabung ayam onlinejudi bola onlinelive casino onlineslot mahjong waysjuara303juara303juara303juara303juara303juara303juara303juara303Sabung Ayam OnlineMix ParlayBandar Casino OnlineMahjong WaysWala MeronJudi BolaPokerSlot Mahjongjudi bola onlinejudi bola onlinesabung ayam onlinejudi bola onlineSLOT MAHJONGmahjong ways 2judi bolamahjong ways 2sabung ayam onlinetosayama academy workshopsabung ayam onlinejudi bola onlinesitus live casino onlinesabung ayam onlinejudi bola onlineagen live casino onlineimplementasi logika analisis bmkg dalam membaca tren mahjong wayscloudflare jadi faktor mudahnya menang di permainan mahjong wayssiswa srma 44 minahasa memahami probabilitas melalui pola digital mahjong wayspola mahjong ways bisa bikin untung besar walaupun harga emas jatuhgunung semeru erupsi bikin geger tetapi pola majong ways lebih bikin dagdigdugsabung ayam onlinesabung ayam onlinesabung ayam onlinesabung ayam onlinesabung ayam online
judi bolaslot pulsaslot pulsaslot gacor mahjongsabung ayam onlinelive casino onlineindobit88judi bolasv388judi bolaMAHJONG WAYS 2LIVE CASINOJUDI BOLA ONLINESABUNG AYAM ONLINEmix parlaysabung ayam onlinelive casinomahjong waysmix parlaysabung ayam onlinelive casinomahjong wayssabung ayam onlinesabung ayam onlinemix parlaysabung ayam onlinelive casinomahjong waysmix parlaysabung ayam onlinelive casinomahjong waysmix parlaymahjong slotSABUNG AYAM ONLINESITUS LIVE CASINO ONLINESLOT MAHJONGSLOT777SLOT MAHJONGSLOT THAILANDJUDI BOLA ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINESLOT MAHJONG WAYSSLOT MAHJONG WAYSSITUS JUDI BOLAJUDI BOLA ONLINELIVE CASINO ONLINESLOT KAKEK ZEUSMIX PARLAYSABUNG AYAM ONLINESLOT MAHJONG WAYSSABUNG AYAM ONLINEjudi bolaagen baccaratsv388Slot Mahjong Gacorlive casinosv388
sabung ayam onlineslot thailandslot mahjong waysjudi bola onlinejudi bola onlinesabung ayam onlineslot gacoragus berhasil memecahkan pola rahasia yang bikin tajirpola abadi dari kakek yang bikin cuan tiap hariSitus Live Casinotrik profesional membongkar pola mahjong ways untuk raih multiplier maksimalbonus free spin adalah fitur yang paling dicari dalam setiap spin di mahjong wins 3cara cepat stabilkan kemenangan di indojawa88 untuk pemain yang sering boncostrik pause otomatis 7 detik bikin mahjong wild muncul lebih seringpanduan strategi turbo auto untuk mahjong wins 2 agar scatter munculkunci strategi meningkatkan efektivitas bermain mahjong ways 2Slot MahjongJudi BolaSabung Ayam OnlineSabung Ayam OnlineSlot MahjongJudi BolaSabung Ayam Onlinesabung ayam onlinelive casino onlineMAHJONG WAYS 2SV388JUDI BOLA ONLINELIVE CASINO ONLINEJUDI BOLA ONLINESBOBET88SBOBETlive casino onlinejudi bola onlineslot mahjong wayssabung ayam onlinejudi bola onlinelive casino onlineslot mahjong waysSabung Ayam OnlineMix ParlayAgen Casino OnlineZeus SlotSabung Ayam OnlineJudi Bola OnlineLive Casino OnlineSlot Gacor online
judi bola onlinejudi bola onlinejudi bola onlinesabung ayam onlineSV388Mix ParlayDragon TigerMahjong WaysSabung Ayam OnlineJudi Bola OnlineBlackjack dan BaccaratMahjong Wayssabung ayam onlinemix parlay sbobetlive casino onlinescatter hitamsv388sbobet88casino onlinezeus slotsv388mix parlay sbobetlive casino onlinescatter hitamsabung ayam onlinesabung ayam onlinejudi bola onlinejudi bola onlinejudi bola onlinejudi bola onlinejudi bola onlinejudi bola onlinebororan trik mudah menang mahjong wayspola gacor mahjong winsmaxwin mahjong ways 3tips membaca ritme mahjong wins 3profit konsisten mahjong waysrtp gacor mahjong waysscatter hitam mahjong wins 3Judi Bola Onlineteknik mengendalikan bacaan rtp mahjong ways tanpa ribetmapping rtp dan pola taktik kemenangan pragmatic pgsoftpanduan memilih waktu terbaik main mahjong ways agar dapat untung maksimalpola dan trik menang terbaru terbukti memberi kejutanrahasia sukses menguasai mahjong ways secara total3 teknik spin ala indojawa88 untuk mnejemput scatter hitamstrategi pola rtp untuk optimasi mahjong wins black scatterSabung Ayam OnlineSitus Sabung AyamJudi Bola
mix parlaysabung ayam onlinelive casinomahjong slotmix parlaysabung ayam onlinelive casinoslot mahjongmix parlaylive casinomix parlaysabung ayam onlinelive casinomahjong slotmix parlaysabung ayam onlinelive casinomahjong slotsabung ayam onlineslot mahjongSITUS JUDI BOLAJUDI BOLA ONLINELIVE CASINO ONLINESLOT KAKEK ZEUSMIX PARLAYSABUNG AYAM ONLINESLOT MAHJONG WAYSSABUNG AYAM ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINEMIX PARLAYSLOT MAHJONGMAHJONG WAYS 2SABUNG AYAM ONLINESBOBET88judi bolalive casino onlinesabung ayam onlineslot mahjong gacorsabung ayam onlinejudi bola onlinelive casino onlineslot mahjong gacorsabung ayam onlinejudi bola onlinelive casino onlineslot mahjong gacorSabung Ayam OnlineJudi Bola OnlineCasino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineCasino OnlineMahjong Ways 2Sabung Ayam Onlinesabung ayam onlinejudi bola onlineagen live casino onlinemahjong ways 2CASINO ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINECASINO ONLINESITUS JUDI BOLASlot Qrislive casino onlinesabung ayam onlinejudi bolajudi bola onlineslot mahjongJUDI BOLAJUDI BOLALIVE CASINO ONLINEJUDI BOLALIVE CASINO ONLINESABUNG AYAM ONLINESLOT QRISjudi bolaLive Casino OnlineSabung Ayam OnlineSlot QrisMix ParlayMix Parlay
mahjong ways 2daftar dan login wahanabetCapWorks Official ContactAynsley Official SitedexelTienda de antigüedades y muebles rústicos会社概要 / Company ProfileHarifuku Clinic Official AccessNusa Islands Bali Official PackagesTrinidad and Tobago Pilots’ Association Official About Pagekuasai pola rtp pragmatic playlangkah mendapatkan scatter emaspola rtp pg soft indojawa88Green Gold Mountain Official SiteKomite SMKN 1 Tanjung Jabung Barat Official Sitetutorial maxwin mahjong waysstrategi rtp mahjong waysEIKON Official Policieskontak situs pecinta ayamNusa Islands Bali Official ContactCitraLand Surabaya Official NewsLenterakita About PageVinayak Group Official SiteI Think An Idea Official SitePITAC Official SitePortfolioSitez Official SiteMedical LTD Official SiteCapworks Official SiteMartino & Luth Official SiteTech With Mike First Official SiteSahabat Tiopan Official SiteE-Sekolah CBT Official SiteBDF Ventura Official SiteOcean E Soft Official SiteArab DMC Official SiteBBC Noun Official SiteCang Vu Hai Phong Official SiteThe Flat Official SiteThe Black Sheep Official SiteCEM Argentina Official SiteSlot MahjongTop Dawg Tavern Official SiteKelas Nesfatin Official SiteDuhoc Interlink Official SiteKarunia Inda Med Mandiri Official SiteJFV Pulm Official SiteRatiohead Official SiteAskona Official SiteMAN Surabaya E-Learning Official SiteShaker Group Official SiteTakaKawa Shoten Official SiteBrydan Solutions Official SiteConcursos Rodin Official SiteEHOB Official SiteConmou Official SiteCareer Wings Official SiteMontero Espinosa Official SiteBDF Ventura Official SiteDesa Sangginora Official SiteBDF Ventura Official SiteTaruna Akademia Official SiteAkura Official SiteMUI Ciamis Official SiteNamulanda Technical Institute Official Site