How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell

Share This Post

[ad_1]

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Most people interested in generative AI likely already know that Large Language Models (LLMs) — like those behind ChatGPT, Anthropic’s Claude, and Google’s Gemini — are trained on massive datasets: trillions of words pulled from websites, books, codebases, and, increasingly, other media such as images, audio, and video. But why?

From this data, LLMs develop a statistical, generalized understanding of language, its patterns, and the world — encoded in the form of billions of parameters, or “settings,” in a network of artificial neurons (which are mathematical functions that transform input data into output signals).

By being exposed to all this training data, LLMs learn to detect and generalize patterns that are reflected in the parameters of their neurons. For instance, the word “apple” often appears near terms related to food, fruit, or trees, and sometimes computers. The model picks up that apples can be red, green, or yellow, or even sometimes other colors if rotten or rare, are spelled “a-p-p-l-e” in English, and are edible. This statistical knowledge influences how the model responds when a user enters a prompt — shaping the output it generates based on the associations it “learned” from the training data.

But a big question — even among AI researchers — remains: how much of an LLM’s training data is used to build generalized representations of concepts, and how much is instead memorized verbatim or stored in a way that is identical or nearly identical to the original data?

This is important not only for better understanding how LLMs operate — and when they go wrong — but also as model providers defend themselves in copyright infringement lawsuits brought by data creators and owners, such as artists and record labels. If LLMs are shown to reproduce significant portions of their training data verbatim, courts could be more likely to side with plaintiffs arguing that the models unlawfully copied protected material. If not — if the models are found to generate outputs based on generalized patterns rather than exact replication — developers may be able to continue scraping and training on copyrighted data under existing legal defenses such as fair use.

Now, we finally have an answer to the question of how much LLMs memorize versus generalize: a new study released this week from researchers at Meta, Google DeepMind, Cornell University, and NVIDIA finds that GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter.

To understand what 3.6 bits means in practice:

  • A single bit is the smallest unit of digital data, representing either a 0 or a 1. Eight bits make up one byte.
  • Storing 3.6 bits allows for approximately 12.13 distinct values, as calculated by 2^3.6.
  • This is about the amount of information needed to choose one of 12 options—similar to selecting a month of the year or the outcome of a roll of a 12-sided die.
  • It is not enough to store even one English letter (which needs about 4.7 bits), but it is just enough to encode a character from a reduced set of 10 common English letters (which requires about 3.32 bits).
  • In bytes, 3.6 bits is 0.45 bytes—less than half the size of a typical character stored in ASCII (which uses 8 bits or 1 byte).

This number is model-independent within reasonable architectural variations: different depths, widths, and precisions produced similar results. The estimate held steady across model sizes and even precision levels, with full-precision models reaching slightly higher values (up to 3.83 bits/parameter).

More training data DOES NOT lead to more memorization — in fact, a model will be less likely to memorize any single data point

One key takeaway from the research is that models do not memorize more when trained on more data. Instead, a model’s fixed capacity is distributed across the dataset, meaning each individual datapoint receives less attention.

Jack Morris, the lead author, explained via the social network X that “training on more data will force models to memorize less per-sample.”

These findings may help ease concerns around large models memorizing copyrighted or sensitive content.

If memorization is limited and diluted across many examples, the likelihood of reproducing any one specific training example decreases. In essence, more training data leads to safer generalization behavior, not increased risk.

How the researchers identified these findings

To precisely quantify how much language models memorize, the researchers used an unconventional but powerful approach: they trained transformer models on datasets composed of uniformly random bitstrings. Each of these bitstrings was sampled independently, ensuring that no patterns, structure, or redundancy existed across examples.

Because each sample is unique and devoid of shared features, any ability the model shows in reconstructing or identifying these strings during evaluation directly reflects how much information it retained—or memorized—during training.

The key reason for this setup was to completely eliminate the possibility of generalization. Unlike natural language—which is full of grammatical structure, semantic overlap, and repeating concepts—uniform random data contains no such information. Every example is essentially noise, with no statistical relationship to any other. In such a scenario, any performance by the model on test data must come purely from memorization of the training examples, since there is no distributional pattern to generalize from.

The authors argue their method is perhaps one of the only principled ways to decouple memorization from learning in practice, because when LLMs are trained on real language, even when they produce an output that matches the training data, it’s difficult to know whether they memorized the input or merely inferred the underlying structure from the patterns they’ve observed.

This method allows the researchers to map a direct relationship between the number of model parameters and the total information stored. By gradually increasing model size and training each variant to saturation, across hundreds of experiments on models ranging from 500K to 1.5 billion parameters, they observed consistent results: 3.6 bits memorized per parameter, which they report as a fundamental measure of LLM memory capacity.

The team applied their methodology to models trained on real-world datasets as well. When trained on text, models exhibited a balance of memorization and generalization.

Smaller datasets encouraged more memorization, but as dataset size increased, models shifted toward learning generalizable patterns. This transition was marked by a phenomenon known as “double descent,” where performance temporarily dips before improving once generalization kicks in.

The study also examined how model precision—comparing training in bfloat16 versus float32—affects memorization capacity. They observed a modest increase from 3.51 to 3.83 bits-per-parameter when switching to full 32-bit precision. However, this gain is far less than the doubling of available bits would suggest, implying diminishing returns from higher precision.

Unique data is more likely to be memorized

The paper proposes a scaling law that relates a model’s capacity and dataset size to the effectiveness of membership inference attacks.

These attacks attempt to determine whether a particular data point was part of a model’s training set. The research shows that such attacks become unreliable as dataset size grows, supporting the argument that large-scale training helps reduce privacy risk.

While the paper focuses on average-case behavior, some researchers have pointed out that certain types of data—such as highly unique or stylized writing—may still be more susceptible to memorization.

The authors acknowledge this limitation and emphasize that their method is designed to characterize general trends rather than edge cases.

Moving toward greater human understanding of LLM understanding

By introducing a principled and quantifiable definition of memorization, the study gives developers and researchers new tools for evaluating the behavior of language models. This helps not only with model transparency but also with compliance, privacy, and ethical standards in AI development. The findings suggest that more data—and not less—may be the safer path when training large-scale language models.

To put total model memorization in perspective:

  • A 500K-parameter model can memorize roughly 1.8 million bits, or 225 KB of data.
  • A 1.5 billion parameter model can hold about 5.4 billion bits, or 675 megabytes of raw information.
  • This is not comparable to typical file storage like images (e.g., a 3.6 MB uncompressed image is about 30 million bits), but it is significant when distributed across discrete textual patterns.

I’m no lawyer or legal expert, but I would highly expect such research to be cited in the numerous ongoing lawsuits between AI providers and data creators/rights owners.


[ad_2]
Source link

Related Posts

Eat and Run Verification as a Safety Standard in Online Betting

The Growing Need for Safety in Online BettingOnline betting...

High-Quality Online Gaming Sites Like Gaza88

The online gaming industry has matured into a highly...

Online Gaming Platform Shutdown Scams: A Warning Report

The world of online gaming is filled with exciting...

The Best Apps for Mobile Live Video Broadcasting

Why Mobile Live Broadcasting Keeps GrowingMobile live video broadcasting...

Top Benefits of Choosing Mobile Crane Hire Over Buying

In today’s fast-moving construction and industrial landscape, flexibility and...

Dive Into New Challenges and Win Big

Embrace the Excitement of Overcoming Challenges and Achieving Great...
- Advertisement -spot_img
Slot Gacor Slot777slot mahjongslot mahjongjudi bola onlinesabung ayam onlinejudi bola onlinelive casino onlineslot danaslot thailandsabung ayam onlinejudi bola onlinesitus live casino onlineslot mahjong waysbandar togel onlinejudi bolasabung ayam onlinejudi bolaSABUNG AYAM ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINEjudi bola onlineslot mahjong wayslive casino onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bola onlinemahjong wayssabung ayam onlinesbobet88slot mahjongsabung ayam onlinesbobet mix parlayslot777judi bola onlinesabung ayam onlinesabung ayam onlinejudi bola onlinelive casino onlineslot mahjong waysjuara303juara303juara303juara303juara303juara303juara303juara303SV388Mix ParlayBLACKJACKSLOT777Sabung Ayam OnlineBandar Judi BolaAgen Sicbo Online
agen sabung ayamslot mahjong gacorsabung ayam onlinejudi bola onlinelive casino onlineslot mahjongsabung ayam onlinejudi bola onlinelive casino onlineslot mahjongslot mahjongsabung ayam onlinescatter hitamlive casino onlinemix parlaysabung ayam onlinelive casinomahjong waysmix parlaysabung ayam onlinelive casinomahjong waysmix parlaySBOBETSBOBETCASINO ONLINESBOBETSBOBET88SABUNG AYAM ONLINESBOBETagen judi bolalive casino onlinesabung ayam onlinejudi bola sbobetsabung ayam onlineSabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineAgen Live Casino OnlineMahjong Ways 2slot gacorjudi bolamix parlayjudi bolasv388SABUNG AYAM ONLINELIVE CASINO ONLINEJUDI BOLAMAHJONG WAYSSLOT MAHJONGJUDI BOLA ONLINELIVE CASINO ONLINESABUNG AYAM ONLINE
SABUNG AYAM ONLINESABUNG AYAM ONLINEJUDI BOLA ONLINEJUDI BOLA ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINESABUNG AYAM ONLINEjudi bola onlinesabung ayam onlinelive casino onlinesitus toto 4djudi bola onlinejudi bola onlinesabung ayam onlinelive casino onlinejudi bola onlinemix parlaysbobet88sv388sbobet mix parlayws168sbobet88sv388sv388sbobet88sabung ayam onlinejudi bola onlinesabung ayam onlinesbobet mix parlaysabung ayam onlinejudi bola onlineslot gacorsabung ayam onlinejudi bola onlinelive casino onlineslot mahjong waysjuara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303juara303SV388Mix ParlayLive Casino OnlineSitus Slot GacorSV388SBOBET WAPBlackjackPragmatic PlaySV388Judi Bola OnlineBlackjackKakek ZeusSV388Mix ParlayAgen BlackjackSlot Gacor Onlinesabung ayam onlinejudi bola onlinesabung ayam onlinejudi bola onlinejudi bola onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bola onlineslot mahjong wayssabung ayam onlinejudi bolaslot mahjonglive casino onlinesabung ayam onlinejudi bola onlineslot mahjong gacorsitus toto togel 4Dsabung ayam onlinesitus toto togel 4Dsitus live casinojudi bola onlinesitus slot mahjongjudi bolasabung ayam onlinesabung ayam onlinemahjong wayssabung ayam onlinejudi bolasabung ayam onlinejudi bola
judi bola onlinejudi bola onlinejudi bola onlinejudi bola onlineJUDI BOLA ONLINESBOBET88JUDI BOLA ONLINEJUDI BOLA ONLINESV388Judi Bola OnlineBlackjackKakek ZeusSV388SBOBET WAPAgen BlackjackSlot Gacor Onlinejuara303juara303juara303juara303juara303juara303juara303juara303judi bola onlinejudi bola onlinejudi bola onlinesabung ayam onlinejudi bolasabung ayam onlinesabung ayam onlinejudi bola onlinesitus live casino onlineslot mahjong wayssabung ayam onlinesitus live casinojudi bola onlinedexel
Slot Mahjong Waysslot danaslot danaslot danasabung ayam onlinesabung ayam onlineJUDI BOLA ONLINESV388Mix ParlayAgen Casino OnlineSLOT777Sabung Ayam OnlineAgen Judi BolaLive Casino Onlinesabung ayam onlinesabung ayam onlinejudi bola onlineslot mahjong wayssabung ayam onlinejudi bola onlinesitus live casino onlineagen togel onlineSabung Ayam OnlineJudi Bola OnlineSlot MahjongBandar togelSabung Ayam OnlineJudi Bola Onlinejudi bola onlinejudi bola onlinesabung ayam onlinelive casino onlineJUDI BOLA ONLINESBOBET88JUDI BOLA ONLINEmix parlaymix parlaylive casinosabung ayam onlinemix parlayslot danaslot mahjongslot mahjongjudi bolaMAHJONG WAYS 2SABUNG AYAM ONLINELIVE CASINO ONLINESABUNG AYAM ONLINESBOBETLIVE CASINO ONLINESLOT MAHJONG WAYSSABUNG AYAM ONLINEMIX PARLAYSABUNG AYAM ONLINESABUNG AYAM ONLINEWALA MERONWALA MERONSITUS SABUNG AYAMSITUS SABUNG AYAMjudi bola terpercayaSabung Ayam Onlinemix parlaySabung Ayam OnlineZeus Slot GacorSitus Judi BolaSabung Ayam Onlinesitus sabung ayamSlot MahjongSV388SBOBET88live casino onlineslot mahjong gacorSV388SBOBET88live casino onlineslot mahjong gacorSabung Ayam OnlineJudi Bola OnlineCasino OnlineMahjong Ways 2Sabung Ayam OnlineJudi Bola OnlineLive Casino OnlineMahjong Ways 2judi bolacasino onlinesv388sabung ayam onlinejudi bola onlineagen live casino onlinemahjong waysLIVE CASINOJUDI BOLA ONLINESABUNG AYAM ONLINESITUS BOLASV388LIVE CASINO ONLINESLOT QRISSABUNG AYAM ONLINEMIX PARLAYMIX PARLAYJUDI BOLA ONLINESLOT MAHJONG
Mahjong Ways 2mahjong ways 2indojawa88daftar dan login wahanabetCapWorks Official ContactAynsley Official SitedexelHarifuku Clinic Official AccessNusa Islands Bali Official PackagesTrinidad and Tobago Pilots’ Association Official About PageNusa Islands Bali Official ContactCapworks Official SiteTech With Mike First Official SiteSahabat Tiopan Official SiteOcean E Soft Official SiteCang Vu Hai Phong Official SiteThe Flat Official SiteTop Dawg Tavern Official SiteDuhoc Interlink Official SiteRatiohead Official SiteMAN Surabaya E-Learning Official SiteShaker Group Official SiteTakaKawa Shoten Official SiteBrydan Solutions Official SiteConcursos Rodin Official SiteConmou Official SiteCareer Wings Official SiteMontero Espinosa Official SiteBDF Ventura Official SiteAkura Official SiteNamulanda Technical Institute Official Sitemenu home roasted coffeetosayama academy workshopjudi bola onlineContactez le Monaco Rugby Sevens - Club Professionnel à 7Virtual Eco Museum Official Event 2025DRT Seitai Official Contacta leading company in UWB technology development