How much do language models memorize?

Content has been generated from NotebookLM

Authors:
Source: Excerpts from “2505.24832”

Summary

Explores the concept of memorization in large language models (LLMs), introducing a novel method to quantify it and distinguish it from generalization. The authors define model capacity and investigate its relationship with dataset size, training dynamics, and membership inference.

Here are the main themes and most important ideas or facts from the source:

A Novel Definition of Memorization and Generalization: The paper proposes a new definition of memorization that quantifies the extent to which a model retains information about a specific datapoint, leveraging the concept of compression rate in bits. This approach draws inspiration from Kolmogorov information theory and Shannon information theory.
- Separation: The core innovation is formally separating memorization into two components:
  - Unintended Memorization (memU): “the information a model contains about a specific dataset.” This captures the sample-level specifics stored by the model.
  - Generalization (memI): “the information a model contains about the true data-generation process.” This represents the reusable patterns and knowledge acquired by the model.
- Practical Measurement: Unlike previous definitions that faced challenges with uncomputability (e.g., exact Kolmogorov complexity) or were limited to random variables, this approach is designed to be “easily measured in practice using model likelihoods.” The paper approximates Kolmogorov complexity using the best available compression schemes based on model likelihoods.
- Critique of Prior Definitions: The authors argue that previous definitions of memorization, particularly extraction-based methods, are insufficient. They state, “Language models can be coerced to output almost any string… hence the fact that a model outputs something is not necessarily a sign of memorization.” They also note that “verbatim reproduction of a text is not a prerequisite for memorization.” Existing mathematical definitions (e.g., based on membership inference or differential privacy) are often defined at the dataset/distribution level, making them “inadequate for measuring memorization for certain instances.”
Model Capacity: A Key Concept: The paper defines and measures the “capacity” of language models, which is “the total amount of memorization that can be stored in θ across all its parameters.”
- Empirical Capacity Limit: Through experiments with uniform random bitstrings, the authors estimate that models in the GPT family have an approximate capacity of 3.6 bits-per-parameter. This finding is robust, with measurements consistently between 3.5 and 4 bits of information per parameter depending on architecture and precision. “Memorization plateaus at the empirical capacity limit of different-sized models from the GPT-family, approximately 3.6 bits-per-parameter.” (Figure 1 caption).
- Linear Scaling: Model capacity scales linearly with the number of parameters. “Our models consistently memorize between 3.5 and 3.6 bits per parameter.” This corroborates prior work that noticed fact storage scales linearly with model capacity.
- Precision’s Effect: Doubling precision from bfloat16 to float32 only results in a “small increase in capacity, and an increase in α from 3.51 to 3.83 bits-per-parameter on average,” suggesting that “most of the extra model bits added when increasing precision from bfloat16 to float32 are not used for raw storage.”
Training Dynamics and Double Descent: The study investigates how models memorize and generalize during training, particularly observing the phenomenon of “grokking” and “double descent.”
- Memorization to Generalization Shift: When training on real text, language models memorize up to their capacity. “Once their capacity fills, at which point ‘grokking’ begins, and unintended memorization decreases as models begin to generalize.” (Abstract). “On real text, language models memorize up to a certain capacity, at which point they substitute unintended memorization for generalization, and begin to learn general, reusable patterns as opposed to sample-level specifics.”
- Double Descent Explanation: The paper provides an intuitive explanation for the double descent phenomenon in training loss (where performance first worsens and then improves with more data/model size). “Our framework shows that double descent phenomenon begins to occur at this point, when the data size exceeds the model capacity in bits.” “Double descent occurs exactly when the dataset size begins to exceed the model’s capacity, when unintended memorization is no longer beneficial for lowering the loss.” (Figure 3 caption). This suggests that models are “forced to share information between datapoints to save capacity, which leads to generalization” once individual datapoint memorization is no longer feasible.
Membership Inference and Scaling Laws: The research explores the relationship between model capacity, dataset size, and the success rate of membership inference attacks.
- Difficulty with Large Datasets: Membership inference attacks become significantly harder as the dataset size increases relative to model capacity. “If the dataset size is too large compared to the model, membership inference of an average training sample may not be possible.” (Figure 14).
- Scaling Law for F1 Score: The authors develop a predictive scaling law for the F1 score of a loss-based membership inference attack, based on model capacity and dataset size. The F1 score, representing the attack’s effectiveness, “follows a roughly sigmoidal form with respect to dataset size,” decreasing towards 0.5 (random guessing) as dataset size approaches infinity. “Our scaling laws extrapolate to larger models, and predict most modern language models are trained on too much data to do reliable membership inference on the average data point.”
- Validation on Larger Models: The proposed scaling law was validated on larger GPT-2 models (125M and 1.5B parameters), showing predictions “generally within 1.5 points of the true F1 score,” providing “evidence for why membership inference attacks fail on models trained on extremely large datasets.”
Characteristics of Memorized Data: Even with careful deduplication of training data, models still memorize certain specific datapoints.
- Rare Words and Unintended Memorization: There is a “strong correlation between trainset TF-IDF and memorization: examples with more rare words are more memorized.” (Figure 16).
- Non-English Content: Manual analysis revealed that “the most memorized datapoints have extremely rare tokens, typically ones not found in English.” (Table 5). For instance, out of the top twenty memorized sequences, “all but three contain sequences of tokens from other languages (Japanese, Chinese, and Hebrew).” This highlights that despite efforts to generalize, models retain precise information about unique or outlier data points.

Summary

This paper provides a rigorous, measurable framework to understand how language models store and generalize information, defining model capacity in bits per parameter, explaining the double descent phenomenon, and predicting the efficacy of membership inference attacks based on scaling laws. It emphasizes that while generalization occurs, specific, often rare, data points can still be highly memorized.