Are Scaling Laws Really Plateauing?

12/25/2024

13 min read

We’ve all seen the charts: as you increase model/dataset size, test error (used to measure performance) decreases following a seemingly predictable curve (e.g. Kaplan et al. (2020) and Hoffmann et al. (2022)). At some point, the curve flattens, and people start talking about “plateaus” in scaling laws—implying we’ve reached some fundamental limit.

In this blog post, I will focus on Data Scaling laws, simple rules that describe how performance improves as we feed more data into a model. Researchers (mostly in industry, where such experiments are feasible) often report that error decreases at a certain rate, something like $1/ m^{α}$ , where $m$ is the number of training samples (e.g. tokens in the case of language models). Scaling laws look like this:

Scaling Laws Comparison — An example of data scaling laws from "Deep Learning Scaling is Predictable, Empirically (Hestness et al. 2017)"

However, such scaling laws are established in a very specific setting: the model is fixed (and only width/depth etc are scaled), the dataset is fixed as well and only the number of training samples is scaled (from a fixed dataset), the parameterization of the hyperparameters (learning rate, batch size, etc.) is fixed (e.g. the same learning rate is used for all layers in the MLP block). With this, it is natural to ask: are the observed plateaus real fundamental limits or the result of suboptimal scaling? (here, suboptimal should be understood both in terms of asymptotic performance and the scaling rate).

In this post, I’ll show through a simple example that the scaling laws we see depend heavily on how we use the data and the quality of the data itself. With different data selection strategies—even starting from the same underlying distribution—we can end up with different scaling rates and different asymptotic performance. This suggests that before concluding we’ve hit a true plateau, we need to understand what the optimal scaling laws could be. Without that, “plateauing” might be less about nature’s hard limit and more about our current suboptimal approach to learning at scale. While theorists (especially statisticians) might be familiar with this, I think it is important to highlight this point to practitioners.

I. Scaling Laws depend on Data Strategies

Consider a simple linear model:

y = x + ε, ε \sim N (0, σ^{2}) .

We fit a slope $\overset{w}{^}$ via ordinary least squares (OLS) without an intercept:

\overset{y}{^} = w x .

The solution to this problem is given by

\overset{w}{^} = \frac{\sum _{i = 1}^{m} x _{i} y _{i}}{\sum _{i = 1}^{m} x _{i}^{2}},

and the test error is given by

L (m) = E [(y - \overset{w}{^} x)^{2}] = σ^{2} + E [(1 - \overset{w}{^})^{2}] .

We will see that under the full, ideal usage of data, $L (m)$ converges to $σ^{2}$ at a rate $\propto σ^{2} / m$ . But if we restrict or filter our data in certain ways, that rate can worsen, or improve.

1. Scenario A: Full-Range. In this case, we assume that training data is drawn from the true distribution.

Training Data: $(x_{i}, y_{i})$ i.i.d. from $x_{i} \sim Uniform (- 1, 1)$ .

Standard OLS results shows that $\overset{w}{^}$ converges to $1$ at a rate $\propto σ^{2} / m$ , which yields the following scaling law:

$L (m)$ converges to $σ^{2}$ at a rate $\propto σ^{2} / m$ .

2. Scenario B: Samples Near Zero. Here, we assume that the data collection process is imperfect, and we only have access to samples $x$ with small norms.

Training Data: samples with $∣ x ∣$ small, e.g. $∣ x ∣ < δ_{m}$ where $δ_{m} \to 0$ as $m \to \infty$ .

In this case, the slope signal is small, so $\overset{w}{^}$ has a high variance, which naturally slows down the convergence rate. More precisely, we have:

E [(\overset{w}{^} - 1)^{2}] \propto \frac{σ ^{2} δ _{m}^{- 1}}{m} .

As a result, we have a slower scaling law:

$L (m)$ converges to $σ^{2}$ at a rate $\propto σ^{2} δ_{m}^{- 1} / m$ .

3. Scenario C: Select Low-Noise Samples ( $x > 0$ ). Here, we assume that we have access to a verifier that can select low noise samples, i.e. samples for which $∣ ε ∣$ is small, effectively lowering the noise in training.

Training Data: low noise samples, i..e samples for which $∣ ε ∣$ is small. The variance of $\overset{w}{^}$ might shrink faster than $1/ m$ . This is straightforward to see:

E [(\overset{w}{^} - 1)^{2}] \propto \frac{σ _{e ff}^{2}}{m} .

where $σ_{e ff}^{2}$ is the effective noise level in the training data. The final test error remains $σ^{2}$ . However, if we can use the verifier to select low-noise samples, we can speed up the convergence rate. For instance, if we can select $∣ ε ∣ < m^{- β}$ for some $β > 0$ , we have:

$L (m)$ converges to $σ^{2}$ at a rate $\propto 1/ m^{1 + 2 β}$ .

Key Takeaway

Even though the true process ${(x, y)}$ is the same, the way we use or filter the data can drastically alter the scaling law---how quickly $L (m)$ converges to 1. Some strategies slow down learning (Scenario B), while others can speed it up (Scenario C).

II. Asymptotic Performance depends on Data Quality

To show the impact of training data quality on asymptotic performance, we consider a linear model with a bias term that we don’t model. Specifically, suppose the true data is coming from the following model:

y = x + b + ε,

where $b > 0$ is a fixed bias term, and $ε \sim N (0, σ^{2})$ is Gaussian noise, and the input $x$ is drawn from $x \sim Uniform (- 1, 1)$ .

We fit a model of the form $y = w x$ (ignoring the bias). This introduces an irreducible error in the model. By minimizing the mean squared error on the training data, we find the optimal parameter $\overset{w}{^}$ :

\overset{w}{^} = \frac{\sum _{i = 1}^{m} x _{i} y _{i}}{\sum _{i = 1}^{m} x _{i}^{2}} .

As $m$ increases, $\overset{w}{^}$ converges to $w^{*} = \frac{E [ x y ]}{E [ x ^{2} ]}$ . The asymptotic test error is given by:

E [(y - w^{*} x)^{2}] = b^{2} + σ^{2} + E [(1 - w^{*})^{2}] .

We will now consider four scenarios that differ in how they use the data, and see how they affect the asymptotic performance.

1. Scenario A (Full-Range): In this case, the training inputs $x_{i}$ are drawn uniformly from $[- 1, 1]$ , which covers the entire input space.

In this case, we have $w^{*} = \frac{1/3}{1/3} = 1$ . The test error approaches the asymptotic limit $b^{2} + σ^{2}$ .

2. Scenario B ( $x_{i} > 0$ ): Assume that for some reason, we only have acces to training samples satisfying $x_{i} > 0$ , which introduces a bias in the training data distribution. In real life scenarios, this represents the case where the training data does not cover all the input space (of the true data distribution). This changes the asymptotic solution to:

w^{*} = \frac{1/3 + b /2}{1/3} = 1 + \frac{b}{2} .

The asymptotic test error becomes:

E [(y - w x)^{2}] = b^{2} + σ^{2} + Δ,

where $Δ = 3 b^{2} /4$ accounts for the bias introduced by using only half the support of $x$ , and the final test error is worse than in Scenario 1.

III. What Does This Mean for "Plateauing" Scaling Laws?

In the scenarios above, the underlying data distribution is identical. Yet, the test errors and the scaling laws are different:

Different scaling rates ( $1/ m$ , $1/ m^{1 + β}$ , etc.).
Different asymptotic limits ( $b^{2} + σ^{2}$ vs. $b^{2} + σ^{2} + Δ$ ).

This demonstrates that the way we use/select data—not just the data itself—can significantly impact both the scaling law and the best achievable performance.

IV. Beyond Simple Models: Implications for Large-Scale Models

For large language models and other neural networks, scaling laws are often treated as fixed. But as these examples show, they are highly dependent on data strategies. The impact of data quality was studied through the lense of data pruning (see Sorscher et al. (2022) where authors studided the impact of data pruning on scaling laws, and our paper Ayed et al. (2023) for a detailed analysis on the impact of the quality of data selection on scaling laws). While this blog post focused on data scaling laws, the same principles apply to other scaling laws, such as model scaling laws.

If we rely on suboptimal scaling strategies, we might misinterpret observed plateaus as fundamental limitations. In reality, they could reflect missed opportunities to optimize scaling strategies. Before concluding that a scaling law has plateaued, we need to ask: is this the best possible scaling law? is the model/dataset parameterization optimal? Without knowing what the optimal scaling law looks like, it’s premature to assume that any observed behavior reflects a true limit rather than a shortcoming in our approach.

References

Hestness, J. et al. Deep learning scaling is predictable, empirically. (2017).

Kaplan, J. et al. Scaling laws for neural language models. (2020).

Rosenfeld, J. et al. A constructive prediction of the future of scaling laws in machine learning. (2020).

Sorscher, B. et al. Beyond neural scaling laws: beating power law scaling via data pruning. (2023).

Ayed, F. et al. Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (2023).