What are scaling laws?

"Scaling laws" describe the relationship between an AI model's performance and the three aspects of compute: the length of the model's training run, the amount of training data used, and the size of the model. These laws are used to guide the allocation of limited resources between those three variables in a way that produces the most capable model possible.

The compute used to train large foundation models like GPT-4 is not cheap, so researchers would like to be confident that they’re allocating resources efficiently. So, in 2020, instead of continuing to rely on gut feelings, OpenAI came up with the first generation of scaling laws.

Scaling laws are used to decide on trade-offs like: Should I pay Stack Overflow to legally be able to train on their data? Or should I buy more GPUs? Or should I pay the extra electricity bills I would have by training my model longer? If my compute goes up by 10×, how many parameters should I add to my model to make the best possible use of my GPUs?

In the case of very large language models like GPT-4, these trade-offs look more like training a 20-billion parameter model on 40% of an archive of the Internet vs. training a 200-billion parameter model on 4% of an archive of the Internet, or any of an infinite number of points along the same boundary.

OpenAI’s paper found that it is almost always better to increase your model size than to increase your dataset size. Subsequent researchers and institutions took this philosophy to heart and focused on engineering larger models, rather than on training smaller models over more data. The following table and graph show the change in trend of parameter growth of machine learning models. Note the increase to half a trillion parameters with constant training data.

DeepMind updated these scaling laws in 2022. They found that for every increase in compute, you should increase data size and model size by approximately the same amount. To verify that the law was right, DeepMind trained a 70-billion parameter model ("Chinchilla") using the same compute as had been used for the 280-billion parameter Gopher. That is, they trained the smaller Chinchilla with 1.4 trillion tokens, while the larger Gopher had only been trained with 300 billion tokens. And, as the new scaling laws predict, Chinchilla is a lot better than Gopher on pretty much every metric.

Is scaling models enough to get AGI?

[1]


  1. 1a3orn (2022) , New Scaling Laws for Large Language Models ↩︎