What are scaling laws?
"Scaling laws", in the context of training an AI model, describe how model performance depends on three key quantities: the model's size (number of parameters), the length of the training run, and the amount of data it was trained on. These three quantities determine how much compute is used in the training process, and scaling laws are used to allocate a fixed amount of compute between them so as to produce the most capable model.
Scaling laws are used to decide on trade-offs like: Should I pay Stack Overflow to train on its data? Or should I buy more GPUs? Or should I pay the higher electricity bills I would get by training my model longer? If my compute goes up 10×, how many parameters should I add to my model to make the best possible use of my GPUs?
In the case of frontier models like GPT-4, these trade-offs might look like training a 20-billion parameter model on 40% of an archive of the Internet, training a 200-billion parameter model on 4% of an archive of the Internet, or any strategy in between.
In 2020, OpenAI proposed the first scaling laws, based on finding that, at least for the largest models at the time, increasing model size was more effective than using more data. Subsequent research largely accepted this hypothesis — note, in the table below, the acceleration of growth in model size, while a relatively consistent amount of training data was used.
| model | year | size (#parameters) | data (#training tokens) | 
|---|---|---|---|
| LaMDA | 2021 | 137 billion | 168 billion | 
| GPT-3 | 2020 | 174 billion | 300 billion | 
| Jurassic | 2021 | 178 billion | 300 billion | 
| Gopher | 2021 | 280 billion | 300 billion | 
| MT-NLG 530B | 2022 | 530 billion | 270 billion | 
Caption: The number of parameters have been increasing faster recently. Note the logarithmic scale. Graph from Epoch.
DeepMind researchers proposed new scaling laws in 2022. They found that increasing the size of the model and the size of the dataset by roughly the same amount was a more effective use of compute than mainly increasing model size. To test the new scaling law, DeepMind trained a 70-billion parameter model called "Chinchilla" using the same amount of compute as the 280-billion parameter Gopher. Chinchilla’s smaller size allowed DeepMind to reallocate compute to train the model on a much larger dataset (1.4 trillion tokens compared to Gopher’s 300 billion). As the new scaling laws predicted, Chinchilla performed significantly better than Gopher.