Reliable scaling in the over-trained regime and for downstream error prediction. (left) We fit a scaling law for model validation loss, parameterized by (i) a token multiplier, which is the ratio of training tokens to parameters and (ii) the approximated compute in FLOPs used to train a model. We extrapolate, in both the number of parameters and token multipliers, the validation performance of models requiring over 300x the training compute used to construct the scaling law. (right) We also fit a scaling law to predict average downstream top-1 error as a function of validation loss. We find that fitting scaling laws for downstream error benefits from using more expensive models when compared to fitting for loss prediction. We predict the average error over 17 downstream tasks for models trained with over 20x the compute.

News

Abstract

Scaling laws are useful guides for developing language models, but there are still gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., “Chinchilla optimal” regime); however, in practice, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32x over-trained) and a 6.9B parameter, 138B token run — each from experiments that take 300x less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments that take 20x less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

More info

Acknowledgements

SYG is supported by an NSF Graduate Research Fellowship, GS by the Onassis Foundation - Scholarship ID: F ZS 056-1/2022-2023, and MN by the Federal Ministry of Education and Research of Germany under grant no. 01IS22094B WEST-AI. We thank Stability AI and Toyota Research Institute (TRI) for access to compute resources. This research has been supported by NSF Grants AF 1901292, CNS 2148141, Tripods CCF 1934932, IFML CCF 2019844, and research gifts by Western Digital, Amazon, WNCG IAP, UT Austin Machine Learning Lab (MLL), Cisco, and the Stanly P. Finch Centennial Professorship in Engineering. We also thank Kushal Arora, Alper Canberk, Mia Chiquier, Sachit Menon, Chuer Pan, Purva Tendulkar, and Mandi Zhao for valuable feedback.

* equal advising