Researchers from Lawrence Livermore National Laboratory and MosaicML have published a survey of over 200 papers on algorithmically-efficient deep learning. The survey includes a taxonomy of methods to speed up training as well as a practitioner’s guide for mitigating training bottlenecks.
The team began with the observation that efficiency metrics for deep learning all have confounding factors, such as hardware, which make it difficult to compare results from different research papers. With this in mind, they developed a definition of algorithmic speedup: changing the training recipe to reduce the total training time while maintaining comparable model quality. Given this definition, they categorized speedup strategies using three axes: components, or where to make changes; actions, or what changes to make; and mechanisms, or when and how to make the changes. After categorizing the existing literature on speedup, the team produced their practitioners’ guide for training speedup. According to the researchers:
Our central contributions are an organization of the algorithmic-efficiency literature…and a technical characterization of the practical issues affecting the reporting and achievement of speedups….With these contributions, we hope to improve the research and application of algorithmic efficiency, a critical piece of the compute-efficient deep learning needed to overcome the economic, environmental, and inclusion-related roadblocks faced by existing research.
Deep learning models have been achieving impressive results and are often capable of superhuman performance on many benchmarks. However, this has come at the cost of increasing the size of the models along with increased training time and cost, with models such as GPT-3 estimated to have cost nearly $2 million to train. Besides the financial cost, many people are concerned about the energy used to train and deploy the models. The simplest way to reduce these burdens is to reduce the time spent training a model. The bulk of the research paper is devoted to summarizing techniques to reduce training time while maintaining a high model quality—that is, its accuracy on a benchmark or test dataset.
Taxonomy of Deep Learning Speedup (source: https://arxiv.org/abs/2210.06640)
These techniques are categorized first by component, which are function (e.g., model parameters), data (e.g., training dataset), and optimization (e.g., training objective). Next, they are categorized by the actions that can be taken on training components. The actions are the “5 Rs” and target reduction in either the time for a training iteration or the number of iterations, or both:
- Remove: remove elements of components to reduce iteration time
- Restrict: reduce the space of possible values to reduce iteration time
- Reorder: shift when elements are introduced to reduce both iteration time and number of iterations
- Replace: replace one element with another to reduce both teration time and number of iterations
- Retrofit: opposite of the remove to reduce number of iterations
The paper concludes with a set of practical guidelines for reducing training time. The authors identify a set of bottlenecks at the hardware level, such GPU memory or storage capacity, along with tips for mitigating these bottlenecks. For example, to mitigate the GPU compute bottleneck, one method is to reduce the size of the tensors being operated on. They point out that data loading is a frequent bottleneck, but “by no means the only one.”
Co-author Davis Blalock, a research scientist at MosaicML, posted a summary of the work on Twitter, where he noted that “just training for less time” is a very powerful strategy. He also recommended:
Watch out for dataloader bottlenecks. If you’re training an image classifier and you’re not sure if your training speed is limited by the dataloader, it is. This not only wastes compute, but also artificially penalizes fast models—e.g., your method might not seem slower than a baseline, but that’s just because your dataloader is hiding the slowdown.
MosaicML recently participated in the MLPerf competition, where they “achieved leading NLP performance” in the Open division with a 2.7x speedup when training a BERT model compared to the baseline recipe. In early 2022, InfoQ covered the previous round of MLPerf results from December of 2021.