Deep Dive into DeepSpeed: Enhancing Large Model Training Efficiency
Discover how to enhance large model training with DeepSpeed. Learn techniques for efficient distributed training, compression, and more.
Welcome to the "Practical Application of AI Large Language Model Systems" Series
In the course "Building a 100M Parameter Transformer Model from Scratch," we built a Transformer from scratch and conducted a full training. Using an A10-24G GPU with 500M training text, the estimated training time was one month. This shows the high machine requirements for training. We used only 500M of data, but training a large model typically involves much more data and larger parameters.
As far as I know, training large models like GPT-3 and GLM-130B takes around 3 months. So, our current approach is not feasible.
How can we speed up training in practice?
The answer is distributed training. Popular frameworks include Microsoft's DeepSpeed and NVIDIA's NCCL. This course focuses on Microsoft's DeepSpeed.