Deep Dive into DeepSpeed: Enhancing Large Model Training Efficiency

Discover how to enhance large model training with DeepSpeed. Learn techniques for efficient distributed training, compression, and more.

Meng Li

Aug 02, 2024

∙ Paid

Welcome to the "Practical Application of AI Large Language Model Systems" Series

Meng Li

Jun 7

Read full story

In the course "Building a 100M Parameter Transformer Model from Scratch," we built a Transformer from scratch and conducted a full training. Using an A10-24G GPU with 500M training text, the estimated training time was one month. This shows the high machine requirements for training. We used only 500M of data, but training a large model typically involves much more data and larger parameters.

As far as I know, training large models like GPT-3 and GLM-130B takes around 3 months. So, our current approach is not feasible.

How can we speed up training in practice?

The answer is distributed training. Popular frameworks include Microsoft's DeepSpeed and NVIDIA's NCCL. This course focuses on Microsoft's DeepSpeed.

AI Disruption

Table of Contents

AI Disruption

Deep Dive into DeepSpeed: Enhancing Large Model Training Efficiency

Discover how to enhance large model training with DeepSpeed. Learn techniques for efficient distributed training, compression, and more.

Table of Contents

This post is for paid subscribers