DeepSeek-V3 Goes Viral: 671B MoE at $5.58M
Discover DeepSeek-V3: A groundbreaking 671B-parameter AI model with unmatched efficiency, outperforming SOTA models and redefining open-source benchmarks.
Today, the DeepSeek model has taken the world by storm.
Opening X, the feed is flooded with discussions about DeepSeek-V3, with one of the hottest topics being its colossal 671B parameters and the surprising efficiency of its training process. The pre-training required just 2.664 million H800 GPU hours, and even with context extension and post-training, the total training time only reached 2.788 million H800 GPU hours.
In comparison, the Llama 3 series models have a computational budget of 39.3 million H100 GPU hours, sufficient to train DeepSeek-V3 at least 15 times over.
Despite its relatively lower computational demand, DeepSeek-V3 delivers performance on par with or even surpassing other state-of-the-art models.
According to the newly released DeepSeek-V3 technical report, its base model excels in tasks spanning English, code, mathematics, Chinese, and multilingual scenarios. On benchmarks like AGIEval, CMath, and MMMLU-non-English, it even significantly outperforms other open-source models.