Meta Releases Llama3.2 1B/3B Quantized Models: Accelerated Edge Inference, Reduced Memory Usage

Meta launches Llama3.2 quantized models with 2-4x faster inference and reduced memory usage, optimized for mobile devices.

Oct 25, 2024

∙ Paid

On October 24, 2024, Meta announced the release of the first lightweight series of Llama3.2 quantized models. These models have sufficient performance and a compact size, enabling them to run on many popular mobile devices.

With the rapid development of AI technology, the demand for on-device inference is increasing, and Meta's release aims to address this pain point.

Overview

Meta's release of the two quantized Llama3.2 models brings two major improvements:

Speed Boost: The new models have 2-4 times faster inference, effectively enhancing the user-end interaction experience.
Reduced Memory Usage: Model size is reduced by 56%, and memory usage is decreased by 41%, allowing the models to run on memory-constrained devices such as mobile phones.

Quantization Techniques

To achieve these significant performance improvements, Meta introduced two key quantization techniques—Quantization-Aware Training (QAT) and SpinQuant—which played a crucial role in optimizing model size and inference performance.

Let's dive into these quantization methods and their application scenarios.

AI Disruption

Meta Releases Llama3.2 1B/3B Quantized Models: Accelerated Edge Inference, Reduced Memory Usage

Meta launches Llama3.2 quantized models with 2-4x faster inference and reduced memory usage, optimized for mobile devices.

Overview

Quantization Techniques

This post is for paid subscribers