Surprising Truths About GPU Memory for Large Models: How Much Do You Really Need?

Aug 11, 2024

This article explains how to estimate the required GPU memory based on model parameters, settings, and batch size.

Suppose you want to fully train a llama-6B model. How much GPU memory do you need?

We'll also explore how memory requirements change under fp32, fp16, and int8 modes.

Memory Composition for Large Models

GPU memory for large models consists of three parts: the model itself, the CUDA kernel, and batch size.

The model's memory needs can be divided into three areas: model parameters, gradients, and optimizer parameters.

Model Parameters

Memory required = number of parameters * memory per parameter.

Consider the impact of precision on memory:

Gradients

Memory required = number of parameters * memory per gradient.

Optimizer Parameters

The amount of memory depends on the optimizer. For AdamW, it requires twice the model parameters (to store the first and second moments).

CUDA kernel uses around 1.3GB of RAM, as shown below:

torch.ones((1, 1)).to("cuda") 

print_gpu_utilization() 

>>> GPU memory occupied: 1343 MB

First, calculate the memory for intermediate variables for each instance in the batch:

Memory = intermediate parameter count * memory per parameter * batch size.

Let's calculate the memory needed for a Llama-6B model with a batch size of 50 and int8 precision.

Total for the model: 6GB + 6GB + 12GB + 1.3GB = 25.3GB.

LLaMA architecture:

For each instance:

Memory = (4096 + 11008) * 2048 * 32 * 1 byte = 990MB.

For a batch size of 50:

Memory = 990MB * 50 = 48.3GB.

Total memory required for Llama-6B with int8 precision and a batch size of 50:

25.3GB + 48.3GB = 73.6GB.

This just fits within an A100 GPU with 80GB RAM, allowing full-parameter fine-tuning of Llama-6B with a batch size of 50 and int8 precision.

You can apply similar calculations for other scenarios based on precision, model size, intermediate variables, and batch size.

Aug 11

Welcome to the "Practical Application of AI Large Language Model Systems" Series