No Out Of Memory!

Nothing kills the momentum of training a large model quite like CUDA out of memory. You've got your dataset ready, your hyperparameters tuned, and then—boom—your job crashes before the first gradient update.

This post is a practical guide to understanding and estimating GPU memory usage, specifically for post-training scenarios (SFT, RLHF, GRPO) and inference. By the end, you'll be able to predict whether your workload fits on your hardware before you even start.

TL;DR: Quick Estimation Formulas

Let's cut to the chase. Here are the formulas you can use right away. All assume bf16/fp16 precision, with P denoting parameter count in billions.

Inference:

Memory (GB) ≈ 2P + KV_Cache
KV_Cache (GB) ≈ 4 × num_layers × hidden_dim × seq_len × batch_size / 1024³

SFT (Full Fine-tuning with Adam):

Memory (GB) ≈ 16P + Activations

SFT (LoRA Fine-tuning):

Memory (GB) ≈ 2P + Activations + negligible_LoRA_overhead

GRPO (LoRA, with weight sharing between Actor and Reference):

Memory (GB) ≈ max(Generation_Peak, Training_Peak)
Generation_Peak ≈ 2P + KV_Cache
Training_Peak ≈ 2P + Activations

Quick Reference (Single H100 80GB, GRPO-LoRA, seq_len=2048):

| Model Size | Feasible? | Approx. Batch Size | |------------|-----------|-------------------| | 7B | ✅ Yes | 8–16 | | 14B | ✅ Yes | 4–8 | | 32B | ❌ No | Requires 2+ GPUs | | 72B | ❌ No | Requires 8+ GPUs |

Understanding Memory Components

Before diving into estimation, let's understand what actually consumes GPU memory. When you run nvidia-smi, you see the total memory usage, but this includes both system-level overhead (CUDA context, cuDNN, etc.) and framework-level consumption (PyTorch tensors, etc.). We'll focus on the framework level—that's what we can actually optimize.

GPU memory during training and inference can be broken down into five main components:

1. Model Parameters

This is the most straightforward component. Each parameter occupies memory based on its data type:

| Precision | Bytes per Parameter | 7B Model | 72B Model | |-----------|---------------------|----------|-----------| | fp32 | 4 | 28 GB | 288 GB | | bf16/fp16 | 2 | 14 GB | 144 GB | | int8 | 1 | 7 GB | 72 GB | | int4 | 0.5 | 3.5 GB | 36 GB |

The formula is simple:

Parameter Memory (GB) = bytes_per_param × num_params / 1024³

For bf16, this simplifies to approximately 2P GB where P is in billions.

2. Optimizer States

This is often the silent killer for training memory. Different optimizers have different memory footprints:

Adam/AdamW (most common for LLM training):

First moment (momentum): 4 bytes per parameter (fp32)
Second moment (variance): 4 bytes per parameter (fp32)
Master weights copy: 4 bytes per parameter (fp32, when using mixed precision)

Total: 12 bytes per parameter, regardless of model precision.

For a 7B model, that's 84 GB just for optimizer states—more than the model itself in bf16!

SGD with Momentum: 4 bytes per parameter (just momentum)

AdaFactor: Can be more memory-efficient with factored second moments

The key insight: even if your model uses bf16, the optimizer typically maintains fp32 states for numerical stability. This is why full fine-tuning is so memory-intensive.

3. Gradients

During backpropagation, gradients are computed for each trainable parameter. Gradient memory matches the training precision:

Gradient Memory (GB) = bytes_per_param × num_trainable_params / 1024³

For bf16 training of a 7B model: 14 GB.

With LoRA, you only store gradients for the adapter parameters (typically 0.1%–1% of total), making this negligible.

4. Activations

Activations are the intermediate outputs from each layer during the forward pass. They must be stored for the backward pass to compute gradients. This is often where OOM surprises come from, because activation memory scales with:

Batch size: Linear scaling
Sequence length: Linear scaling
Model depth: Linear scaling
Hidden dimension: Linear scaling

The Megatron-LM paper provides a detailed formula:

Activation Memory = s × b × h × (34 + 5 × a × s / h) × L × precision_bytes

Where:

s = sequence length
b = micro-batch size
h = hidden dimension
a = number of attention heads
L = number of layers

For practical estimation with gradient checkpointing enabled (which most frameworks use by default), activations are significantly reduced because intermediate values are recomputed during backward pass:

Activation Memory (with checkpointing) ≈ 2 × L × h × s × b × precision_bytes / 1024³

For a 7B model (L=32, h=4096) with batch_size=8 and seq_len=2048 in bf16:

Activations ≈ 2 × 32 × 4096 × 2048 × 8 × 2 / 1024³ ≈ 8 GB

Double the batch size, and you double the activation memory.

5. KV Cache (Inference and Generation)

During autoregressive generation, each new token depends on all previous tokens. To avoid recomputing attention for the entire sequence at each step, we cache the Key and Value projections—this is the KV cache.

For each layer, we store:

Keys: batch_size × seq_len × hidden_dim (2 bytes in bf16)
Values: batch_size × seq_len × hidden_dim (2 bytes in bf16)

Total KV cache:

KV Cache (GB) = 2 × 2 × num_layers × hidden_dim × seq_len × batch_size / 1024³
             = 4 × L × h × s × b / 1024³

For a 7B model (L=32, h=4096) with batch_size=8 and seq_len=2048:

KV Cache ≈ 4 × 32 × 4096 × 2048 × 8 / 1024³ ≈ 8 GB

For a 72B model (L=80, h=8192) with the same settings:

KV Cache ≈ 4 × 80 × 8192 × 2048 × 8 / 1024³ ≈ 40 GB

This is why long-context generation is so memory-hungry—the cache grows linearly with sequence length.

Inference Memory Estimation

Inference is the simpler case. You need:

Model parameters: 2P GB (bf16)
KV cache: Grows during generation

Inference Memory (GB) = 2P + KV_Cache

Practical Example: Qwen2.5-7B Inference

Model parameters: 14 GB

For batch_size=16, max_seq_len=4096:

KV Cache = 4 × 32 × 4096 × 4096 × 16 / 1024³ ≈ 32 GB
Total ≈ 14 + 32 = 46 GB

Fits comfortably on an 80GB H100 with room for system overhead.

Optimizing Inference Memory

Several techniques can reduce inference memory:

Quantization (int8/int4): Reduces parameter memory by 2–4×
PagedAttention (vLLM): More efficient KV cache management
Sliding window attention: Limits KV cache size for very long sequences
Multi-Query/Grouped-Query Attention: Reduces KV cache by sharing keys/values across heads

Post-Training Memory Estimation

Post-training includes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Let's examine each.

Supervised Fine-Tuning (SFT)

SFT is conceptually straightforward: you train the model on curated examples using cross-entropy loss. Memory requirements depend heavily on whether you're doing full fine-tuning or using parameter-efficient methods.

Full Fine-Tuning

Every component contributes:

SFT Full Memory = Parameters + Gradients + Optimizer_States + Activations
                = 2P + 2P + 12P + Activations
                = 16P + Activations

For a 7B model:

Static memory: 16 × 7 = 112 GB
+ Activations (varies with batch/seq)

This exceeds the 80GB of an H100. Full fine-tuning of 7B+ models requires either:

Multiple GPUs with tensor/pipeline parallelism
ZeRO-style optimizer sharding
CPU offloading (slow)

LoRA Fine-Tuning

LoRA (Low-Rank Adaptation) freezes the base model and only trains small adapter matrices. Typically, LoRA parameters are 0.1%–1% of total parameters.

SFT LoRA Memory = Parameters + LoRA_Gradients + LoRA_Optimizer_States + Activations
                ≈ 2P + negligible + negligible + Activations
                ≈ 2P + Activations

For a 7B model with gradient checkpointing:

Parameters: 14 GB
Activations: ~8 GB (batch=8, seq=2048)
Total: ~22 GB

Easily fits on a single H100 with room for larger batches.

SFT Memory Comparison

| Method | 7B Model | 72B Model | |--------|----------|-----------| | Full FT (static) | 112 GB | 1.15 TB | | LoRA (static) | 14 GB | 144 GB | | LoRA + Activations | ~22 GB | ~160 GB |

This is why LoRA has become the de facto standard for post-training.

Reinforcement Learning: GRPO

Reinforcement Learning from Human Feedback (RLHF) adds complexity because it involves multiple models working together. Traditional PPO requires four models:

Actor: The policy being trained
Critic: Estimates value functions (trainable)
Reference: Frozen copy of the original policy
Reward: Scores generated responses

GRPO (Group Relative Policy Optimization) simplifies this by eliminating the critic. Instead of learning a value function, GRPO uses group-relative advantages computed directly from reward scores within a batch. This makes it more memory-efficient and increasingly popular for LLM post-training.

GRPO Training Loop

A typical GRPO step involves:

Generation (Rollout): Actor generates responses to prompts
Reward Scoring: Reward model (or rule-based function) scores responses
Advantage Computation: Relative advantages computed within groups
Policy Update: Actor is updated to maximize advantage-weighted log probabilities
KL Penalty: Reference model computes KL divergence to prevent policy drift

Memory Analysis for GRPO

The key insight is that generation and training don't happen simultaneously. This means we can analyze peak memory for each phase separately:

Generation Phase:

Generation Memory = Actor_Params + KV_Cache
                  = 2P + KV_Cache

If using a separate reward model:

Generation Memory = Actor_Params + Reward_Params + KV_Caches

Training Phase:

Training Memory = Actor_Params + Gradients + Optimizer_States + Activations + Reference_Params

With LoRA:

Training Memory ≈ Actor_Params + Activations + Reference_Params
                = 2P + Activations + 2P  (if separate reference)
                = 2P + Activations       (if sharing weights)

Critical Optimization: When using LoRA, the Actor and Reference can share the same base weights. The Actor applies LoRA adapters during forward pass; the Reference does not. This eliminates the need for a separate Reference model copy.

GRPO with Weight Sharing (verl's approach)

verl implements a 3D-HybridEngine that takes this further. The same model weights are used for both generation and training, with the framework dynamically switching between inference and training modes:

Peak Memory = max(Generation_Peak, Training_Peak)

Generation_Peak = 2P + KV_Cache
Training_Peak = 2P + Activations + LoRA_overhead

Since these phases don't overlap, we take the maximum rather than summing.

GRPO Memory Example: Qwen2.5-7B

Configuration: LoRA, batch_size=16, seq_len=2048, gradient checkpointing enabled

Generation Phase:

Model params: 14 GB
KV Cache: 4 × 32 × 4096 × 2048 × 16 / 1024³ ≈ 16 GB
Generation Peak: ~30 GB

Training Phase:

Model params: 14 GB
Activations: 2 × 32 × 4096 × 2048 × 16 × 2 / 1024³ ≈ 16 GB
Training Peak: ~30 GB

Total Peak: ~30 GB (plus system overhead)

An H100 80GB handles this comfortably, leaving room for larger batches.

Empirical Data from verl

From verl's documentation:

| Model | GPUs | Config | Max Batch | |-------|------|--------|-----------| | Qwen2.5-7B | 1×H100 | GRPO-LoRA | 16 | | Qwen2.5-32B | 4×H100 | GRPO-LoRA | 180 | | Qwen2.5-72B | 8×H100 | Full Fine-tuning | 176 |

Note that 32B and 72B require multiple GPUs with tensor parallelism.

When You Need a Reward Model

If using a learned reward model instead of rule-based rewards, you need additional memory:

Total Generation = Actor_Params + Reward_Params + KV_Caches

For same-sized Actor and Reward models:

Generation Peak ≈ 4P + 2 × KV_Cache

This roughly doubles the generation phase memory. Strategies to handle this:

Use a smaller reward model
Quantize the reward model (int8)
Run reward inference on separate GPUs
Use rule-based rewards when possible (e.g., code execution, math verification)

Parallelism Strategies

When a single GPU isn't enough, parallelism strategies can distribute memory across multiple devices.

Tensor Parallelism (TP)

Splits individual operations (matrix multiplications) across GPUs. Each GPU holds a fraction of each layer.

Per-GPU Memory = Parameters / TP + Activations / TP + ...

Best for: Large models that don't fit on a single GPU. Requires high-bandwidth interconnect (NVLink).

Pipeline Parallelism (PP)

Distributes layers across GPUs. GPU 0 has layers 0–N, GPU 1 has layers N+1–2N, etc.

Per-GPU Memory = Parameters / PP + ...

Best for: Very deep models. Can work with lower bandwidth but introduces pipeline bubbles.

Data Parallelism (DP)

Common misconception: DP does not reduce per-GPU memory. Each GPU holds the full model and processes different data batches. DP increases throughput, not memory efficiency.

ZeRO (DeepSpeed)

Partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across GPUs.

ZeRO-1: Reduces optimizer state memory by N (number of GPUs)
ZeRO-2: Also shards gradients
ZeRO-3: Also shards parameters

Trade-off: ZeRO-3 has significant communication overhead, which can hurt throughput. For post-training, TP/PP often provides better efficiency.

Practical Tips for Avoiding OOM

Ranked by impact:

1. Use LoRA (Essential for Post-Training)

Reduces trainable parameters by 99%+, eliminating optimizer state overhead. Rank 8–64 works well for most cases.

2. Enable Gradient Checkpointing (Default in Most Frameworks)

Trades ~30% extra compute for 5–10× activation memory reduction. Almost always worth it.

3. Reduce Batch Size / Sequence Length

Both activations and KV cache scale linearly with batch size and sequence length. If you hit OOM:

First: reduce micro-batch size
Use gradient accumulation to maintain effective batch size
Consider shorter sequences if task allows

4. Use Specialized Inference Engines for Generation

vLLM and SGLang provide optimized KV cache management:

PagedAttention reduces memory fragmentation
Continuous batching improves throughput
verl and ROLL both support vLLM integration

5. Apply Tensor Parallelism for Large Models

When single-GPU is insufficient:

TP=2 halves per-GPU memory
TP=4 quarters it
Requires NVLink for good performance

6. Quantize When Possible

int8 inference reduces model memory by 2×
int4 reduces by 4×
Training typically requires bf16 for stability

7. CPU Offloading (Last Resort)

Moves optimizer states or even parameters to CPU RAM. Works but significantly slows training. Only use when you're just slightly over GPU memory limits.

Conclusion

GPU memory estimation for LLM post-training comes down to understanding five components:

Parameters: 2P GB in bf16
Optimizer States: 12P GB for Adam (eliminated with LoRA)
Gradients: 2P GB (eliminated with LoRA)
Activations: Scales with batch × sequence (reduced with checkpointing)
KV Cache: Scales with batch × sequence (only during generation)

Key takeaways:

LoRA is essential for single-GPU post-training—it removes the optimizer state bottleneck
Generation and training peaks don't overlap in GRPO—take the maximum, not the sum
Activations and KV cache are your dynamic variables—adjust batch size and sequence length to fit
When in doubt, estimate conservatively and leave headroom for system overhead

Next time you see CUDA out of memory, you'll know exactly where to look: is it the KV cache during generation, or activations during training? Then you can apply the right fix.

References: