No Out Of Memory!
Nothing kills the momentum of training a large model quite like CUDA out of memory. You've got your dataset ready, your hyperparameters tuned, and then—boom—your job crashes before the first gradient update.
This post is a practical guide to understanding and estimating GPU memory usage, specifically for post-training scenarios (SFT, RLHF, GRPO) and inference. By the end, you'll be able to predict whether your workload fits on your hardware before you even start.
TL;DR: Quick Estimation Formulas
Let's cut to the chase. Here are the formulas you can use right away. All assume bf16/fp16 precision, with P denoting parameter count in billions.
Inference:
Memory (GB) ≈ 2P + KV_Cache
KV_Cache (GB) ≈ 4 × num_layers × hidden_dim × seq_len × batch_size / 1024³
SFT (Full Fine-tuning with Adam):
Memory (GB) ≈ 16P + Activations
SFT (LoRA Fine-tuning):
Memory (GB) ≈ 2P + Activations + negligible_LoRA_overhead
GRPO (LoRA, with weight sharing between Actor and Reference):
Memory (GB) ≈ max(Generation_Peak, Training_Peak)
Generation_Peak ≈ 2P + KV_Cache
Training_Peak ≈ 2P + Activations
Quick Reference (Single H100 80GB, GRPO-LoRA, seq_len=2048):
| Model Size | Feasible? | Approx. Batch Size | |------------|-----------|-------------------| | 7B | ✅ Yes | 8–16 | | 14B | ✅ Yes | 4–8 | | 32B | ❌ No | Requires 2+ GPUs | | 72B | ❌ No | Requires 8+ GPUs |
Understanding Memory Components
Before diving into estimation, let's understand what actually consumes GPU memory. When you run nvidia-smi, you see the total memory usage, but this includes both system-level overhead (CUDA context, cuDNN, etc.) and framework-level consumption (PyTorch tensors, etc.). We'll focus on the framework level—that's what we can actually optimize.
GPU memory during training and inference can be broken down into five main components:
1. Model Parameters
This is the most straightforward component. Each parameter occupies memory based on its data type:
| Precision | Bytes per Parameter | 7B Model | 72B Model | |-----------|---------------------|----------|-----------| | fp32 | 4 | 28 GB | 288 GB | | bf16/fp16 | 2 | 14 GB | 144 GB | | int8 | 1 | 7 GB | 72 GB | | int4 | 0.5 | 3.5 GB | 36 GB |
The formula is simple:
Parameter Memory (GB) = bytes_per_param × num_params / 1024³
For bf16, this simplifies to approximately 2P GB where P is in billions.
2. Optimizer States
This is often the silent killer for training memory. Different optimizers have different memory footprints:
Adam/AdamW (most common for LLM training):
- First moment (momentum): 4 bytes per parameter (fp32)
- Second moment (variance): 4 bytes per parameter (fp32)
- Master weights copy: 4 bytes per parameter (fp32, when using mixed precision)
Total: 12 bytes per parameter, regardless of model precision.
For a 7B model, that's 84 GB just for optimizer states—more than the model itself in bf16!
SGD with Momentum: 4 bytes per parameter (just momentum)
AdaFactor: Can be more memory-efficient with factored second moments
The key insight: even if your model uses bf16, the optimizer typically maintains fp32 states for numerical stability. This is why full fine-tuning is so memory-intensive.
3. Gradients
During backpropagation, gradients are computed for each trainable parameter. Gradient memory matches the training precision:
Gradient Memory (GB) = bytes_per_param × num_trainable_params / 1024³
For bf16 training of a 7B model: 14 GB.
With LoRA, you only store gradients for the adapter parameters (typically 0.1%–1% of total), making this negligible.
4. Activations
Activations are the intermediate outputs from each layer during the forward pass. They must be stored for the backward pass to compute gradients. This is often where OOM surprises come from, because activation memory scales with:
- Batch size: Linear scaling
- Sequence length: Linear scaling
- Model depth: Linear scaling
- Hidden dimension: Linear scaling
The Megatron-LM paper provides a detailed formula:
Activation Memory = s × b × h × (34 + 5 × a × s / h) × L × precision_bytes
Where:
s= sequence lengthb= micro-batch sizeh= hidden dimensiona= number of attention headsL= number of layers
For practical estimation with gradient checkpointing enabled (which most frameworks use by default), activations are significantly reduced because intermediate values are recomputed during backward pass:
Activation Memory (with checkpointing) ≈ 2 × L × h × s × b × precision_bytes / 1024³
For a 7B model (L=32, h=4096) with batch_size=8 and seq_len=2048 in bf16:
Activations ≈ 2 × 32 × 4096 × 2048 × 8 × 2 / 1024³ ≈ 8 GB
Double the batch size, and you double the activation memory.
5. KV Cache (Inference and Generation)
During autoregressive generation, each new token depends on all previous tokens. To avoid recomputing attention for the entire sequence at each step, we cache the Key and Value projections—this is the KV cache.
For each layer, we store:
- Keys:
batch_size × seq_len × hidden_dim(2 bytes in bf16) - Values:
batch_size × seq_len × hidden_dim(2 bytes in bf16)
Total KV cache:
KV Cache (GB) = 2 × 2 × num_layers × hidden_dim × seq_len × batch_size / 1024³
= 4 × L × h × s × b / 1024³
For a 7B model (L=32, h=4096) with batch_size=8 and seq_len=2048:
KV Cache ≈ 4 × 32 × 4096 × 2048 × 8 / 1024³ ≈ 8 GB
For a 72B model (L=80, h=8192) with the same settings:
KV Cache ≈ 4 × 80 × 8192 × 2048 × 8 / 1024³ ≈ 40 GB
This is why long-context generation is so memory-hungry—the cache grows linearly with sequence length.
Inference Memory Estimation
Inference is the simpler case. You need:
- Model parameters:
2PGB (bf16) - KV cache: Grows during generation
Inference Memory (GB) = 2P + KV_Cache
Practical Example: Qwen2.5-7B Inference
Model parameters: 14 GB
For batch_size=16, max_seq_len=4096:
KV Cache = 4 × 32 × 4096 × 4096 × 16 / 1024³ ≈ 32 GB
Total ≈ 14 + 32 = 46 GB
Fits comfortably on an 80GB H100 with room for system overhead.
Optimizing Inference Memory
Several techniques can reduce inference memory:
- Quantization (int8/int4): Reduces parameter memory by 2–4×
- PagedAttention (vLLM): More efficient KV cache management
- Sliding window attention: Limits KV cache size for very long sequences
- Multi-Query/Grouped-Query Attention: Reduces KV cache by sharing keys/values across heads
Post-Training Memory Estimation
Post-training includes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Let's examine each.
Supervised Fine-Tuning (SFT)
SFT is conceptually straightforward: you train the model on curated examples using cross-entropy loss. Memory requirements depend heavily on whether you're doing full fine-tuning or using parameter-efficient methods.
Full Fine-Tuning
Every component contributes:
SFT Full Memory = Parameters + Gradients + Optimizer_States + Activations
= 2P + 2P + 12P + Activations
= 16P + Activations
For a 7B model:
Static memory: 16 × 7 = 112 GB
+ Activations (varies with batch/seq)
This exceeds the 80GB of an H100. Full fine-tuning of 7B+ models requires either:
- Multiple GPUs with tensor/pipeline parallelism
- ZeRO-style optimizer sharding
- CPU offloading (slow)
LoRA Fine-Tuning
LoRA (Low-Rank Adaptation) freezes the base model and only trains small adapter matrices. Typically, LoRA parameters are 0.1%–1% of total parameters.
SFT LoRA Memory = Parameters + LoRA_Gradients + LoRA_Optimizer_States + Activations
≈ 2P + negligible + negligible + Activations
≈ 2P + Activations
For a 7B model with gradient checkpointing:
Parameters: 14 GB
Activations: ~8 GB (batch=8, seq=2048)
Total: ~22 GB
Easily fits on a single H100 with room for larger batches.
SFT Memory Comparison
| Method | 7B Model | 72B Model | |--------|----------|-----------| | Full FT (static) | 112 GB | 1.15 TB | | LoRA (static) | 14 GB | 144 GB | | LoRA + Activations | ~22 GB | ~160 GB |
This is why LoRA has become the de facto standard for post-training.
Reinforcement Learning: GRPO
Reinforcement Learning from Human Feedback (RLHF) adds complexity because it involves multiple models working together. Traditional PPO requires four models:
- Actor: The policy being trained
- Critic: Estimates value functions (trainable)
- Reference: Frozen copy of the original policy
- Reward: Scores generated responses
GRPO (Group Relative Policy Optimization) simplifies this by eliminating the critic. Instead of learning a value function, GRPO uses group-relative advantages computed directly from reward scores within a batch. This makes it more memory-efficient and increasingly popular for LLM post-training.
GRPO Training Loop
A typical GRPO step involves:
- Generation (Rollout): Actor generates responses to prompts
- Reward Scoring: Reward model (or rule-based function) scores responses
- Advantage Computation: Relative advantages computed within groups
- Policy Update: Actor is updated to maximize advantage-weighted log probabilities
- KL Penalty: Reference model computes KL divergence to prevent policy drift
Memory Analysis for GRPO
The key insight is that generation and training don't happen simultaneously. This means we can analyze peak memory for each phase separately:
Generation Phase:
Generation Memory = Actor_Params + KV_Cache
= 2P + KV_Cache
If using a separate reward model:
Generation Memory = Actor_Params + Reward_Params + KV_Caches
Training Phase:
Training Memory = Actor_Params + Gradients + Optimizer_States + Activations + Reference_Params
With LoRA:
Training Memory ≈ Actor_Params + Activations + Reference_Params
= 2P + Activations + 2P (if separate reference)
= 2P + Activations (if sharing weights)
Critical Optimization: When using LoRA, the Actor and Reference can share the same base weights. The Actor applies LoRA adapters during forward pass; the Reference does not. This eliminates the need for a separate Reference model copy.
GRPO with Weight Sharing (verl's approach)
verl implements a 3D-HybridEngine that takes this further. The same model weights are used for both generation and training, with the framework dynamically switching between inference and training modes:
Peak Memory = max(Generation_Peak, Training_Peak)
Generation_Peak = 2P + KV_Cache
Training_Peak = 2P + Activations + LoRA_overhead
Since these phases don't overlap, we take the maximum rather than summing.
GRPO Memory Example: Qwen2.5-7B
Configuration: LoRA, batch_size=16, seq_len=2048, gradient checkpointing enabled
Generation Phase:
Model params: 14 GB
KV Cache: 4 × 32 × 4096 × 2048 × 16 / 1024³ ≈ 16 GB
Generation Peak: ~30 GB
Training Phase:
Model params: 14 GB
Activations: 2 × 32 × 4096 × 2048 × 16 × 2 / 1024³ ≈ 16 GB
Training Peak: ~30 GB
Total Peak: ~30 GB (plus system overhead)
An H100 80GB handles this comfortably, leaving room for larger batches.
Empirical Data from verl
From verl's documentation:
| Model | GPUs | Config | Max Batch | |-------|------|--------|-----------| | Qwen2.5-7B | 1×H100 | GRPO-LoRA | 16 | | Qwen2.5-32B | 4×H100 | GRPO-LoRA | 180 | | Qwen2.5-72B | 8×H100 | Full Fine-tuning | 176 |
Note that 32B and 72B require multiple GPUs with tensor parallelism.
When You Need a Reward Model
If using a learned reward model instead of rule-based rewards, you need additional memory:
Total Generation = Actor_Params + Reward_Params + KV_Caches
For same-sized Actor and Reward models:
Generation Peak ≈ 4P + 2 × KV_Cache
This roughly doubles the generation phase memory. Strategies to handle this:
- Use a smaller reward model
- Quantize the reward model (int8)
- Run reward inference on separate GPUs
- Use rule-based rewards when possible (e.g., code execution, math verification)
Parallelism Strategies
When a single GPU isn't enough, parallelism strategies can distribute memory across multiple devices.
Tensor Parallelism (TP)
Splits individual operations (matrix multiplications) across GPUs. Each GPU holds a fraction of each layer.
Per-GPU Memory = Parameters / TP + Activations / TP + ...
Best for: Large models that don't fit on a single GPU. Requires high-bandwidth interconnect (NVLink).
Pipeline Parallelism (PP)
Distributes layers across GPUs. GPU 0 has layers 0–N, GPU 1 has layers N+1–2N, etc.
Per-GPU Memory = Parameters / PP + ...
Best for: Very deep models. Can work with lower bandwidth but introduces pipeline bubbles.
Data Parallelism (DP)
Common misconception: DP does not reduce per-GPU memory. Each GPU holds the full model and processes different data batches. DP increases throughput, not memory efficiency.
ZeRO (DeepSpeed)
Partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across GPUs.
- ZeRO-1: Reduces optimizer state memory by N (number of GPUs)
- ZeRO-2: Also shards gradients
- ZeRO-3: Also shards parameters
Trade-off: ZeRO-3 has significant communication overhead, which can hurt throughput. For post-training, TP/PP often provides better efficiency.
Practical Tips for Avoiding OOM
Ranked by impact:
1. Use LoRA (Essential for Post-Training)
Reduces trainable parameters by 99%+, eliminating optimizer state overhead. Rank 8–64 works well for most cases.
2. Enable Gradient Checkpointing (Default in Most Frameworks)
Trades ~30% extra compute for 5–10× activation memory reduction. Almost always worth it.
3. Reduce Batch Size / Sequence Length
Both activations and KV cache scale linearly with batch size and sequence length. If you hit OOM:
- First: reduce micro-batch size
- Use gradient accumulation to maintain effective batch size
- Consider shorter sequences if task allows
4. Use Specialized Inference Engines for Generation
vLLM and SGLang provide optimized KV cache management:
- PagedAttention reduces memory fragmentation
- Continuous batching improves throughput
- verl and ROLL both support vLLM integration
5. Apply Tensor Parallelism for Large Models
When single-GPU is insufficient:
- TP=2 halves per-GPU memory
- TP=4 quarters it
- Requires NVLink for good performance
6. Quantize When Possible
- int8 inference reduces model memory by 2×
- int4 reduces by 4×
- Training typically requires bf16 for stability
7. CPU Offloading (Last Resort)
Moves optimizer states or even parameters to CPU RAM. Works but significantly slows training. Only use when you're just slightly over GPU memory limits.
Conclusion
GPU memory estimation for LLM post-training comes down to understanding five components:
- Parameters:
2PGB in bf16 - Optimizer States:
12PGB for Adam (eliminated with LoRA) - Gradients:
2PGB (eliminated with LoRA) - Activations: Scales with batch × sequence (reduced with checkpointing)
- KV Cache: Scales with batch × sequence (only during generation)
Key takeaways:
- LoRA is essential for single-GPU post-training—it removes the optimizer state bottleneck
- Generation and training peaks don't overlap in GRPO—take the maximum, not the sum
- Activations and KV cache are your dynamic variables—adjust batch size and sequence length to fit
- When in doubt, estimate conservatively and leave headroom for system overhead
Next time you see CUDA out of memory, you'll know exactly where to look: is it the KV cache during generation, or activations during training? Then you can apply the right fix.
References: