Multi-GPU Training Setup
Training large machine learning models requires multiple GPUs working together. This guide covers setting up a multi-GPU environment with CUDA, PyTorch distributed training, and best practices for maximizing GPU utilization.
Prerequisites
- Multiple NVIDIA GPUs (same model recommended)
- Ubuntu 22.04/24.04
- Sufficient RAM (2x total GPU memory recommended)
- NVMe storage for fast data loading
CUDA and Driver Installation
# Install NVIDIA drivers
sudo apt install -y nvidia-driver-550
# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run --toolkit --silent
# Verify
nvidia-smi
nvcc --version
PyTorch with Multi-GPU
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Verify GPU availability
python3 -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"
Distributed Data Parallel (DDP)
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def train(rank, world_size):
setup(rank, world_size)
model = YourModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
# Training loop with ddp_model
dist.destroy_process_group()
# Launch: torchrun --nproc_per_node=4 train.py
NCCL Configuration
# Optimize NCCL for multi-GPU communication
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1 # Disable InfiniBand if not available
export NCCL_P2P_DISABLE=0 # Enable peer-to-peer GPU communication
export NCCL_SOCKET_IFNAME=eth0 # Network interface for multi-node
Monitoring GPU Usage
# Real-time GPU monitoring
watch -n 1 nvidia-smi
# GPU monitoring with nvitop (better UI)
pip install nvitop
nvitop
Best Practices
- Use DistributedDataParallel over DataParallel for better performance
- Match batch size to GPU count (linear scaling)
- Use mixed precision training (torch.cuda.amp) to save GPU memory
- Pin data loader memory with pin_memory=True
- Use NVMe storage for training data to prevent I/O bottlenecks