Multi-GPU Training Environment Setup

By Admin · Mar 15, 2026 · Updated Jun 25, 2026 · 417 views · 2 min read

Multi-GPU Training Setup

Training large machine learning models requires multiple GPUs working together. This guide covers setting up a multi-GPU environment with CUDA, PyTorch distributed training, and best practices for maximizing GPU utilization.

Prerequisites

Multiple NVIDIA GPUs (same model recommended)
Ubuntu 22.04/24.04
Sufficient RAM (2x total GPU memory recommended)
NVMe storage for fast data loading

CUDA and Driver Installation

# Install NVIDIA drivers
sudo apt install -y nvidia-driver-550

# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run --toolkit --silent

# Verify
nvidia-smi
nvcc --version

PyTorch with Multi-GPU

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Verify GPU availability
python3 -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"

Distributed Data Parallel (DDP)

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank, world_size):
    setup(rank, world_size)
    model = YourModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    # Training loop with ddp_model
    dist.destroy_process_group()

# Launch: torchrun --nproc_per_node=4 train.py

NCCL Configuration

# Optimize NCCL for multi-GPU communication
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1       # Disable InfiniBand if not available
export NCCL_P2P_DISABLE=0      # Enable peer-to-peer GPU communication
export NCCL_SOCKET_IFNAME=eth0  # Network interface for multi-node

Monitoring GPU Usage

# Real-time GPU monitoring
watch -n 1 nvidia-smi

# GPU monitoring with nvitop (better UI)
pip install nvitop
nvitop

Best Practices

Use DistributedDataParallel over DataParallel for better performance
Match batch size to GPU count (linear scaling)
Use mixed precision training (torch.cuda.amp) to save GPU memory
Pin data loader memory with pin_memory=True
Use NVMe storage for training data to prevent I/O bottlenecks