Docs / AI & Machine Learning / Multi-GPU Training Environment Setup

Multi-GPU Training Environment Setup

By Admin · Mar 15, 2026 · Updated Apr 23, 2026 · 392 views · 2 min read

Multi-GPU Training Setup

Training large machine learning models requires multiple GPUs working together. This guide covers setting up a multi-GPU environment with CUDA, PyTorch distributed training, and best practices for maximizing GPU utilization.

Prerequisites

  • Multiple NVIDIA GPUs (same model recommended)
  • Ubuntu 22.04/24.04
  • Sufficient RAM (2x total GPU memory recommended)
  • NVMe storage for fast data loading

CUDA and Driver Installation

# Install NVIDIA drivers
sudo apt install -y nvidia-driver-550

# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run --toolkit --silent

# Verify
nvidia-smi
nvcc --version

PyTorch with Multi-GPU

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Verify GPU availability
python3 -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"

Distributed Data Parallel (DDP)

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank, world_size):
    setup(rank, world_size)
    model = YourModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    # Training loop with ddp_model
    dist.destroy_process_group()

# Launch: torchrun --nproc_per_node=4 train.py

NCCL Configuration

# Optimize NCCL for multi-GPU communication
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1       # Disable InfiniBand if not available
export NCCL_P2P_DISABLE=0      # Enable peer-to-peer GPU communication
export NCCL_SOCKET_IFNAME=eth0  # Network interface for multi-node

Monitoring GPU Usage

# Real-time GPU monitoring
watch -n 1 nvidia-smi

# GPU monitoring with nvitop (better UI)
pip install nvitop
nvitop

Best Practices

  • Use DistributedDataParallel over DataParallel for better performance
  • Match batch size to GPU count (linear scaling)
  • Use mixed precision training (torch.cuda.amp) to save GPU memory
  • Pin data loader memory with pin_memory=True
  • Use NVMe storage for training data to prevent I/O bottlenecks

Was this article helpful?