Multi-GPU training with Accelerate

What is Accelerate?

Accelerate is a library designed to simplify multi-GPU training of PyTorch models.

It supports many different parallelization strategies like Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP) and DeepSpeed.

The main selling point of the library is that it can handle things like model placement, dataloader division and gradient accumulation across multiple GPUs on multiple machines with minimal configuration.

There are huge number of advanced features included in Accelerate so here we’ll focus on two aspects: how Accelerate works when training a model with Hugging Face Trainer and how Accelerate works when you have an existing PyTorch training loop that you want to convert to use multi-GPU training.

Configuring Accelerate

Accelerate is configured using a yaml-file that sets everything from model distribution strategy to networking settings.

Typical configuration file might looks something like this (accelerate_config.yaml):

# Accelerate configuration for multi-GPU distributed training
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

Easiest way of creating this configuration file is to run accelerate config that launches series of prompts about your desired parallelization strategy.

The main command used to launch Accelerate codes, accelerate launch, has a huge number of different arguments that can be set. All of the arguments can be set in the configuration file, but they can also be given via command line.

Using Trainer with Accelerate

Trainer has been designed to utilize Accelerate automatically.

Let’s consider the following training code (trainer_mnist_cnn.py) that trains a simple CNN model based on the MNIST dataset:

import time
import torch
from torch import nn
import transformers

from torchvision import datasets
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

from transformers import Trainer, TrainingArguments

# Describe model
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding='valid'),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Flatten(),
            nn.Linear(32*13*13, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, images=None, **kwargs):
        return self.layers(images)

def main():
    """
    main function that does the training
    """

    data_dir = './data'

    train_dataset = datasets.MNIST(data_dir, train=True, download=True, transform=ToTensor())
    test_dataset = datasets.MNIST(data_dir, train=False, transform=ToTensor())

    model = SimpleCNN()

    def collator_fn(data):
        images = torch.stack([d[0] for d in data])
        labels = torch.tensor([d[1] for d in data])
        return {"images":images, "labels":labels}


    class MNISTTrainer(Trainer):
        def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
            images = inputs.pop('images')
            target = inputs.pop('labels')
            output = model(images)
            criterion = nn.CrossEntropyLoss()
            loss = criterion(output, target)
            return (loss, outputs) if return_outputs else loss


    trainer_args = TrainingArguments(
        report_to="none",
        num_train_epochs=4,
        eval_strategy="epoch",
        logging_steps=0.1,
    )

    trainer = MNISTTrainer(
        model=model,
        args=trainer_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        data_collator=collator_fn,
    )
    
    trainer.train()


if __name__ == "__main__":
    main()

This code can be launched on a single GPU with

python trainer_mnist_cnn.py

Converting code that uses Trainer to multi-GPU code is quite trivial, as Trainer has been designed to work together with Accelerate.

When launched with a configuration file like the one given above, Accelerate would try to do a DDP setup (distributed_type: MULTI_GPU) with 8 GPUs (num_processes: 8):

accelerate launch --config_file accelerate_config.yaml trainer_mnist_cnn.py

In this case the model and data are both so small, that there would not be any benefits on using multiple GPUs.

This more complex model that fine-tunes a language model for sentiment classification can already see some benefits from multi-GPU training.

Using Accelerate with PyTorch code

Let’s consider the following training code that again trains a simple CNN model on MNIST dataset, but this time has a training loop closer to regular PyTorch (trainer_mnist_cnn.py)

import time
import torch
from torch import nn
import transformers

from transformers import get_linear_schedule_with_warmup

from torchvision import datasets
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

import logging

# Describe model
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding='valid'),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Flatten(),
            nn.Linear(32*13*13, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        return self.layers(x)

def main():
    """
    main function that does the training
    """
    
    data_dir = './data'

    n_epochs = 8
    batch_size = 32

    # Set up logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        level=logging.INFO,
    )
    logger = logging.getLogger(__name__)

    # Define data sets and data loaders
    train_dataset = datasets.MNIST(data_dir, train=True, download=True, transform=ToTensor())
    test_dataset = datasets.MNIST(data_dir, train=False, transform=ToTensor())
    train_dataloader = DataLoader(train_dataset, batch_size=32)
    test_dataloader = DataLoader(test_dataset, batch_size=32)
    
    # Define loss function
    loss_function = nn.CrossEntropyLoss()

    # Define model
    model = SimpleCNN()

    # Define optimizer
    optimizer = torch.optim.AdamW(model.parameters())

    num_training_steps = n_epochs * len(train_dataloader)

    # Define learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=min(500, num_training_steps // 10),  # 10% warmup
        num_training_steps=num_training_steps
    )
    
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    model = model.to(device)
    model.train()

    total_steps = 0
    start_time = time.time()

    for epoch in range(n_epochs):

        epoch_loss = 0
        epoch_steps = 0
        
        logger.info(f"Starting epoch {epoch + 1}/{n_epochs}")
        for step, batch in enumerate(train_dataloader):
            optimizer.zero_grad()
            inputs, targets = batch
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            loss = loss_function(outputs, targets)
            optimizer.step()
            scheduler.step()
            
            total_steps += 1
            epoch_loss += loss.item()
            epoch_steps += 1

            # Enhanced logging
            if total_steps % 100 == 0:
                elapsed_time = time.time() - start_time
                avg_loss = epoch_loss / epoch_steps
                current_lr = scheduler.get_last_lr()[0]
                steps_per_sec = total_steps / elapsed_time
                
                logger.info(
                    f"Step {total_steps} | Loss: {loss.item():.4f} | "
                    f"Avg Loss: {avg_loss:.4f} | LR: {current_lr:.2e} | "
                    f"Steps/sec: {steps_per_sec:.2f}"
                )

if __name__ == "__main__":
    main()

This can be converted to use Accelerate with minor changes to the original code (trainer_mnist_cnn.py):

import time
import torch
from torch import nn
import transformers

from transformers import get_linear_schedule_with_warmup

from torchvision import datasets
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

from accelerate import Accelerator
import logging

# Describe model
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding='valid'),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Flatten(),
            nn.Linear(32*13*13, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        return self.layers(x)

def main():
    """
    main function that does the training
    """
    
    data_dir = './data'

    n_epochs = 8
    batch_size = 32

    # Set up logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        level=logging.INFO,
    )
    logger = logging.getLogger(__name__)

    # Define data sets and data loaders
    train_dataset = datasets.MNIST(data_dir, train=True, download=True, transform=ToTensor())
    test_dataset = datasets.MNIST(data_dir, train=False, transform=ToTensor())
    train_dataloader = DataLoader(train_dataset, batch_size=32)
    test_dataloader = DataLoader(test_dataset, batch_size=32)
    
    # Define loss function
    loss_function = nn.CrossEntropyLoss()

    # Define model
    model = SimpleCNN()

    # Define optimizer
    optimizer = torch.optim.AdamW(model.parameters())

    num_training_steps = n_epochs * len(train_dataloader)

    # Define learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=min(500, num_training_steps // 10),  # 10% warmup
        num_training_steps=num_training_steps
    )
    
    accelerator = Accelerator()

    device = accelerator.device
    
    model, optimizer, train_dataloader, scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, scheduler
    )
    
    model.train()

    total_steps = 0
    start_time = time.time()

    for epoch in range(n_epochs):

        epoch_loss = 0
        epoch_steps = 0
        
        if accelerator.is_main_process:
            logger.info(f"Starting epoch {epoch + 1}/{n_epochs}")
        for step, batch in enumerate(train_dataloader):
            optimizer.zero_grad()
            inputs, targets = batch
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            loss = loss_function(outputs, targets)
            accelerator.backward(loss)
            optimizer.step()
            scheduler.step()
            
            total_steps += 1
            epoch_loss += loss.item()
            epoch_steps += 1

            # Enhanced logging
            if total_steps % 100 == 0 and accelerator.is_main_process:
                elapsed_time = time.time() - start_time
                avg_loss = epoch_loss / epoch_steps
                current_lr = scheduler.get_last_lr()[0]
                steps_per_sec = total_steps / elapsed_time
                
                logger.info(
                    f"Step {total_steps} | Loss: {loss.item():.4f} | "
                    f"Avg Loss: {avg_loss:.4f} | LR: {current_lr:.2e} | "
                    f"Steps/sec: {steps_per_sec:.2f}"
                )
                
    accelerator.end_training()


if __name__ == "__main__":
    main()
    #torch.distributed.destroy_process_group()

Main modifications are are setting up the Accelerator, letting the Accelerate handle the model placement

    accelerator = Accelerator()

    device = accelerator.device
    
    model, optimizer, train_dataloader, scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, scheduler
    )

and making certain that loss is propagated across all distributed models:

        for step, batch in enumerate(train_dataloader):
            optimizer.zero_grad()
            inputs, targets = batch
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            loss = loss_function(outputs, targets)
            accelerator.backward(loss)
            optimizer.step()
            scheduler.step()

Now this training loop can be launched in the same way as the Trainer one and it will utilize multiple GPUs:

accelerate launch --config_file accelerate_config.yaml accelerate_mnist_cnn.py