Multi-GPU training with Accelerate
What is Accelerate?
Accelerate is a library designed to simplify multi-GPU training of PyTorch models.
It supports many different parallelization strategies like Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP) and DeepSpeed.
The main selling point of the library is that it can handle things like model placement, dataloader division and gradient accumulation across multiple GPUs on multiple machines with minimal configuration.
There are huge number of advanced features included in Accelerate so here we’ll focus on two aspects: how Accelerate works when training a model with Hugging Face Trainer and how Accelerate works when you have an existing PyTorch training loop that you want to convert to use multi-GPU training.
Configuring Accelerate
Accelerate is configured using a yaml-file that sets everything
from model distribution strategy to networking settings.
Typical configuration file might looks something like this
(accelerate_config.yaml):
# Accelerate configuration for multi-GPU distributed training
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
Easiest way of creating this configuration file is to run accelerate config that launches series of prompts about your desired parallelization strategy.
The main command used to launch Accelerate codes, accelerate launch,
has a huge number
of different arguments that can be set. All of the arguments can be
set in the configuration file, but they can also be given via command line.
Using Trainer with Accelerate
Trainer has been designed to utilize Accelerate automatically.
Let’s consider the following training code
(trainer_mnist_cnn.py) that trains a simple
CNN model based on the MNIST dataset:
import time
import torch
from torch import nn
import transformers
from torchvision import datasets
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
from transformers import Trainer, TrainingArguments
# Describe model
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding='valid'),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2),
nn.Flatten(),
nn.Linear(32*13*13, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, images=None, **kwargs):
return self.layers(images)
def main():
"""
main function that does the training
"""
data_dir = './data'
train_dataset = datasets.MNIST(data_dir, train=True, download=True, transform=ToTensor())
test_dataset = datasets.MNIST(data_dir, train=False, transform=ToTensor())
model = SimpleCNN()
def collator_fn(data):
images = torch.stack([d[0] for d in data])
labels = torch.tensor([d[1] for d in data])
return {"images":images, "labels":labels}
class MNISTTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
images = inputs.pop('images')
target = inputs.pop('labels')
output = model(images)
criterion = nn.CrossEntropyLoss()
loss = criterion(output, target)
return (loss, outputs) if return_outputs else loss
trainer_args = TrainingArguments(
report_to="none",
num_train_epochs=4,
eval_strategy="epoch",
logging_steps=0.1,
)
trainer = MNISTTrainer(
model=model,
args=trainer_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
data_collator=collator_fn,
)
trainer.train()
if __name__ == "__main__":
main()
This code can be launched on a single GPU with
python trainer_mnist_cnn.py
Converting code that uses Trainer to multi-GPU code is quite trivial,
as Trainer has been designed to work together with Accelerate.
When launched with a configuration file like the one given above, Accelerate
would try to do a DDP setup (distributed_type: MULTI_GPU) with 8
GPUs (num_processes: 8):
accelerate launch --config_file accelerate_config.yaml trainer_mnist_cnn.py
In this case the model and data are both so small, that there would not be any benefits on using multiple GPUs.
This more complex model that
fine-tunes a language model for sentiment classification can already see some benefits
from multi-GPU training.
Using Accelerate with PyTorch code
Let’s consider the following training code that again trains a simple CNN model
on MNIST dataset, but this time has a training loop closer to regular PyTorch
(trainer_mnist_cnn.py)
import time
import torch
from torch import nn
import transformers
from transformers import get_linear_schedule_with_warmup
from torchvision import datasets
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import logging
# Describe model
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding='valid'),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2),
nn.Flatten(),
nn.Linear(32*13*13, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
return self.layers(x)
def main():
"""
main function that does the training
"""
data_dir = './data'
n_epochs = 8
batch_size = 32
# Set up logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
level=logging.INFO,
)
logger = logging.getLogger(__name__)
# Define data sets and data loaders
train_dataset = datasets.MNIST(data_dir, train=True, download=True, transform=ToTensor())
test_dataset = datasets.MNIST(data_dir, train=False, transform=ToTensor())
train_dataloader = DataLoader(train_dataset, batch_size=32)
test_dataloader = DataLoader(test_dataset, batch_size=32)
# Define loss function
loss_function = nn.CrossEntropyLoss()
# Define model
model = SimpleCNN()
# Define optimizer
optimizer = torch.optim.AdamW(model.parameters())
num_training_steps = n_epochs * len(train_dataloader)
# Define learning rate scheduler
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=min(500, num_training_steps // 10), # 10% warmup
num_training_steps=num_training_steps
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
model.train()
total_steps = 0
start_time = time.time()
for epoch in range(n_epochs):
epoch_loss = 0
epoch_steps = 0
logger.info(f"Starting epoch {epoch + 1}/{n_epochs}")
for step, batch in enumerate(train_dataloader):
optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
optimizer.step()
scheduler.step()
total_steps += 1
epoch_loss += loss.item()
epoch_steps += 1
# Enhanced logging
if total_steps % 100 == 0:
elapsed_time = time.time() - start_time
avg_loss = epoch_loss / epoch_steps
current_lr = scheduler.get_last_lr()[0]
steps_per_sec = total_steps / elapsed_time
logger.info(
f"Step {total_steps} | Loss: {loss.item():.4f} | "
f"Avg Loss: {avg_loss:.4f} | LR: {current_lr:.2e} | "
f"Steps/sec: {steps_per_sec:.2f}"
)
if __name__ == "__main__":
main()
This can be converted to use Accelerate with minor changes to the original code
(trainer_mnist_cnn.py):
import time
import torch
from torch import nn
import transformers
from transformers import get_linear_schedule_with_warmup
from torchvision import datasets
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
from accelerate import Accelerator
import logging
# Describe model
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding='valid'),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2),
nn.Flatten(),
nn.Linear(32*13*13, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
return self.layers(x)
def main():
"""
main function that does the training
"""
data_dir = './data'
n_epochs = 8
batch_size = 32
# Set up logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
level=logging.INFO,
)
logger = logging.getLogger(__name__)
# Define data sets and data loaders
train_dataset = datasets.MNIST(data_dir, train=True, download=True, transform=ToTensor())
test_dataset = datasets.MNIST(data_dir, train=False, transform=ToTensor())
train_dataloader = DataLoader(train_dataset, batch_size=32)
test_dataloader = DataLoader(test_dataset, batch_size=32)
# Define loss function
loss_function = nn.CrossEntropyLoss()
# Define model
model = SimpleCNN()
# Define optimizer
optimizer = torch.optim.AdamW(model.parameters())
num_training_steps = n_epochs * len(train_dataloader)
# Define learning rate scheduler
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=min(500, num_training_steps // 10), # 10% warmup
num_training_steps=num_training_steps
)
accelerator = Accelerator()
device = accelerator.device
model, optimizer, train_dataloader, scheduler = accelerator.prepare(
model, optimizer, train_dataloader, scheduler
)
model.train()
total_steps = 0
start_time = time.time()
for epoch in range(n_epochs):
epoch_loss = 0
epoch_steps = 0
if accelerator.is_main_process:
logger.info(f"Starting epoch {epoch + 1}/{n_epochs}")
for step, batch in enumerate(train_dataloader):
optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
total_steps += 1
epoch_loss += loss.item()
epoch_steps += 1
# Enhanced logging
if total_steps % 100 == 0 and accelerator.is_main_process:
elapsed_time = time.time() - start_time
avg_loss = epoch_loss / epoch_steps
current_lr = scheduler.get_last_lr()[0]
steps_per_sec = total_steps / elapsed_time
logger.info(
f"Step {total_steps} | Loss: {loss.item():.4f} | "
f"Avg Loss: {avg_loss:.4f} | LR: {current_lr:.2e} | "
f"Steps/sec: {steps_per_sec:.2f}"
)
accelerator.end_training()
if __name__ == "__main__":
main()
#torch.distributed.destroy_process_group()
Main modifications are are setting up the Accelerator, letting the Accelerate handle the model placement
accelerator = Accelerator()
device = accelerator.device
model, optimizer, train_dataloader, scheduler = accelerator.prepare(
model, optimizer, train_dataloader, scheduler
)
and making certain that loss is propagated across all distributed models:
for step, batch in enumerate(train_dataloader):
optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
Now this training loop can be launched in the same way as the Trainer one
and it will utilize multiple GPUs:
accelerate launch --config_file accelerate_config.yaml accelerate_mnist_cnn.py