Taking advantage of multiple GPUs to train larger models such as RoBERTa-Large on NLP datasets

This article is co-authored by Saichandra Pandraju.

This tutorial will help you implement Model Parallelism (splitting the model layers into multiple GPUs) to help train larger models over multiple GPUs. We would be using the RoBERTa-Large model from Hugging Face Transformers. This approach helps achieve Model Parallelism just with PyTorch and without using any PyTorch wrappers such as Pytorch-Lightning.

Lastly, we would be using the IMDB dataset of 50K Movie Reviews for fine-tuning our RoBERTa-Large model


First, you will need a machine/VM with multiple GPUs…

Sakthi Ganesh

Machine Learning Engineer looking to solve real world problems and make AI/ML more efficient and accessible

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store