Taking advantage of multiple GPUs to train larger models such as RoBERTa-Large on NLP datasets
This article is co-authored by Saichandra Pandraju.
This tutorial will help you implement Model Parallelism (splitting the model layers into multiple GPUs) to help train larger models over multiple GPUs. We would be using the RoBERTa-Large model from Hugging Face Transformers. This approach helps achieve Model Parallelism just with PyTorch and without using any PyTorch wrappers such as Pytorch-Lightning.
Lastly, we would be using the IMDB dataset of 50K Movie Reviews for fine-tuning our RoBERTa-Large model
First, you will need a machine/VM with multiple GPUs…
Machine Learning Engineer looking to solve real world problems and make AI/ML more efficient and accessible