{"id":90699,"date":"2022-03-21T09:00:00","date_gmt":"2022-03-21T16:00:00","guid":{"rendered":"https:\/\/cloudblogs.microsoft.com\/opensource\/?p=90699"},"modified":"2024-06-19T10:50:28","modified_gmt":"2024-06-19T17:50:28","slug":"supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed","status":"publish","type":"post","link":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/","title":{"rendered":"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed"},"content":{"rendered":"\n<p><em>This post was co-authored by Jithun Nair and Aswin Mathews, members of technical staff at AMD<\/em>.<\/p>\n\n\n\n<p>In recent years, large-scale deep learning models have demonstrated impressive capabilities, excelling at tasks across natural language processing, computer vision, and speech domains. Companies now use these models to power novel AI-driven user experiences across a whole spectrum of applications and industries. However, efficiently training large models with 10\u2019s or 100\u2019s of billions of parameters is difficult\u2014the sheer size of these models requires them to be distributed across multiple nodes with careful orchestration of compute and communication.&nbsp;&nbsp;&nbsp; &nbsp;<\/p>\n\n\n\n<p><a href=\"https:\/\/www.deepspeed.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">DeepSpeed<\/a>, as part of <a href=\"http:\/\/aka.ms\/aiatscale\">Microsoft\u2019s AI at Scale initiative<\/a>, is a popular open-source library for PyTorch that addresses these difficulties and vastly improves the scale, speed, cost, and usability of large model training and inference. It addresses the scaling challenges by allowing users to easily apply a powerful suite of compute, memory, and communication optimization techniques with minimal code changes.&nbsp;With these techniques, DeepSpeed has enabled training the <a href=\"https:\/\/www.microsoft.com\/research\/blog\/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model\/\" target=\"_blank\" rel=\"noreferrer noopener\">largest transformer model with 530 billion parameters for language generation<\/a> and helped speed-up training and inference time by a factor of two times to 20 times for real-life scenarios. It is also integrated into popular training libraries like HuggingFace Transformers and PyTorch Lightning.<\/p>\n\n\n\n<p>Since 2006, AMD has been developing and continuously improving their GPU hardware and software technology for high-performance computing (HPC) and machine learning. Their open software platform, ROCm, contains the libraries, compilers, runtimes, and tools necessary for accelerating compute-intensive applications on AMD GPUs. Today, the major machine learning frameworks (like PyTorch, TensorFlow) have ROCm supported binaries that are fully upstreamed so that users can directly run their code written using these frameworks on <a href=\"https:\/\/www.amd.com\/en\/graphics\/instinct-server-accelerators\" target=\"_blank\" rel=\"noreferrer noopener\">AMD Instinct GPU hardware<\/a> and other <a href=\"https:\/\/docs.amd.com\/bundle\/Hardware_and_Software_Reference_Guide\/page\/Hardware_and_Software_Support.html\" target=\"_blank\" rel=\"noreferrer noopener\">ROCm compatible GPU hardware<\/a>\u2014without any porting effort.<\/p>\n\n\n\n<p>AMD has worked closely with the Microsoft DeepSpeed team to bring the suite of parallelization and optimization techniques for the training of large models efficiently on AMD GPUs supporting ROCm. This unlocks the ability to efficiently train models with hundreds of billions of parameters on a wide choice of GPU hardware and system configurations ranging from a single desktop to a distributed cluster of high-performance AMD Instinct\u2122 MI100\/MI200 accelerators.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Enabling state-of-the-art DL Stack on AMD GPU<\/h2>\n\n\n\n<p>As an effective enabler of large model training, DeepSpeed provides a suite of powerful parallelism and memory optimizations, such as <a href=\"https:\/\/www.deepspeed.ai\/tutorials\/zero\/\" target=\"_blank\" rel=\"noreferrer noopener\">ZeRO<\/a>, <a href=\"https:\/\/www.deepspeed.ai\/tutorials\/zero-offload\/\" target=\"_blank\" rel=\"noreferrer noopener\">ZeRO-Offload<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training\/\" target=\"_blank\" rel=\"noreferrer noopener\">ZeRO-Infinity<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/\" target=\"_blank\" rel=\"noreferrer noopener\">3D parallelism<\/a>, which are crucial for efficient training at massive model scales. With DeepSpeed, model scientists can significantly scale up their model sizes on AMD GPUs well beyond the limits of pure data parallelism. As an example, figure 1 shows model sizes that can be trained using 128 MI100 GPUs (on eight nodes) using different DeepSpeed optimizations. In general, each DeepSpeed optimization enables model scaling of two orders of magnitude compared to the 1.5 billion parameters limits of data parallelism. At the extreme, ZeRO-Infinity powers models with nearly 2 trillion parameters.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-1024x602.webp\" alt=\"DeepSpeed enables over three orders of magnitude of model scaling on 128 MI100 GPUs.\" class=\"wp-image-90702 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-1024x602.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-300x177.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-768x452.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-1536x904.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-800x471.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-400x235.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-450x265.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-650x382.webp 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1.webp 1686w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-1024x602.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-1024x602.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-300x177.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-768x452.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-1536x904.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-800x471.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-400x235.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-450x265.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-650x382.png 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1.png 1686w\"><figcaption>Figure 1: DeepSpeed enables over three orders of magnitude of model scaling on 128 MI100 GPUs.<\/figcaption><\/figure>\n\n\n\n<p>To achieve the optimizations in compute, memory, and communication, DeepSpeed makes use of <a href=\"https:\/\/docs.amd.com\/bundle\/AMD_HIP_Programming_Guide_v5.0\/page\/Introduction.html\" target=\"_blank\" rel=\"noreferrer noopener\">HIP<\/a> (language\/runtime), <a href=\"https:\/\/docs.amd.com\/bundle\/RocBLAS_documentation\/page\/usermanual.html\" target=\"_blank\" rel=\"noreferrer noopener\">rocBLAS<\/a> (for GEMMs), and <a href=\"https:\/\/docs.amd.com\/bundle\/Welcome-to-RCCL-s-documentation----RCCL-0.8-documentation\/page\/library.html\" target=\"_blank\" rel=\"noreferrer noopener\">RCCL<\/a> (for communication) libraries in the ROCm stack. The following figure shows how DeepSpeed interacts with AMD\u2019s ROCm software stack. It requires a version of PyTorch that is built for ROCm.<\/p>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture2.webp\" alt=\"Diagram of DeepSpeed and the AMD ROCm stack.\" class=\"wp-image-90705 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture2.webp\"><figcaption>Figure 2: DeepSpeed and the AMD ROCm stack.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Through careful work across both AMD and Microsoft engineers, we are proud to announce that <a href=\"https:\/\/github.com\/microsoft\/DeepSpeed\/releases\/tag\/v0.6.0\" target=\"_blank\" rel=\"noreferrer noopener\">DeepSpeed v0.6<\/a> works natively with ROCm-enabled GPUs. This new release of DeepSpeed uses the same APIs as prior releases and does not require any user code changes to leverage the full features of DeepSpeed on ROCm-enabled GPUs. DeepSpeed&#8217;s python-level code remains unchanged primarily due to the seamless ROCm experience on PyTorch. DeepSpeed&#8217;s CUDA-specific kernels are exposed to users through ROCm&#8217;s automatic hipification tools embedded in the PyTorch runtime. This automatic hipification allows DeepSpeed users to continue to enjoy a simple install <a href=\"https:\/\/pypi.org\/project\/deepspeed\/\" target=\"_blank\" rel=\"noreferrer noopener\">through PyPI<\/a> and just-in-time (JIT) hipification and compilation at runtime if or when kernels are utilized by end-users. In addition, ROCm has become an integral part of DeepSpeed&#8217;s continuous integration (CI) testing, which will elevate ROCm support in DeepSpeed for all future pull requests and new features.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Power efficient distributed training of large models<\/h2>\n\n\n\n<p>DeepSpeed enables high training efficiency while running distributed training for large models with billions of parameters across multiple MI100 GPUs and nodes. For example, figure 3 shows that on 8 MI100 nodes\/64 GPUs, DeepSpeed trains a wide range of model sizes, from 0.3 billion parameters (such as Bert-Large) to 50 billion parameters, at efficiencies that range from 38TFLOPs\/GPU to 44TFLOPs\/GPU.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-1024x505.webp\" alt=\"DeepSpeed enables efficient training for a wide range of real-world model sizes.\" class=\"wp-image-90708 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-1024x505.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-300x148.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-768x378.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-1536x757.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-800x394.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-400x197.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-450x222.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-650x320.webp 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3.webp 2001w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-1024x505.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-1024x505.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-300x148.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-768x378.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-1536x757.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-800x394.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-400x197.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-450x222.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-650x320.png 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3.png 2001w\"><figcaption>Figure 3: DeepSpeed enables efficient training for a wide range of real-world model sizes.<\/figcaption><\/figure>\n\n\n\n<p>DeepSpeed also empowers MI100 GPUs to obtain good training scalability for large models as the number of GPUs increases. For example, the plot below in figure 4 shows that for training a 10B parameter model on up to 64 MI100 GPUs, DeepSpeed achieves a throughput scaling (weak scaling) that is close to perfect linear speedup.<\/p>\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-6234de0665344.webp\" alt=\"chart, line chart, scatter chart. \" class=\"wp-image-90858 webp-format\" width=\"799\" height=\"423\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture1-6234de0665344.webp\"><figcaption>Figure 4: DeepSpeed achieves good training throughput scalability on 64 MI100 GPUs.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Democratizing large model training<\/h2>\n\n\n\n<p>An important democratizing feature of DeepSpeed is the ability to reduce the number of GPUs required to fit large models by offloading model states to the central processing unit (CPU) memory and NVMe memory. Offloading makes large models accessible to users with a limited GPU budget by enabling the training (or finetuning) of models with 10s or 100s of billions of parameters on a single node. Below, we briefly provide a flavor of the model scaling that DeepSpeed enables on a single MI100 GPU.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Efficient model scaling on single GPU&nbsp;<\/h3>\n\n\n\n<p>The figure below shows that ZeRO-Offload (such as offloading to CPU memory) can train much larger models (such as 12B parameters), on a single MI100 GPU, compared to the baseline PyTorch which runs out of memory (OOM) for models larger than 1.2B parameters. Moreover, ZeRO-Offload sustains higher training throughput (41\u201451 TFLOPs) than PyTorch (30 TFLOPs) by enabling larger batch sizes. In summary, ZeRO-Offload supports model sizes ten times larger than otherwise possible and at higher performance.<\/p>\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture2-6234de838b5a4.webp\" alt=\"chart, bar chart\" class=\"wp-image-90861 webp-format\" width=\"799\" height=\"408\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture2-6234de838b5a4.webp\"><figcaption>Figure 5: DeepSpeed enables 10 times model scaling on a single MI100 GPU with great efficiency. PyTorch runs out of memory for models larger than 1.2B.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Extreme model scaling on single GPU&nbsp;<\/h3>\n\n\n\n<p>The figure below shows that ZeRO-Infinity (such as offloading to NVMe memory) enables even more dramatic scaling in model size on a single MI100 GPU. ZeRO-Infinity utilizes the available 3.5TB NVMe memory in the server to train models as large as 120B parameters on one GPU, thereby scaling the model size by two orders of magnitude compared to the baseline.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-1024x388.webp\" alt=\"\" class=\"wp-image-90864 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-1024x388.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-300x114.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-768x291.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-1536x582.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-800x303.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-400x152.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-450x171.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-650x246.webp 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4.webp 1546w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-1024x388.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-1024x388.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-300x114.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-768x291.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-1536x582.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-800x303.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-400x152.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-450x171.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4-650x246.png 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/03\/Picture3-6234debe9c7b4.png 1546w\"><figcaption>Figure 6: DeepSpeed enables two orders of magnitude of model scaling on a single MI100 GPU.<\/figcaption><\/figure>\n\n\n\n<p>We also noticed that the training throughput with a single NVMe device (6.2GB\/sec reads and 3.2GB\/sec writes) was 12 TFLOPs and we can increase the throughput by using more NVMe devices. For example, we observed that by using four NVMe devices we were able to double the throughput to 24 TFLOPs.&nbsp;&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Using DeepSpeed on AMD GPUs\u2014Getting started<\/h2>\n\n\n\n<p>It is convenient to use DeepSpeed with the latest ROCm software on a range of <a href=\"https:\/\/docs.amd.com\/bundle\/Hardware_and_Software_Reference_Guide\/page\/Hardware_and_Software_Support.html\" target=\"_blank\" rel=\"noreferrer noopener\">supported AMD GPUs<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Installation<\/h3>\n\n\n\n<p>The simplest way to use DeepSpeed for ROCm is to use the pre-built docker image (rocm\/deepspeed:latest) available on <a href=\"https:\/\/hub.docker.com\/r\/rocm\/deepspeed\" target=\"_blank\" rel=\"noreferrer noopener\">Docker Hub<\/a>.<\/p>\n\n\n\n<p>You can also easily install DeepSpeed for ROCm by just using \u201c<code>pip install deepspeed<\/code>\u201d. For details and advanced options please refer to the <a href=\"https:\/\/www.deepspeed.ai\/getting-started\/#installation\" target=\"_blank\" rel=\"noreferrer noopener\">installation section of the DeepSpeed documentation page<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Using DeepSpeed on ROCm with HuggingFace models&nbsp;<\/h3>\n\n\n\n<p>The <a href=\"https:\/\/github.com\/huggingface\/transformers\" target=\"_blank\" rel=\"noreferrer noopener\">HuggingFace Transformers<\/a> is compatible with the latest DeepSpeed and ROCm stack.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Several <a href=\"https:\/\/github.com\/huggingface\/transformers\/tree\/master\/examples\/pytorch\" target=\"_blank\" rel=\"noreferrer noopener\">language examples<\/a> on HuggingFace repository can be easily run on AMD GPUs without any code modifications. We have tested several models like BERT, BART, DistilBERT, T5-Large, DeBERTa-V2-XXLarge, GPT2 and RoBERTa-Large with DeepSpeed ZeRO-2 on ROCm.<\/li><li>DeepSpeed can be activated in HuggingFace examples using the deepspeed command-line argument, `<code>--deepspeed=deepspeed_config.json<\/code>`.<\/li><\/ul>\n\n\n\n<p>We\u2019ve demonstrated how DeepSpeed and AMD GPUs work together to enable efficient large model training for a single GPU and across distributed GPU clusters. We hope you can take these capabilities to quickly transform your ideas into fully trained models in no time on top of AMD GPUs.&nbsp;<\/p>\n\n\n\n<p>We would love to hear feedback and welcome contributions on the <a href=\"https:\/\/github.com\/microsoft\/DeepSpeed\/issues\" target=\"_blank\" rel=\"noreferrer noopener\">DeepSpeed<\/a> and AMD <a href=\"https:\/\/github.com\/RadeonOpenCompute\/ROCm\/issues\" target=\"_blank\" rel=\"noreferrer noopener\">ROCm GitHub<\/a> repos.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Related work<\/h2>\n\n\n\n<p>AMD ROCm is also supported as an execution provider in the <a href=\"https:\/\/github.com\/microsoft\/onnxruntime\/tree\/master\/orttraining\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime for Training<\/a>, another open-source project led by Microsoft. We refer you to a <a href=\"https:\/\/cloudblogs.microsoft.com\/opensource\/2021\/07\/13\/onnx-runtime-release-1-8-1-previews-support-for-accelerated-training-on-amd-gpus-with-the-amd-rocm-open-software-platform\/\">previous blog<\/a> for more details. DeepSpeed is composable with ONNX Runtime using the open source ORTModule that is part of <a href=\"https:\/\/github.com\/pytorch\/ort\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime for PyTorch<\/a> package. This allows the composition of DeepSpeed and ONNX Runtime optimizations. You can find more details in this <a href=\"https:\/\/techcommunity.microsoft.com\/t5\/ai-machine-learning-blog\/accelerate-pytorch-transformer-model-training-with-onnx-runtime\/ba-p\/2540471\" target=\"_blank\" rel=\"noreferrer noopener\">blog post<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Contributors<\/h2>\n\n\n\n<p>This work was made possible with deep collaboration between system researchers and engineers at AMD and Microsoft.&nbsp;The contributors of this work include Jithun Nair, Jeff Daily, Ramya Ramineni, Aswin Mathews, and Peng Sun from AMD; Olatunji Ruwase, Jeff Rasley, Jeffrey Zhu, Yuxiong He, and Gopi Kumar from Microsoft.<\/p>\n\n\n\n<p>This posting is the authors\u2019 own opinion and may not represent AMD\u2019s or Microsoft\u2019s positions, strategies or opinions. Links to third-party sites are provided for convenience and unless explicitly stated, neither AMD nor Microsoft is responsible for the contents of such linked sites and no endorsement is implied.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post was co-authored by Jithun Nair and Aswin Mathews, members of technical staff at AMD. In recent years, large-scale deep learning models have demonstrated impressive capabilities, excelling at tasks across natural language processing, computer vision, and speech domains.<\/p>\n","protected":false},"author":5562,"featured_media":95490,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msxcm_post_with_no_image":false,"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","footnotes":""},"post_tag":[1905,2272,1824],"content-type":[361],"topic":[2238,2250],"programming-languages":[2265],"coauthors":[1899],"class_list":["post-90699","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-deepspeed","tag-microsoft","tag-onnx-runtime","content-type-project-updates","topic-ai-machine-learning","topic-deep-learning","programming-languages-pytorch","review-flag-1593580428-734","review-flag-1-1593580432-963","review-flag-2-1593580437-411","review-flag-3-1593580442-169","review-flag-4-1593580448-609","review-flag-5-1593580453-725","review-flag-6-1593580457-852","review-flag-8-1593580468-572","review-flag-lever-1593580265-989","review-flag-new-1593580248-669"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.2 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed | Microsoft Open Source Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed | Microsoft Open Source Blog\" \/>\n<meta property=\"og:description\" content=\"This post was co-authored by Jithun Nair and Aswin Mathews, members of technical staff at AMD. In recent years, large-scale deep learning models have demonstrated impressive capabilities, excelling at tasks across natural language processing, computer vision, and speech domains.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Open Source Blog\" \/>\n<meta property=\"article:published_time\" content=\"2022-03-21T16:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-06-19T17:50:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1170\" \/>\n\t<meta property=\"og:image:height\" content=\"640\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Olatunji Ruwase\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:site\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Olatunji Ruwase\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 min read\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/\"},\"author\":[{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/author\/olatunji-ruwase\/\",\"@type\":\"Person\",\"@name\":\"Olatunji Ruwase\"}],\"headline\":\"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed\",\"datePublished\":\"2022-03-21T16:00:00+00:00\",\"dateModified\":\"2024-06-19T17:50:28+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/\"},\"wordCount\":1567,\"publisher\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp\",\"keywords\":[\"DeepSpeed\",\"Microsoft\",\"ONNX Runtime\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/\",\"name\":\"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed | Microsoft Open Source Blog\",\"isPartOf\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp\",\"datePublished\":\"2022-03-21T16:00:00+00:00\",\"dateModified\":\"2024-06-19T17:50:28+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#primaryimage\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp\",\"contentUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp\",\"width\":1170,\"height\":640},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/opensource.microsoft.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#website\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/\",\"name\":\"Microsoft Open Source Blog\",\"description\":\"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability\",\"publisher\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\",\"name\":\"Microsoft Open Source Blog\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png\",\"contentUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Open Source Blog\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/OpenAtMicrosoft\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed | Microsoft Open Source Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/","og_locale":"en_US","og_type":"article","og_title":"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed | Microsoft Open Source Blog","og_description":"This post was co-authored by Jithun Nair and Aswin Mathews, members of technical staff at AMD. In recent years, large-scale deep learning models have demonstrated impressive capabilities, excelling at tasks across natural language processing, computer vision, and speech domains.","og_url":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/","og_site_name":"Microsoft Open Source Blog","article_published_time":"2022-03-21T16:00:00+00:00","article_modified_time":"2024-06-19T17:50:28+00:00","og_image":[{"width":1170,"height":640,"url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.png","type":"image\/png"}],"author":"Olatunji Ruwase","twitter_card":"summary_large_image","twitter_creator":"@OpenAtMicrosoft","twitter_site":"@OpenAtMicrosoft","twitter_misc":{"Written by":"Olatunji Ruwase","Est. reading time":"6 min read"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#article","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/"},"author":[{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/olatunji-ruwase\/","@type":"Person","@name":"Olatunji Ruwase"}],"headline":"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed","datePublished":"2022-03-21T16:00:00+00:00","dateModified":"2024-06-19T17:50:28+00:00","mainEntityOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/"},"wordCount":1567,"publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp","keywords":["DeepSpeed","Microsoft","ONNX Runtime"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/","url":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/","name":"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed | Microsoft Open Source Blog","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#primaryimage"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp","datePublished":"2022-03-21T16:00:00+00:00","dateModified":"2024-06-19T17:50:28+00:00","breadcrumb":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#primaryimage","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp","width":1170,"height":640},{"@type":"BreadcrumbList","@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/opensource.microsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed"}]},{"@type":"WebSite","@id":"https:\/\/opensource.microsoft.com\/blog\/#website","url":"https:\/\/opensource.microsoft.com\/blog\/","name":"Microsoft Open Source Blog","description":"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability","publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/opensource.microsoft.com\/blog\/#organization","name":"Microsoft Open Source Blog","url":"https:\/\/opensource.microsoft.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Open Source Blog"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/OpenAtMicrosoft"]}]}},"msxcm_display_generated_audio":false,"msxcm_animated_featured_image":null,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Open Source Blog","distributor_original_site_url":"https:\/\/opensource.microsoft.com\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/90699","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/users\/5562"}],"replies":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=90699"}],"version-history":[{"count":1,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/90699\/revisions"}],"predecessor-version":[{"id":95735,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/90699\/revisions\/95735"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media\/95490"}],"wp:attachment":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=90699"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/post_tag?post=90699"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/content-type?post=90699"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/topic?post=90699"},{"taxonomy":"programming-languages","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/programming-languages?post=90699"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/coauthors?post=90699"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}