What’s new with Microsoft in open-source and Kubernetes at KubeCon North America 2024
At Microsoft, we are committed to innovation in the cloud-native ecosystem through…
ONNX Runtime is an open source project that is designed to accelerate machine learning across a wide range of frameworks, operating systems, and hardware platforms. It is used extensively in Microsoft products, like Office 365 and Bing, delivering over 20 billion inferences every day and up to 17 times faster inferencing.
Today we are introducing significant updates to ONNX Runtime. In addition to improvements for model inferencing, we’re announcing the preview of training acceleration.
ONNX Runtime now supports accelerated training of transformer models. Transformer models have become the building blocks for advanced language processing and generation. These models contain hundreds of millions of parameters and training them can consume many clusters of GPUs over days. Reducing the total training time can help enable rapid improvements in, and thus faster deployment of, these models.
To further accelerate training, we built custom kernels and graph optimizations to eliminate redundant operations. Additionally, ONNX Runtime enables larger batch sizes on the same 32GB memory of NVIDIA V100 Tensor Core GPUs. We tested ONNX Runtime by pretraining BERT-Large, reusing the training scripts and datasets from benchmarking tests by NVIDIA.
In the table below, you’ll see the relative training time improvements for pre-training the BERT-Large model on a 4 node NVIDIA DGX-2 cluster. The batch sizes reflect the Phase-1 and Phase-2 stages for the training experiment, using the datasets as detailed in NVIDIA repo. The detailed test report is here.
4x DGX2 (64x V100 32GB) |
PyTorch 1.5 with NGC 20.03-py3 |
PyTorch 1.5 with ONNX Runtime |
% Gain with ONNX Runtime |
Phase 1 time (hours) | 11.12 | 9.99 | 10.16% |
Phase 2 time (hours) | 6.62 | 5.77 | 12.84% |
Total time (hours) | 17.74 | 15.76 | 11.16% |
Developers can use the sample for pretraining BERT-Large with ONNX Runtime and fine-tune to their datasets as needed. We have also published a ready-to-use sample to start experiments in Azure Machine Learning. To use in custom environments, developers can build from the source code using the instructions published here.
We continue to improve inference acceleration with ONNX Runtime and are now partnering with Hugging Face to make it easy to accelerate popular transformer models.
We have seen gains from using ONNX Runtime with transformer models and are excited to release functionality that makes it easy to inference Hugging Face Transformer models with ONNX Runtime.
Clément Delangue, CEO of Hugging Face.
Today, we are also releasing multiple updates to ONNX Runtime for inferencing. The new ONNX Runtime inference version 1.3 includes:
Questions or feedback? Please let us know in the comments below.