ONNX Runtime is an open source project that is designed to accelerate machine learning across a wide range of frameworks, operating systems, and hardware platforms. It is used extensively in Microsoft products, like Office 365 and Bing, delivering over 20 billion inferences every day and up to 17 times faster inferencing.
Today we are introducing significant updates to ONNX Runtime. In addition to improvements for model inferencing, we’re announcing the preview of training acceleration.
ONNX Runtime for training
ONNX Runtime now supports accelerated training of transformer models. Transformer models have become the building blocks for advanced language processing and generation. These models contain hundreds of millions of parameters and training them can consume many clusters of GPUs over days. Reducing the total training time can help enable rapid improvements in, and thus faster deployment of, these models.
Today’s preview release of training acceleration incorporates innovations from the AI at Scale initiative, such as ZeRO optimization and Project Parasail, that improve memory utilization and parallelism on GPUs. ONNX Runtime also features mixed precision implementation to fit more training data in a single NVIDIA GPU’s available memory, helping training jobs converge faster, thereby saving time. It is integrated into the existing trainer code for PyTorch and TensorFlow. ONNX Runtime is already being used for training models at Microsoft. For example:
Office 365 uses ONNX Runtime to accelerate pre-training of the Turing Natural Language Representation (T-NLR) model, a transformer model with more than 400 million parameters, powering rich end-user features like Suggested Replies, Smart Find, and Inside Look. Using ONNX Runtime has reduced training time by 45% on a cluster of 64 NVIDIA V100 Tensor Core GPUs in Azure Machine Learning.
Bing uses large transformer models with more than 500 million parameters to train and service task-specific models. These models use ONNX Runtime to accelerate pre-training and fine-tuning throughput, cutting training time by 44%.
Visual Studio uses ONNX Runtime to accelerate pre-training a model, similar to GPT-2 Medium, with more than 300 million parameters to power code autocompletion in the IntelliCode feature.
To further accelerate training, we built custom kernels and graph optimizations to eliminate redundant operations. Additionally, ONNX Runtime enables larger batch sizes on the same 32GB memory of NVIDIA V100 Tensor Core GPUs. We tested ONNX Runtime by pretraining BERT-Large, reusing the training scripts and datasets from benchmarking tests by NVIDIA.
In the table below, you’ll see the relative training time improvements for pre-training the BERT-Large model on a 4 node NVIDIA DGX-2 cluster. The batch sizes reflect the Phase-1 and Phase-2 stages for the training experiment, using the datasets as detailed in NVIDIA repo. The detailed test report is here.
Developers can use the sample for pretraining BERT-Large with ONNX Runtime and fine-tune to their datasets as needed. We have also published a ready-to-use sample to start experiments in Azure Machine Learning. To use in custom environments, developers can build from the source code using the instructions published here.
ONNX Runtime for inferencing
We continue to improve inference acceleration with ONNX Runtime and are now partnering with Hugging Face to make it easy to accelerate popular transformer models.
We have seen gains from using ONNX Runtime with transformer models and are excited to release functionality that makes it easy to inference Hugging Face Transformer models with ONNX Runtime.
Clément Delangue, CEO of Hugging Face.
Today, we are also releasing multiple updates to ONNX Runtime for inferencing. The new ONNX Runtime inference version 1.3 includes:
In 2018 we (re)-open-sourced MS‑DOS 1.25 and 2.11, and more recently in 2024 we were able to make the source for MS‑DOS 4.0 available to the public as well. Today, on 86-DOS 1.00’s 45th anniversary, we’re continuing that tradition with the earliest DOS source code discovered to date.
For decades, fragments and unofficial copies of Microsoft’s 6502 BASIC have circulated online, mirrored on retrocomputing sites, and preserved in museum archives. Coders have studied the code, rebuilt it, and even run it in modern systems. Today, for the first time, we're officially releasing it under an open-source license.