What’s new with Microsoft in open-source and Kubernetes at KubeCon North America 2024
At Microsoft, we are committed to innovation in the cloud-native ecosystem through…
This post is co-authored by Emma Ning, Azure Machine Learning; Nathan Yan, Azure Machine Learning; Jeffrey Zhu, Bing; Jason Li, Bing
One of the most popular deep learning models used for natural language processing is BERT (Bidirectional Encoder Representations from Transformers). Due to the significant computation required, inferencing BERT at high scale can be extremely costly and may not even be possible with strict latency constraints.
Recently, we shared how Bing has improved BERT inference on NVIDIA GPU for its real-time service needs, serving more than one million BERT inferences per second within Bing’s latency limits. We are excited to announce that Microsoft has open sourced enhanced versions of these optimizations into the ONNX Runtime and extended them to work on both GPU and CPU.
With ONNX Runtime, AI developers can now easily productionize large transformer models with high performance across both CPU and GPU hardware, using the same technology Microsoft uses to serve their customers. Read on to learn more, including how to integrate into your own project.
To serve the most relevant results to our customers, Bing uses cutting-edge natural language processing (NLP) techniques to better understand user queries, webpages, and other documents. A key component of NLP is language representation models like BERT, RoBERTa, or MT-DNN. Bing has been developing and fine-tuning its own language representation models for tasks like web search, question answering and captions, and multimedia search.
However, applying large transformer networks in a real-time production environment has both latency and cost challenges, since running a 12- or 24-layer BERT for every query is computationally expensive. As announced in November, we first used knowledge distillation to condense the larger model to a three-layer BERT model without any significant loss in accuracy, significantly reducing the computation cost. But the distilled three-layer BERT model was still benchmarked at 77ms serving latency and running this over millions of queries and documents per second remained prohibitively expensive. To optimize further, the entire model was re-implemented using C++ APIs to take full advantage of NVIDIA GPU architecture, which achieved an 800x throughput improvement when compared to CPU.
Once these optimizations were successfully used in Bing production, there was more to do. As these large transformer networks are reusable for lots more NLP tasks beyond web search, we needed an easy way to share this beneficial work for others. The current solution required each model developer to reimplement the model with our C++ libraries, which was time-consuming. To further democratize transformer inference and empower others to benefit from these advances, we optimized them further, extended them to CPU, and open sourced them in ONNX Runtime.
ONNX Runtime is a high-performance inference engine for machine learning models. It’s compatible with PyTorch, TensorFlow, and many other frameworks and tools that support the ONNX standard. ONNX Runtime is designed with an open and extensible architecture for easily optimizing and accelerating inference by leveraging built-in graph optimizations and various hardware acceleration capabilities across CPU, GPU, and Edge devices. ONNX Runtime can easily plug into your technology stack, since it works on Linux, Windows, Mac, and Android, and has convenient APIs for Python, C#, C++, C, and Java.
Transformer models like BERT consist of a graph of many operators. Graph optimization, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations, is an essential technique built into ONNX Runtime. Since the BERT model is mainly composed of stacked transformer cells, we optimize each cell by fusing key sub-graphs of multiple elementary operators into single kernels for both CPU and GPU, including Self-Attention, LayerNormalization, and Gelu layers. This significantly reduces memory copy between numerous elementary computations.
Additionally, in the CPU implementation of Self-Attention, the columns of matrix Q, K, and V are partitioned based on the number of self-attention heads. With this optimization, we can significantly increase the parallelization and fully leverage available CPU cores. Moreover, the transpose op following the full connection of Q, K, and V can be computed within GEMM, which further reduces the computation cost.
With these optimizations, ONNX Runtime performs the inference on BERT-SQUAD with 128 sequence length and batch size 1 on Azure Standard NC6S_v3 (GPU V100):
Below are the detailed performance numbers for 3-layer BERT with 128 sequence length measured from ONNX Runtime. On CPU, we saw 17x latency speed up with ~100 queries per second throughput. On NVIDIA GPUs we saw more than 3x latency speed up however with batch size of 64, which results ~10,000 queries per second throughput.
With the latest BERT optimizations available in ONNX Runtime, Bing transitioned the transformer inferencing codebase to the jointly developed ONNX Runtime. Not only did ONNX Runtime inference large transformer networks at the scale of Bing’s traffic, but the new optimizations also improved Bing latencies. Furthermore, Bing found ONNX Runtime was much easier to use and cut the time to reuse the optimizations for new scenarios from multiple days to a few hours.
Besides Bing, ONNX Runtime is deployed by dozens of Microsoft products and services, including Office, Windows, Cognitive Services, Skype, Bing Ads, and PowerBI – on hundreds of millions of devices, serving billions of requests. ONNX Runtime is used for a variety of models for computer vision, speech, language processing, forecasting, and more. Teams have achieved up to 18x performance improvements over their previous inference solutions on the same hardware.
You can take advantage of the same acceleration used by Microsoft in your own products, whether you are targeting the cloud or the intelligent edge or whether you are using CPUs or GPUs. To get started:
We are providing example code for both PyTorch BERT acceleration and TensorFlow BERT acceleration. We also demonstrate how you can use Azure Machine Learning for creating and managing a seamless pipeline for training and deploying with ONNX Runtime in this tutorial.
We look forward to seeing how these ONNX Runtime advancements will improve the performance of your production CPU and GPU workloads. Get started with ONNX Runtime here.
Questions or feedback? Let us know in the comments below.