Improve BERT inference speed by combining the power of Optimum, OpenVINO™, ONNX Runtime, and Azure
- Big models are slow, we need to make them faster
- Overview of Optimum Intel and quantization aware training
- Overview of ONNXRuntime, and OpenVINO™ Execution Provider
- Putting the tools together to achieve better performance
- Converting PyTorch FP32 model to INT8 ONNX model with QAT
- Comparison of FP32 model and INT8 ONNX model with Netron model visualization tool
- Performance improvement results
- Comparison of ONNXRUNTIME_PERF_TEST application for ONNX-FP32 and ONNX-INT8 models
- Optimize speed and performance
WRITTEN BY
/en-us/opensource/blog/author/cassie-breviu
Big models are slow, we need to make them faster
Today’s best-performing language processing models use huge neural architectures with hundreds of millions of parameters. State-of-the-art transformer-based architectures like BERT are available as pretrained models for anyone to use for any language task.
The big models have outstanding accuracy, but they are difficult to use in practice. These models are resource hungry due to a large number of parameters. These issues become worse when serving the fine-tuned model and it requires a lot of memory and time to process a single message. A state-of-the-art model is not good if it can handle only one message per second. To improve the throughput, we need to accelerate the well-performing BERT model, by reducing the computation or the number of operations with the help of quantization.
Overview of Optimum Intel and quantization aware training
Optimum Intel is an extension for the Hugging Face Optimum library with OpenVINO™ runtime as a backend for the Transformers architectures. It also provides an interface to Intel® NNCF (Neural Network Compression Framework) package. It helps implement Intel’s optimizations through NNCF with changes to just a few lines of code in the training pipeline.
Quantization aware training (QAT) is a widely used technique for optimizing models during training. It inserts nodes into the neural network during training that simulates the effect of lower precision. This allows the training algorithm to consider quantization errors as part of the overall training loss that gets minimized during training. QAT has better accuracy and reliability than carrying out quantization after the model has been trained. The output after training with our tool is a quantized PyTorch model, ONNX model, and IR.xml.
Overview of ONNXRuntime, and OpenVINO™ Execution Provider
ONNX Runtime is an open source project that is designed to accelerate machine learning across a wide range of frameworks, operating systems, languages, and hardware platforms. It enables the acceleration of machine learning inferencing across all of your deployment targets using a single set of APIs.
Intel and Microsoft joined hands to create the OpenVINO™ Execution Provider (OVEP) for ONNX Runtime, which enables ONNX models for running inference using ONNX Runtime APIs while using the OpenVINO™ Runtime as a backend. With the OpenVINO™ Execution Provider, ONNX Runtime delivers better inferencing performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, and VPU. Now you’ve got a basic understanding of quantization, ONNX Runtime, and OVEP, let’s take the best of both worlds and stitch the story together.
Putting the tools together to achieve better performance
In our next steps, we will be doing quantization aware training using Optimum-Intel and Inference using Optimum-ORT with OpenVINO™ Execution Provider through Azure Machine Learning. Optimum can be used to load optimized models from the Hugging Face Hub and create pipelines to run accelerated inferences.
Converting PyTorch FP32 model to INT8 ONNX model with QAT
When utilizing the Hugging Face training pipelines all you need is to update a few lines of code and you can invoke the NNCF optimizations for quantizing the model. The output of this would be an optimized INT8 PyTorch model, ONNX model, and OpenVINO™ IR. See the flow diagram below:
from trainer_qa import QuestionAnsweringOVTrainer
Run the training pipeline
1. Import OVConfig:
from optimum.intel.openvino import OVConfig
from trainer_qa import QuestionAnsweringOVTrainer
2. Initialize a config from the
ov_config = OVConfig()
3. Initialize our Trainer
trainer = QuestionAnsweringOVTrainer()
Comparison of FP32 model and INT8 ONNX model with Netron model visualization tool
When compared with FP32, the INT8 model has QuantizeLinear and DequantizeLinear operations added to mimic the lower precision after the QAT.
To replicate this example check out the reference code with detailed instructions on QAT and Inference using OpenVINO and Azure Machine Learning Jupyter Notebooks on GitHub.
Performance improvement results
Comparison of ONNXRUNTIME_PERF_TEST application for ONNX-FP32 and ONNX-INT8 models
We’ve used ONNXRuntime APIs for running inference for the BERT model. As you can see the performance for the INT8 optimized model improved almost to 2.95x when compared to FP32 without much compromise in the accuracy.
Optimize speed and performance
As neural networks move from servers to the edge, optimizing speed and size becomes even more important. In this blog, we gave an overview of how to use open source tooling to make it easy to improve performance.
References:
- Enhanced Low-Precision Pipeline to Accelerate Inference with OpenVINO™ toolkit.
- Developer Guide: Model Optimization with the OpenVINO™ Toolkit.
- Evaluating QA: Metrics, Predictions, and the Null Response.
Framework configuration: ONNXRuntime, Optimum-Intel [NNCF]
Application configuration: ONNXRuntime, EP: OpenVINO™ ./onnx_perf_test OPENVINO 2022.2: ./benchmark_app
Input: Question and context
Application Metric: Normalized throughput
Platform: Intel Icelake-8380
Number of Nodes: 2
Number of Sockets: 2
CPU or Accelerator: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Cores/socket, Threads/socket or EU/socket: 40,2
ucode: 0xd000375
HT: Enabled
Turbo: Enabled
BIOS Version: American Megatrends International, LLC. V1.4
System DDR Mem Config: slots / cap / run-speed: 32/32 GB/3200 MT/s
Total Memory/Node (DDR+DCPMM): 1024GB
Storage – boot: INTEL_SSDSC2KB019T8 1.8T
NIC: 2 x Ethernet Controller X710 for 10GBASE-T
OS: Ubuntu 20.04.4 LTS
Kernel: 5.15.0-46-generic