{"id":94179,"date":"2023-01-25T09:00:00","date_gmt":"2023-01-25T17:00:00","guid":{"rendered":""},"modified":"2024-06-19T10:50:42","modified_gmt":"2024-06-19T17:50:42","slug":"improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure","status":"publish","type":"post","link":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/","title":{"rendered":"Improve BERT inference speed by combining the power of Optimum, OpenVINO\u2122, ONNX Runtime, and Azure"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">In this blog, we will discuss one of the ways to make huge models like BERT smaller and faster with OpenVINO\u2122 Neural Networks Compression Framework (NNCF) and ONNX Runtime with OpenVINO\u2122 Execution Provider through <a href=\"https:\/\/azure.microsoft.com\/products\/machine-learning\/#product-overview\">Azure Machine Learning<\/a>.<\/p>\n\n\n\n\n\n<h2 class=\"wp-block-heading\">Big models are slow, we need to make them faster<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Today&#8217;s best-performing language processing models use huge neural architectures with hundreds of millions of parameters. State-of-the-art transformer-based architectures like BERT are available as pretrained models for anyone to use for any language task.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The big models have outstanding accuracy, but they are difficult to use in practice. These models are resource hungry due to a large number of parameters. These issues become worse when serving the fine-tuned model and it requires a lot of memory and time to process a single message. A state-of-the-art model is not good if it can handle only one message per second. To improve the throughput, we need to accelerate the well-performing BERT model, by reducing the computation or the number of operations with the help of quantization.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Overview of Optimum Intel and quantization aware training<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Optimum Intel is an extension for the Hugging Face Optimum library with OpenVINO\u2122 runtime as a backend for the Transformers architectures. It also provides an interface to Intel\u00ae NNCF (Neural Network Compression Framework) package. It helps implement Intel\u2019s optimizations through NNCF with changes to just a few lines of code in the training pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Quantization aware training (QAT) is a widely used technique for optimizing models during training. It inserts nodes into the neural network during training that simulates the effect of lower precision. This allows the training algorithm to consider quantization errors as part of the overall training loss that gets minimized during training. QAT has better accuracy and reliability than carrying out quantization after the model has been trained. The output after training with our tool is a quantized PyTorch model, ONNX model, and IR.xml.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Overview of ONNXRuntime, and OpenVINO\u2122 Execution Provider<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">ONNX Runtime is an open source project that is designed to accelerate machine learning across a wide range of frameworks, operating systems, languages, and hardware platforms. It enables the acceleration of machine learning inferencing across all of your deployment targets using a single set of APIs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Intel and Microsoft joined hands to create the <a href=\"https:\/\/onnxruntime.ai\/docs\/execution-providers\/OpenVINO-ExecutionProvider.html\" target=\"_blank\" rel=\"noreferrer noopener\">OpenVINO\u2122 Execution Provider<\/a> (OVEP) for ONNX Runtime, which enables ONNX models for running inference using ONNX Runtime APIs while using the OpenVINO\u2122 Runtime as a backend. With the OpenVINO\u2122 Execution Provider, ONNX Runtime delivers better inferencing performance on the same hardware compared to generic acceleration on Intel\u00ae CPU, GPU, and VPU. Now you\u2019ve got a basic understanding of quantization, ONNX Runtime, and OVEP, let&#8217;s take the best of both worlds and stitch the story together.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Putting the tools together to achieve better performance<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In our next steps, we will be doing quantization aware training using Optimum-Intel and Inference using Optimum-ORT with OpenVINO\u2122 Execution Provider through Azure Machine Learning. Optimum can be used to load optimized models from the&nbsp;<a href=\"https:\/\/huggingface.co\/docs\/optimum\/v1.2.1\/en\/onnxruntime\/hf.co\/models\">Hugging Face Hub<\/a>&nbsp;and create pipelines to run accelerated inferences.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Converting PyTorch FP32 model to INT8 ONNX model with QAT<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When utilizing the Hugging Face training pipelines all you need is to update a few lines of code and you can invoke the NNCF optimizations for quantizing the model. The output of this would be an optimized INT8 PyTorch model, ONNX model, and OpenVINO\u2122 IR. See the flow diagram below:<\/p>\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/image.webp\" alt=\"An End-to-End NLP workflow diagram to showcase the necessary intermediate steps e.g., model download from Hugging Face, model quantization, model inferencing, and model deployment with respective component software and delivery formats.\" class=\"wp-image-94180 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/image.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For this case study, we have chosen the <a href=\"https:\/\/huggingface.co\/bert-large-uncased-whole-word-masking-finetuned-squad\" target=\"_blank\" rel=\"noreferrer noopener\">bert-squad pretrained model<\/a> from Hugging Face. This has been pretrained on the SQuAD dataset for the question-answering use case. QAT can be applied by replacing the <a href=\"https:\/\/huggingface.co\/docs\/transformers\/main\/en\/main_classes\/trainer#trainer\" target=\"_blank\" rel=\"noreferrer noopener\">Transformers Trainer<\/a>&nbsp;with the Optimum (OVTrainer). See below:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; gutter: false; title: ; quick-code: false; notranslate\" title=\"\">\nfrom trainer_qa import QuestionAnsweringOVTrainer\n\nRun the training pipeline\n1. Import OVConfig:\n\nfrom optimum.intel.openvino import OVConfig\nfrom trainer_qa import QuestionAnsweringOVTrainer\n\n2. Initialize a config from the \n    ov_config = OVConfig()\n    \n3. Initialize our Trainer\n    trainer = QuestionAnsweringOVTrainer()\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Comparison of FP32 model and INT8 ONNX model with Netron&nbsp;model visualization tool<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When compared with FP32, the INT8 model has <strong>QuantizeLinear<\/strong> and <strong>DequantizeLinear<\/strong> operations added to mimic the lower precision after the QAT.<\/p>\n\n\n\n<div class=\"wp-block-group is-content-justification-center is-nowrap is-layout-flex wp-container-core-group-is-layout-23441af8 wp-block-group-is-layout-flex\">\n<div class=\"wp-block-group has-global-padding is-layout-constrained wp-block-group-is-layout-constrained\"><figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/Fig-1.webp\" alt=\"A fp32 model visualization for deep learning model.\" class=\"wp-image-94181 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/Fig-1.webp\"><figcaption class=\"wp-element-caption\"><em><strong>Fig1: FP32 model<\/strong><\/em><\/figcaption><\/figure>\n<\/div>\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/Fig-2.webp\" alt=\"An INT8 model visualization for deep learning model that has Quantize Linear and Dequantize Linear operations added to the ops to mimic the lower precision after the quantization aware training.\" class=\"wp-image-94183 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/Fig-2.webp\"><figcaption class=\"wp-element-caption\"><em><strong>Fig2: INT8 model<\/strong><\/em><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">To replicate this example check out the reference code with detailed instructions on <a href=\"https:\/\/github.com\/intel\/nlp-training-and-inference-openvino\/tree\/main\/question-answering-bert-qat\" target=\"_blank\" rel=\"noreferrer noopener\">QAT and Inference using OpenVINO<\/a> and <a href=\"https:\/\/github.com\/microsoft\/onnxruntime-inference-examples\/tree\/main\/python\/OpenVINO_EP\/azureml\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Machine Learning Jupyter Notebooks<\/a> on GitHub.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Performance improvement results<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Accuracy<\/th><th>Original FP32<\/th><th>QAT INT8<\/th><th>Explanation<\/th><\/tr><\/thead><tbody><tr><td>F1<\/td><td>93.1<\/td><td>92.83<\/td><td>In this case, it&#8217;s computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.<\/td><\/tr><tr><td>Eval_exact<\/td><td>86.91<\/td><td>86.94<\/td><td>This metric is as simple as it sounds. For each question + answer pair, if the&nbsp;<em>characters<\/em>&nbsp;of the model&#8217;s prediction exactly match the characters of (one of) the True Answer(s), EM = 1, otherwise EM = 0. This is a strict all-or-nothing metric; being off by a single character results in a score of 0. When assessing against a negative example, if the model predicts any text at all, it automatically receives a 0 for that example.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Comparison of ONNXRUNTIME_PERF_TEST application for ONNX-FP32 and ONNX-INT8 models<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ve used ONNXRuntime APIs for running inference for the BERT model. As you can see the performance for the INT8 optimized model improved almost to 2.95x when compared to FP32 without much compromise in the accuracy.<\/p>\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/Table.webp\" alt=\"A fp32 versus int8 model performance comparison using ONNX Runtime performance test application with OpenVINO&trade; Execution Provider. Performance test application gives 2.95 times throughput with int8 model.\" class=\"wp-image-94184 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/Table.webp\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Quantized PyTorch, ONNX, and INT8 models can also be served using OpenVINO\u2122 Model Server for high-scalability and optimization for Intel\u00ae solutions so that you can take advantage of all the power of the Intel\u00ae Xeon\u00ae processor or Intel\u2019s AI accelerators and expose it over a network interface.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Optimize speed and performance <\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">As neural networks move from servers to the edge, optimizing speed and size becomes even more important. In this blog, we gave an overview of how to use open source tooling to make it easy to improve performance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">References:<\/h3>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/medium.com\/openvino-toolkit\/enhanced-low-precision-pipeline-to-accelerate-inference-with-openvino-toolkit-deefe0206c24\" target=\"_blank\" rel=\"noreferrer noopener\">Enhanced Low-Precision Pipeline to Accelerate Inference with OpenVINO\u2122 toolkit<\/a>.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/medium.com\/openvino-toolkit\/developer-guide-model-optimization-with-the-openvino-toolkit-d19a201dd3ce\" target=\"_blank\" rel=\"noreferrer noopener\">Developer Guide: Model Optimization with the OpenVINO\u2122 Toolkit<\/a>. <\/li>\n\n\n\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/qa.fastforwardlabs.com\/no%20answer\/null%20threshold\/bert\/distilbert\/exact%20match\/f1\/robust%20predictions\/2020\/06\/09\/Evaluating_BERT_on_SQuAD.html\" target=\"_blank\" rel=\"noreferrer noopener\">Evaluating QA: Metrics, Predictions, and the Null Response<\/a>.<\/li>\n<\/ol>\n\n\n\n<h4 class=\"wp-block-heading\">SW\/HW configuration<\/h4>\n\n\n\n<div class=\"wp-block-group has-global-padding is-layout-constrained wp-block-group-is-layout-constrained\">\n<p class=\"wp-block-paragraph\"><strong>Framework configuration: <\/strong>ONNXRuntime, Optimum-Intel [NNCF]<br><strong>Application configuration: <\/strong>ONNXRuntime, EP: OpenVINO\u2122 <strong>.\/onnx_perf_test<\/strong> OPENVINO 2022.2: .\/benchmark_app<br><strong>Input:<\/strong> Question and context<br><strong>Application Metric:<\/strong>&nbsp;Normalized throughput<br><strong>Platform<\/strong>: Intel Icelake-8380<br><strong>Number of Nodes<\/strong>: 2<br><strong>Number of Sockets<\/strong>: 2<br><strong><strong>CPU or Accelerator<\/strong><\/strong>: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz<br><strong>Cores\/socket, Threads\/socket or EU\/socket: <\/strong>40,2<br><strong>ucode<\/strong>: 0xd000375<br><strong>HT:<\/strong> Enabled<br><strong>Turbo: <\/strong>Enabled<br><strong>BIOS Version<\/strong>: American Megatrends International, LLC. V1.4<br><strong>System DDR Mem Config: slots \/ cap \/ run-speed<\/strong>: 32\/32 GB\/3200 MT\/s<br><strong>Total Memory\/Node (DDR+DCPMM)<\/strong>: 1024GB<br><strong>Storage \u2013 boot<\/strong>: INTEL_SSDSC2KB019T8 1.8T<br><strong>NIC<\/strong>: 2 x&nbsp;Ethernet&nbsp;Controller X710 for 10GBASE-T<br><strong>OS<\/strong>: Ubuntu 20.04.4 LTS<br><strong>Kernel<\/strong>: 5.15.0-46-generic<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Make large models smaller and faster with OpenVino Execution Provider, NNCF and ONNX Runtime leveraging Azure Machine Learning.<\/p>\n","protected":false},"author":5562,"featured_media":95489,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","_alt_title":"","ms-ems-related-posts":[],"footnotes":""},"tags":[1824,1827],"programming-languages":[],"content-type":[340],"job-role":[],"topic":[2238,2252],"coauthors":[1972,1973,1969,1974],"class_list":["post-94179","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-onnx-runtime","tag-transformer","content-type-tutorials-and-demos","topic-ai-machine-learning","topic-tools","review-flag-1593580428-734","review-flag-1593580419-521","review-flag-1-1593580432-963","review-flag-2-1593580437-411","review-flag-3-1593580442-169","review-flag-4-1593580448-609","review-flag-5-1593580453-725"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>OpenVINO\u2122, ONNX Runtime, and Azure improve BERT inference speed | Microsoft Open Source Blog<\/title>\n<meta name=\"description\" content=\"Make large models smaller and faster with OpenVino Execution Provider, NNCF and ONNX Runtime leveraging Azure Machine Learning.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"OpenVINO\u2122, ONNX Runtime, and Azure improve BERT inference speed | Microsoft Open Source Blog\" \/>\n<meta property=\"og:description\" content=\"Make large models smaller and faster with OpenVino Execution Provider, NNCF and ONNX Runtime leveraging Azure Machine Learning.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Open Source Blog\" \/>\n<meta property=\"article:published_time\" content=\"2023-01-25T17:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-06-19T17:50:42+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/image.png\" \/>\n<meta name=\"author\" content=\"Cassie Breviu, Akhila Vidiyala, Devang Aggarwal, Sachin Rastogi\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:site\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Cassie Breviu, Akhila Vidiyala, Devang Aggarwal, Sachin Rastogi\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/\"},\"author\":[{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/cassie-breviu\\\/\",\"@type\":\"Person\",\"@name\":\"Cassie Breviu\"},{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/akhila-vidiyala\\\/\",\"@type\":\"Person\",\"@name\":\"Akhila Vidiyala\"},{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/devang-aggarwal\\\/\",\"@type\":\"Person\",\"@name\":\"Devang Aggarwal\"},{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/sachin-rastogi\\\/\",\"@type\":\"Person\",\"@name\":\"Sachin Rastogi\"}],\"headline\":\"Improve BERT inference speed by combining the power of Optimum, OpenVINO\u2122, ONNX Runtime, and Azure\",\"datePublished\":\"2023-01-25T17:00:00+00:00\",\"dateModified\":\"2024-06-19T17:50:42+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/\"},\"wordCount\":1129,\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/STB13_Michelle_03.webp\",\"keywords\":[\"ONNX Runtime\",\"Transformer\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/\",\"name\":\"OpenVINO\u2122, ONNX Runtime, and Azure improve BERT inference speed | Microsoft Open Source Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/STB13_Michelle_03.webp\",\"datePublished\":\"2023-01-25T17:00:00+00:00\",\"dateModified\":\"2024-06-19T17:50:42+00:00\",\"description\":\"Make large models smaller and faster with OpenVino Execution Provider, NNCF and ONNX Runtime leveraging Azure Machine Learning.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/#primaryimage\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/STB13_Michelle_03.webp\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/STB13_Michelle_03.webp\",\"width\":1170,\"height\":640},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2023\\\/01\\\/25\\\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Improve BERT inference speed by combining the power of Optimum, OpenVINO\u2122, ONNX Runtime, and Azure\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"name\":\"Microsoft Open Source Blog\",\"description\":\"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability\",\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\",\"name\":\"Microsoft Open Source Blog\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Open Source Blog\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/OpenAtMicrosoft\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/person\\\/4d7e7cd8266dc319e43a6de1e173495f\",\"name\":\"Teri Dormer\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g98331fbdc1fedab03f83292cd9dfa932\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g\",\"caption\":\"Teri Dormer\"},\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/teridormer\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"OpenVINO\u2122, ONNX Runtime, and Azure improve BERT inference speed | Microsoft Open Source Blog","description":"Make large models smaller and faster with OpenVino Execution Provider, NNCF and ONNX Runtime leveraging Azure Machine Learning.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/","og_locale":"en_US","og_type":"article","og_title":"OpenVINO\u2122, ONNX Runtime, and Azure improve BERT inference speed | Microsoft Open Source Blog","og_description":"Make large models smaller and faster with OpenVino Execution Provider, NNCF and ONNX Runtime leveraging Azure Machine Learning.","og_url":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/","og_site_name":"Microsoft Open Source Blog","article_published_time":"2023-01-25T17:00:00+00:00","article_modified_time":"2024-06-19T17:50:42+00:00","og_image":[{"url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/01\/image.png","type":"","width":"","height":""}],"author":"Cassie Breviu, Akhila Vidiyala, Devang Aggarwal, Sachin Rastogi","twitter_card":"summary_large_image","twitter_creator":"@OpenAtMicrosoft","twitter_site":"@OpenAtMicrosoft","twitter_misc":{"Written by":"Cassie Breviu, Akhila Vidiyala, Devang Aggarwal, Sachin Rastogi","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/#article","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/"},"author":[{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/cassie-breviu\/","@type":"Person","@name":"Cassie Breviu"},{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/akhila-vidiyala\/","@type":"Person","@name":"Akhila Vidiyala"},{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/devang-aggarwal\/","@type":"Person","@name":"Devang Aggarwal"},{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/sachin-rastogi\/","@type":"Person","@name":"Sachin Rastogi"}],"headline":"Improve BERT inference speed by combining the power of Optimum, OpenVINO\u2122, ONNX Runtime, and Azure","datePublished":"2023-01-25T17:00:00+00:00","dateModified":"2024-06-19T17:50:42+00:00","mainEntityOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/"},"wordCount":1129,"publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Michelle_03.webp","keywords":["ONNX Runtime","Transformer"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/","url":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/","name":"OpenVINO\u2122, ONNX Runtime, and Azure improve BERT inference speed | Microsoft Open Source Blog","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/#primaryimage"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Michelle_03.webp","datePublished":"2023-01-25T17:00:00+00:00","dateModified":"2024-06-19T17:50:42+00:00","description":"Make large models smaller and faster with OpenVino Execution Provider, NNCF and ONNX Runtime leveraging Azure Machine Learning.","breadcrumb":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/#primaryimage","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Michelle_03.webp","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Michelle_03.webp","width":1170,"height":640},{"@type":"BreadcrumbList","@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/01\/25\/improve-bert-inference-speed-by-combining-the-power-of-optimum-openvino-onnx-runtime-and-azure\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/opensource.microsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Improve BERT inference speed by combining the power of Optimum, OpenVINO\u2122, ONNX Runtime, and Azure"}]},{"@type":"WebSite","@id":"https:\/\/opensource.microsoft.com\/blog\/#website","url":"https:\/\/opensource.microsoft.com\/blog\/","name":"Microsoft Open Source Blog","description":"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability","publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/opensource.microsoft.com\/blog\/#organization","name":"Microsoft Open Source Blog","url":"https:\/\/opensource.microsoft.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Open Source Blog"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/OpenAtMicrosoft"]},{"@type":"Person","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/person\/4d7e7cd8266dc319e43a6de1e173495f","name":"Teri Dormer","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g98331fbdc1fedab03f83292cd9dfa932","url":"https:\/\/secure.gravatar.com\/avatar\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/4f1c6b1df49619573e006bda75a18efb7f99db184762acc79d899b8a6ef768aa?s=96&d=microsoft&r=g","caption":"Teri Dormer"},"url":"https:\/\/opensource.microsoft.com\/blog\/author\/teridormer\/"}]}},"bloginabox_animated_featured_image":null,"bloginabox_display_generated_audio":false,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Open Source Blog","distributor_original_site_url":"https:\/\/opensource.microsoft.com\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/94179","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/users\/5562"}],"replies":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=94179"}],"version-history":[{"count":2,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/94179\/revisions"}],"predecessor-version":[{"id":98693,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/94179\/revisions\/98693"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media\/95489"}],"wp:attachment":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=94179"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/tags?post=94179"},{"taxonomy":"programming-languages","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/programming-languages?post=94179"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/content-type?post=94179"},{"taxonomy":"job-role","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/job-role?post=94179"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/topic?post=94179"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/coauthors?post=94179"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}