{"id":91533,"date":"2022-05-02T11:00:00","date_gmt":"2022-05-02T18:00:00","guid":{"rendered":"https:\/\/cloudblogs.microsoft.com\/opensource\/?p=91533"},"modified":"2024-06-19T10:50:28","modified_gmt":"2024-06-19T17:50:28","slug":"optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus","status":"publish","type":"post","link":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/","title":{"rendered":"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs"},"content":{"rendered":"<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:15% auto\"><figure class=\"wp-block-media-text__media\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Mohit-150x150.webp\" alt=\"a man wearing glasses and looking at the camera\" class=\"wp-image-91701 size-thumbnail webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Mohit.webp\"><\/figure><div class=\"wp-block-media-text__content\">\n<p>Mohit Ayani, Solutions Architect, NVIDIA<\/p>\n<\/div><\/div>\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:15% auto\"><figure class=\"wp-block-media-text__media\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/ShangZhang-photo-150x150.webp\" alt=\"Shang Zhang\" class=\"wp-image-91704 size-thumbnail webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/ShangZhang-photo.webp\"><\/figure><div class=\"wp-block-media-text__content\">\n<p>Shang Zhang, Senior AI Developer Technology Engineer, NVIDIA<\/p>\n<\/div><\/div>\n\n\n<div class=\"wp-block-media-text alignwide is-stacked-on-mobile\" style=\"grid-template-columns:15% auto\"><figure class=\"wp-block-media-text__media\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Jay-Rodge-150x150.webp\" alt=\"Jay Rodge\" class=\"wp-image-91707 size-thumbnail webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Jay-Rodge.webp\"><\/figure><div class=\"wp-block-media-text__content\">\n<p>Jay Rodge, Product Marketing Manager-AI, NVIDIA<\/p>\n<\/div><\/div>\n\n\n\n<p>Transformer-based models have revolutionized the natural language processing (NLP) domain. Ever since its inception, transformer architecture has been integrated into models like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) for performing tasks such as text generation or summarization and question and answering to name a few. The newer models are getting bigger in size by stacking more transformer layers and larger input sequence lengths, which in turn, has led to improvements in model accuracy but comes at a cost of higher inference times.<\/p>\n\n\n\n<p><a href=\"https:\/\/nvda.ws\/3JWE4jf\" target=\"_blank\" rel=\"noreferrer noopener\">NVIDIA TensorRT<\/a> is an SDK for high-performance deep learning inference on NVIDIA GPUs. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference. One of the key features of TensorRT is that it allows the models to be deployed in reduced precisions like FP16 and INT8 without compromising on accuracy. Recently, Bing announced the support of running their <a href=\"https:\/\/blogs.bing.com\/Engineering-Blog\/october-2021\/Bing-delivers-more-contextualized-search-using-quantized-transformer-inference-on-NVIDIA-GPUs-in-Azu\" target=\"_blank\" rel=\"noreferrer noopener\">transformer models<\/a> on Azure T4 GPUs leveraging TensorRT INT8 optimization. Starting with TensorRT 8.0, users can now see down to <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-announces-tensorrt-8-slashing-bert-large-inference-down-to-1-millisecond\/\" target=\"_blank\" rel=\"noreferrer noopener\">1.2ms inference latency<\/a> using INT8 optimization on BERT Large.<\/p>\n\n\n\n<p>Many of these transformer models from different frameworks (such as PyTorch and TensorFlow) can be converted to the Open Neural Network Exchange (ONNX) format, which is the open standard format representing AI and deep learning models for further optimizations. <a href=\"https:\/\/onnxruntime.ai\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime<\/a> is a high-performance inference engine to run machine learning models, with multi-platform support and a flexible execution provider interface to integrate hardware-specific libraries. As shown in Figure 1, ONNX Runtime integrates TensorRT as one execution provider for model inference acceleration on NVIDIA GPUs by harnessing the TensorRT optimizations.&nbsp;Based on the TensorRT capability, ONNX Runtime partitions the model graph and offloads the parts that TensorRT supports to TensorRT execution provider for efficient model execution on NVIDIA hardware.&nbsp;<\/p>\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-1024x440.webp\" alt=\"Different execution providers supported by ONNX Runtime\" class=\"wp-image-91545 webp-format\" width=\"800\" height=\"343\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-1024x440.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-300x129.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-768x330.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-800x344.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-400x172.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-450x193.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-650x279.webp 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1.webp 1099w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-1024x440.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-1024x440.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-300x129.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-768x330.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-800x344.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-400x172.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-450x193.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1-650x279.png 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture1.png 1099w\"><figcaption>Figure 1: Different execution providers supported by ONNX Runtime.<\/figcaption><\/figure>\n\n\n\n<p>In this blog, we will be using the <a href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/bert\" target=\"_blank\" rel=\"noreferrer noopener\">HuggingFace BERT<\/a> model, apply TensorRT INT8 optimizations, and accelerate the inference with ONNX Runtime with TensorRT execution provider.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><a><\/a>Setup<\/h2>\n\n\n\n<p>To get started, you can clone the <a href=\"https:\/\/github.com\/huggingface\/transformers\" target=\"_blank\" rel=\"noreferrer noopener\">transformer<\/a> repository from the HuggingFace Github page.<\/p>\n\n\n\n<p><code>$ git clone <a href=\"https:\/\/github.com\/huggingface\/transformers.git\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/huggingface\/transformers.git<\/a><br>$ cd transformers<\/code><\/p>\n\n\n\n<p>Then, you can build and launch the docker container using the following steps which uses the<a href=\"https:\/\/developer.nvidia.com\/blog\/new-on-ngc-security-reports-latest-containers-for-pytorch-tensorflow-hpc-and-more\/\" target=\"_blank\" rel=\"noreferrer noopener\"> NGC PyTorch<\/a> container image.<\/p>\n\n\n\n<p><code>$ docker build . -f examples\/research_projects\/quantization-qdqbert\/Dockerfile -t bert_quantization:latest<\/code><\/p>\n\n\n\n<p><code>$ docker run --gpus all --privileged --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 bert_quantization:latest<\/code><\/p>\n\n\n\n<p>Once inside the container, navigate to the quantization directory.<\/p>\n\n\n\n<p><code>$ cd transformers\/examples\/research_projects\/quantization-qdqbert\/<\/code><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">INT8 optimization<\/h2>\n\n\n\n<p>Model quantization is becoming popular in the deep learning optimization methods to use the 8-bit integers calculations for using the faster and cheaper 8-bit Tensor Cores. This, in turn, can be used to compute convolution and matrix-multiplication operations yielding more throughput, which is particularly effective on compute-limited layers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a><\/a>Quantization Toolkit<\/h3>\n\n\n\n<p><a href=\"https:\/\/github.com\/NVIDIA\/TensorRT\/tree\/main\/tools\/pytorch-quantization\" target=\"_blank\" rel=\"noreferrer noopener\">TensorRT Quantization Toolkit for PyTorch<\/a> provides a convenient tool to train and evaluate PyTorch models with simulated quantization. This library can automatically or manually add quantization to PyTorch models and the quantized model can be exported to ONNX and imported by TensorRT 8.0 and later.<\/p>\n\n\n\n<p>If you already have an ONNX model, you can directly apply ONNX Runtime quantization tool with Post Training Quantization (PTQ)&nbsp; for running with ONNX Runtime-TensorRT quantization. Please refer to&nbsp;<a href=\"https:\/\/github.com\/microsoft\/onnxruntime-inference-examples\/tree\/main\/quantization\/nlp\/bert\/trt\" target=\"_blank\" rel=\"noreferrer noopener\">this example<\/a>&nbsp;for more details. This blog focuses on starting with a PyTorch model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">HuggingFace QDQBERT model<\/h3>\n\n\n\n<p>The <a href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/qdqbert\" target=\"_blank\" rel=\"noreferrer noopener\">HuggingFace QDQBERT<\/a> model starts from the <a href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/bert\" target=\"_blank\" rel=\"noreferrer noopener\">HuggingFace BERT<\/a> model, and uses TensorRT Quantization Toolkit for PyTorch to insert Q\/DQ nodes into the network. Fake quantization operations (pairs of QuantizeLinear\/DequantizeLinear ops) are added to (1) linear layer inputs and weights, (2) matmul inputs, (3) residual add inputs, in the BERT model. After that, the QDQBERT model is exported to ONNX format, which can be imported into TensorRT. The QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example <em>bert-large-uncased<\/em>), and perform Quantization Aware Training (QAT) or Post Training Quantization (PTQ) afterwards.<\/p>\n\n\n\n<p>Launch the following command to first perform calibration:<\/p>\n\n\n\n<p><code>python3 run_quant_qa.py \\<br>--model_name_or_path bert-large-uncased \\<br>--dataset_name squad \\<br>--max_seq_length 128 \\<br>--doc_stride 32 \\<br>--output_dir calib\/bert-large-uncased \\<br>--do_calib \\<br>--calibrator percentile \\<br>--percentile 99.99<\/code><\/p>\n\n\n\n<p>And then the QAT can be launched by executing the script:<\/p>\n\n\n\n<p><code>python3 run_quant_qa.py \\<br>--model_name_or_path calib\/bert-large-uncased \\<br>--dataset_name squad \\<br>--do_train \\<br>--do_eval \\<br>--per_device_train_batch_size 12 \\<br>--learning_rate 4e-5 \\<br>--num_train_epochs 2 \\<br>--max_seq_length 128 \\<br>--doc_stride 32 \\<br>--output_dir finetuned_int8\/bert-large-uncased \\<br>--tokenizer_name bert-large-uncased \\<br>--save_steps 0<\/code><\/p>\n\n\n\n<p>At a high level, TensorRT processes ONNX models with Q\/DQ operators similarly to how TensorRT processes any other ONNX model: TensorRT imports an ONNX model containing Q\/DQ operations. It performs a set of optimizations that are dedicated to Q\/DQ processing. It continues to perform the general optimization passes. It builds a platform-specific, execution-plan file for inference execution. This plan file contains quantized operations and weights. <\/p>\n\n\n\n<p>Thus, you can now export the fine-tuned model with Q\/DQ operations to the ONNX format using the following:<\/p>\n\n\n\n<p><code>python3 run_quant_qa.py \\<br>--model_name_or_path finetuned_int8\/bert-large-uncased \\<br>--output_dir .\/ \\<br>--save_onnx \\<br>--per_device_eval_batch_size 1 \\<br>--max_seq_length 128 \\<br>--doc_stride 32 \\<br>--dataset_name squad \\<br>--tokenizer_name bert-large-uncased<\/code><\/p>\n\n\n\n<p>Starting from TensorRT 8.0, TensorRT processes Q\/DQ networks with <a href=\"https:\/\/developer.nvidia.com\/blog\/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt\/\" target=\"_blank\" rel=\"noreferrer noopener\">new optimizations<\/a>, which increases Q\/DQ model performance and provides predictable and user-controlled arithmetic precision transitions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><a><\/a>Results<\/h2>\n\n\n\n<p>Experiments of inferencing performance are performed on NVIDIA A100, using ONNX Runtime 1.11 and TensorRT 8.2 with HuggingFace BERT-large model. The inference task is SQuAD, with INT8 quantization by the HuggingFace QDQBERT-large model.<\/p>\n\n\n\n<p>The benchmarking can be done using either trtexec:<\/p>\n\n\n\n<p><code>trtexec --onnx=model.onnx --explicitBatch --workspace=16384 --int8 --shapes=input_ids:64x128,attention_mask:64x128,token_type_ids:64x128 --verbose<\/code><\/p>\n\n\n\n<p>We also have the python script which uses the ONNX Runtime with TensorRT execution provider and can also be used instead:<\/p>\n\n\n\n<p><code>python3 ort-infer-benchmark.py<\/code><\/p>\n\n\n\n<p>With the optimizations of ONNX Runtime with TensorRT EP, we are seeing up to seven times speedup over PyTorch inference for BERT Large and BERT Base, with latency under 2 ms and 1 ms respectively for BS=1. The figures below show the inference latency comparison when running the BERT Large with sequence length 128 on NVIDIA A100.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-1024x721.webp\" alt=\"Compute latency comparison between ONNX Runtime-TensorRT and PyTorch for running BERT-Large on NVIDIA A100 GPU for sequence length 128.\" class=\"wp-image-91548 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-1024x721.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-300x211.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-768x541.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-1536x1081.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-800x563.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-400x282.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-450x317.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-650x458.webp 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2.webp 1916w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-1024x721.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-1024x721.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-300x211.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-768x541.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-1536x1081.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-800x563.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-400x282.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-450x317.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2-650x458.png 650w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2022\/04\/Picture2.png 1916w\"><figcaption>Figure 2: Compute latency comparison between ONNX Runtime-TensorRT and PyTorch for running BERT-Large on NVIDIA A100 GPU for sequence length 128.<\/figcaption><\/figure>\n\n\n\n<p>You can also check the accuracy of the INT8 model using the following script:<\/p>\n\n\n\n<p><code>python3 evaluate-hf-trt-qa.py \\<br>--onnx_model_path=.\/model.onnx \\<br>--output_dir .\/ \\<br>--per_device_eval_batch_size 64 \\<br>--max_seq_length 128 \\<br>--doc_stride 32 \\<br>--dataset_name squad \\<br>--tokenizer_name bert-large-uncased \\<br>--int8 \\<br>--seed 42<\/code><\/p>\n\n\n\n<p>Accuracy metrics with ONNX Runtime-TensorRT 8.2 EP for the SQuAD task are:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>&nbsp;<\/td><td>INT8<\/td><td>FP16<\/td><td>FP32<\/td><\/tr><tr><td>F1 score<\/td><td>87.52263875<\/td><td>87.69072304<\/td><td>87.96610141<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">At the end<\/h2>\n\n\n\n<p>ONNX Runtime-TensorRT INT8 quantization shows very promising results on NVIDIA GPUs. We\u2019d love to hear any feedback or suggestions as you try it in your production scenarios. You can submit feedback by participating in our GitHub repos (<a href=\"https:\/\/github.com\/NVIDIA\/TensorRT\/tree\/main\/tools\/pytorch-quantization\" target=\"_blank\" rel=\"noreferrer noopener\">TensorRT<\/a> and <a href=\"https:\/\/github.com\/microsoft\/onnxruntime\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime<\/a>).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Mohit Ayani, Solutions Architect, NVIDIA Shang Zhang, Senior AI Developer Technology Engineer, NVIDIA Jay Rodge, Product Marketing Manager-AI, NVIDIA Transformer-based models have revolutionized the natural language processing (NLP) domain.<\/p>\n","protected":false},"author":5562,"featured_media":95464,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msxcm_post_with_no_image":false,"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","footnotes":""},"post_tag":[1824,1911,1908,1827],"content-type":[361],"topic":[],"programming-languages":[2265],"coauthors":[699],"class_list":["post-91533","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-onnx-runtime","tag-quantization","tag-tensorrt","tag-transformer","content-type-project-updates","programming-languages-pytorch","review-flag-1593580428-734","review-flag-1593580771-946","review-flag-1-1593580432-963","review-flag-2-1593580437-411","review-flag-3-1593580442-169","review-flag-5-1593580453-725","review-flag-8-1593580468-572","review-flag-integ-1593580288-449","review-flag-new-1593580248-669"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.2 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs | Microsoft Open Source Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs | Microsoft Open Source Blog\" \/>\n<meta property=\"og:description\" content=\"Mohit Ayani, Solutions Architect, NVIDIA Shang Zhang, Senior AI Developer Technology Engineer, NVIDIA Jay Rodge, Product Marketing Manager-AI, NVIDIA Transformer-based models have revolutionized the natural language processing (NLP) domain.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Open Source Blog\" \/>\n<meta property=\"article:published_time\" content=\"2022-05-02T18:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-06-19T17:50:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1170\" \/>\n\t<meta property=\"og:image:height\" content=\"640\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Emma Ning\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:site\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Emma Ning\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 min read\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/\"},\"author\":[{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/author\/emma-ning\/\",\"@type\":\"Person\",\"@name\":\"Emma Ning\"}],\"headline\":\"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs\",\"datePublished\":\"2022-05-02T18:00:00+00:00\",\"dateModified\":\"2024-06-19T17:50:28+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/\"},\"wordCount\":989,\"publisher\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.webp\",\"keywords\":[\"ONNX Runtime\",\"Quantization\",\"TensorRT\",\"Transformer\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/\",\"name\":\"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs | Microsoft Open Source Blog\",\"isPartOf\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.webp\",\"datePublished\":\"2022-05-02T18:00:00+00:00\",\"dateModified\":\"2024-06-19T17:50:28+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#primaryimage\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.webp\",\"contentUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.webp\",\"width\":1170,\"height\":640,\"caption\":\"A woman smiles at coworker in an office.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/opensource.microsoft.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#website\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/\",\"name\":\"Microsoft Open Source Blog\",\"description\":\"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability\",\"publisher\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\",\"name\":\"Microsoft Open Source Blog\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png\",\"contentUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Open Source Blog\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/OpenAtMicrosoft\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs | Microsoft Open Source Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/","og_locale":"en_US","og_type":"article","og_title":"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs | Microsoft Open Source Blog","og_description":"Mohit Ayani, Solutions Architect, NVIDIA Shang Zhang, Senior AI Developer Technology Engineer, NVIDIA Jay Rodge, Product Marketing Manager-AI, NVIDIA Transformer-based models have revolutionized the natural language processing (NLP) domain.","og_url":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/","og_site_name":"Microsoft Open Source Blog","article_published_time":"2022-05-02T18:00:00+00:00","article_modified_time":"2024-06-19T17:50:28+00:00","og_image":[{"width":1170,"height":640,"url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.png","type":"image\/png"}],"author":"Emma Ning","twitter_card":"summary_large_image","twitter_creator":"@OpenAtMicrosoft","twitter_site":"@OpenAtMicrosoft","twitter_misc":{"Written by":"Emma Ning","Est. reading time":"5 min read"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#article","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/"},"author":[{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/emma-ning\/","@type":"Person","@name":"Emma Ning"}],"headline":"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs","datePublished":"2022-05-02T18:00:00+00:00","dateModified":"2024-06-19T17:50:28+00:00","mainEntityOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/"},"wordCount":989,"publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.webp","keywords":["ONNX Runtime","Quantization","TensorRT","Transformer"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/","url":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/","name":"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs | Microsoft Open Source Blog","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#primaryimage"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.webp","datePublished":"2022-05-02T18:00:00+00:00","dateModified":"2024-06-19T17:50:28+00:00","breadcrumb":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#primaryimage","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.webp","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_030.webp","width":1170,"height":640,"caption":"A woman smiles at coworker in an office."},{"@type":"BreadcrumbList","@id":"https:\/\/opensource.microsoft.com\/blog\/2022\/05\/02\/optimizing-and-deploying-transformer-int8-inference-with-onnx-runtime-tensorrt-on-nvidia-gpus\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/opensource.microsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Optimizing and deploying transformer INT8 inference with ONNX Runtime-TensorRT on NVIDIA GPUs"}]},{"@type":"WebSite","@id":"https:\/\/opensource.microsoft.com\/blog\/#website","url":"https:\/\/opensource.microsoft.com\/blog\/","name":"Microsoft Open Source Blog","description":"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability","publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/opensource.microsoft.com\/blog\/#organization","name":"Microsoft Open Source Blog","url":"https:\/\/opensource.microsoft.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Open Source Blog"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/OpenAtMicrosoft"]}]}},"msxcm_display_generated_audio":false,"msxcm_animated_featured_image":null,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Open Source Blog","distributor_original_site_url":"https:\/\/opensource.microsoft.com\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/91533","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/users\/5562"}],"replies":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=91533"}],"version-history":[{"count":1,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/91533\/revisions"}],"predecessor-version":[{"id":95737,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/91533\/revisions\/95737"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media\/95464"}],"wp:attachment":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=91533"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/post_tag?post=91533"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/content-type?post=91533"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/topic?post=91533"},{"taxonomy":"programming-languages","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/programming-languages?post=91533"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/coauthors?post=91533"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}