{"id":94796,"date":"2023-09-07T08:00:00","date_gmt":"2023-09-07T15:00:00","guid":{"rendered":""},"modified":"2024-08-23T09:16:24","modified_gmt":"2024-08-23T16:16:24","slug":"boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors","status":"publish","type":"post","link":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/","title":{"rendered":"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors"},"content":{"rendered":"\n<p>One of the exciting features of the 4<sup>th<\/sup> Gen Intel\u00ae Xeon\u00ae CPU is Intel\u00ae Advanced Matrix Extension (AMX). Intel\u00ae AMX is an x86 extension that accelerates matrix multiplications common in deep learning (DL) workloads. To take advantage of this for performance acceleration through ONNX Runtime, Intel and Microsoft developed the 8-bit integer matrix multiplication kernel in ONNX Runtime using Intel\u00aeAMX instructions, resulting in four times faster performance than 3<sup>rd<\/sup> Gen Intel\u00ae Xeon\u00ae using Intel\u00ae DL Boost. This blog explores how <a href=\"https:\/\/onnxruntime.ai\/\">ONNX Runtime<\/a> harnesses Intel\u00ae AMX to accelerate performance for the 4<sup>th<\/sup> Gen Intel\u00ae Xeon\u00ae CPUs. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"intel-amx\">Intel AMX<\/h2>\n\n\n\n<p><a href=\"https:\/\/www.intel.com\/content\/www\/us\/en\/products\/docs\/accelerator-engines\/advanced-matrix-extensions\/overview.html\" target=\"_blank\" rel=\"noreferrer noopener\">Intel AMX<\/a> is an x86 extension that operates on matrices. The extension consists of two key components shown in Figure 1: a set of 2D register files called tiles representing sub-arrays of a larger 2D memory image and a set of accelerators called the tile multiplication unit (TMUL) that operate on these tiles. Intel AMX instructions are synchronous with the central processing unit (CPU) instruction stream and tile load\/store operations are coherent with the CPU memory operations. Intel AMX instructions can be interleaved with other x86 instructions and can run in parallel with other extensions like Intel\u00ae AVX512. For more details on the Intel AMX architecture, refer to <a href=\"https:\/\/cdrdv2.intel.com\/v1\/dl\/getContent\/671488?explicitVersion=true\" target=\"_blank\" rel=\"noreferrer noopener\">Intel\u00ae 64 and IA-32 Architectures Optimization Reference Manual<\/a>.<\/p>\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture1.webp\" alt=\"Here is the Intel AMX Architecture. A set of 2D register files called tiles representing sub-arrays of a larger 2D memory image and a set of accelerators called the tile multiplication unit (TMUL)&nbsp; that operate on these tiles.\" class=\"wp-image-94799 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture1.webp\"><\/figure>\n\n\n\n<p>Figure 1. Intel AMX Architecture.<\/p>\n\n\n\n<p>TMUL comprises of a grid of fused multiply-add (FMA) units that operate on Intel AMX tiles. The matrix multiplication operation in the TMUL instruction computes C[M][N] += A[M][K] * B[K][N]shown in Figure 2.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A tile can have between 1-16 rows, and 1-MAX_TILE_K columns.<\/li>\n\n\n\n<li>B tile can have between 1-MAX_TILE_K rows, and 1-16 columns.<\/li>\n\n\n\n<li>C tile can have between 1-16 rows, and 1-16 columns.<\/li>\n<\/ul>\n\n\n\n<p>Where MAX_TILE_K=64\/sizeof(type_t), where type_t is the type of the data being operated on. So, MAX_TILE_K =64 for (u)int8 data, and MAX_TILE_K=32 for bfloat16 data.<\/p>\n\n\n\n<p>The data type in the output tile is dependent on the data type in the input tile.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A tiles and B tiles contain data of type type_t, i.e. (u)int8 or bfloat16.<\/li>\n\n\n\n<li>C tiles contain data of type res_type_t:<ul><li>int32 if type_t=(u)int8<\/li><\/ul>\n<ul class=\"wp-block-list\">\n<li>float if type_t=bfloat16<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture2.webp\" alt=\"This figure explains Intel AMX matrix multiplication with max-sized int8 tiles. The matrix multiplication operation in the TMUL instruction computes C[M][N] += A[M][K] * B[K][N]\" class=\"wp-image-94800 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture2.webp\"><\/figure>\n\n\n\n<p>Figure 2. Intel AMX matrix multiplication with max-sized int8 tiles.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"onnx-runtime-and-intel-amx\">ONNX Runtime and Intel AMX<\/h2>\n\n\n\n<p>ONNX Runtime stands as an open-source, high-performance machine learning engine. It integrates a range of hardware acceleration techniques designed to enhance performance across diverse hardware platforms. Intel and Microsoft have ongoing collaboration to ensure accelerated performance can be achieved across the Intel hardware ecosystem. Previously, Intel and Microsoft have developed 8-bit integer matrix multiplication and convolution kernels using Intel\u00ae DL Boost instructions. These were introduced in the 2<sup>nd<\/sup> Gen Intel\u00ae Xeon\u00ae processor line. With the advent of the latest Intel hardware generation, Intel has partnered with Microsoft to add Intel AMX instructions for 8-bit integer matrix multiplications into the Microsoft Linear Algebra Subroutine (MLAS), the default CPU execution provider in ONNX Runtime. Figure 3 is a high-level view of the execution providers in ONNX Runtime and Intel technology it has been enabled for. The steps to build ONNX Runtime with the CPU execution provider can be found in the <a href=\"https:\/\/onnxruntime.ai\/docs\/build\/inferencing.html#cpu\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime documentation<\/a>. Add <code>--build_micro_benchmarks<\/code> to the build command to build the micro benchmarks. To run QGEMM micro benchmarks, <code>onnxruntime_mlas_benchmark.exe --benchmark_filter=QGEMM*<\/code>.<\/p>\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture3.webp\" alt=\"Here is the ONNX Runtime architecture with Intel technology it has enabled.\" class=\"wp-image-94801 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture3.webp\"><\/figure>\n\n\n\n<p>Figure 3. ONNX Runtime architecture.<\/p>\n\n\n\n<p>Code listing 1 is a snippet of the ONNX Runtime matrix multiplication kernel optimized using Intel AMX instructions. <code>tile_loadd<\/code> loads data into a TMM tile. <code>tile_dpbusd<\/code> executes the matrix multiplication on the TMM tiles. The code was written in intrinsic and machine code. You can get more details of the&nbsp;ONNX Runtime matrix multiplication code from <a href=\"https:\/\/github.com\/microsoft\/onnxruntime\/blob\/main\/onnxruntime\/core\/mlas\/lib\/qgemm_kernel_amx.cpp\" target=\"_blank\" rel=\"noreferrer noopener\">qgemm_kernel_amx.cpp<\/a>, <a href=\"https:\/\/github.com\/microsoft\/onnxruntime\/blob\/main\/onnxruntime\/core\/mlas\/lib\/amx_common.h\" target=\"_blank\" rel=\"noreferrer noopener\">amx_common.h<\/a>, and <a href=\"https:\/\/github.com\/microsoft\/onnxruntime\/blob\/main\/onnxruntime\/core\/mlas\/lib\/x86_64\/QgemmU8S8KernelAmxCommon.S\" target=\"_blank\" rel=\"noreferrer noopener\">QgemmU8S8KernelAmxCommon.S<\/a>.<\/p>\n\n\n\n<p>Code Listing 1.&nbsp;ONNX Runtime matrix multiplication using Intel AMX instructions.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<p><code>const MLAS_GEMM_U8S8_KERNEL_AMX::PackedAType* a_blk = A;<\/code><\/p>\n\n\n\n<p><code>const MLAS_GEMM_U8S8_KERNEL_AMX::PackedAType* a_next_blk = A + PackedCountK * TILE_M;<\/code><\/p>\n\n\n\n<p><code>for (size_t k = PackedCountK; k &gt; 0; k -=TILE_K) {<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; tile_loadd(TMM0, b_blk, TILE_K);<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; tile_loadd(TMM2, a_blk, static_cast&lt;int&gt;(PackedCountK));<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; tile_loadd(TMM1, (void*)(b_blk + PackedCountK * TILE_N), TILE_K);<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; tile_dpbusd(TMM4, TMM2, TMM0);<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; tile_dpbusd(TMM6, TMM2, TMM1);<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; if (m1 &gt; 0){<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; &nbsp; &nbsp; tile_loadd(TMM3, a_next_blk, static_cast&lt;int&gt;(PackedCountK));<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; &nbsp; &nbsp; tile_dpbusd(TMM5, TMM3, TMM0);<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; &nbsp; &nbsp; tile_dpbusd(TMM7, TMM3, TMM1);<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; }<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; b_blk += TILE_N * TILE_K;<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; a_blk += TILE_K;<\/code><\/p>\n\n\n\n<p><code>&nbsp; &nbsp; a_next_blk += TILE_K;<\/code><\/p>\n\n\n\n<p><code>}<\/code><\/p>\n<\/div>\n<\/div>\n\n\n\n<p>Powered by Intel AMX instructions, the 8-bit integer matrix multiplication in ONNX Runtime on 4<sup>th<\/sup> Gen Intel Xeon resulted in more than four times performance gain over 3<sup>rd<\/sup> Gen Intel Xeon as shown in Figure 4. In order to realize the performance gains for your model which encompasses matrix multiplications, the initial step involves quantizing your model using the <a href=\"https:\/\/onnxruntime.ai\/docs\/performance\/model-optimizations\/quantization.html\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime quantization<\/a>. Afterward, you can proceed to execute your model utilizing the ONNX Runtime CPU package on a 4th Gen Intel Xeon processor.<\/p>\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-1024x655.webp\" alt=\"This figure shows the performance benchmark that the 8-bit integer matrix multiplication in ONNX Runtime on 4th Gen Intel Xeon resulted in more than four times performance gain over 3rd Gen Intel Xeon.\" class=\"wp-image-94802 webp-format\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-1024x655.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-300x192.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-768x491.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-1536x983.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-2048x1311.png 2048w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-800x512.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-400x256.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-450x288.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-650x416.webp 650w\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-1024x655.png\" data-orig-srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-1024x655.png 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-300x192.png 300w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-768x491.png 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-1536x983.png 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-2048x1311.png 2048w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-800x512.png 800w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-400x256.png 400w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-450x288.png 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2023\/09\/Picture4-650x416.png 650w\"><\/figure>\n\n\n\n<p>Figure 4. ONNX Runtime 8-bit integer matrix multiplication performance benefit of 4<sup>th<\/sup> Gen over 3<sup>rd<\/sup> Gen Intel Xeon processors across square matrix shapes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"looking-ahead\">Looking ahead<\/h2>\n\n\n\n<p>ONNX Runtime achieves enhanced performance on 4<sup>th<\/sup> Gen Intel Xeon processors thanks to the incorporation of Intel AMX for the implementation of 8-bit integer matrix multiplications. Deep learning (DL) models that contain a substantial number of matrix multiplications can benefit from using ONNX Runtime optimized with Intel AMX instructions. Intel and Microsoft will continue to develop ONNX Runtime for new DL hardware features.<\/p>\n\n\n\n<p>We invite you to try <a href=\"https:\/\/onnxruntime.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime<\/a> for your model accelerations on 4th Gen Intel Xeon processors, and look forward to hearing your feedback and requests, and invite you to submit them through our GitHub projects.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"configurations\">Configurations<\/h2>\n\n\n\n<p>Intel Xeon Platinum 8380, 40 cores, HT On, Turbo On, Total Memory 1024GB (16x64GB DDR4 3200 MT\/s [3200 MT\/s]), Ubuntu 22.04 LTS, 5.15.0-57-generic, GCC&nbsp; 11.3.0, ONNX&nbsp; Runtime v1.14.1<\/p>\n\n\n\n<p>Intel Xeon Platinum 8480+, 56 cores, HT On, Turbo On, Total Memory 1024GB (16x64GB DDR5 4800 MT\/s [4800 MT\/s]), Ubuntu 22.04 LTS, 5.15.0-57-generic, GCC&nbsp; 11.3.0, ONNX&nbsp; Runtime v1.14.1<\/p>\n","protected":false},"excerpt":{"rendered":"<p>ONNX Runtime harnesses Intel\u00ae AMX to accelerate performance for the 4th Gen Intel\u00ae Xeon\u00ae CPUs.<\/p>\n","protected":false},"author":6194,"featured_media":95466,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msxcm_post_with_no_image":false,"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","footnotes":""},"post_tag":[1824],"content-type":[346],"topic":[2241],"programming-languages":[],"coauthors":[2048,2049],"class_list":["post-94796","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-onnx-runtime","content-type-news","topic-cloud","review-flag-1593580428-734","review-flag-1593580419-521","review-flag-1-1593580432-963","review-flag-2-1593580437-411","review-flag-3-1593580442-169","review-flag-4-1593580448-609","review-flag-5-1593580453-725","review-flag-8-1593580468-572","review-flag-machi-1680214156-53","review-flag-new-1593580248-669"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.2 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors | Microsoft Open Source Blog<\/title>\n<meta name=\"description\" content=\"Explore how ONNX Runtime harnesses Intel\u00ae AMX to accelerate performance for the 4th Gen Intel\u00ae Xeon\u00ae CPUs.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors | Microsoft Open Source Blog\" \/>\n<meta property=\"og:description\" content=\"Explore how ONNX Runtime harnesses Intel\u00ae AMX to accelerate performance for the 4th Gen Intel\u00ae Xeon\u00ae CPUs.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Open Source Blog\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-07T15:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-08-23T16:16:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1170\" \/>\n\t<meta property=\"og:image:height\" content=\"640\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Chen Fu, Kiefer Kuah\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:site\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Chen Fu, Kiefer Kuah\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 min read\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/\"},\"author\":[{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/author\/chen-fu\/\",\"@type\":\"Person\",\"@name\":\"Chen Fu\"},{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/author\/kiefer-kuah\/\",\"@type\":\"Person\",\"@name\":\"Kiefer Kuah\"}],\"headline\":\"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors\",\"datePublished\":\"2023-09-07T15:00:00+00:00\",\"dateModified\":\"2024-08-23T16:16:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/\"},\"wordCount\":913,\"publisher\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.webp\",\"keywords\":[\"ONNX Runtime\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/\",\"name\":\"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors | Microsoft Open Source Blog\",\"isPartOf\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.webp\",\"datePublished\":\"2023-09-07T15:00:00+00:00\",\"dateModified\":\"2024-08-23T16:16:24+00:00\",\"description\":\"Explore how ONNX Runtime harnesses Intel\u00ae AMX to accelerate performance for the 4th Gen Intel\u00ae Xeon\u00ae CPUs.\",\"breadcrumb\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#primaryimage\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.webp\",\"contentUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.webp\",\"width\":1170,\"height\":640},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/opensource.microsoft.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#website\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/\",\"name\":\"Microsoft Open Source Blog\",\"description\":\"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability\",\"publisher\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#organization\",\"name\":\"Microsoft Open Source Blog\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png\",\"contentUrl\":\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Open Source Blog\"},\"image\":{\"@id\":\"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/OpenAtMicrosoft\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors | Microsoft Open Source Blog","description":"Explore how ONNX Runtime harnesses Intel\u00ae AMX to accelerate performance for the 4th Gen Intel\u00ae Xeon\u00ae CPUs.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/","og_locale":"en_US","og_type":"article","og_title":"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors | Microsoft Open Source Blog","og_description":"Explore how ONNX Runtime harnesses Intel\u00ae AMX to accelerate performance for the 4th Gen Intel\u00ae Xeon\u00ae CPUs.","og_url":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/","og_site_name":"Microsoft Open Source Blog","article_published_time":"2023-09-07T15:00:00+00:00","article_modified_time":"2024-08-23T16:16:24+00:00","og_image":[{"width":1170,"height":640,"url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.png","type":"image\/png"}],"author":"Chen Fu, Kiefer Kuah","twitter_card":"summary_large_image","twitter_creator":"@OpenAtMicrosoft","twitter_site":"@OpenAtMicrosoft","twitter_misc":{"Written by":"Chen Fu, Kiefer Kuah","Est. reading time":"4 min read"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#article","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/"},"author":[{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/chen-fu\/","@type":"Person","@name":"Chen Fu"},{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/kiefer-kuah\/","@type":"Person","@name":"Kiefer Kuah"}],"headline":"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors","datePublished":"2023-09-07T15:00:00+00:00","dateModified":"2024-08-23T16:16:24+00:00","mainEntityOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/"},"wordCount":913,"publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.webp","keywords":["ONNX Runtime"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/","url":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/","name":"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors | Microsoft Open Source Blog","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#primaryimage"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.webp","datePublished":"2023-09-07T15:00:00+00:00","dateModified":"2024-08-23T16:16:24+00:00","description":"Explore how ONNX Runtime harnesses Intel\u00ae AMX to accelerate performance for the 4th Gen Intel\u00ae Xeon\u00ae CPUs.","breadcrumb":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#primaryimage","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.webp","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/CLO19_Ubisoft_Azure_055.webp","width":1170,"height":640},{"@type":"BreadcrumbList","@id":"https:\/\/opensource.microsoft.com\/blog\/2023\/09\/07\/boosting-performance-in-onnx-runtime-with-intel-amx-for-4th-gen-intel-xeon-processors\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/opensource.microsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Boosting performance in ONNX Runtime with Intel\u00ae AMX for 4th Gen Intel\u00ae Xeon\u00ae Processors"}]},{"@type":"WebSite","@id":"https:\/\/opensource.microsoft.com\/blog\/#website","url":"https:\/\/opensource.microsoft.com\/blog\/","name":"Microsoft Open Source Blog","description":"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability","publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/opensource.microsoft.com\/blog\/#organization","name":"Microsoft Open Source Blog","url":"https:\/\/opensource.microsoft.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Open Source Blog"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/OpenAtMicrosoft"]}]}},"msxcm_display_generated_audio":false,"msxcm_animated_featured_image":null,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Open Source Blog","distributor_original_site_url":"https:\/\/opensource.microsoft.com\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/94796","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/users\/6194"}],"replies":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=94796"}],"version-history":[{"count":17,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/94796\/revisions"}],"predecessor-version":[{"id":96346,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/94796\/revisions\/96346"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media\/95466"}],"wp:attachment":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=94796"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/post_tag?post=94796"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/content-type?post=94796"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/topic?post=94796"},{"taxonomy":"programming-languages","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/programming-languages?post=94796"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/coauthors?post=94796"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}