{"id":87048,"date":"2021-07-09T09:00:18","date_gmt":"2021-07-09T16:00:18","guid":{"rendered":""},"modified":"2025-05-30T15:36:04","modified_gmt":"2025-05-30T22:36:04","slug":"simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices","status":"publish","type":"post","link":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/","title":{"rendered":"Simple steps to create scalable processes to deploy ML models as microservices"},"content":{"rendered":"\n<p><strong><em>This post was co-authored by Alejandro Saucedo, Director of Machine Learning Engineering at Seldon Technologies.<\/em><\/strong><\/p>\n\n\n\n<p><em>About the co-author: Alejandro leads teams of machine learning engineers focused on the scalability and extensibility of machine learning deployment and monitoring products with over five million installations. Alejandro is also the Chief Scientist at the Institute for Ethical AI and Machine Learning, where he leads the development of industry standards on machine learning explainability, adversarial robustness, and differential privacy. With over 10 years of software development experience, Alejandro has held technical leadership positions across hyper-growth scale-ups and has a strong track record building cross-functional teams of software engineers.<\/em><\/p>\n\n\n\n<p>As organizations adopt machine learning in production, they face growing challenges that arise when the number of production machine learning models starts to increase. In this article, we provide a practical tutorial that will enable AI practitioners to leverage production-ready workflows to deploy their machine learning models at scale. More specifically, we will demonstrate the benefits of open source tools and frameworks like ONNX Runtime, Seldon Core, and HuggingFace, as well as how these can be integrated with Azure Kubernetes Services to achieve robust and scalable machine learning operations capabilities.<\/p>\n\n\n\n<p>By the end of this blog post, you will have a simple, repeatable, and scalable process to deploy complex machine learning models. You will learn by example, deploying the OpenAI GPT-2 natural language processing (NLP) model as a fully-fledged microservice with real-time metrics, and robust monitoring capabilities. Try out&nbsp;our&nbsp;&nbsp;<a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/examples\/triton_gpt2_example_azure.html\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-2&nbsp;Azure AKS&nbsp;Deployment&nbsp;Notebook<\/a>&nbsp;that&nbsp;demonstrates the&nbsp;full process.<\/p>\n\n\n\n<p>The steps that will be carried out in this blog are outlined in the image below, and include the following:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Fetch the pre-trained GPT2 Model using HuggingFace and export to ONNX.<\/li>\n\n\n\n<li>Setup Kubernetes Environment and upload model artifact.<\/li>\n\n\n\n<li>Deploy ONNX Model with Seldon Core to Azure Kubernetes Service.<\/li>\n\n\n\n<li>Send inference requests to Kubernetes deployed GPT2 Model.<\/li>\n\n\n\n<li>Visualize real-time monitoring metrics with Azure dashboards.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture1-60e459d1d317b-1024x162.webp\" alt=\"Placeholder\"\/><\/figure>\n\n\n\n<p>Furthermore, the tools that we\u2019ll be using in this framework will be the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/workflow\/github-readme.html\" target=\"_blank\" rel=\"noreferrer noopener\">Seldon Core<\/a>: A machine learning model deployment and monitoring framework for Kubernetes which will allow us to convert our model artifact into a scalable microservice with real-time metrics.<\/li>\n\n\n\n<li><a href=\"https:\/\/www.onnxruntime.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime<\/a>: An optimized runtime engine to improve the performance of model inference, which we\u2019ll be using to optimize and run our models.<\/li>\n\n\n\n<li><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/kubernetes-service\/\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Kubernetes Service (AKS)<\/a>: Azure\u2019s managed Kubernetes service, where we will be running the deployed machine learning models.<\/li>\n\n\n\n<li><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/azure-monitor\/overview\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Monitor<\/a>: Azure\u2019s service for managed monitoring, where we will be able to visualize all the performance metrics.<\/li>\n\n\n\n<li><a href=\"https:\/\/huggingface.co\/gpt2\" target=\"_blank\" rel=\"noreferrer noopener\">HuggingFace<\/a>: An ecosystem for training and pre-trained transformer-based NLP models, which we will leverage to get access to the OpenAI GPT-2 model.<\/li>\n<\/ul>\n\n\n\n<p>Let\u2019s get started.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1-fetch-the-trained-gpt-2-model-with-huggingface-and-export-to-onnx\">1. Fetch the trained GPT-2 Model with HuggingFace and export to ONNX<\/h2>\n\n\n\n<p><a href=\"https:\/\/openai.com\/blog\/better-language-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-2<\/a> is a popular NLP language model trained on a huge dataset that can generate human-like text. We will use Hugging Face\u2019s utilities to import the pre-trained GPT-2 tokenizer and model. First, we download the tokenizer as follows.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom transformers import GPT2Tokenizer\ntokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n<\/pre><\/div>\n\n\n<p>Now we can download the GPT2 Tensorflow model and export it to deploy it:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom transformers import GPT2Tokenizer,\nTFGPT2LMHeadModel\u00a0 tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\nmodel = TFGPT2LMHeadModel.from_pretrained(\"gpt2\", from_pt=True,\npad_token_id=tokenizer.eos_token_id) model.save_pretrained(\".\/tfgpt2model\", saved_model=True)\n<\/pre><\/div>\n\n\n<p>Finally, we convert it and optimize it for ONNX Runtime with the command below:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npython -m tf2onnx.convert --saved-model\n.\/tfgpt2model\/saved_model\/1 --opset 11\u00a0 --output model.onnx\n<\/pre><\/div>\n\n\n<p>One of the main advantages of using the <a href=\"https:\/\/onnxruntime.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">ONNX Runtime<\/a> is the high-performance inference capabilities and broad compatibility that it brings. The ONNX Runtime enables practitioners to use any machine learning framework of their choice, and convert it to the optimized Open Neural Network Exchange (ONNX) format. Once these models are converted, the ONNX Runtime can be used to deploy it to a variety of targets including desktop, IoT, mobile, and in our case Azure Kubernetes through Seldon Core.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture9-1024x474.webp\" alt=\"Placeholder\"\/><\/figure>\n\n\n\n<p>This framework has benefited from a broad range of rich benchmarks that showcase the high throughput and low latencies that can be achieved, which you can see in \u201c<a href=\"https:\/\/medium.com\/microsoftazure\/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333\" target=\"_blank\" rel=\"noreferrer noopener\">Accelerate your NLP pipelines using Hugging Face Transformers and ONNX Runtime<\/a>\u201d, as well as in \u201c<a href=\"https:\/\/medium.com\/microsoftazure\/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7\" target=\"_blank\" rel=\"noreferrer noopener\">Faster and smaller quantized NLP with Hugging Face and ONNX Runtime<\/a>\u201d.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture10.png\" alt=\"Placeholder\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"2-setup-kubernetes-environment-and-upload-model-artifact\">2. Setup Kubernetes environment and upload model artifact<\/h2>\n\n\n\n<p><a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/\" target=\"_blank\" rel=\"noreferrer noopener\">Seldon Core<\/a> is one of the leading open-source frameworks for machine-learning model deployment and monitoring at scale on Kubernetes. It allows machine learning practitioners to convert their trained model artifacts or machine learning model code into fully-fledged microservices. All models deployed with Seldon are enabled with advanced monitoring, robust promotion strategies, and scalable architectural patterns. We will be using Seldon Core to deploy our GPT-2 model.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.png\" alt=\"Placeholder\"\/><\/figure>\n\n\n\n<p>Seldon provides out-of-the-box a broad range of <a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/stable\/servers\/overview.html\" target=\"_blank\" rel=\"noreferrer noopener\">Pre-Packaged Inference Servers<\/a> to deploy model artifacts to TFServing, Triton, ONNX Runtime, etc. It also provides <a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/stable\/workflow\/overview.html#two-types-of-model-servers\" target=\"_blank\" rel=\"noreferrer noopener\">Custom Language Wrappers<\/a> to deploy custom Python, Java, C++, and more. In this blog post, we will be leveraging the Triton Prepackaged server with the ONNX Runtime backend. In order to set up Seldon Core in you can follow <a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/examples\/seldon_core_setup.html#Seldon-Core-Setup\" target=\"_blank\" rel=\"noreferrer noopener\">Seldon core setup instructions<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"setting-up-the-azure-kubernetes-environment\">Setting Up the Azure Kubernetes Environment<\/h3>\n\n\n\n<p>The following diagram depicts our target architecture utilizing Azure Kubernetes Service (AKS)\u2014fully-managed Kubernetes service provided on Azure which removes the complexity of managing infrastructure and allows developers and data scientists to focus on machine learning models.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/seldon_.png\" alt=\"diagram\"\/><\/figure>\n\n\n\n<p>We recommend creating an AKS cluster with three node pools and installed CSI driver\u2014refer to the following notebook for the scripts which you can run yourself (<a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/examples\/triton_gpt2_example_azure_setup.html\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Setup Notebook<\/a>):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>CPU System Nodepool<\/strong>: running Kubernetes system components.<\/li>\n\n\n\n<li><strong>CPU User Nodepool<\/strong>: running Seldon Operator and Istio components.<\/li>\n\n\n\n<li><strong>GPU User Nodepool<\/strong>: running machine learning model inference with GPU hardware optimizations.<\/li>\n\n\n\n<li><strong>Azure Blob CSI driver<\/strong> mounting Azure Storage Account for model hosting.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"upload-model-artifacts\">Upload model artifacts<\/h3>\n\n\n\n<p>Seldon is able to automatically download your model artifacts from an object store, so we will start by uploading our model in Azure blob storage. To abstract away details of storage connection from SeldonDeployment, we will be able to use the PersistentVolume reference in our model manifest, which will be mounted with our model container. For details on setting up PersistentVolume for Azure Blob with Blob CSI driver refer to our example <a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/examples\/triton_gpt2_example_azure.html\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. We can then upload ONNX model file to Azure Blob following the default directory structure as per the <a href=\"https:\/\/github.com\/triton-inference-server\/server\/blob\/master\/docs\/model_repository.md#onnx-models\" target=\"_blank\" rel=\"noreferrer noopener\">Triton model repository format<\/a>:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture13-1024x54.webp\" alt=\"Placeholder\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"3-deploy-to-kubernetes-aks-with-seldon-core\">3. Deploy to Kubernetes (AKS) with Seldon Core<\/h2>\n\n\n\n<p>Now, we are ready to deploy our model using <a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/servers\/triton.html?highlight=Triton#triton-inference-server\" target=\"_blank\" rel=\"noreferrer noopener\">Seldon Core\u2019s Triton pre-packaged server<\/a>. For that we need to define and apply a SeldonDeployment for prepackaged Triton server&nbsp; Kubernetes manifest file as shown below:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture14-599x1024.webp\" alt=\"Placeholder\"\/><\/figure>\n\n\n\n<p>Some of the key attributes to notice:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>implementation field is set to `TRITON_SERVER`.<\/li>\n\n\n\n<li>\u2018model_url\u2019 points to PVC to download the model (&lt;pvc>:\/\/&lt;name>).<\/li>\n\n\n\n<li>\u2018name\u2019 field is GPT-2 to instruct what model to download.<\/li>\n\n\n\n<li>\u2018componentSpecs\u2019 override Pod spec fields such as limits\/requests and tolerations to instruct Kubernetes scheduler to run the Pods on GPU nodes.<\/li>\n\n\n\n<li>\u2018protocol\u2019 field is using the widely adopted <a href=\"https:\/\/github.com\/kubeflow\/kfserving\/blob\/master\/docs\/predict-api\/v2\/required_api.md\" target=\"_blank\" rel=\"noreferrer noopener\">inference protocol<\/a>.<\/li>\n\n\n\n<li>annotations direct Azure Monitor to scrape real-time metrics.<\/li>\n<\/ul>\n\n\n\n<p>Once you deploy it, you can verify the logs as follows:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nkubectl logs <podname> -c gpt2\n=============================\n== Triton Inference Server ==\n=============================\nI0531 14:01:19.977429 1 metrics.cc:193]\u00a0\u00a0 GPU 0: Tesla V100-PCIE-16GB\nI0531 14:01:19.977639 1 server.cc:119] Initializing Triton Inference Server\nI0531 14:01:21.819524 1 onnx_backend.cc:198] Creating instance gpt2_0_gpu0 on GPU 0 (7.0) using model.onnx\nI0531 14:01:24.692839 1 model_repository_manager.cc:925] successfully loaded 'gpt2' version 1\nI0531 14:01:24.695776 1 http_server.cc:2679] Started HTTPService at 0.0.0.0:9000\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"4-run-inference-requests-with-your-deployed-model\">4. Run inference requests with your deployed model<\/h2>\n\n\n\n<p>Now that we have deployed our model, we are able to perform text generation in real-time. This can be done by sending REST requests directly against our productionized model; however, we\u2019ll have to carry out a couple of steps first: namely tokenization of our input, then sending the request, and then decoding the resulting tokens. This is shown in detail below.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Tokenize the input sentence using Hugging Face GPT-2 pre-trained tokenizer:<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ntokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\ninput_text = 'I love Artificial Intelligence'\ntoken_input = tokenizer.encode(gen_sentence, return_tensors='tf')\n<\/pre><\/div>\n\n\n<ul class=\"wp-block-list\">\n<li>Now we can send the tokens by constructing the input payload:<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npayload = # ...input payload\n \nshape = input_ids.shape.as_list()\npayload = {\n           \"inputs\": [\n               {\"name\": \"input_ids:0\",\n                \"datatype\": \"INT32\",\n                \"shape\": shape,\n                \"data\": input_ids.numpy().tolist()\n                },\n               ...\n               ]\n           }\nres = requests.post(\n  'http:\/\/<ingressIP>\/seldon\/seldon\/gpt2gpu\/v2\/models\/gpt2\/infer',\n  json=payload\n)\n<\/pre><\/div>\n\n\n<ul class=\"wp-block-list\">\n<li>Our GPT-2 model will return the probability distribution for the next token over the vocabulary for the input vector (<em>logits<\/em>). Following the \u201cgreedy\u201d approach we decode the response to a string and append to the input sentence.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nnext_token_str = postprocess_response(res)\ngenerated_sentence += ' ' + next_token_str\n<\/pre><\/div>\n\n\n<ul class=\"wp-block-list\">\n<li>We repeat this to generate the full synthetic sentence:<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n'I love Artificial Intelligence. I love the way it\u2019s designed'\n<\/pre><\/div>\n\n\n<p>The full end-to-end implementation could be found in our <a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/examples\/triton_gpt2_example_azure.html\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-2 Notebook<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"5-visualize-monitoring-metrics-with-azure-monitor\">5. Visualize monitoring metrics with Azure Monitor<\/h2>\n\n\n\n<p>We are now able to visually monitor the real-time metrics generated by our Seldon Model by enabling <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/azure-monitor\/containers\/container-insights-overview\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Monitor Container Insights<\/a> in the AKS cluster. We can navigate to the insights blade page and check whether resources\/limits configured for SeldonDeployment are within the healthy thresholds and monitor the changes in Memory or CPU during model inference.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture15-1024x520.webp\" alt=\"Placeholder\"\/><\/figure>\n\n\n\n<p>In addition to container metrics, we can collect the specialized metrics generated by Seldon Triton orchestrator. Azure Monitor Container Insights provides out-of-the-box ability to scrape Prometheus metrics from declared endpoints, no need to install and operate Prometheus server. To learn more see&nbsp; <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/azure-monitor\/containers\/container-insights-prometheus-integration\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft docs\u2014Prometheus metrics with Container insights<\/a>. For our case we can now visualize the real-time metrics with specialized dashboards as demonstrated in our <a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/examples\/triton_gpt2_example_azure.html\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-2 Notebook<\/a> and the following example dashboard:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture16-1024x790.webp\" alt=\"Placeholder\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"uncover-repeatable-ml-deployment-processes-today\">Uncover repeatable ML deployment processes today<\/h2>\n\n\n\n<p>In this blog, we have been able to cover a repeatable and scalable process to deploy a GPT-2 NLP model as a fully-fledged microservice with real-time metrics to enable observability and monitoring capabilities at scale using Seldon Core in Azure.<\/p>\n\n\n\n<p>More specifically, we were able to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Fetch the trained GPT-2 Model with HuggingFace and export to ONNX.<\/li>\n\n\n\n<li>Setup Kubernetes Environment and Upload Model Artifact.<\/li>\n\n\n\n<li>Deploy ONNX Model with Seldon Core to Azure Kubernetes Service.<\/li>\n\n\n\n<li>Send requests to generate text with deployed GPT-2 Model.<\/li>\n\n\n\n<li>Visualize monitoring metrics with Azure dashboards.<\/li>\n<\/ul>\n\n\n\n<p>These workflows are continuously being refined and evolved through the Seldon Core open source project, and advanced state-of-the-art algorithms on outlier detection, concept drift, explainability, and more are improving continuously\u2014if you are interested in learning more or contributing please feel free to reach out. If you are interested in further hands-on examples of scalable deployment strategies of machine learning models, you can check out:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/towardsdatascience.com\/production-machine-learning-monitoring-outliers-drift-explainers-statistical-performance-d9b1d02ac158\" target=\"_blank\" rel=\"noreferrer noopener\">Production machine learning monitoring<\/a>: Outliers, drift, explainers, and statistical performance.<\/li>\n\n\n\n<li>Real-time machine learning at scale using <a href=\"https:\/\/towardsdatascience.com\/real-time-stream-processing-for-machine-learning-at-scale-with-spacy-kafka-seldon-core-6360f2fedbe\" target=\"_blank\" rel=\"noreferrer noopener\">SpaCy, Kafka, and Seldon Core<\/a>.<\/li>\n\n\n\n<li><a href=\"https:\/\/docs.seldon.io\/projects\/seldon-core\/en\/latest\/workflow\/github-readme.html\" target=\"_blank\" rel=\"noreferrer noopener\">Seldon Core quick-start documentation<\/a>.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This post was co-authored by Alejandro Saucedo, Director of Machine Learning Engineering at Seldon Technologies. About the co-author: Alejandro leads teams of machine learning engineers focused on the scalability and extensibility of machine learning deployment and monitoring products with over five million installations.<\/p>\n","protected":false},"author":5562,"featured_media":87060,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msxcm_post_with_no_image":false,"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","footnotes":""},"post_tag":[308,1668,158,2272,1824],"content-type":[340],"topic":[2238,2244],"programming-languages":[],"coauthors":[2326],"class_list":["post-87048","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-azure-kubernetes-service","tag-gpt-2","tag-kubernetes","tag-microsoft","tag-onnx-runtime","content-type-tutorials-and-demos","topic-ai-machine-learning","topic-devops","review-flag-1593580362-584","review-flag-1593580428-734","review-flag-1593580419-521","review-flag-1-1593580432-963","review-flag-2-1593580437-411","review-flag-3-1593580442-169","review-flag-4-1593580448-609","review-flag-5-1593580453-725","review-flag-7-1593580463-151","review-flag-artif-1680214273-578","review-flag-free-1593619513-693","review-flag-iot-1680213327-385","review-flag-lever-1593580265-989","review-flag-machi-1680214156-53","review-flag-ml-1680214110-748"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Simple steps to create scalable processes to deploy ML models as microservices | Microsoft Open Source Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Simple steps to create scalable processes to deploy ML models as microservices | Microsoft Open Source Blog\" \/>\n<meta property=\"og:description\" content=\"This post was co-authored by Alejandro Saucedo, Director of Machine Learning Engineering at Seldon Technologies. About the co-author: Alejandro leads teams of machine learning engineers focused on the scalability and extensibility of machine learning deployment and monitoring products with over five million installations.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Open Source Blog\" \/>\n<meta property=\"article:published_time\" content=\"2021-07-09T16:00:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-05-30T22:36:04+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.png\" \/>\n\t<meta property=\"og:image:width\" content=\"894\" \/>\n\t<meta property=\"og:image:height\" content=\"599\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Elena Neroslavskaya\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.png\" \/>\n<meta name=\"twitter:creator\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:site\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Elena Neroslavskaya\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 min read\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/\"},\"author\":[{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/elena-neroslavskaya\\\/\",\"@type\":\"Person\",\"@name\":\"Elena Neroslavskaya\"}],\"headline\":\"Simple steps to create scalable processes to deploy ML models as microservices\",\"datePublished\":\"2021-07-09T16:00:18+00:00\",\"dateModified\":\"2025-05-30T22:36:04+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/\"},\"wordCount\":1627,\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/07\\\/Picture11.webp\",\"keywords\":[\"Azure Kubernetes Service\",\"GPT-2\",\"Kubernetes\",\"Microsoft\",\"ONNX Runtime\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/\",\"name\":\"Simple steps to create scalable processes to deploy ML models as microservices | Microsoft Open Source Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/07\\\/Picture11.webp\",\"datePublished\":\"2021-07-09T16:00:18+00:00\",\"dateModified\":\"2025-05-30T22:36:04+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/#primaryimage\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/07\\\/Picture11.webp\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/07\\\/Picture11.webp\",\"width\":894,\"height\":599,\"caption\":\"diagram, schematic\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2021\\\/07\\\/09\\\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Simple steps to create scalable processes to deploy ML models as microservices\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"name\":\"Microsoft Open Source Blog\",\"description\":\"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability\",\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\",\"name\":\"Microsoft Open Source Blog\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Open Source Blog\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/OpenAtMicrosoft\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Simple steps to create scalable processes to deploy ML models as microservices | Microsoft Open Source Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/","og_locale":"en_US","og_type":"article","og_title":"Simple steps to create scalable processes to deploy ML models as microservices | Microsoft Open Source Blog","og_description":"This post was co-authored by Alejandro Saucedo, Director of Machine Learning Engineering at Seldon Technologies. About the co-author: Alejandro leads teams of machine learning engineers focused on the scalability and extensibility of machine learning deployment and monitoring products with over five million installations.","og_url":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/","og_site_name":"Microsoft Open Source Blog","article_published_time":"2021-07-09T16:00:18+00:00","article_modified_time":"2025-05-30T22:36:04+00:00","og_image":[{"width":894,"height":599,"url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.png","type":"image\/png"}],"author":"Elena Neroslavskaya","twitter_card":"summary_large_image","twitter_image":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.png","twitter_creator":"@OpenAtMicrosoft","twitter_site":"@OpenAtMicrosoft","twitter_misc":{"Written by":"Elena Neroslavskaya","Est. reading time":"7 min read"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/#article","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/"},"author":[{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/elena-neroslavskaya\/","@type":"Person","@name":"Elena Neroslavskaya"}],"headline":"Simple steps to create scalable processes to deploy ML models as microservices","datePublished":"2021-07-09T16:00:18+00:00","dateModified":"2025-05-30T22:36:04+00:00","mainEntityOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/"},"wordCount":1627,"publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.webp","keywords":["Azure Kubernetes Service","GPT-2","Kubernetes","Microsoft","ONNX Runtime"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/","url":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/","name":"Simple steps to create scalable processes to deploy ML models as microservices | Microsoft Open Source Blog","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/#primaryimage"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.webp","datePublished":"2021-07-09T16:00:18+00:00","dateModified":"2025-05-30T22:36:04+00:00","breadcrumb":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/#primaryimage","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.webp","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2021\/07\/Picture11.webp","width":894,"height":599,"caption":"diagram, schematic"},{"@type":"BreadcrumbList","@id":"https:\/\/opensource.microsoft.com\/blog\/2021\/07\/09\/simple-steps-to-create-scalable-processes-to-deploy-ml-models-as-microservices\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/opensource.microsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Simple steps to create scalable processes to deploy ML models as microservices"}]},{"@type":"WebSite","@id":"https:\/\/opensource.microsoft.com\/blog\/#website","url":"https:\/\/opensource.microsoft.com\/blog\/","name":"Microsoft Open Source Blog","description":"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability","publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/opensource.microsoft.com\/blog\/#organization","name":"Microsoft Open Source Blog","url":"https:\/\/opensource.microsoft.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Open Source Blog"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/OpenAtMicrosoft"]}]}},"msxcm_animated_featured_image":null,"bloginabox_display_generated_audio":false,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Open Source Blog","distributor_original_site_url":"https:\/\/opensource.microsoft.com\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/87048","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/users\/5562"}],"replies":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=87048"}],"version-history":[{"count":2,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/87048\/revisions"}],"predecessor-version":[{"id":97506,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/87048\/revisions\/97506"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media\/87060"}],"wp:attachment":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=87048"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/post_tag?post=87048"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/content-type?post=87048"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/topic?post=87048"},{"taxonomy":"programming-languages","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/programming-languages?post=87048"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/coauthors?post=87048"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}