{"id":97917,"date":"2025-07-07T08:00:00","date_gmt":"2025-07-07T15:00:00","guid":{"rendered":""},"modified":"2026-03-05T09:59:59","modified_gmt":"2026-03-05T17:59:59","slug":"optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3","status":"publish","type":"post","link":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/","title":{"rendered":"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3"},"content":{"rendered":"\n<p>Large Language Models (LLMs)&nbsp;have revolutionized AI, but fine-tuning these massive models&nbsp;remains&nbsp;a significant challenge\u2014especially for organizations with limited computing resources.&nbsp;To address this, the Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.&nbsp;<\/p>\n\n\n\n<p>Fine-tuning is the process of adapting pre-trained LLMs to perform better on specific tasks by training them on specialized data\u2014whether that\u2019s domain-specific text&nbsp;(legal&nbsp;or medical), task-specific examples&nbsp;(summaries&nbsp;or chat transcripts), or business-specific content&nbsp;(internal&nbsp;documentation or customer interactions). This critical step allows organizations to significantly improve model accuracy, tailor outputs to their needs, and&nbsp;leverage&nbsp;powerful foundation models without the enormous cost of training from scratch.&nbsp;<\/p>\n\n\n\n<p>In this post, we share&nbsp;best practices based on&nbsp;insights from our experiments fine-tuning Microsoft&#8217;s Phi-3-mini-128k model using&nbsp;<a href=\"https:\/\/github.com\/kaito-project\/kaito\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>KAITO<\/strong><\/a>,&nbsp;a&nbsp;Cloud Native Computing Foundation (CNCF)-governed&nbsp;open source&nbsp;project&nbsp;that&nbsp;simplifies running AI workloads on Kubernetes&nbsp;and the associated managed add-on&nbsp;for Azure Kubernetes Service (AKS).&nbsp;You&#8217;ll&nbsp;learn&nbsp;strategies to help you fine-tune powerful LLMs even within&nbsp;reasonable hardware constraints, making advanced AI&nbsp;a realistic&nbsp;option&nbsp;for your organization.&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-16018d1d wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/kaito-project\/kaito\">Learn more about KAITO<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"understanding-the-model\">Understanding the&nbsp;model&nbsp;<\/h2>\n\n\n\n<p>Understanding our&nbsp;model\u2019s&nbsp;characteristics&nbsp;is&nbsp;crucial for developing effective fine-tuning strategies, especially when working with limited computational resources.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Phi-3-mini-128k-instruct is a 3.8 billion parameter model (~14.6 GB)\u2014compact enough to run efficiently on most single- or dual-GPU setups, yet powerful enough to support a wide range of tasks.&nbsp;<\/p>\n\n\n\n<p>Microsoft provides both 4k and 128k context length variants of this model&nbsp;<a href=\"https:\/\/huggingface.co\/microsoft\" target=\"_blank\" rel=\"noreferrer noopener\">on Hugging Face<\/a>.&nbsp;These numbers refer to the maximum number of tokens the model can process at once\u2014essentially the&nbsp;amount of text the AI can &#8220;remember&#8221; in a single conversation. The 4k variant handles about 6-7 pages of text (~3,000 words), while the 128k version processes an entire novel&#8217;s worth of content simultaneously (~90,000 words).&nbsp;&nbsp;<\/p>\n\n\n\n<p>However, longer context windows require significantly more memory and computational resources during fine-tuning, which is one of the key challenges&nbsp;we&#8217;ll&nbsp;address in this article.&nbsp;&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-memory-challenge-in-fine-tuning\">The&nbsp;memory&nbsp;challenge in&nbsp;fine-tuning&nbsp;<\/h2>\n\n\n\n<p>During the fine-tuning process, memory becomes the primary bottleneck\u2014especially when&nbsp;training on&nbsp;longer sequences of tokens (larger chunks of text) or when working with limited hardware.&nbsp;<\/p>\n\n\n\n<p>Unlike inference\u2014which simply generates text\u2014fine-tuning must store&nbsp;additional&nbsp;data like computational graphs and gradients. As a result,&nbsp;inference memory grows&nbsp;roughly linearly<strong>&nbsp;<\/strong>with sequence length, while&nbsp;fine-tuning memory grows&nbsp;non-linearly&nbsp;and can quickly exhaust&nbsp;GPU&nbsp;resources.&nbsp;<\/p>\n\n\n\n<p>In practice, while inference might handle 32,000 tokens on a single GPU, fine-tuning the same model might be limited to just 2,000-4,000 tokens before running out of memory.&nbsp;This creates a fundamental challenge: even though Phi-3-mini-128k-instruct can process novel-length text during use, memory constraints during fine-tuning often force developers to work with much shorter sequences, limiting the model&#8217;s full potential.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"key-memory-optimizations\">Key&nbsp;memory&nbsp;optimizations&nbsp;<\/h2>\n\n\n\n<p>To&nbsp;identify&nbsp;effective memory optimizations, we leveraged&nbsp;KAITO&#8217;s&nbsp;existing&nbsp;<a href=\"https:\/\/github.com\/kaito-project\/kaito\/blob\/main\/presets\/workspace\/inference\/text-generation\/inference_api.py\" target=\"_blank\" rel=\"noreferrer noopener\">inference<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/github.com\/kaito-project\/kaito\/blob\/main\/presets\/workspace\/tuning\/text-generation\/fine_tuning.py\" target=\"_blank\" rel=\"noreferrer noopener\">fine-tuning<\/a>&nbsp;services with added memory logging to conduct our tests. All experiments were conducted on an NVIDIA A100 (80GB) GPU using standard KAITO&nbsp;deployments (code available in our&nbsp;<a href=\"https:\/\/github.com\/kaito-project\/kaito\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub repository<\/a>).&nbsp;<\/p>\n\n\n\n<p>For those looking to reproduce these results, both our fine-tuning and inference services&nbsp;now&nbsp;expose a&nbsp;<em>\/metrics<\/em>&nbsp;endpoint that allows you to track memory usage with these optimizations in your own environment.&nbsp;<\/p>\n\n\n\n<p>Through this approach, we&nbsp;identified&nbsp;several effective strategies for&nbsp;optimizing&nbsp;memory usage during fine-tuning:&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1-precision-format-selection\">1. Precision&nbsp;format&nbsp;selection&nbsp;<\/h2>\n\n\n\n<p>Here\u2019s how memory usage compares across formats:&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Precision&nbsp;format<\/strong>&nbsp;<\/td><td><strong>Memory&nbsp;usage<\/strong>&nbsp;<\/td><\/tr><tr><td>Float32&nbsp;<\/td><td>15.73 GB&nbsp;<\/td><\/tr><tr><td>Float16\/BFloat16&nbsp;<\/td><td>8.09 GB&nbsp;<\/td><\/tr><tr><td>8-bit Quantized (INT8)&nbsp;<\/td><td>4.64 GB&nbsp;<\/td><\/tr><tr><td>4-bit Quantized (INT4)&nbsp;<\/td><td>3.00 GB&nbsp;<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>As&nbsp;shown, selecting&nbsp;the right precision format can be the difference between being able to fine-tune a model or not. The 80% memory savings from 4-bit quantization allows you to work with models that would otherwise require&nbsp;high-end GPUs or distributed&nbsp;setups, significantly reducing your cloud&nbsp;compute&nbsp;costs.&nbsp;<\/p>\n\n\n\n<p>In KAITO, you can&nbsp;configure precision using&nbsp;your&nbsp;tuning&nbsp;parameters&nbsp;ConfigMap&nbsp;like so:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: yaml; title: ; notranslate\" title=\"\">\nModelConfig:\u00a0\n\u00a0\u00a0torch_dtype: \"bfloat16\" # Options: float32, float16, bfloat16\u00a0\n\u00a0\nQuantizationConfig:\u00a0\n\u00a0 load_in_4bit: true\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 # Enable 4-bit quantization\u00a0\n\u00a0 bnb_4bit_quant_type: \"nf4\"\u00a0\n\u00a0 bnb_4bit_compute_dtype: \"bfloat16\"\u00a0\n\u00a0 bnb_4bit_use_double_quant: true\u00a0\n<\/pre><\/div>\n\n\n<p>For more details on precision formats and&nbsp;parameters, see the&nbsp;<a href=\"https:\/\/github.com\/kaito-project\/kaito\/blob\/main\/charts\/kaito\/workspace\/templates\/qlora-params.yaml\" target=\"_blank\" rel=\"noreferrer noopener\">KAITO QLoRA parameters template<\/a>. For&nbsp;full examples,&nbsp;visit the&nbsp;<a href=\"https:\/\/github.com\/kaito-project\/kaito\/tree\/main\/examples\/fine-tuning\" target=\"_blank\" rel=\"noreferrer noopener\">fine-tuning examples directory<\/a>&nbsp;to see how these settings are used in practice.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"2-lora-vs-qlora\">2. LoRA vs.&nbsp;QLoRA&nbsp;<\/h2>\n\n\n\n<p><strong>LoRA (<\/strong><strong>Low-Rank Adaptation<\/strong><strong>)<\/strong>&nbsp;fine-tunes&nbsp;models by freezing the original weights and inserting small trainable adapter layers. This drastically reduces&nbsp;compute&nbsp;by training only a tiny fraction of the model parameters.&nbsp;<\/p>\n\n\n\n<p><strong>QLoRA&nbsp;(Quantized LoRA)<\/strong>&nbsp;takes this further by storing the frozen weights in 4-bit precision instead of 16-bit, reducing memory usage without sacrificing&nbsp;LoRA\u2019s&nbsp;core benefits.&nbsp;<\/p>\n\n\n\n<p><strong>Memory&nbsp;efficiency<\/strong>&nbsp;<br>Standard LoRA (16-bit) showed steep memory growth\u2014exceeding 80GB at ~3,500 tokens.&nbsp;QLoRA, in contrast,&nbsp;maintained&nbsp;a stable profile and reduced memory usage by ~75%, enabling fine-tuning with much longer sequences on the same hardware.&nbsp;<\/p>\n\n\n\n<p>This comes with a modest tradeoff in processing speed due to quantization overhead, but the benefits are clear\u2014especially for domains like legal or medical, where preserving long context improves model&nbsp;comprehension and&nbsp;accuracy.&nbsp;<\/p>\n\n\n\n<p><strong>In short:<\/strong>&nbsp;LoRA is faster but&nbsp;memory-heavy.&nbsp;QLoRA&nbsp;is slower but far more&nbsp;memory-efficient.&nbsp;<\/p>\n\n\n\n<p>In KAITO, you can enable&nbsp;QLoRA&nbsp;using your tuning parameters&nbsp;ConfigMap&nbsp;like so:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: yaml; title: ; notranslate\" title=\"\">\nQuantizationConfig:\u00a0\n\u00a0 load_in_4bit: true\u00a0\n\u00a0 bnb_4bit_quant_type: \"nf4\"\u00a0\n\u00a0 bnb_4bit_compute_dtype: \"bfloat16\"\u00a0\n\u00a0 bnb_4bit_use_double_quant: true\u00a0\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"3-batch-size-optimization\">3. Batch&nbsp;size&nbsp;optimization&nbsp;<\/h2>\n\n\n\n<p>Contrary to intuition, increasing batch size&nbsp;can deliver several benefits:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Improved memory efficiency per total tokens processed&nbsp;<\/li>\n\n\n\n<li>Higher training throughput, with&nbsp;more tokens processed per second&nbsp;<\/li>\n\n\n\n<li>Better&nbsp;utilization&nbsp;of&nbsp;available compute&nbsp;resources&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>Choosing the&nbsp;right batch size can&nbsp;reduce&nbsp;fine-tuning time and cost by&nbsp;2-3&nbsp;times. For example, processing 10,000 training examples with&nbsp;a&nbsp;batch size&nbsp;of&nbsp;2 instead of&nbsp;1 could reduce a 12-hour job to just&nbsp;five-six&nbsp;hours while using the same hardware.&nbsp;<\/p>\n\n\n\n<p>While&nbsp;a&nbsp;batch size&nbsp;of&nbsp;1&nbsp;is ideal for handling&nbsp;the longest individual sequences, larger batch sizes (2-4)&nbsp;tend to&nbsp;offer a better balance of speed and memory efficiency when working within&nbsp;a fixed token budget.&nbsp;This results in faster training and more effective resource use.&nbsp;&nbsp;<\/p>\n\n\n\n<p>In&nbsp;KAITO, you can&nbsp;configure batch size&nbsp;using your tuning parameters&nbsp;ConfigMap&nbsp;like so:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nTrainingArguments:\u00a0\n\u00a0\u00a0per_device_train_batch_size:\u00a02\u00a0 #\u00a0Adjust based on your sequence length needs\u00a0\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"4-lora-rank-and-target-module-selection\">4. LoRA&nbsp;rank&nbsp;and&nbsp;target&nbsp;module&nbsp;selection&nbsp;<\/h2>\n\n\n\n<p>In LoRA fine-tuning, the&nbsp;<strong>rank<\/strong>&nbsp;parameter&nbsp;determines&nbsp;the size of the adapter layers and how much the model can adapt to new data.&nbsp;Higher ranks provide greater&nbsp;capacity&nbsp;for learning new patterns\u2014but also increase memory and&nbsp;compute&nbsp;requirements.&nbsp;<\/p>\n\n\n\n<p>We evaluated&nbsp;ranks of&nbsp;8, 16, 64, 256, and 16,384. Our findings:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ranks 8-256&nbsp;demonstrated&nbsp;minimal differences in memory usage and processing speed.&nbsp;<\/li>\n\n\n\n<li>Very&nbsp;high&nbsp;ranks (like&nbsp;16384)&nbsp;were&nbsp;significantly&nbsp;slower and&nbsp;more&nbsp;memory-intensive.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>We also explored which parts of the model to target:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Focusing&nbsp;only&nbsp;on&nbsp;attention layers&nbsp;affected just ~0.04%&nbsp;of the model.&nbsp;<\/li>\n\n\n\n<li>Including&nbsp;MLP layers&nbsp;increased this to&nbsp;~0.12%, with little&nbsp;added&nbsp;memory cost.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p><strong>Recommendation:&nbsp;<\/strong><\/p>\n\n\n\n<p>Start with smaller ranks (8\u201364), which deliver comparable quality to higher values at a fraction of the resource cost\u2014ideal for most production and business use cases.&nbsp;<\/p>\n\n\n\n<p>Here\u2019s&nbsp;an example config for memory-efficient tuning:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nLoraConfig:\u00a0\n\u00a0 r: 8\u00a0\n\u00a0\u00a0lora_alpha: 8\u00a0\n\u00a0\u00a0lora_dropout: 0.0\u00a0\n\u00a0\u00a0target_modules: [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\"]\u00a0\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"5-pytorch-memory-management\">5.&nbsp;PyTorch&nbsp;memory&nbsp;management&nbsp;<\/h2>\n\n\n\n<p>KAITO uses&nbsp;PyTorch&nbsp;to run fine-tuning jobs on GPUs. By default,&nbsp;PyTorch&nbsp;reserves large chunks of GPU memory early on, which can cause out-of-memory (OOM) errors\u2014even when enough memory&nbsp;is technically available.&nbsp;<\/p>\n\n\n\n<p>To address this, we enabled the following setting:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: yaml; title: ; notranslate\" title=\"\">\nPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True\u00a0\n<\/pre><\/div>\n\n\n<p>It enables&nbsp;PyTorch&nbsp;to&nbsp;allocate&nbsp;memory in smaller, flexible pieces&nbsp;rather than&nbsp;all at once, reducing&nbsp;memory waste&nbsp;and improving reliability\u2014especially on shared or busy GPUs.&nbsp;<\/p>\n\n\n\n<p>This setting is now enabled by default in recent versions of KAITO. However, if&nbsp;you&#8217;re&nbsp;using an older release or running in a custom environment, you may need to configure it manually.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"best-practices\">Best&nbsp;practices&nbsp;<\/h2>\n\n\n\n<p>Based on our experiments with Phi-3-mini-128k, we recommend the following approach for efficient fine-tuning with KAITO:&nbsp;<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Start with&nbsp;QLoRA<\/strong>&nbsp;for memory-efficient fine-tuning, especially with longer sequences or limited GPU resources.&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"2\" class=\"wp-block-list\">\n<li><strong>Optimize&nbsp;batch size<\/strong>&nbsp;rather than defaulting to batch size 1. For many scenarios, a larger batch size processing the same total tokens will be more efficient.&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"3\" class=\"wp-block-list\">\n<li><strong>Start with low LoRA rank values (8\u201364)<\/strong>. Higher ranks offer more adaptability but&nbsp;don\u2019t&nbsp;always improve model quality\u2014so only scale up if your task shows clear benefit.&nbsp;<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"complete-kaito-configmap-example\">Complete KAITO&nbsp;ConfigMap&nbsp;example&nbsp;<\/h3>\n\n\n\n<p>A complete YAML&nbsp;ConfigMap&nbsp;with these optimizations is <a href=\"https:\/\/github.com\/kaito-project\/kaito\/blob\/268b12891126e7137cc698fe59561c816436365d\/examples\/fine-tuning\/kaito_configmap_tuning_phi_3.yaml\" target=\"_blank\" rel=\"noreferrer noopener\">available here.<\/a>&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"continue-to-fine-tune-llms\">Continue to fine-tune LLMs<\/h2>\n\n\n\n<p>Efficiently fine-tuning LLMs requires understanding the complex interplay between model architecture, memory management, and training dynamics. Our experiments&nbsp;demonstrate&nbsp;that with the right combination of techniques\u2014particularly&nbsp;QLoRA, optimized batch sizes, and appropriate LoRA configurations\u2014it&#8217;s&nbsp;possible to fine-tune powerful models like Phi-3 within&nbsp;reasonable hardware requirements.&nbsp;<\/p>\n\n\n\n<p>By implementing these optimizations in KAITO, you can work with larger models and longer sequences, even with limited computational resources, advancing the accessibility and practical application of&nbsp;state-of-the-art&nbsp;language models in your projects.&nbsp;<\/p>\n\n\n<div class=\"wp-block-msxcm-cta-block\" data-moray data-bi-an=\"CTA Block\">\n\t<div class=\"card d-block mx-ng mx-md-0\">\n\t\t<div class=\"row no-gutters material-color-brand-light bg-fabric-white\">\n\n\t\t\t\t\t\t\t<div class=\"col-md-4\">\n\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2025\/01\/CLO25-Security-Lifestyle-Getty-1084167628-1024x683.jpg\" class=\"card-img img-object-cover\" alt=\"Computer programmer working with male colleague in office\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2025\/01\/CLO25-Security-Lifestyle-Getty-1084167628-1024x683.jpg 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2025\/01\/CLO25-Security-Lifestyle-Getty-1084167628-388x259.jpg 388w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2025\/01\/CLO25-Security-Lifestyle-Getty-1084167628-768x513.jpg 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2025\/01\/CLO25-Security-Lifestyle-Getty-1084167628-1536x1025.jpg 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2025\/01\/CLO25-Security-Lifestyle-Getty-1084167628-2048x1367.jpg 2048w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2025\/01\/CLO25-Security-Lifestyle-Getty-1084167628-450x300.jpg 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2025\/01\/CLO25-Security-Lifestyle-Getty-1084167628-650x434.jpg 650w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"d-flex col-md\">\n\t\t\t\t<div class=\"card-body align-self-center p-4 p-md-5\">\n\t\t\t\t\t\n\t\t\t\t\t<h2>KAITO<\/h2>\n\n\t\t\t\t\t<div class=\"mb-3\">\n\t\t\t\t\t\t<p>A\u00a0CNCF-governed\u00a0open source\u00a0project\u00a0that\u00a0simplifies running AI workloads on Kubernetes.<\/p>\n\t\t\t\t\t<\/div>\n\n\t\t\t\t\t\t\t\t\t\t\t<div class=\"link-group\">\n\t\t\t\t\t\t\t<a href=\"https:\/\/github.com\/kaito-project\/kaito\" class=\"btn btn-link text-decoration-none p-0\" target=\"_blank\">\n\t\t\t\t\t\t\t\t<span>Explore more<\/span>\n\t\t\t\t\t\t\t\t<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span>\n\t\t\t\t\t\t\t<\/a>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t<\/div>\n\n\t\t\t\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n\n<p>We encourage you to try these techniques in your own fine-tuning workflows and share your experience with the KAITO community on&nbsp;<a href=\"https:\/\/github.com\/kaito-project\/kaito\/discussions\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Discussions<\/a>.&nbsp;We\u2019re&nbsp;excited to see what you build.&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><em>This research was conducted using NVIDIA A100 (80GB) GPU, CUDA: 12.4,&nbsp;PyTorch: 2.2.0<\/em>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.<\/p>\n","protected":false},"author":6194,"featured_media":95490,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msxcm_post_with_no_image":false,"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","footnotes":""},"post_tag":[136,158],"content-type":[340],"topic":[2238,2241],"programming-languages":[2265],"coauthors":[2607],"class_list":["post-97917","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-github","tag-kubernetes","content-type-tutorials-and-demos","topic-ai-machine-learning","topic-cloud","programming-languages-pytorch","review-flag-1593580362-584","review-flag-1593580428-734","review-flag-1-1593580432-963","review-flag-2-1593580437-411","review-flag-3-1593580442-169","review-flag-4-1593580448-609","review-flag-5-1593580453-725","review-flag-6-1593580457-852","review-flag-7-1593580463-151","review-flag-8-1593580468-572","review-flag-alway-1593580310-39","review-flag-lever-1593580265-989","review-flag-new-1593580248-669"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3 | Microsoft Open Source Blog<\/title>\n<meta name=\"description\" content=\"The Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.\u00a0Learn more.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3 | Microsoft Open Source Blog\" \/>\n<meta property=\"og:description\" content=\"The Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.\u00a0Learn more.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Open Source Blog\" \/>\n<meta property=\"article:published_time\" content=\"2025-07-07T15:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-03-05T17:59:59+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1170\" \/>\n\t<meta property=\"og:image:height\" content=\"640\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Ishaan Sehgal\u00a0\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:site\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ishaan Sehgal\u00a0\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 min read\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/\"},\"author\":[{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/ishaan-sehgal\\\/\",\"@type\":\"Person\",\"@name\":\"Ishaan Sehgal\u00a0\"}],\"headline\":\"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3\",\"datePublished\":\"2025-07-07T15:00:00+00:00\",\"dateModified\":\"2026-03-05T17:59:59+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/\"},\"wordCount\":1768,\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/STB13_Rick_03.webp\",\"keywords\":[\"GitHub\",\"Kubernetes\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/\",\"name\":\"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3 | Microsoft Open Source Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/STB13_Rick_03.webp\",\"datePublished\":\"2025-07-07T15:00:00+00:00\",\"dateModified\":\"2026-03-05T17:59:59+00:00\",\"description\":\"The Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.\u00a0Learn more.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/#primaryimage\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/STB13_Rick_03.webp\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/STB13_Rick_03.webp\",\"width\":1170,\"height\":640},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2025\\\/07\\\/07\\\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"name\":\"Microsoft Open Source Blog\",\"description\":\"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability\",\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\",\"name\":\"Microsoft Open Source Blog\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Open Source Blog\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/OpenAtMicrosoft\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3 | Microsoft Open Source Blog","description":"The Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.\u00a0Learn more.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/","og_locale":"en_US","og_type":"article","og_title":"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3 | Microsoft Open Source Blog","og_description":"The Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.\u00a0Learn more.","og_url":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/","og_site_name":"Microsoft Open Source Blog","article_published_time":"2025-07-07T15:00:00+00:00","article_modified_time":"2026-03-05T17:59:59+00:00","og_image":[{"width":1170,"height":640,"url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.png","type":"image\/png"}],"author":"Ishaan Sehgal\u00a0","twitter_card":"summary_large_image","twitter_creator":"@OpenAtMicrosoft","twitter_site":"@OpenAtMicrosoft","twitter_misc":{"Written by":"Ishaan Sehgal\u00a0","Est. reading time":"7 min read"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/#article","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/"},"author":[{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/ishaan-sehgal\/","@type":"Person","@name":"Ishaan Sehgal\u00a0"}],"headline":"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3","datePublished":"2025-07-07T15:00:00+00:00","dateModified":"2026-03-05T17:59:59+00:00","mainEntityOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/"},"wordCount":1768,"publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp","keywords":["GitHub","Kubernetes"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/","url":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/","name":"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3 | Microsoft Open Source Blog","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/#primaryimage"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp","datePublished":"2025-07-07T15:00:00+00:00","dateModified":"2026-03-05T17:59:59+00:00","description":"The Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.\u00a0Learn more.","breadcrumb":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/#primaryimage","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2024\/06\/STB13_Rick_03.webp","width":1170,"height":640},{"@type":"BreadcrumbList","@id":"https:\/\/opensource.microsoft.com\/blog\/2025\/07\/07\/optimizing-memory-usage-in-large-language-models-fine-tuning-with-kaito-best-practices-from-phi-3\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/opensource.microsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3"}]},{"@type":"WebSite","@id":"https:\/\/opensource.microsoft.com\/blog\/#website","url":"https:\/\/opensource.microsoft.com\/blog\/","name":"Microsoft Open Source Blog","description":"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability","publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/opensource.microsoft.com\/blog\/#organization","name":"Microsoft Open Source Blog","url":"https:\/\/opensource.microsoft.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Open Source Blog"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/OpenAtMicrosoft"]}]}},"msxcm_animated_featured_image":null,"bloginabox_display_generated_audio":false,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Open Source Blog","distributor_original_site_url":"https:\/\/opensource.microsoft.com\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/97917","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/users\/6194"}],"replies":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=97917"}],"version-history":[{"count":14,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/97917\/revisions"}],"predecessor-version":[{"id":98358,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/97917\/revisions\/98358"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media\/95490"}],"wp:attachment":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=97917"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/post_tag?post=97917"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/content-type?post=97917"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/topic?post=97917"},{"taxonomy":"programming-languages","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/programming-languages?post=97917"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/coauthors?post=97917"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}