Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3

By Ishaan Sehgal , Software Engineer 2

Content type
Tutorials and demos

Programming language
PyTorch

Large Language Models (LLMs) have revolutionized AI, but fine-tuning these massive models remains a significant challenge—especially for organizations with limited computing resources. To address this, the Cloud Native team at Azure is working to make AI on Kubernetes more cost-effective and approachable for a broader range of users.

Fine-tuning is the process of adapting pre-trained LLMs to perform better on specific tasks by training them on specialized data—whether that’s domain-specific text (legal or medical), task-specific examples (summaries or chat transcripts), or business-specific content (internal documentation or customer interactions). This critical step allows organizations to significantly improve model accuracy, tailor outputs to their needs, and leverage powerful foundation models without the enormous cost of training from scratch.

In this post, we share best practices based on insights from our experiments fine-tuning Microsoft’s Phi-3-mini-128k model using KAITO, a Cloud Native Computing Foundation (CNCF)-governed open source project that simplifies running AI workloads on Kubernetes and the associated managed add-on for Azure Kubernetes Service (AKS). You’ll learn strategies to help you fine-tune powerful LLMs even within reasonable hardware constraints, making advanced AI a realistic option for your organization.

Learn more about KAITO

Understanding the model

Understanding our model’s characteristics is crucial for developing effective fine-tuning strategies, especially when working with limited computational resources.

Phi-3-mini-128k-instruct is a 3.8 billion parameter model (~14.6 GB)—compact enough to run efficiently on most single- or dual-GPU setups, yet powerful enough to support a wide range of tasks.

Microsoft provides both 4k and 128k context length variants of this model on Hugging Face. These numbers refer to the maximum number of tokens the model can process at once—essentially the amount of text the AI can “remember” in a single conversation. The 4k variant handles about 6-7 pages of text (~3,000 words), while the 128k version processes an entire novel’s worth of content simultaneously (~90,000 words).

However, longer context windows require significantly more memory and computational resources during fine-tuning, which is one of the key challenges we’ll address in this article.

The memory challenge in fine-tuning

During the fine-tuning process, memory becomes the primary bottleneck—especially when training on longer sequences of tokens (larger chunks of text) or when working with limited hardware.

Unlike inference—which simply generates text—fine-tuning must store additional data like computational graphs and gradients. As a result, inference memory grows roughly linearly with sequence length, while fine-tuning memory grows non-linearly and can quickly exhaust GPU resources.

In practice, while inference might handle 32,000 tokens on a single GPU, fine-tuning the same model might be limited to just 2,000-4,000 tokens before running out of memory. This creates a fundamental challenge: even though Phi-3-mini-128k-instruct can process novel-length text during use, memory constraints during fine-tuning often force developers to work with much shorter sequences, limiting the model’s full potential.

Key memory optimizations

To identify effective memory optimizations, we leveraged KAITO’s existing inference and fine-tuning services with added memory logging to conduct our tests. All experiments were conducted on an NVIDIA A100 (80GB) GPU using standard KAITO deployments (code available in our GitHub repository).

For those looking to reproduce these results, both our fine-tuning and inference services now expose a /metrics endpoint that allows you to track memory usage with these optimizations in your own environment.

Through this approach, we identified several effective strategies for optimizing memory usage during fine-tuning:

1. Precision format selection

Here’s how memory usage compares across formats:

Precision format	Memory usage
Float32	15.73 GB
Float16/BFloat16	8.09 GB
8-bit Quantized (INT8)	4.64 GB
4-bit Quantized (INT4)	3.00 GB

As shown, selecting the right precision format can be the difference between being able to fine-tune a model or not. The 80% memory savings from 4-bit quantization allows you to work with models that would otherwise require high-end GPUs or distributed setups, significantly reducing your cloud compute costs.

In KAITO, you can configure precision using your tuning parameters ConfigMap like so:

ModelConfig: 
  torch_dtype: "bfloat16" # Options: float32, float16, bfloat16 
 
QuantizationConfig: 
  load_in_4bit: true       # Enable 4-bit quantization 
  bnb_4bit_quant_type: "nf4" 
  bnb_4bit_compute_dtype: "bfloat16" 
  bnb_4bit_use_double_quant: true

For more details on precision formats and parameters, see the KAITO QLoRA parameters template. For full examples, visit the fine-tuning examples directory to see how these settings are used in practice.

2. LoRA vs. QLoRA

LoRA (Low-Rank Adaptation) fine-tunes models by freezing the original weights and inserting small trainable adapter layers. This drastically reduces compute by training only a tiny fraction of the model parameters.

QLoRA (Quantized LoRA) takes this further by storing the frozen weights in 4-bit precision instead of 16-bit, reducing memory usage without sacrificing LoRA’s core benefits.

Memory efficiency
Standard LoRA (16-bit) showed steep memory growth—exceeding 80GB at ~3,500 tokens. QLoRA, in contrast, maintained a stable profile and reduced memory usage by ~75%, enabling fine-tuning with much longer sequences on the same hardware.

This comes with a modest tradeoff in processing speed due to quantization overhead, but the benefits are clear—especially for domains like legal or medical, where preserving long context improves model comprehension and accuracy.

In short: LoRA is faster but memory-heavy. QLoRA is slower but far more memory-efficient.

In KAITO, you can enable QLoRA using your tuning parameters ConfigMap like so:

QuantizationConfig: 
  load_in_4bit: true 
  bnb_4bit_quant_type: "nf4" 
  bnb_4bit_compute_dtype: "bfloat16" 
  bnb_4bit_use_double_quant: true

3. Batch size optimization

Contrary to intuition, increasing batch size can deliver several benefits:

Improved memory efficiency per total tokens processed
Higher training throughput, with more tokens processed per second
Better utilization of available compute resources

Choosing the right batch size can reduce fine-tuning time and cost by 2-3 times. For example, processing 10,000 training examples with a batch size of 2 instead of 1 could reduce a 12-hour job to just five-six hours while using the same hardware.

While a batch size of 1 is ideal for handling the longest individual sequences, larger batch sizes (2-4) tend to offer a better balance of speed and memory efficiency when working within a fixed token budget. This results in faster training and more effective resource use.

In KAITO, you can configure batch size using your tuning parameters ConfigMap like so:

TrainingArguments: 
  per_device_train_batch_size: 2  # Adjust based on your sequence length needs

4. LoRA rank and target module selection

In LoRA fine-tuning, the rank parameter determines the size of the adapter layers and how much the model can adapt to new data. Higher ranks provide greater capacity for learning new patterns—but also increase memory and compute requirements.

We evaluated ranks of 8, 16, 64, 256, and 16,384. Our findings:

Ranks 8-256 demonstrated minimal differences in memory usage and processing speed.
Very high ranks (like 16384) were significantly slower and more memory-intensive.

We also explored which parts of the model to target:

Focusing only on attention layers affected just ~0.04% of the model.
Including MLP layers increased this to ~0.12%, with little added memory cost.

Recommendation:

Start with smaller ranks (8–64), which deliver comparable quality to higher values at a fraction of the resource cost—ideal for most production and business use cases.

Here’s an example config for memory-efficient tuning:

LoraConfig: 
  r: 8 
  lora_alpha: 8 
  lora_dropout: 0.0 
  target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]

5. PyTorch memory management

KAITO uses PyTorch to run fine-tuning jobs on GPUs. By default, PyTorch reserves large chunks of GPU memory early on, which can cause out-of-memory (OOM) errors—even when enough memory is technically available.

To address this, we enabled the following setting:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

It enables PyTorch to allocate memory in smaller, flexible pieces rather than all at once, reducing memory waste and improving reliability—especially on shared or busy GPUs.

This setting is now enabled by default in recent versions of KAITO. However, if you’re using an older release or running in a custom environment, you may need to configure it manually.

Best practices

Based on our experiments with Phi-3-mini-128k, we recommend the following approach for efficient fine-tuning with KAITO:

Start with QLoRA for memory-efficient fine-tuning, especially with longer sequences or limited GPU resources.

Optimize batch size rather than defaulting to batch size 1. For many scenarios, a larger batch size processing the same total tokens will be more efficient.

Start with low LoRA rank values (8–64). Higher ranks offer more adaptability but don’t always improve model quality—so only scale up if your task shows clear benefit.

Complete KAITO ConfigMap example

A complete YAML ConfigMap with these optimizations is available here.

Continue to fine-tune LLMs

Efficiently fine-tuning LLMs requires understanding the complex interplay between model architecture, memory management, and training dynamics. Our experiments demonstrate that with the right combination of techniques—particularly QLoRA, optimized batch sizes, and appropriate LoRA configurations—it’s possible to fine-tune powerful models like Phi-3 within reasonable hardware requirements.

By implementing these optimizations in KAITO, you can work with larger models and longer sequences, even with limited computational resources, advancing the accessibility and practical application of state-of-the-art language models in your projects.

KAITO

A CNCF-governed open source project that simplifies running AI workloads on Kubernetes.

Explore more

We encourage you to try these techniques in your own fine-tuning workflows and share your experience with the KAITO community on GitHub Discussions. We’re excited to see what you build.

This research was conducted using NVIDIA A100 (80GB) GPU, CUDA: 12.4, PyTorch: 2.2.0.

Ishaan Sehgal

Software Engineer 2

Ishaan is a software engineer at Microsoft AKS and a core contributor to KAITO, a Kubernetes-native platform for AI workloads. He is passionate about building at the intersection of systems and AI.

See more articles from this author

Jul 14 •

7 min read

Hyperlight: Debugging hardware-protected guests

You can now interactively debug Hyperlight guest micro-VMs. Attach the GNU Debugger…
Jun 30 •

4 min read

Expanding platform engineering capabilities with Radius Resource Types

Now, with Radius Resource Types, platform engineers can define resource types specific…
Jun 10 •

4 min read

Drasi accepted into CNCF sandbox for change-driven solutions

The Azure Incubations team is proud to share that Drasi has officially…
Mar 26 •

10 min read

Hyperlight Wasm: Fast, secure, and OS-free

We're announcing the release of Hyperlight Wasm: a Hyperlight virtual machine (VM)…