OpenAI masters scale with Kubernetes on Azure

By Anand Chandramohan, Anand is a Senior Product Manager at Microsoft Azure focused on compute services, including containers and Kubernetes.

Content type
Tutorials and demos

OpenAI’s mission is to build safe artificial general intelligence (AGI) and ensure AGI’s benefits are as widely and evenly distributed as possible. As a non-profit AI research company, they focus on long-term research, working on problems that require fundamental advances in AI capabilities.

OpenAI runs Kubernetes for their deep learning research because Kubernetes can provide a fast iteration cycle, scalability, and a lack of boilerplate, which makes it ideal for most of OpenAI’s experiments. They currently operate several Kubernetes clusters (some in the cloud and some on physical hardware), the largest of which they pushed to over 2,500 nodes. Their Kubernetes cluster runs in Azure on a combination of D15v2 and NC24 VMs.

To find out more about how OpenAI adopted Kubernetes and how they resolved some common deployment issues, check out this detailed OpenAI blog post on scaling Kubernetes to 2,500 nodes.

If you want to learn more about Azure Container Service (AKS), the new managed Kubernetes service that OpenAI is using, visit the AKS site. You only pay for the VMs that add value to your business and can try AKS for free.

Anand Chandramohan

Anand is a Senior Product Manager at Microsoft Azure focused on compute services, including containers and Kubernetes.

See more articles from this author

Aug 6 •

6 min read

Introducing Wassette: WebAssembly-based tools for AI agents

Wassette empowers AI agents to securely fetch and run Wasm tools, enabling…
Jul 14 •

7 min read

Hyperlight: Debugging hardware-protected guests

You can now interactively debug Hyperlight guest micro-VMs. Attach the GNU Debugger…
Jul 7 •

7 min read

Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3

The Cloud Native team at Azure is working to make AI on…
Jun 30 •

4 min read

Expanding platform engineering capabilities with Radius Resource Types

Now, with Radius Resource Types, platform engineers can define resource types specific…

Related posts

Introducing Wassette: WebAssembly-based tools for AI agents

Hyperlight: Debugging hardware-protected guests

Optimizing memory usage in large language models fine-tuning with KAITO: Best practices from Phi-3

Expanding platform engineering capabilities with Radius Resource Types