CNCF Aims to Standardize Kubernetes for AI Workloads

CNCF is building a new certification to make AI/ML workloads easily portable across Kubernetes cloud environments.

19 August 2025, 17:23 published Updated 19 August 2025, 17:23

CNCF to Certify Kubernetes for AI/ML Workloads: What It Means for Developers

If you need seamless portability for AI inferencing and model workloads across clouds, what should Kubernetes support? The Cloud Native Computing Foundation (CNCF) is asking exactly that.

They’re launching a program to certify Kubernetes distributions that can reliably run specific AI/ML workloads. But first, they need a clear set of requirements and recommendations. Think of it as an AI-focused version of the Kubernetes conformance guide—one that already helps more than 100 distributions remain interoperable across environments.

Building Baseline Compatibility Across Clouds

CNCF CTO Chris Aniszczyk explained at KubeCon + CloudNativeCon events that the goal is familiarity across platforms: if a workload runs on one certified Kubernetes, it should run without modification on another—regardless of cloud provider.

A new working group inside SIG‑Architecture is drafting a specification: a standardized set of APIs, capabilities, and configs that a Kubernetes cluster must support to handle AI/ML jobs reliably and efficiently.

This effort lays the foundation for a broader Cloud Native AI Conformance framework, covering telemetry, storage, security, and other core cloud-native concerns.

Three Core AI Workload Types for Kubernetes Certification

According to the draft, the group is focusing on three key AI workload types:

Large-scale training and fine-tuning
Must-have features include high-performance accelerators (GPUs, NPUs), topology-aware networking, gang scheduling, and data access at scale.
High-performance inference
This requires not just accelerators but also advanced traffic routing and standardized metrics for monitoring latency and throughput.
MLOps pipelines
Needs include batch job orchestration, queue systems for resource contention, secure integration with object storage or model registries, and robust support for CRDs/operators.

Essential Capabilities: What Must Be Supported

The draft splits into “must-have” requirements and “should-have” recommendations—many based on recent Kubernetes enhancements:

Dynamic Resource Allocation (DRA):
Fully available in Kubernetes 1.34, enabling flexible GPU sharing and fine-grained resource control.
Kubernetes Gateway API Inference Extension:
Defines traffic routing patterns for large language models (LLMs).
Cluster Autoscaler:
Must support scaling node groups while respecting requested accelerator types.

These are just a few examples of the AI-ready features the framework aims to formalize.

How Certification Will Work

A dedicated and independent accreditation group will be responsible for overseeing the entire certification process. This team will define, enforce, and update the testing criteria necessary to determine whether a Kubernetes distribution is truly capable of supporting standardized AI and ML workloads as defined by the new guidelines.

Each Kubernetes distribution that seeks certification will undergo a comprehensive evaluation, including compatibility checks, workload simulations, and adherence to the defined API, configuration, and hardware capability requirements. All certified distributions will need to pass a YAML-based conformance checklist, modeled after the existing Kubernetes certification process. This checklist will ensure consistency and transparency in how tests are conducted and how conformance is measured.

Certified distributions will be listed on a publicly accessible website, allowing developers, enterprises, and cloud providers to easily verify which platforms meet the new AI conformance standards. Importantly, these certifications won’t be permanent—each distribution will be re-tested on an annual basis to ensure they continue to meet the evolving requirements, especially as Kubernetes and AI technologies advance rapidly.