본문으로 건너뛰기

NVIDIA GPU Operator

Install

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update nvidia \
&& helm search repo nvidia/gpu-operator -l | head -n 10
helm pull nvidia/gpu-operator --version v25.10.1
helm show values nvidia/gpu-operator --version v25.10.1 > gpu-operator-v25.10.1.yaml
gpu-operator-values.yaml
nfd:
enabled: false

cdi:
enabled: true

daemonsets:
tolerations: []

driver:
enabled: true
nvidiaDriverCRD:
enabled: true
deployDefaultCR: false

toolkit:
enabled: true

devicePlugin:
enabled: true

dcgmExporter:
enabled: true

# GPU Feature Discovery
gfd:
enabled: false

migManager:
enabled: false

vgpuDeviceManager:
enabled: false

vfioManager:
enabled: false

sandboxDevicePlugin:
enabled: false
  • driver
    • enabled: true
      • Enables GPU Operator driver management.
      • If the NVIDIA GPU driver must remain pre-installed and host-managed, keep this value false.
    • nvidiaDriverCRD.enabled: true
      • Uses NVIDIADriver resources to define driver versions and node selection.
    • nvidiaDriverCRD.deployDefaultCR: false
      • Prevents Helm from creating a default NVIDIADriver that matches all GPU nodes.
      • No operator-managed driver is installed until an explicit NVIDIADriver resource is applied.
  • toolkit
    • enabled: true
      • If NVIDIA Container Toolkit is already installed and managed on the node, set this value false.
helm template gpu-operator nvidia/gpu-operator \
--version v25.10.1 \
-n nvidia-gpu \
-f gpu-operator-values.yaml \
> gpu-operator.yaml
helm upgrade -i gpu-operator nvidia/gpu-operator \
--history-max 5 \
--create-namespace \
--version v25.10.1 \
-n nvidia-gpu \
-f gpu-operator-values.yaml

Driver CRD

Create one or more NVIDIADriver resources after the Operator is installed. See Driver CRD for node-group examples and upgrade policy guidance.

Upgrade Impact

Driver updates can interrupt GPU workloads on the target node because the NVIDIA kernel modules and Fabric Manager may need to be restarted or reloaded. Use maxParallelUpgrades: 1 and drain one node at a time for production clusters.

For clusters with pre-installed host drivers, plan a node-by-node migration. Cordon and drain the node, remove or disable the host-managed NVIDIA driver and Fabric Manager stack, reboot if required, then let the NVIDIADriver resource install the operator-managed stack.

정보

NFD(Node Feature Discovery) and GFD(GPU Feature Discovery) add system information to node labels. Use these labels in nodeAffinity or in NVIDIADriver.spec.nodeSelector.

If the NFD PCI deviceLabelFields setting includes [class, vendor], a node with an NVIDIA PCI device can receive a label such as feature.node.kubernetes.io/pci-0302_10de.present: "true".

Uninstall

helm uninstall gpu-operator -n nvidia-gpu
kubectl delete crd nvidiadrivers.nvidia.com
kubectl delete crd clusterpolicies.nvidia.com