NVIDIA GPU Operator
Install
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update nvidia \
&& helm search repo nvidia/gpu-operator -l | head -n 10
helm pull nvidia/gpu-operator --version v25.10.1
helm show values nvidia/gpu-operator --version v25.10.1 > gpu-operator-v25.10.1.yaml
nfd:
enabled: false
cdi:
enabled: true
daemonsets:
tolerations: []
driver:
enabled: true
nvidiaDriverCRD:
enabled: true
deployDefaultCR: false
toolkit:
enabled: true
devicePlugin:
enabled: true
dcgmExporter:
enabled: true
# GPU Feature Discovery
gfd:
enabled: false
migManager:
enabled: false
vgpuDeviceManager:
enabled: false
vfioManager:
enabled: false
sandboxDevicePlugin:
enabled: false
driverenabled: true- Enables GPU Operator driver management.
- If the NVIDIA GPU driver must remain pre-installed and host-managed, keep this value
false.
nvidiaDriverCRD.enabled: true- Uses
NVIDIADriverresources to define driver versions and node selection.
- Uses
nvidiaDriverCRD.deployDefaultCR: false- Prevents Helm from creating a default
NVIDIADriverthat matches all GPU nodes. - No operator-managed driver is installed until an explicit
NVIDIADriverresource is applied.
- Prevents Helm from creating a default
toolkitenabled: true- If NVIDIA Container Toolkit is already installed and managed on the node, set this value
false.
- If NVIDIA Container Toolkit is already installed and managed on the node, set this value
helm template gpu-operator nvidia/gpu-operator \
--version v25.10.1 \
-n nvidia-gpu \
-f gpu-operator-values.yaml \
> gpu-operator.yaml
helm upgrade -i gpu-operator nvidia/gpu-operator \
--history-max 5 \
--create-namespace \
--version v25.10.1 \
-n nvidia-gpu \
-f gpu-operator-values.yaml
Driver CRD
Create one or more NVIDIADriver resources after the Operator is installed. See Driver CRD for node-group examples and upgrade policy guidance.
Upgrade Impact
Driver updates can interrupt GPU workloads on the target node because the NVIDIA kernel modules and Fabric Manager may need to be restarted or reloaded. Use maxParallelUpgrades: 1 and drain one node at a time for production clusters.
For clusters with pre-installed host drivers, plan a node-by-node migration. Cordon and drain the node, remove or disable the host-managed NVIDIA driver and Fabric Manager stack, reboot if required, then let the NVIDIADriver resource install the operator-managed stack.
NFD(Node Feature Discovery) and GFD(GPU Feature Discovery) add system information to node labels. Use these labels in nodeAffinity or in NVIDIADriver.spec.nodeSelector.
If the NFD PCI deviceLabelFields setting includes [class, vendor], a node with an NVIDIA PCI device can receive a label such as feature.node.kubernetes.io/pci-0302_10de.present: "true".
Uninstall
helm uninstall gpu-operator -n nvidia-gpu
kubectl delete crd nvidiadrivers.nvidia.com
kubectl delete crd clusterpolicies.nvidia.com