본문으로 건너뛰기

NVIDIA GPU Operator Driver CRD

Driver CRD

Use the NVIDIADriver CRD when driver versions must be managed per node group. Enable the CRD mode in the GPU Operator Helm values and disable the default driver custom resource so that no catch-all driver is installed before explicit node-group resources are created.

gpu-operator-values.yaml
driver:
enabled: true
nvidiaDriverCRD:
enabled: true
deployDefaultCR: false
정보

With deployDefaultCR: false, the GPU Operator does not create a default NVIDIADriver resource. Operator-managed driver installation starts only after a matching NVIDIADriver resource is applied.

Node Groups

Select target nodes with stable labels. Keep node selectors disjoint so that a node matches only one NVIDIADriver resource.

kubectl label node <node> node.type=b300
kubectl label node <node> node.type=l40s

B300

nvidia-driver-b300.yaml
apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
name: b300-580-126-20
spec:
driverType: gpu
nodeSelector:
node.type: b300
repository: nvcr.io/nvidia
image: driver
version: "580.126.20"
imagePullPolicy: IfNotPresent
usePrecompiled: false
kernelModuleType: auto
manager:
repository: nvcr.io/nvidia/cloud-native
image: k8s-driver-manager
version: v0.9.1
imagePullPolicy: IfNotPresent
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
maxUnavailable: 1
waitForCompletion:
timeoutSeconds: 0
podSelector: ""
gpuPodDeletion:
force: false
deleteEmptyDir: false
timeoutSeconds: 300
drain:
enable: true
force: false
deleteEmptyDir: false
timeoutSeconds: 300
kubectl apply -f nvidia-driver-b300.yaml

L40S

Use a separate resource for L40S nodes if they require a different rollout window, upgrade policy, or driver version.

nvidia-driver-l40s.yaml
apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
name: l40s-580-126-20
spec:
driverType: gpu
nodeSelector:
node.type: l40s
repository: nvcr.io/nvidia
image: driver
version: "580.126.20"
imagePullPolicy: IfNotPresent
usePrecompiled: false
kernelModuleType: auto
manager:
repository: nvcr.io/nvidia/cloud-native
image: k8s-driver-manager
version: v0.9.1
imagePullPolicy: IfNotPresent
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
maxUnavailable: 1
waitForCompletion:
timeoutSeconds: 0
podSelector: ""
gpuPodDeletion:
force: false
deleteEmptyDir: false
timeoutSeconds: 300
drain:
enable: true
force: false
deleteEmptyDir: false
timeoutSeconds: 300
kubectl apply -f nvidia-driver-l40s.yaml

Upgrade Impact

Driver upgrades can interrupt GPU workloads on the target node because the NVIDIA kernel modules and Fabric Manager may need to be restarted or reloaded. Set maxParallelUpgrades: 1 and drain one node at a time for production clusters.

For clusters with pre-installed host drivers, plan a node-by-node migration. Cordon and drain the node, remove or disable the host-managed NVIDIA driver and Fabric Manager stack, reboot if required, then let the matching NVIDIADriver resource install the operator-managed stack.

Verification

kubectl get nvidiadrivers
kubectl get pods -n nvidia-gpu -o wide | grep nvidia-driver
kubectl get nodes -l nvidia.com/gpu.present=true \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/cuda\.driver-version\.full}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}' \
| sort
경고

Do not create overlapping NVIDIADriver resources. A node must match only one driver custom resource. Keep deployDefaultCR: false when custom node-group resources are used.