Skip to content

Commit 742f3b3

Browse files
committed
Add comprehensive GPU Operator debugging output
- List all GPU Operator pods and DaemonSets - Show DCGM exporter DaemonSet, Service, ServiceMonitor details - Display recent events for troubleshooting deployment failures Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
1 parent 459ec97 commit 742f3b3

1 file changed

Lines changed: 27 additions & 0 deletions

File tree

tests/e2e/scenarios/ai-conformance/run-test.sh

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,33 @@ else
279279
echo "Warning: Prometheus pod not found"
280280
fi
281281

282+
echo "----------------------------------------------------------------"
283+
echo "GPU Operator Component Status for Debugging"
284+
echo "----------------------------------------------------------------"
285+
286+
echo "All GPU Operator pods:"
287+
kubectl get pods -n gpu-operator -o wide || true
288+
289+
echo ""
290+
echo "GPU Operator DaemonSets:"
291+
kubectl get daemonsets -n gpu-operator -o wide || true
292+
293+
echo ""
294+
echo "DCGM Exporter DaemonSet details:"
295+
kubectl describe daemonset -n gpu-operator nvidia-dcgm-exporter || true
296+
297+
echo ""
298+
echo "DCGM Exporter Service:"
299+
kubectl get service -n gpu-operator nvidia-dcgm-exporter -o yaml || echo "No DCGM service found"
300+
301+
echo ""
302+
echo "DCGM Exporter ServiceMonitor:"
303+
kubectl get servicemonitor -n gpu-operator nvidia-dcgm-exporter -o yaml || echo "No ServiceMonitor found"
304+
305+
echo ""
306+
echo "Recent GPU Operator events:"
307+
kubectl get events -n gpu-operator --sort-by='.lastTimestamp' | tail -20 || true
308+
282309
echo "AI Conformance Environment Setup Complete."
283310

284311
# Now run the actual AI conformance tests

0 commit comments

Comments
 (0)