scanprobe v0 · alpha

A tiny read-only scan for the first few minutes of GPU troubleshooting.

Status: v0 alpha  ·  NVIDIA local evidence only  ·  seeking incident reports from real GPU operators  ·  one-pager (PDF)

Use scanprobe when a GPU node has become a suspect and you need obvious local NVIDIA evidence before rerunning, draining, or filing a support ticket.

It answers one narrow question: is this node or GPU obviously weird from local NVIDIA evidence, or should I keep looking elsewhere?

What it checks

scanprobe gathers the first-pass checks an operator often runs by hand:

It prints a primary issue, the visible evidence, a next action, and one conservative triage label:

Commands

Clone the repo:

git clone https://github.com/cv700/scanprobe
cd scanprobe

Run the default human-readable scan:

python3 scanprobe.py

JSON for wrappers, scripts, or pasting into tickets:

python3 scanprobe.py --json

Constraints

No daemon. No telemetry. No mutation. No stress workload. No API key. No hidden benchmark. No claim that a GPU is healthy.

scanprobe is not a DCGM replacement. DCGM is the serious datacenter GPU management and diagnostics stack. scanprobe is the smaller first-pass scan before rerun, drain, escalation, or deeper diagnostics.

The goal is modest: save a few minutes, catch obvious local evidence, and make the first response easier to paste into Slack, Jira, or a provider ticket. In alpha, we are testing whether that means roughly 2–10 minutes saved per first-pass incident.

Use it for the first look after something feels off. Do not use it as a routine heartbeat or a health certificate.

Roadmap

v0  ·  save one minute
one command, pasteable summary

One command gathers obvious local NVIDIA evidence and prints a pasteable summary. Current alpha.

v1  ·  save ten minutes
better fixtures, clearer issue tags, stronger failure-case handling

Better fixtures from real operator reports, clearer issue tags, and stronger handling for common visibility, Xid, ECC, throttle, and nvidia-smi failure cases.

v2  ·  save an hour
read-only runbook pilot

A read-only runbook pilot that helps operators decide which deeper tool to run next, including DCGM or provider diagnostics when appropriate.

Scope

scanprobe currently ships NVIDIA local evidence only. Ashiba will add AMD support once real AMD SMI, ROCm, and kernel-log fixtures show which read-only signals change an operator's next action.

Resources