Runbook
Decision trees for the repeat operator. Each entry (R1 to R18) maps a situation to the minimum set of commands that resolves it. For your first build, follow Quickstart end to end. To take a fresh flash to the complete verified AV stack (CUDA, ZED X, Metis inference), follow R18. For symptom-driven debugging, see Troubleshooting.
Scaling past one unit (the build-once, flash-N workflow the community post points here for): R10 reuses one build across devices, R13 produces a signed release tarball and batch-flashes from fleet.csv, and R16 clones a fully provisioned golden unit with per-device identity. Full guides: Fleet and Golden Image.
This runbook targets a Jetson Orin NX 16GB (p3767-0000 module on a p3768 Orin Nano carrier) running L4T R36.4.3 / JetPack 6.2 with a PREEMPT_RT kernel, an NVMe root on /dev/nvme0n1p1, the Axelera Metis NPU, and the ZED X camera overlay.
R1: Flash a brand-new Jetson
make doctor # 30 s, abort here on any FAIL
make ignite-no-flash # 60 to 90 min, produces a clean, audited image
# Now plug the Jetson in recovery mode.
make flash # 15 to 25 min
# Wait for boot; first-boot service runs ~3 to 5 min then reboots.
make verify # post-flash gauntlet
Note on make verify: the first-boot service provisions the /opt/av-env Python environment only when the device has internet. After an offline first boot the service re-runs on every boot and finishes provisioning once a network is available (for example, an Ethernet cable); until then the venv-import step of make verify fails, which is expected.
Judging boot: a blank HDMI screen or a boot logo frozen on its last square is normal on Orin and proves nothing. Judge boot health by the USB gadget instead: wait for lsusb to show 0955:7020, then ping 192.168.55.1, then ssh. A healthy boot reaches sshd in about 60 seconds.
Or run the full interactive sequence:
make ignite # doctor -> all -> audit -> flash (interactive) -> verify
R2: Changed a CONFIG flag, minimum rebuild?
Did you change LOCALVERSION, PREEMPT_RT, MODVERSIONS, or anything that
changes UTS_RELEASE or vermagic constituents?
yes -> full rebuild path: make clean && make all && make audit && make flash
no -> try the partial path:
make extract # idempotent, only re-applies changed patches
make build # rebuilds kernel + all modules + headers .deb
make bake # restages payloads
make audit # confirms gates still green
make flash # write
Always end with make verify after the flash completes.
R3: Changed a kernel patch, minimum rebuild?
# Edit scripts/01_extract_and_patch.sh or KERNEL_PATCHES.md instructions
make clean # most reliable, patches are not always reversible
make all && make audit
# ... recovery mode ...
make flash && make verify
Skipping make clean after a patch change risks Phase 1 silently no-op’ing because of its idempotency guards.
R4: Audit failed
scripts/pre_flash_audit.sh prints which check FAILed. Decode it:
| Failed check | Most likely cause | Fix |
|---|---|---|
| Version string | Phase 2 ran with wrong LOCALVERSION | rebuild Phase 2 |
| Real-time core | PREEMPT_RT not active in built kernel | check defconfig, run generic_rt_build.sh enable, rebuild |
| DMABUF heaps | CONFIG_DMABUF_HEAPS=y missing | check that defconfig injection ran in Phase 1 |
| PCIe retries | LINK_WAIT_MAX_RETRIES != 200 | re-run Phase 1 (the patch forces 200, pinned in versions.env) |
| CPU isolation | extlinux.conf missing isolcpus | re-run make bake |
| Tickless mode | extlinux.conf missing nohz_full | re-run make bake |
| CMA policy | extlinux.conf contains a cma= boot arg (it must not; the device tree linux,cma pool supplies CMA) | remove the cma= arg, re-run make bake |
| ZED X overlay | DTBO missing in /boot/ | re-run Phase 2 (the DTBO compile is in 02_build_kernel.sh) |
| Module vermagic | At least one .ko has the wrong vermagic | make clean && make all (never partial) |
After fixing, re-run make audit. If it stays red, gather logs (make logs) and consult Troubleshooting §B.
R5: Flash hangs
Work the tree top to bottom:
1. lsusb -t
- Jetson on a hub or extender? -> move to a direct motherboard port.
- Multiple USB devices on same controller? -> unplug others.
2. lsusb | grep 0955
- Nothing? -> Jetson is not in recovery mode. Power off, re-short
REC+GND, re-power.
- Shows 0955:7323? -> recovery OK; the flash should be progressing.
3. ip link show
- Look for usb0. If absent, the RNDIS gadget did not enumerate on
the host. Check dmesg for the rndis_host module loading.
- sudo modprobe rndis_host
- sudo udevadm control --reload-rules
4. sudo dmesg -w (in another terminal)
- Watch for USB enumeration and disconnect events.
5. Verify USB autosuspend is off:
sudo sh -c 'echo -1 > /sys/module/usbcore/parameters/autosuspend'
sudo sh -c 'echo 200 > /sys/module/usbcore/parameters/usbfs_memory_mb'
6. Re-run: make flash
If it is still stuck, run make logs and see Troubleshooting §F.
R6: Jetson boots but verify_tuning shows FAILs
scripts/verify_tuning.sh reports each check. Common patterns:
RT Kernel: FAIL
-> Boot landed on the stock kernel. Likely the wrong kernel image was
written, or extlinux.conf points at the wrong label. SSH in and check
the default label in /boot/extlinux/extlinux.conf.
CPU Isolation: FAIL
-> extlinux.conf is missing isolcpus. The first-boot script may have
failed before the inject step. Run it manually:
sudo /home/j/jetson_first_boot.sh
sudo reboot
CMA Reservation: FAIL
-> The verified-good state is CmaTotal=262144 kB (256MB), supplied by the
device tree linux,cma pool. There must be NO cma= arg on the cmdline:
a cmdline cma= bypasses the device tree pool, and large values fail to
reserve on Orin NX ("cma: Failed to reserve 2048 MiB"), leaving zero
CMA. With zero CMA, nvgpu cannot make its 64MB physically-contiguous
comptag allocation at GPU poweron ("DMA alloc FAILED"), GR falcon
recovery fails, CUDA is unavailable, and nvpmodel cannot set any
power mode.
- grep CmaTotal /proc/meminfo expect 262144 kB
- cat /proc/cmdline must contain no cma= arg
- dmesg | grep -i cma any "Failed to reserve" means a cma=
arg slipped in
If a cma= arg is present, remove it from /boot/extlinux/extlinux.conf
and reboot; the device tree pool takes over.
Power Mode: FAIL
-> The flash installs the SUPER nvpmodel conf as the default table.
Verified mode IDs on this image: 0=MAXN_SUPER, 1=10W, 2=15W, 3=25W,
4=40W. Set the maximum profile with:
sudo nvpmodel -m 0 # MAXN_SUPER
rt-tune requests mode 0 on every boot (NVPMODEL_MODE in power.conf
overrides it). If nvpmodel cannot set ANY mode, it is failing to read
the GPU frequency table, which means the GPU did not come up: work the
CMA entry above first.
Axelera Metis: GHOST
-> The PCIe link failed to train. See KERNEL_PATCHES.md §1.
Most likely the LINK_WAIT_MAX_RETRIES=200 patch did not apply.
ZED X Driver: MISSING
-> sl_zedx.ko is not loaded. Check its vermagic:
modinfo /lib/modules/$(uname -r)/.../sl_zedx.ko | grep vermagic
If the vermagic does not match $(uname -r), it is a mismatch. Re-flash
with a fresh build.
R7: Vermagic mismatch
Never force-load. Rebuild.
make clean
make all # kernel + all modules + headers .deb in one Docker run
make audit # verify_vermagic.sh --rootfs gates here
make flash
Never run insmod --force. Never apt install nvidia-l4t-kernel-modules. The Pin-Priority: -1 in /etc/apt/preferences.d/99-jetson-av-kernel-lock already blocks the latter, but operator discipline is the final defense.
See Vermagic strategy.
R8: Install ZED SDK / Voyager SDK after first boot
First boot handles this when the artifacts were staged at bake time and the device has internet; after an offline first boot the service re-runs each boot and completes provisioning once a network is available. To install manually on the target (the pip step requires /opt/av-env, which first boot creates during online provisioning):
# CUDA userspace first: the flashed image has no JetPack packages and the
# ZED SDK requires CUDA 12.6 (TROUBLESHOOTING P-3)
sudo apt update && sudo apt install nvidia-jetpack
# ZED SDK
ls /opt/zed-sdk/ZED_SDK_Tegra_*.run # already there?
sudo /opt/zed-sdk/install_zed_sdk.sh # idempotent
# Voyager SDK (pip wheels, not install.sh). torch==2.8.0 is restated as a
# constraint: without it pip "upgrades" the cu126 wheel to PyPI's CPU-only
# torch (TROUBLESHOOTING P-2)
sudo /opt/av-env/bin/pip install axelera-rt axelera-devkit torch==2.8.0 \
--extra-index-url https://software.axelera.ai/artifactory/api/pypi/axelera-pypi/simple
Both depend on the linux-headers .deb being installed (handled by first boot). Confirm with dpkg -l | grep linux-headers.
R9: OTA update overwrote the kernel
If apt upgrade proposes nvidia-l4t-kernel*, decline it. First boot pins these packages at Pin-Priority: -1.
If it ran anyway:
# Confirm the damage
uname -r # expect 5.15.x-tegra
# if it is anything else, the kernel is stock again
# Recovery: re-flash from a fresh build
make clean && make all && make audit && make flash
There is no in-place repair. The module ABI is ruined the moment a stock kernel boots.
R10: Fleet-deploy multiple Jetsons
Build once, then reuse latest_jetson/ for each subsequent flash:
# Build once (or pull a pre-built artifact from CI)
make doctor
make all # produces deterministic artifacts (SOURCE_DATE_EPOCH locked)
make audit
# Per device: put each Jetson in recovery mode, then
make flash # rewrites the same image
make verify # confirm
To validate that two builds are byte-identical (fleet QA):
sha256sum latest_jetson/Linux_for_Tegra/kernel/Image
sha256sum latest_jetson/Linux_for_Tegra/staging/kernel-headers/linux-headers-*.deb
sha256sum latest_jetson/Linux_for_Tegra/kernel/dtb/*.dtbo
See Build for the reproducibility theory.
R11: Bundle logs for a support request
make logs
Produces support-bundle-YYYYMMDD-HHMMSS.tar.gz containing:
BUILD_LOG.md,FLASH_LOG.txt,IGNITION_*.logEXPECTED_VERMAGIC,BUILD_MANIFEST.json- the staged defconfig and extlinux.conf
- a vermagic snapshot (
vermagic-rootfs.txt) - if the target is reachable:
target-uname.txt,target-dmesg-err.txt,target-journal-{first-boot,rt-tune}.txt,target-hardware.txt,target-vermagic.txt
Attach the tarball to your support ticket.
R12: Check pinned versions
make versions
Reads versions.env and prints every pinned version, URL, USB ID, and RT tuning value. It also prints the last-build vermagic and manifest if a build has run.
R13: Release tarball + batch flash N units
make doctor # preflight
make all && make audit # build + gate
GPG_KEY=YOUR_KEY make release VERSION=v1.0.0 # signed release tarball
# -> releases/release-v1.0.0.tar.gz
# On the flash station(s):
make fleet-init # creates fleet.csv from the example
$EDITOR fleet.csv # add device labels + hostnames + IPs
make flash-batch FLEET=fleet.csv # iterates devices; per-device PASS/FAIL log
make fleet-status # summarize fleet_log.csv
The full workflow is in Fleet.
R14: Verify resilience layer
ssh j@av-07 << 'EOF'
systemctl is-active systemd-journald jetson-blackbox.service \
jetson-brownout-guard.service tmp.mount
cat /etc/jetson-av-resilience-installed
journalctl --disk-usage
chronyc tracking | head -3
sudo ufw status verbose
ls /var/log/jetson-av/flights/
EOF
The resilience layer’s core services are verified live on the reference device (2026-06-11): jetson-blackbox, jetson-brownout-guard, and jetson-av-pcie-aer-monitor all active, btrfs data partition mounted with the weekly scrub timer (see SAMPLES.md §5). Every line should report a healthy state. The post-flash validator (make verify) runs most of these automatically.
Force a black-box flush before powering off (for example, when aborting a flight):
ssh j@av-07 'sudo kill -USR1 $(systemctl show jetson-blackbox -p MainPID --value)'
Full guide in Platform Resilience and Black Box.
R15: Verify AV stack (GPU / Metis / DLA)
Status: the Phase 5 stack (CUDA userspace, ROS 2, Isaac ROS, the mission service) is installed and verified live on the reference device (2026-06-11): verify_opengl_cuda.sh 14/14, ZED X capture at 29.5 FPS, live Metis inference at 49.2 FPS (29.6 camera-limited), full mission graph active under systemd. See R18. To check a unit:
ssh j@av-07 'jetson-av-version' # build identity
ssh j@av-07 'sudo /home/j/phase5/verify_opengl_cuda.sh' # CUDA stack
ssh j@av-07 'ros2 pkg list | grep -E "isaac_ros|nav2"'
ssh j@av-07 'sudo /usr/local/bin/launch_av_mission.sh --dry-run'
To start the mission:
ssh j@av-07 'sudo systemctl start jetson-av-mission.service'
ssh j@av-07 'systemctl status "jetson-av-*"'
Inspect it at runtime:
ssh j@av-07 'ros2 topic hz /zed/zed_node/rgb/color/rect/image' # camera
ssh j@av-07 'ros2 topic hz /detections' # Metis
ssh j@av-07 'ros2 topic echo /vslam/pose --once' # SLAM
Full guide in AV Stack and CUDA Libraries; measured throughput for every camera + Metis configuration is in Benchmarks.
R16: Clone a configured Jetson to N units
Flash one unit with make ignite, install apps, ROS packages, and models, then validate it. From that golden unit:
# On Jetson #0 (the golden), boot, install everything, validate.
# Then power off, put it in APX recovery mode, and from the host:
make clone-golden TAG=v1.0-bench-validated
# -> golden-images/golden-v1.0-bench-validated-<timestamp>/
make list-goldens # confirm
# For each receiving Jetson (in recovery mode each time):
make flash-golden GOLDEN=golden-v1.0-bench-validated-<ts> DEVICE=av-07
make verify
Every clone runs personalize_first_boot.sh at first boot, which sets a unique hostname, fresh SSH host keys, and an optional static IP. The result is bit-identical at flash time and divergent in identity at boot.
Full guide in Golden Image.
R17: Audit step manifest
logs/STEP_MANIFEST.tsv records every step. To query it:
column -t -s$'\t' logs/STEP_MANIFEST.tsv | tail -50
# Just the failures:
awk -F$'\t' '$4!="PASS" && $4!="SKIPPED" && NR>1' logs/STEP_MANIFEST.tsv
# Time spent per phase:
awk -F$'\t' 'NR>1 {s[$3]+=$5} END {for (p in s) printf " %-12s %ds\n", p, s[p]}' \
logs/STEP_MANIFEST.tsv
Per-step logs are at logs/<timestamp>_<slug>.log and are bundled by make logs. See Verification.
R18: Full AV stack on a fresh flash (the complete replication sequence)
The end-to-end sequence that takes a freshly flashed unit to a verified working stack: CUDA, OpenCV-CUDA, ZED X capture (sharp, with IMU), ROS 2 + Isaac ROS, and live Metis inference. Every step and trap below was verified live on the reference device 2026-06-10/11. Run on the Jetson, in this order; each stage gates the next.
Stage 0: prerequisites. Flash per R1. First boot must complete online (/opt/av-env provisioned, marker ~/.jetson_initialized present). For the camera, the zedx-driver vendor tree must be available, either staged at /opt/zedx-daemons/vendor by the bake (images built after 2026-06-11) or cloned to ~/zedx-driver on the device.
Stage 1: CUDA userspace (~15 min, ~3.8 GB). The sample rootfs ships no JetPack packages (Troubleshooting P-3):
sudo apt update && sudo apt install -y nvidia-jetpack
/usr/local/cuda/bin/nvcc --version # CUDA 12.6
/opt/av-env/bin/python -c "import torch; print(torch.cuda.is_available())" # True
If torch prints False or a +cpu version, repair per P-2.
Stage 2, Phase 5: OpenCV-CUDA + ROS 2 + Isaac ROS + mission service (~70 min first unit, minutes thereafter via /opt/opencv-cache).
cd ~/Documents/jetson-rt-stack # or wherever the repo is checked out
sudo bash scripts/install_av_phase5.sh
/opt/av-env/bin/python -c "import cv2; print(cv2.cuda.getCudaEnabledDeviceCount())" # 1
ros2 pkg list | grep -cE 'isaac_ros|nav2|mavros' # > 25
Isaac ROS installs from NVIDIA’s apt repo (added automatically); no source build needed (FIELD_CONFIRM 3.4).
Stage 3: ZED X camera (~10 min; ~2 min with the daemon cache).
sudo bash scripts/install_zedx_daemons.sh # or /opt/zedx-daemons/install_zedx_daemons.sh
sg zed -c '/opt/av-env/bin/python -c "
import pyzed.sl as sl
c = sl.Camera(); p = sl.InitParameters(); print(c.open(p))"' # SUCCESS
Installs the BMI088/SPSC IMU modules, the three vendor daemons, the udev rule, and the patched libnvisppg.so (sharp image). Re-login (or use sg zed) after the script adds you to the zed group. Details: Drivers §1.4-1.5.
Stage 3b: mission camera + detect node (~20 min first unit).
sudo bash scripts/install_zed_ros2_wrapper.sh # colcon-builds wrapper v5.3.1 at /opt/zed_ros_ws
sudo bash scripts/install_mission_inference.sh # models -> /opt/jetson-av/models, detect node, SLAM wiring
(Both run automatically as Phase 5 steps 3b/3c on fresh installs.) Note the wrapper >= 5.1 topic rename: consumers use /zed/zed_node/rgb/color/rect/image (Troubleshooting H-8).
Stage 4: Voyager inference (~25 min first model, then cached). Follow AV stack §”Voyager inference: verified procedure”: install the app deps (minus opencv-python/pyopencl), make operators (needs ninja-build opencl-headers ocl-icd-opencl-dev libsimde-dev), then:
cd ~/voyager-sdk
GST_PLUGIN_FEATURE_RANK=nvv4l2decoder:NONE PYTHONPATH=$PWD DISPLAY=:0 \
/opt/av-env/bin/python inference.py yolov5s-v7-coco media/h264/traffic1_1080p.mp4
# expected: ~49 FPS end-to-end, <15% CPU
DISPLAY=:0 PYTHONPATH=$PWD sg zed -c \
"/opt/av-env/bin/python ~/Documents/jetson-rt-stack/scripts/demo_zedx_metis.py"
# expected: ~29.6 FPS (camera-limited), live detections on screen
For the C++ equivalents on the live camera (detector-only and full sensor fusion with depth, skeletons, and IMU-fused pose), build per ZED X + Metis C++: the fusion sample keeps every feature at speed via --depth-every and records annotated H.264 through NVENC with --record. Measured numbers for each configuration, plus the reproducible harness, are in Benchmarks.
Stage 5: verify everything.
sudo bash scripts/verify_tuning.sh # RT + vermagic + power gauntlet
sudo bash scripts/verify_opengl_cuda.sh # 14/14 CUDA/GL/TRT/VPI checks
sudo /usr/local/bin/launch_av_mission.sh --dry-run # all 6 nodes resolve
The cyclictest <100 µs gate must be re-run headless (an interactive desktop session adds ~150 µs IPI spikes; the 3 µs average is unaffected).
Still operator-provided afterwards: compiled models + /opt/jetson-av/detect_metis.py under /opt/jetson-av/, the zed_wrapper ROS package (source build against the SDK), Pixhawk on ttyTHS1, and the FIELD_CONFIRM §3.6 HV-rail test before any flight on MAXN_SUPER.