Troubleshooting
This is a symptom-first failure catalog for the jetson-rt-stack pipeline: building the PREEMPT_RT kernel, flashing the Jetson Orin NX 16GB to NVMe, and bringing up the Metis NPU, ZED X camera, and RTL8822CE Wi-Fi on L4T R36.4.3 / JetPack 6.2.
Each entry follows the same shape: symptom, then cause, then fix. Find the symptom that matches what you see in the index below, then work the fix. Entries are grouped by phase: build and integration, flashing, vermagic and module loadability, first-boot provisioning, and hardware.
Related docs: RUNBOOK.md for the happy-path procedures, VERIFICATION_REPORT.md for the audit behind each pinned value, SAMPLES.md for known-good run commands, and BENCHMARKS.md for healthy performance baselines.
Symptom index
Build & integration: B-1 cross-compiler missing · B-2 bindeb-pkg tools missing · B-3 vermagic gate fails · B-4 DTBO empty or absent · B-5 ZED X modules absent · B-6 incremental rebuild fails · B-7 CUDA build OOM (no swap)
Flashing & boot: F-1 flash hangs · F-2 no boot after flash · F-3 device boots stock kernel · F-4 frozen at NVIDIA logo · F-5 blank HDMI (normal) · F-6 RT kernel early-boot hang · F-7 ping OK but no ssh · F-8 nvgpu falcon errors, no CUDA · F-9 Wi-Fi PCIe link never trains · F-10 modules-load.d boot hang
Vermagic / module loadability: V-1 Invalid module format · V-2 sl_zedx.ko DKMS build fails · V-3 fixdep: Exec format error · V-4 loaded module misbehaves
First-boot / provisioning: P-1 Phase 5 installs nothing · P-2 torch is CPU-only · P-3 nvcc missing · P-4 OpenCV apt deps conflict · P-5 ROS setup.bash unbound variable · P-6 cv2 numpy import failure
Hardware & runtime: H-1 Metis not in lspci · H-2 stereo depth wrong · H-3 RT latency high · H-4 performance tanks mid-mission · H-5 image soft · H-6 motion sensors not detected · H-7 daemon socket lost · H-8 SLAM/detect get no frames · H-9 mission units crash-loop · H-10 jtop GPU tab crash · H-11 no Metis utilization view · H-12 CAMERA STREAM FAILED TO START
Build & integration
B-1. Build fails with missing cross-compiler
- Symptom:
aarch64-buildroot-linux-gnu-gcc: command not found. - Cause: Phase 2 was launched on the host, not inside Docker.
- Fix:
make docker-buildonce, thenmake build(the Makefile routes through Docker automatically when not running inside the container).
B-2. make build fails at bindeb-pkg stage
- Symptom:
make[1]: dpkg-buildpackage: command not foundor similar. - Cause: Headers
.debbuild needs Debian packaging tools. - Fix: Add to
Dockerfile:apt-get install -y dpkg-dev fakeroot build-essential debhelper. Already present in the current image; if you rebuilt locally without it, runmake docker-buildagain.
B-3. Vermagic mismatch in the build tree (Phase 2 fails the gate)
- Symptom: end-of-Phase-2 vermagic gate prints
[FAIL] Vermagic mismatches detected. - Cause: a module was built outside the kernel’s
make modulesinvocation (e.g., a vendor build script that calledmakeagainst the wrong KDIR). - Fix: rebuild from clean (
make clean && make extract && make build). Check that the relevant plugin ran (run_hook post_extractin01_extract_and_patch.sh) and registered the driver in-tree via its Kconfig+Makefile shim.
B-4. DTBO file empty (~0 bytes) or absent
- Symptom:
pre_flash_audit.shreportsZED X Overlay: FAIL (Missing binary in rootfs). - Cause: NVIDIA’s
kernel-devicetreebuild system silently skipped thedtbo-ytarget. - Fix:
02_build_kernel.shnow compiles the overlay directly viacpp -DBUILDOVERLAY ... | dtc -@ -f. Confirm Phase 2 ran the manual compile block (look forZED X overlay DTBO compiledin the log).
B-5. ZED X (stereolabs) modules absent, or extract logs patching file null
- Symptom: no
sl_zedx.koin the rootfs aftermake build; ormake extractlogspatching file nullfor0006-...stereolabs_compilation_makefile.patch; or the DTBO compile fails withUtils/camera-clks.dtsi: No such file or directory. - Cause: three distinct traps the Stereolabs vendor tree springs, all now handled by
plugins/zedx/plugin.sh:- The
0006patch addssource/stereolabs/Makefile{,-dkms}from/dev/null;patch -p2mangles the/dev/nullsource to a file literally namednulland the patch’ssrc/kernel/stereolabs/layout doesn’t line up with oursource/stereolabs/copy at any-plevel, so the Makefiles are never created. - The in-tree
zedx-wrapper/Kbuild shim uses a custommodules:target that Kbuild’sobj-msubdir descent never invokes (same trap asmetis-wrapper/), so the modules silently don’t build during the in-tree pass. - The overlays
#include "Utils/camera-clks.dtsi"andUtils/<sensor>-sl-modes.dtsirelative to their nv-public location, butUtils/isn’t an overlay file so the*.dtsiglob copy misses it.
- The
- Fix (all automatic now):
zedx_post_extractawk-materializessource/stereolabs/Makefile{,-dkms}straight from the*compilation_makefile*.patchbody to the correct path.02_build_kernel.shbuilds the stereolabs tree as an external module (M=) against the freshly built kernel, exactly like Metis: look forBuilding ZED X (stereolabs) external modulein the Phase 2 log.zedx_post_extractcopies theUtils/include tree into nv-public beside the overlays.
- Layer 1 (Metis, no camera) is still selectable independently with
CONFIG_CAMERA_NONE=y.
B-6. make build fails on a second run with cp: cannot stat .../debian/linux-headers/...
- Symptom: an incremental re-run fails at the
nvidia-headerstarget withcp: cannot stat '.../debian/linux-headers/.../include-prefixes/arm64'. - Cause:
bindeb-pkgconsumes its owndebian/staging, so the tree does not rebuild reliably in place. - Fix: always build clean,
make clean && make extract && make build. Phase 2 runs underset -eo pipefail, so a failed compile aborts loudly rather than exiting 0 behind a stale Image.
B-7. CUDA pip builds (mmcv) get OOM-killed
- Symptom: a pip build of a CUDA extension (e.g.
mmcv==2.1.0, pulled by the Axelera mmseg model deployunet_fcn_256-cityscapes-onnx) dies mid-compile with a bareKilled.journalctl -kshowsOut of memory: Killed process NNNNN (cicc)(or(nvcc)),constraint=CONSTRAINT_NONE ... global_oom. - Cause: this stack ships ZRAM deliberately disabled for RT memory determinism (
CONFIG_ZRAM is not set;nvzramconfig.servicemasked in03_bake_rootfs.sh), and a freshly baked rootfs has no disk swap. CUDA extension builds fan out onenvcc/ciccprocess per source file, each ~2 GB resident; with zero swap on a 16 GB Orin NX (worse if another compile runs concurrently) the kernel OOM-kills the build.00_doctor.shalready assumes “swap covers peaks” for builds, so the swap is expected to be present but is not created by the flash/bake flow. - Fix (two independent levers; use both):
- Add a disk swapfile on the NVMe (fast, no flash-wear; ZRAM stays off so the RT path is untouched). Keep
vm.swappinesslow so the kernel swaps only under real pressure and the real-time app never pages out:sudo fallocate -l 16G /swapfile && sudo chmod 600 /swapfile sudo mkswap /swapfile && sudo swapon /swapfile echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab # persist across reboots echo 'vm.swappiness=10' | sudo tee /etc/sysctl.d/99-swappiness.conf # persist sudo sysctl -w vm.swappiness=10 - Cap build parallelism so peak RAM stays bounded even without swap, and build only the Orin GPU arch (less RAM, less time):
MAX_JOBS=2 TORCH_CUDA_ARCH_LIST=8.7 pip install --no-build-isolation --no-deps "mmcv==2.1.0"
- Add a disk swapfile on the NVMe (fast, no flash-wear; ZRAM stays off so the RT path is untouched). Keep
- Note: ZRAM (compressed-RAM swap) is intentionally NOT the remedy here. It is masked on this stack on purpose (RT determinism); a low-swappiness NVMe swapfile gives the same build headroom without the latency jitter ZRAM adds to the RT scheduling path. The Axelera runtime separately carries an OOM shield (
oom_score_adj=-1000, applied byjetson_rt_tune.sh), so under memory pressure the mission-critical inference process is never the OOM victim; build-time compilers are.
Flashing
F-1. Flash hangs at “Sending mb1” or “Waiting for target to boot-up”
- Symptom:
l4t_initrd_flash.shpolls forever; no progress past ~30 s. - Cause: USB chain issue (hub, autosuspend, RNDIS not enumerating), or Jetson did not enter recovery mode.
- Fix:
lsusb -t: Jetson must be on a direct motherboard port, not a hub.lsusb | grep 0955:7323: APX must be present (= recovery mode).sudo sh -c 'echo -1 > /sys/module/usbcore/parameters/autosuspend'.sudo sh -c 'echo 200 > /sys/module/usbcore/parameters/usbfs_memory_mb'.- Reseat USB-C, re-short REC+GND, re-power.
F-2. Flash succeeds but Jetson doesn’t boot
- Symptom: After flash and recovery-pin removal, no HDMI, no SSH.
- Most common cause (check first): wrong root device, see F-4 below. A frozen logo with no
192.168.55.1is almost alwaysroot=/dev/mmcblk0p1on a board with no eMMC. - Other cause: extlinux.conf has duplicate or malformed lines from a partial bake.
- Fix: Re-run Phase 3 (
make bake): the script now defensively removes prior RT args andOVERLAYSlines before injecting fresh ones. Thenmake flashagain.
F-3. apply_binaries.sh clobbers the custom kernel (device boots stock)
- Symptom: the flash completes, but on-device
uname -vshowsSMP preempt(notpreempt_rt),cyclictestlatency is bad, anddmesgis full ofInvalid module format/version magic mismatchfor nvgpu, the camera, and Metis. Nothing custom loads. - Cause:
04_flash_nvme.shmust runapply_binaries.shto stage NVIDIA userspace and firmware into the rootfs, butapply_binaries.shalso installs the stocknvidia-l4t-kernel*debs (Image + base/OOT/display modules, vermagic...preempt). That overwrites the PREEMPT_RT Image and ~105 OOT modules built in Phase 2, so the device flashes a stock kernel with stock modules and none of our RT-built.ko(vermagic...preempt_rt) match. - Fix (automatic):
04_flash_nvme.shnow backs uprootfs/lib/modules/<rel>,boot/Image, and the ZED X.dtbobeforeapply_binaries.sh, restores them after, re-runsdepmod, and then runs the rootfs vermagic gate (verify_vermagic.sh --rootfs). If the gate fails it aborts before any device write, so a clobbered rootfs can never reach the NVMe. Look forRestoring custom RT kernel over apply_binaries' stock kernelandCustom RT kernel restored and vermagic-verifiedin the flash log. - Manual recovery (if you ever hit a half-clobbered rootfs): re-install the base modules with
make install -C kernel version=-tegra, re-run the OOTmake modules_installwithKERNEL_HEADERS=<src>/kernel/kernel-jammy-srcset (without it, modules_install targets/lib/modules/$(uname -r)/buildand fails), copy the RT display modules fromsource/nvdisplay/kernel-open/over the stock ones inupdates/opensrc-disp/, thendepmod -b rootfs <rel>and re-run the gate.
F-4. Frozen right after the NVIDIA logo (kernel waits forever for root)
- Symptom: the flash completes, the board powers on, shows the NVIDIA logo, and hangs there forever. No SSH, no
192.168.55.1USB gadget on the host, the fan spins down. It looks dead but the kernel is actually alive, just stuck. - Cause: the kernel command line carries
root=/dev/mmcblk0p1(eMMC). An Orin NX has no eMMC, the rootfs is on NVMe (/dev/nvme0n1p1). Withrootwaitthe kernel boots fine, then waits indefinitely for a disk that will never appear. The root of the root-device bug: a customextlinux.confthat relies only on${cbootargs}inherits the board’s eMMC default. A stock NVMe flash writesroot=/dev/nvme0n1p1explicitly; a hand-rolled boot config must do the same. Theboot.imgused by L4TLauncher’s “Attempting Recovery Boot” path carries the same default (p3767.conf.commonCMDLINE_ADDhas noroot=), so both boot paths need fixing. - Confirm it: read the baked-in cmdline straight out of the kernel partition image:
python3 -c "import sys;d=open('bootloader/boot.img','rb').read();print(d[64:576].split(b'\\0')[0])"If you seeroot=/dev/mmcblk0p1, this is your bug. - Fix (automatic now): the last
root=on the cmdline wins, so we append the right one in both places.03_bake_rootfs.shaddsroot=/dev/<TARGET_STORAGE_DEV> rootwait rootfstype=ext4to the extlinuxAPPEND(normal boot);01_extract_and_patch.shappendsroot=/dev/<TARGET_STORAGE_DEV>top3767.conf.commonCMDLINE_ADD(the recoveryboot.imgpath). Both default tonvme0n1p1.
F-5. Blank HDMI after “Exiting boot services” is NORMAL, not a hang
- Symptom: HDMI shows the UEFI menu and the
EFI stub: ...lines, then goes completely dark the momentExiting boot services...prints. Tempting to read as “the kernel died here.” Do not. - Why it’s normal: NVIDIA’s
tegra_defconfig(and therefore ours) does not setCONFIG_FRAMEBUFFER_CONSOLE, so the kernel never draws a text console on any framebuffer. The EFI/GOP framebuffer thatefifbwould use is torn down at ExitBootServices on ARM64, and Orin’s real HDMI/DP comes from the out-of-treenvidia-drm/nvidia-modesetmodules that load late from userspace. So a perfectly healthy Orin also shows nothing on HDMI between “Exiting boot services” and the desktop. The blank screen is non-diagnostic: it looks identical whether the kernel is booting fine or fatally hung. Don’t chaseFRAMEBUFFER_CONSOLEor HDMI tweaks. earlycon=efifbdoes NOT help on Orin: the GOP framebuffer is gone by the time it would print. It produces nothing; skip it.- The real “did it boot?” signal is the USB device-mode gadget: if
192.168.55.1(RNDIS) and/dev/ttyACM0appear on the host, the kernel reachedmulti-user.target. If they never appear, it hung before multi-user. That, not the screen, is the test. - The real boot log goes to
console=ttyTCU0,115200(the Tegra Combined UART), which${cbootargs}injects viaCMDLINE_ADD. On a carrier with a debug-UART header, a USB-TTL adapter on it reads the entire boot up to the hang. (The p3768 devkit has no onboard USB-serial bridge, so this needs an adapter.) - Related:
L4TLauncher: Attempting Recovery Bootmeans the normal extlinux boot was not used and the QSPIboot.imgwas loaded instead (a fallback).Attempting Direct Bootmeans the rootfs extlinux path was used. If you see Recovery Boot and a hang, suspect the root device (F-4) in theboot.imgcmdline.
F-6. Custom RT kernel hangs in early boot, but the stock kernel boots fine
- Symptom: the flash succeeds, but our PREEMPT_RT kernel never reaches
multi-user.target(no192.168.55.1, no/dev/ttyACM0), while a bone-stock NVIDIA kernel on the same board boots normally. - Step 1, isolate it (the decisive A/B test): re-flash with
STOCK_BASELINE=1:STOCK_BASELINE=1 SEED_USER=j SEED_PASS=jetson bash scripts/04_flash_nvme.sh. This leavesapply_binaries’ stock kernel in place (no RT restore). If the stock image boots and comes up on192.168.55.1, the board, SSD, PCIe (NVMe + Metis), and the entire flash pipeline are proven good and the fault is 100% the RT kernel, and you now have SSH into a working device to debug from. If even stock hangs, the problem is below the kernel (DTB / hardware). - Gotcha in stock-baseline mode: the flasher boots
kernel/Image+l4t_initrd.imgas its RAM environment.apply_binariesrebuildsl4t_initrd.imgwith stockpreemptmodules, butkernel/Imageis still the RT build, so the flasher’s RT kernel can’t load the stock initrd’s USB/RNDIS modules and you get “Device failed to boot to the initrd flash kernel.”04_flash_nvme.shhandles this by copying the stock Image overkernel/Imagein stock-baseline mode so the flash environment is internally consistent. (A genuinely transient version of that same error is USB autosuspend:sudo sh -c 'echo -1 > /sys/module/usbcore/parameters/autosuspend'and use a direct USB port, then retry.) - Step 2, the leading fix: build the NVMe root chain into the kernel rather than as modules:
CONFIG_NVME_CORE=y,CONFIG_BLK_DEV_NVME=y,CONFIG_PCIE_TEGRA194=y,CONFIG_PCIE_TEGRA194_HOST=y,CONFIG_PHY_TEGRA194_P2U=y(wired into 01_extract_and_patch.sh). This takes the initrd module-load step out of the root-mount path, the most likely location of the PREEMPT_RT early hang. The verified kernel runs the NVMe and PCIe chain built-in, which removes the initrd’s dependency on a preempt_rtnvme.ko. Keep the AxeleraLINK_WAIT_MAX_RETRIES=200patch in place so the slow Metis endpoint trains reliably. (The RTL8822CE Wi-Fi link not coming up is a separate problem, see F-9: a confirmed board-side hardware fault, not link-wait timing.) Confirm on-device that the rebuilt RT kernel reaches192.168.55.1before treating this as settled.
F-7. Boot reaches the USB gadget (ping works) but never reaches sshd
- Symptom: after flashing, the host sees the booted-OS USB gadget (
lsusbshows0955:7020 ... running on Tegra, anL4T-READMEvolume mounts, and192.168.55.1pings), butssh/:22stays refused indefinitely (10+ min, no reboot) and the USB serial gadget shows no login prompt. The kernel booted; userspace never delivered a login. Field-diagnosed 2026-06-10: two independent traps produce this signature, and both were present at once. Check both. - Trap 1, the oem-config wizard (unseeded flash):
apply_binaries.shre-creates/etc/systemd/system/default.target -> nv-oem-config.targeton EVERY flash; only the user pre-seed (tools/l4t_create_default_user.sh) removes it. A flash without the seed boots into the first-boot wizard flow, which waits forever beforesshdwith the USB gadget up and pinging. scripts/04_flash_nvme.sh now seeds by default (userjunlessSEED_USERis overridden; setSEED_USER=""for the interactive wizard on purpose) and aborts the flash if the wizard symlink survives seeding. - Trap 2, a hardware-probing driver force-loaded at sysinit: see F-10. A
/etc/modules-load.d/*.confentry runs insidesystemd-modules-load.service(sysinit); if that probe hangs,sysinit.targetnever completes and ssh/getty never start, while the kernel still answers ICMP. - Diagnosis order: confirm the seed state first (
readlink rootfs/etc/systemd/system/default.targetmust NOT benv-oem-config.targeton the staged rootfs), then audit every name inrootfs/etc/modules-load.d/. The blunt instrument, a diagnostic boot with the suspect modules held back (install <mod> /bin/truein a modprobe.d file) plusdefault.target -> multi-user.target, still works as a last resort but was NOT needed once both traps above were fixed: a healthy image reaches sshd in about 60 seconds.
F-8. nvgpu “Unable to recover GR falcon”, no CUDA, nvpmodel fails (CMA)
- Symptom:
dmesgloopsnvgpu ... Unable to recover GR falconandFECS context switch init error; CUDA never initialises; the GPU devfreq node (/sys/devices/platform/17000000.gpu/devfreq/) is absent;nvpmodelfails withError opening /sys/devices/platform/17000000.gpu/devfreq_dev/available_frequencies, so no power mode can be set either. - Root cause (live-verified 2026-06-10): zero CMA. The boot args carried
cma=2048M; on Orin NX the kernel cannot place a 2 GiB contiguous reservation (cma: Failed to reserve 2048 MiB), and any cmdlinecma=ALSO bypasses the device tree’slinux,cmapool (“Reserved memory: bypass linux,cma node”). Net effect:CmaTotal: 0 kB. At GPU power-on, nvgpu must allocate a ~64 MB physically contiguous comptag buffer (ga10b_cbc_alloc_comptags); with zero CMA that allocation can never succeed (DMA alloc FAILED: [sysmem] size=64225280 ... PHYSICALLY_ADDRESSED), and the GR falcon / FECS chain collapses. The dramatic falcon errors are all downstream of one missing memory pool. - Fix: pass NO
cma=boot argument at all. The device treelinux,cmapool (256 MB, NVIDIA-sized, proven on stock) then takes over. scripts/03_bake_rootfs.sh no longer emitscma=and strips stale ones; the pre-flash audit fails if anycma=is present. Verified result on device:CmaTotal 262144 kB, zero nvgpu errors for the whole boot, GPU devfreq present,/dev/nvgpu/igpu0populated, andnvpmodelsets MAXN_SUPER. - Verify with:
grep -i cma /proc/meminfo(CmaTotal must be > 0) andsudo journalctl -k -b | grep -iE "cma|falcon"(no “Failed to reserve”, no falcon errors). Build hygiene that is still good practice but was NOT the root cause here: exportIGNORE_PREEMPT_RT_PRESENCE=1for the OOT GPU build (02_build does), keep OOT modules in lockstep with the kernel, and rundepmod -b $ROOTFS -F System.mapat flash time (04_flash does).
F-9. RTL8822CE Wi-Fi: PCIe link never trains (no wlan0)
- Symptom: no wireless interface;
lspci -d 10ec:c822is empty; the C1 controller logstegra194-pcie 14100000.pcie: Phy link never came up. The card itself is alive: its Bluetooth half enumerates over USB (lsusbshows0bda:c822). - Verdict (final, 2026-06-10): board-side hardware fault in this dev board’s M.2 Key-E slot. The software stack is exonerated by a five-track investigation:
- The STOCK NVIDIA kernel fails identically on this board, so the RT kernel and its config are not the cause. The Realtek options ARE enabled in our build (
RTW88=m,RTW88_8822CE=m, vendorrtl8822ce.kopresent) and the config diff against the pristine BSP contains zero lane-specific deltas. - The flashed QSPI inputs (ODMDATA/BPMP UPHY config, MB1 pinmux, BCTs) are bit-identical to stock.
- The PHY bank is healthy: Metis trains on HSIO lanes 4-7, the live BPMP DTB confirms lane 3 is owned by PCIe C1, and the USB3 ports that could share the lane are disabled.
- The original card trains fine in another Jetson running a stock image.
- A second identical card ALSO fails to train on this board.
Slot power is good too: the C1 node carries
vpcie3v3-supply = <&vdd_3v3_pcie>(wired into 01_extract_and_patch.sh), the rail measures 3300 mV, and retrains plus a full slot power cycle change nothing. The LTSSM never leaves Detect: electrically, nothing answers on lane 3. Full experiment log and captures: WIFI_EVIDENCE.md. - The STOCK NVIDIA kernel fails identically on this board, so the RT kernel and its config are not the cause. The Realtek options ARE enabled in our build (
- If YOUR board shows this: the Wi-Fi support in this image works fine on healthy boards; do not start by suspecting the kernel. Check your hardware the same way:
sudo dmesg | grep 14100000forPhy link never came up, try your card in another machine, and try a known-good card in your slot. If two known-good cards fail in your slot under the stock NVIDIA kernel too, it is your board, not this image. - Bring-up on a healthy board: the vendor driver is present but blacklisted from autoload (see F-10), and the NetworkManager profile for the target SSID is staged.
sudo modprobe rtl8822ce(manual loads bypass the blacklist) plus the staged profile bring Wi-Fi up without a reflash; once the driver probes cleanly, re-bake withWIFI_AUTOLOAD=1for boot-time autoload. - Boot-time cost of a dead C1: the controller burns ~40 s per boot in link-wait windows (two ~20 s LTSSM waits at our
LINK_WAIT_MAX_RETRIES=200). If Wi-Fi on a faulted board is abandoned, trimLINK_WAIT_MAX_RETRIESback toward stock or disable the C1 controller in the device tree to reclaim that time.
F-10. Never force-load a hardware-probing driver from modules-load.d
- Symptom: identical to F-7 trap 2: gadget up, ping OK, no ssh/getty, forever.
- Mechanism: files in
/etc/modules-load.d/are loaded bysystemd-modules-load.serviceat sysinit. A driver whose probe touches real hardware (PCI/I2C) and hangs in D-state blockssysinit.target, and with it every login path, while ICMP (kernel) still answers. This happened with a vendorrtl8822ce.koforce-load: harmless while the Wi-Fi slot was unpowered (probe found nothing and returned), fatal the moment the slot was powered. - Policy (enforced by scripts/03_bake_rootfs.sh): the bake never writes wifi force-load entries. The vendor driver is
blacklisted, which stops both modalias autoload (udev coldplug) and the systemd-modules-load path, while a deliberatesudo modprobe rtl8822cestill works for supervised bring-up. Once the driver is proven to probe cleanly, re-bake withWIFI_AUTOLOAD=1to enable boot-time autoload.
Vermagic / module loadability
V-1. Invalid module format in dmesg
- Symptom:
dmesg | grep -i "invalid module format"or a service fails because its driver isn’t loaded. - Cause: vermagic mismatch, a
.kowhose vermagic doesn’t match the running kernel’s. - Fix:
- Identify the offender:
for ko in /lib/modules/$(uname -r)/... /*.ko; do modinfo "$ko" | grep -H vermagic; done. - Confirm the running kernel’s expectation:
cat /proc/version. - If a vendor
.debwas installed post-flash (e.g. someone ranapt install nvidia-l4t-kernel-modules), re-flash from a clean build. APT pinning (Pin-Priority: -1) added by first-boot prevents recurrence. - See VERMAGIC_STRATEGY.md.
- Identify the offender:
V-2. ZED SDK install fails to build sl_zedx.ko
- Symptom:
dkms install -m sl_zedx ...fails withCannot find sources for kernel 5.15.x-tegra. - Cause: linux-headers-*.deb wasn’t installed before the SDK installer ran.
- Fix:
dpkg -i /opt/kernel-headers/linux-headers-*.deb, then re-run/opt/zed-sdk/install_zed_sdk.sh. If the .deb is missing entirely, re-run Phase 2 (make build) and re-bake (make bake).
V-3. On-target module build fails: scripts/basic/fixdep: Exec format error
- Symptom: any module build against
/usr/src/linux-headers-$(uname -r)(DKMS, or a manualmake -C ... M=...) dies withExec format erroronfixdep, latergenksyms: not foundor modpost failures. - Cause: the headers .deb is produced by the cross-compile in Phase 2, so its
scripts/host tools are x86_64 binaries that cannot execute on the Jetson. This silently breaks the “DKMS installers can rebuild against our headers” guarantee. - Fix: rebuild the host tools natively.
install_zedx_daemons.sh§1 does it automatically (purge x86 binaries underscripts/,gcc fixdep.c, bison/flex +gccfor genksyms,mk_elfconfig+gccfor modpost, then freeze mtimes ofauto.conf/autoconf.hso kbuild does not triggersyncconfig). Do NOT runmake scriptsin the headers tree: the deb ships no Kconfig files, so it fails AND deletesautoconf.h(restore by re-installing the deb:dpkg -i --force-overwrite /opt/kernel-headers/*.deb).
V-4. Loaded modules look fine but driver behaves wrong
- Symptom:
lsmodshows the module, but downstream subsystems (V4L2 device, PCIe link) misbehave. - Cause: per-symbol CRC drift (
CONFIG_MODVERSIONS=ymismatch even though vermagic strings happen to match): extremely rare but possible if someone built modules against a different kernel source tree with the sameLOCALVERSION. - Fix: Force-load is not allowed (
CONFIG_MODULE_FORCE_LOADis off). The only path is rebuild from a clean tree.
First-boot / provisioning
P-1. Phase 5 “completes” but installs nothing (step::run: command not found)
- Symptom: first-boot journal shows
[WARN] Phase 5 install reported issues, every step logsinstall_av_phase5.sh: line NN: step::run: command not found, and afterwards there is no OpenCV, no/opt/ros/humble, nojetson-av-mission.service, no/etc/jetson-av/mission.conf, yet the script printed its “Phase 5 Install Complete” banner. - Cause: two stacked bugs, both fixed in the repo 2026-06-10. (1)
install_av_phase5.sh/install_av_stack.shsourcedlib/verify.shwithoutlib/config.shfirst; verify.sh refuses to define thestep::*functions, andSTRICT=0swallowed the failures. (2) The lib include guards (JETSON_AV_LOG_LOADEDetc.) were exported, so a sub-script (build_opencv_cuda.shinvoked by the orchestrator) inherited the guard but not the functions (env vars cross fork+exec, shell functions don’t) and its libs short-circuited to the samecommand not found. Staged copies under/home/j/phase5on devices flashed before the fix are still broken (theirlib/config.shalso dies on unbound pins withoutversions.env). - Fix: run Phase 5 from a repo checkout on the device:
cd ~/Documents/jetson-rt-stack && sudo bash scripts/install_av_phase5.sh. Devices baked after the fix run it correctly at first boot.
P-2. torch in /opt/av-env is CPU-only (2.x.y+cpu, torch.cuda.is_available() False)
- Symptom:
/opt/av-env/bin/python -c "import torch; print(torch.__version__)"prints a+cpubuild even though first boot installed the cu126 wheel. - Cause: a later unconstrained pip run (observed live 2026-06-10: the Voyager SDK step, PyPI primary + Axelera extra-index) “upgrades” torch to generic PyPI’s CPU wheel. An unpinned
torchvisionsimilarly resolves to 0.26.0 and drags in numpy 2.x, breaking the Voyager pins. - Fix:
sudo /opt/av-env/bin/pip install --force-reinstall torch==2.8.0 \ torchvision==0.23.0 "numpy<2.0.0" \ --index-url https://pypi.jetson-ai-lab.io/jp6/cu126jetson_first_boot.shnow restatestorch==2.8.0in the Voyager pip call and pins torchvision/numpy in the torch call, so freshly baked images don’t regress.
P-3. nvcc not found and zero cuda-* packages on a freshly flashed device
- Symptom:
which nvccempty,dpkg -l | grep -c '^ii cuda'is 0, torch import fails withlibcudart.so.12: cannot open shared object file. - Cause: the NVIDIA sample rootfs ships without JetPack userspace (CUDA toolkit, cuDNN, TensorRT, VPI, NVIDIA GL). Nothing in Phases 1-4 installs it; it comes from apt on the running device.
- Fix:
sudo apt update && sudo apt install nvidia-jetpack(~3.8 GB). Safe by design: the kernel/bootloader pins (Pin-Priority: -1) block any RT-kernel clobber. Then see CUDA_LIBS.md for the OpenCV build and the verification gauntlet.
P-4. OpenCV build deps fail: nvidia-cuda-toolkit : Depends: nvidia-cuda-dev (= 11.5.1-1ubuntu1)
- Symptom: the “Install OpenCV build deps” step fails with an apt dependency conflict, and downstream
cmake: command not found(the whole apt transaction was rolled back, so none of the build deps installed). - Cause:
nvidia-cuda-toolkitis Ubuntu universe’s CUDA 11.5, which conflicts with JetPack’snvidia-cuda-dev6.x. On Jetson the CUDA toolkit comes from JetPack (cuda-toolkit-12-6, installed bynvidia-jetpack) and lives at/usr/local/cuda. It must never be apt-installed from Ubuntu. - Fix: removed from
build_opencv_cuda.shdeps (2026-06-10). If you hit this on an old checkout, dropnvidia-cuda-toolkitfrom the apt line and ensure/usr/local/cuda/binis onPATHfor the build.
P-5. Installer or mission launcher dies right after “STEP COMPLETE: Install ROS 2”
- Symptom: the script aborts with
/opt/ros/humble/setup.bash: line 8: AMENT_TRACE_SETUP_FILES: unbound variableimmediately after the ROS apt install succeeds;launch_av_mission.shcan die the same way at startup. - Cause: ROS 2 setup scripts reference unset variables. Under
set -uan expansion error aborts a non-interactive bash even when thesourceis guarded with|| true. - Fix: wrap every ROS
setup.bashsource inset +u…set -u(applied 2026-06-10 toinstall_av_stack.shandlaunch_av_mission.sh).
P-6. cv2 import dies with numpy.core.multiarray failed to import
- Symptom:
/opt/av-env/bin/python -c "import cv2"fails right after an OpenCV-CUDA build that completed cleanly;cmakeoutput (orCMakeCache.txt) showsnumpy ... (ver 2.x)with anumpy/_core/includepath. - Cause: the bindings compile against the venv’s numpy headers and only run on the same numpy major version. pyzed’s wheel declares an unbounded numpy dependency and its installer runs
--ignore-installed, silently bumping the venv to numpy 2.x; an OpenCV build configured in that window produces bindings that break once the Voyagernumpy<2pin is restored. Two guards added 2026-06-10:install_zed_sdk.shre-asserts the pin after pyzed, andbuild_opencv_cuda.shrefuses to configure when the venv numpy is >=2. - Fix (without a full rebuild: only the bindings need recompiling):
sudo /opt/av-env/bin/pip install --force-reinstall "numpy<2.0.0" cd /var/tmp/opencv-build/opencv/build sudo cmake -U 'PYTHON3*' -U 'OPENCV_PYTHON3*' \ -D PYTHON3_EXECUTABLE=/opt/av-env/bin/python -D BUILD_opencv_python3=ON . sudo make opencv_python3 -j6 && sudo make installAlso remove pip’s CPU
opencv-pythonif present (it shadows the CUDA build):sudo /opt/av-env/bin/pip uninstall -y opencv-python, and ensure the venv sees the system install:ln -sfn /usr/local/lib/python3.10/site-packages/cv2 /opt/av-env/lib/python3.10/site-packages/cv2.
Hardware
H-1. lspci shows nothing for Axelera
- Symptom:
lspci -d 1f9d:is empty; the Metis NPU does not enumerate at0004:01:00.0. - Cause: PCIe link training failed. The cold-boot
LINK_WAIT_MAX_RETRIESpatch is the most common culprit. - Fix: Verify
pcie-designware.hshowsLINK_WAIT_MAX_RETRIES 200in the source tree, then rebuild and re-flash. The patch is applied byplugins/axelerain Phase 1. See KERNEL_PATCHES.md §1.
H-2. ZED X frames appear but stereo depth is wrong
- Cause: MAX96712 deserializer was selected instead of MAX9296.
- Fix: See DRIVERS.md §1.2, “MAX9296 vs MAX96712 silent-corruption trap”.
H-3. cyclictest reports max latency > 1 ms
- Cause: RT boot args missing, governor wrong, or IRQs not pinned.
- Fix:
cat /proc/cmdlinemust containisolcpus=1-5 nohz_full=1-5 rcu_nocbs=1-5.cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governormust beperformance.sudo /home/j/jetson_rt_tune.shto re-apply per-boot tuning.
H-4. Performance tanks mid-mission
- Cause: thermal throttling; check
cat /sys/class/thermal/thermal_zone*/temp(millidegrees C). - Fix:
jetson_rt_tune.shalready pegs the fan at PWM 255. If you’re still hitting >85°C, add active cooling. - Baselines: expected FPS, GPU load, power, and temperature per configuration are in BENCHMARKS.md. For the C++ fusion sample specifically, the
--depth-everycadence flag is the biggest throughput lever; see ZEDX_METIS_CPP.md.
H-5. ZED X works but the image is soft / not crisp
- Symptom: capture runs at full FPS, colors fine, but the image lacks sharpness regardless of
.isptuning files. - Cause: stock L4T R36.4.x
libnvisppg.so(NVIDIA ISP post-processing library) has an image-quality regression with the ZED X. Distinct from the.ispcalibration files (§1.4 of DRIVERS.md). - Fix: install Stereolabs’ patched build from
zedx-driver/nvidia_364_fix/R36.4.3/libnvisppg.so.install_zedx_daemons.shdoes it automatically viadpkg-divertand restartsnvargus-daemon. Verified live 2026-06-11 (sharp at 29.5 FPS sustained, HD1200).
H-6. ZED SDK: CAMERA MOTION SENSORS NOT DETECTED / No camera-to-SPSC mappings found
- Symptom: the camera enumerates in V4L2 (
/dev/video*present) butpyzed/SDKopen()fails; later stages may showDevice 0 not accessibleorFailed to configure device 0. - Cause: ZED SDK 5.3 needs the BMI088 IMU + SPSC kernel modules and the vendor daemons (
ZEDX_Driver,ZEDX_Daemon,IMU_Daemon) from the zedx-driver tree, none of which the SDK installer provides. Two traps: the loader insmods fromkernel/drivers/stereolabs/(NOTupdates/), and/dev/spsc_bmi*is root-only without a udev rule. - Fix:
sudo scripts/install_zedx_daemons.sh(builds modules against the running kernel, installs daemons + udev rule, restarts the chain). Full background: DRIVERS.md §1.5.
H-7. ZED SDK suddenly fails: Failed to connect to daemon socket
- Symptom: the camera worked, then
pyzedopen fails withFailed to connect to daemon socket/Failed to configure device 0;IMU_Daemon.serviceshows active. - Cause:
IMU_Daemonlistens on/tmp/imu_daemon.sock. Anything that wipes/tmp(the Phase 7 resilience step activatingtmp.mountis the documented offender) deletes the socket while the daemon keeps running. - Fix: restart the chain:
sudo systemctl restart driver_zed_loader && sleep 3 && sudo systemctl restart zed_x_daemon IMU_Daemon.install_uav_resilience.shnow does this automatically after enabling tmp.mount.
H-8. SLAM / detect node run but receive no frames (silent)
- Symptom:
zed_wrapper, cuVSLAM, anddetect_metis.pyall show running;ros2 topic hzon the consumer side prints nothing; no errors anywhere. - Cause: zed-ros2-wrapper 5.1.0 renamed every image topic (
rgb/image_rect_color→rgb/color/rect/imageetc.), topics are lazy (published only while subscribed), and left/right topics are off by default (video.publish_left_right: false), exactly what cuVSLAM needs. Any consumer wired to a pre-5.1 name waits forever. - Fix: subscribe to the >=5.1 names (
/zed/zed_node/rgb/color/rect/image,/zed/zed_node/depth/depth_registered); for stereo consumers enable left/right via/etc/jetson-av/zedx_overrides.yaml(staged byinstall_mission_inference.sh; the mission launcher passes it asros_params_override_path). cuVSLAM is wired through/opt/jetson-av/zedx_vslam.launch.py. The stockisaac_ros_visual_slam.launch.pydeclares NO remappings at all.
H-9. Every mission unit crash-loops instantly (activating auto-restart)
- Symptom:
systemctl start jetson-av-missionspawns the units, butjetson-av-camera/slam/nav2/...all cycleactivating auto-restart; journals showros2: command not foundorfailed to initialize logging: Failed to get logging directory. - Cause:
systemd-runtransient units start from PID 1 with no ROS environment and no HOME: nothing from the launcher’s shell carries over. Both bit the first live mission run (2026-06-11). - Fix (in
launch_av_mission.shsince 2026-06-11): nodes spawn via/bin/bash -lc(login shell sources/etc/profile.d/jetson-av-stack.sh) with--setenv=HOME=/root --setenv=ROS_LOG_DIR=/var/log/jetson-av/ros. If you add new spawn sites, keep that pattern.
H-10. jtop crashes when opening the GPU tab (KeyError: '3d_scaling')
- Symptom:
jtopruns fine until you switch to the GPU page, then the curses UI dies withKeyError: '3d_scaling'raised fromjtop/gui/pgpu.py. Confirmed on this image 2026-06-11 (jetson-stats 4.3.2, L4T r36.4.3). - Cause: upstream jetson-stats bug on JetPack 6. The GUI reads
gpu_status['3d_scaling']unconditionally, but the jtop service only populates that key when/sys/devices/platform/17000000.gpu/enable_3d_scalingis readable, and r36’s out-of-tree nvgpu driver no longer exposes that node anywhere in/sys(railgate_enablesurvives, so the rest of the page is fine). 4.3.2 is the latest PyPI release, so upgrading does not help. - Fix: make the missing key non-fatal in the two places that assume it (the page render, and the toggle-button callback path in the client API):
sudo sed -i "s/gpu_status\['3d_scaling'\]/gpu_status.get('3d_scaling', False)/g" \ /usr/local/lib/python3.10/dist-packages/jtop/gui/pgpu.py sudo sed -i "s/\['status'\]\['3d_scaling'\]/['status'].get('3d_scaling', False)/" \ /usr/local/lib/python3.10/dist-packages/jtop/core/gpu.pyNo service restart needed. Showing “Disable” is accurate: the 3D-scaling knob genuinely no longer exists on JP6. Re-apply after any jetson-stats reinstall/upgrade unless upstream has fixed it by then.
H-11. No way to see Metis utilization: axmonitor missing from the pip-installed runtime
- Symptom: jtop shows nothing for the NPU (it only knows Tegra hardware), and the Metis monitoring tool
axmonitoris not in/opt/av-env/binalongsideaxdevice/axrunmodel. Found on this image 2026-06-11 (Voyager runtime 1.6.1 via pip). - Cause: not a bug. The pip install path deliberately omits
axmonitorand its backend serviceaxsystemserver; Axelera only ships them in the apt packages (axelera-voyager-sdk-base-*→axelera-runtime-*). The SDK docs note this in~/voyager-sdk/docs/tutorials/axmonitor.md. arm64 builds matching runtime 1.6.1 confirmed present in their apt repo. - Fix: add the Axelera apt repo and install the matching base metapackage, then start
axsystemserver.service. Full commands and usage in Samples § “Monitoring the Metis”. Verified live 2026-06-11. Two traps in the packaging, both hit on this unit:axmonitordies withlibQt6Svg.so.6: cannot open shared object file: the deb does not declare its Qt dependency.sudo apt-get install libqt6svg6fixes it (sole missing lib perldd .../bin/axmonitor).systemctl enable axsystemserveraborts withupdate-rc.d: error: axsystemserver Default-Start contains no runlevels: the package ships both a native unit and a malformed SysV init script, and the SysV sync kills the enable.systemctl startworks; for boot persistence, symlink the unit into/etc/systemd/system/multi-user.target.wants/by hand (thensystemctl is-enabledreportsenabled). No power readings on the M.2 module (4-Metis PCIe card only). Use jtop’s INA3221 rails for board draw.
H-12. ZED SDK: CAMERA STREAM FAILED TO START after a camera client is killed
- Symptom:
sl::Camera::open()returnsCAMERA STREAM FAILED TO START, and every subsequent open fails the same way, even though/dev/video0,1exist andzed_x_daemonreportsactive. Usually appears right after a camera app was hard-killed or hung mid-stream (an OOM-killed ortimeout-killed process, or a thread-join deadlock on shutdown). - Cause: the GMSL2 / Argus capture pipeline is left wedged. The deserializer plus the
nvargussession were not torn down cleanly, so the stream cannot restart until the daemons reset it. The devices and the daemon still look healthy, which makes it read like a camera fault when it is really a stuck pipeline. - Fix: restart the capture daemons, then retry:
sudo systemctl restart nvargus-daemon zed_x_daemonIf it persists, also
sudo systemctl restart driver_zed_loaderand re-probe with a short capture. Prevent it by shutting camera clients down gracefully (let them finalize) instead ofSIGKILL; a well-behaved client releases the stream on exit. Distinct from H-7 (daemon socket lost) - there the daemon is gone, here the daemon is up but the stream pipeline is stuck. Verified live 2026-06-11.
Logs
Include with any support request:
- Kernel build:
tail -2000 BUILD_LOG.md. - Flash: full
l4t_initrd_flash.sh --showlogsoutput. - First-boot:
journalctl -b 0 -u jetson-first-boot.service. - Per-boot tune:
journalctl -b 0 -u jetson-rt-tune.service. - dmesg:
dmesg --level=err,warn,crit,alert,emerg. - Vermagic state: the full output of
sudo /home/j/verify_tuning.sh.