RT Kernel Optimization
This is the tuning loop for the PREEMPT_RT Jetson built by jetson-rt-stack: measure, diagnose, change, verify, repeat. It covers cyclictest methodology, CPU isolation, IRQ affinity, scheduler and power tuning, and how the project’s automated latency gate works.
For the rationale behind each kernel flag, see KERNEL_OPTIMIZATIONS.md; for the source-tree patches those flags ride on, see KERNEL_PATCHES.md; for the per-boot service knobs and cross-component coordination, see FINE_TUNING.md.
Scope
- Hardware: Jetson Orin NX 16GB (p3767-0000 module on a p3768 Orin Nano devkit carrier).
- BSP: L4T R36.4.3 / JetPack 6.2, kernel 5.15 with the PREEMPT_RT patch.
- Build host: Docker container with the Bootlin toolchain, driven by the numbered scripts/.
- Inputs: kernel source under
latest_jetson/Linux_for_Tegra/source/kernel/, the Axelera PCIe patch, and the ZED X patches inzedx-driver/. - Output: a
CONFIG_*set, extlinux boot args, and a runtime recipe that produce measurable, repeatable jitter reduction.
The isolation set for this build is cores 1 through 5. Core 0 keeps the housekeeping work (timers, RCU callbacks, IRQs). This split is single-sourced in versions.env as RT_ISOLATED_CORES=1-5 and is enforced by the bake and verify scripts.
The automated latency gate
make verify runs scripts/05_post_flash_validate.sh, which SSHes into the flashed board and executes scripts/verify_tuning.sh (baked into the rootfs during stage 3). The jitter step of that script is the project’s go/no-go gate. Note that make verify also includes a Python venv import check against /opt/av-env; that step fails until first-boot provisioning has completed with network access (the first-boot service re-runs on each boot and finishes provisioning once the board is online). The jitter gate itself does not depend on that step.
The gate runs cyclictest exactly as follows, on isolated core 1, for 10 seconds, with no concurrent load:
sudo cyclictest -m -p 99 -t 1 -a 1 -n -D 10s --quiet
It parses the reported Max: latency and passes when that maximum is under 100 microseconds. This is a fast smoke test, not a performance specification. Flags map as: -m lock memory, -p 99 SCHED_FIFO priority 99, -t 1 one measurement thread, -a 1 pin to core 1, -n clock_nanosleep, -D 10s run for ten seconds.
Reproduce the gate by hand on the board:
# As root, after the board has reached multi-user and rt-tune has applied.
sudo cyclictest -m -p 99 -t 1 -a 1 -n -D 10s --quiet
A pass means core 1 stayed under 100 microseconds for the run. It does not characterize the long tail. For a deployment sign-off, run the longer sweeps below.
Documented conditions for the gate
| Parameter | Value |
|---|---|
| Kernel | 5.15 PREEMPT_RT (-tegra) |
| Power profile | MAXN_SUPER (nvpmodel mode 0) |
| jetson_clocks | locked (all clocks at max) |
| CPU governor | performance |
| Isolated cores | 1-5 (isolcpus=1-5 nohz_full=1-5 rcu_nocbs=1-5) |
| Concurrent load | none (gate runs standalone) |
| cyclictest flags | -m -p 99 -t 1 -a 1 -n -D 10s --quiet |
| Reported metric | Max: latency on core 1 |
| Pass threshold | Max < 100 microseconds |
| Duration | 10 seconds |
| Ambient | ~25 C lab bench, natural convection |
The power profile is MAXN_SUPER by default: scripts/04_flash_nvme.sh sets the nvpmodel_p3767_0000_super.conf as the default profile, and scripts/jetson_rt_tune.sh reasserts nvpmodel -m 0 at boot (NVPMODEL_MODE=0, overridable via power.conf). With the SUPER table installed, the live-verified mode IDs are: 0 = MAXN_SUPER, 1 = 10W, 2 = 15W, 3 = 25W, 4 = 40W. Confirm on the board with nvpmodel -q.
Deployment-grade verification
The 10-second gate proves the build is sane. It does not prove the build is mission-ready. Before relying on the kernel under a real workload, run a multi-core, long-duration sweep with the mission load present.
Run a wider sweep across the whole isolation set:
# All isolated cores, 30 minutes, JSON + histogram for later analysis.
sudo cyclictest --smp --mlockall --priority=99 --affinity=1-5 \
--threads --interval=200 --duration=30m \
--histofall --json=/tmp/ct.json
Generate concurrent load while the sweep runs so the numbers reflect contention, not an idle board:
# Inference + capture + CPU pressure. See scripts/run_sustained_load.sh.
./scripts/run_sustained_load.sh 1800
# In parallel, CPU/memory pressure on the housekeeping core only:
sudo stress-ng --cpu 1 --taskset 0 --vm 1 --vm-bytes 1G --timeout 1800s
run_sustained_load.sh drives inference and camera capture, so it needs a provisioned /opt/av-env and an attached ZED X camera. Both are live-verified on the reference device (2026-06-11: 29.5 FPS stereo capture, Voyager inference.py at 49.2 FPS end-to-end). On a fresh flash where first-boot provisioning has not yet completed online, or with no camera attached, generate load with stress-ng alone. For a heavier, reproducible mission load, the C++ fusion sample sustains 46 to 53 FPS with all features live using --depth-every; see BENCHMARKS.md for the harness and ZEDX_METIS_CPP.md for the samples.
Sign-off criteria for a deployment build:
- The 10-second gate passes (Max < 100 microseconds on core 1).
- No recurring high-latency outliers after a 30-minute sweep at operating temperature with the mission load present.
rtla timerlatandrtla osnoisetraces show no remaining major kernel-induced latency source.
The long tail (p99.9 and beyond) is always higher than the gate’s single-thread maximum, especially under thermal load. Characterize it; do not assume it.
Tuning strategy
The kernel ships from scripts/02_build_kernel.sh already configured for RT. This loop is for narrowing residual jitter on a specific workload.
- Baseline: run the gate and the wider sweep, save the histograms.
- Trace: run
rtla timerlatandrtla osnoiseto find the dominant latency sources. - Change one thing: a boot arg, an IRQ affinity, a governor, or a
CONFIG_*value. - Re-measure against the same baseline.
- Iterate until the criteria above hold. Remove any diagnostic tracing config from the final defconfig to drop its overhead.
Kernel config
These flags are appended to arch/arm64/configs/defconfig by scripts/01_extract_and_patch.sh. KERNEL_OPTIMIZATIONS.md documents the full set with rationale; the latency-relevant subset is:
CONFIG_PREEMPT_RT=y: the core RT support. Threads IRQ handlers and makes spinlocks sleepable.CONFIG_NO_HZ_FULL=y: full dynticks on isolated CPUs, so the timer tick stops firing on cores 1-5.CONFIG_HIGH_RES_TIMERS=yandCONFIG_HZ_1000=y: a 1 ms timer base with high-resolution timers.CONFIG_RCU_NOCB_CPU=y: letsrcu_nocbsmove RCU callbacks off the isolated cores.CONFIG_CPU_ISOLATION=y: kernel-level enforcement that the scheduler skips isolated cores.
Diagnostic tracers (CONFIG_OSNOISE_TRACER, CONFIG_TIMERLAT_TRACER, CONFIG_HWLAT_TRACER, and the broader CONFIG_FTRACE family) add runtime overhead. Enable them in a measurement build only, then drop them from the production defconfig.
Verify the RT flags landed in the extracted tree:
DEFCONFIG=latest_jetson/Linux_for_Tegra/source/kernel/kernel-jammy-src/arch/arm64/configs/defconfig
grep -E "PREEMPT_RT|NO_HZ_FULL|RCU_NOCB|HZ_1000|TIMERLAT|OSNOISE" "$DEFCONFIG"
Boot arguments
scripts/03_bake_rootfs.sh injects the RT boot args into the device extlinux on the rootfs. The injected line includes:
root=/dev/nvme0n1p1 rootwait rootfstype=ext4 efi=noruntime pcie_aspm=off \
nohz_full=1-5 isolcpus=1-5 rcu_nocbs=1-5 irqaffinity=0
isolcpus=1-5 nohz_full=1-5 rcu_nocbs=1-5: dedicate cores 1-5 to RT tasks and keep tick, scheduler, and RCU-callback noise off them.irqaffinity=0: default all IRQ affinity to core 0 so device interrupts do not land on the isolated set.- No
cma=argument, deliberately. An earlier revision passedcma=2048Mas an “optimization”; on the Orin NX it failed to reserve (“cma: Failed to reserve 2048 MiB”), and because a cmdlinecma=bypasses the device treelinux,cmapool, the board ran with zero CMA. nvgpu requires a 64MB physically contiguous comptag allocation at GPU poweron (ga10b_cbc_alloc_comptags); with zero CMA that allocation fails (“DMA alloc FAILED: [sysmem] size=64225280 … PHYSICALLY_ADDRESSED”) and cascades into “Unable to recover GR falcon”, a FECS context switch init error, no CUDA, no GPU devfreq, and nvpmodel unable to set any power mode (it reads the GPU frequency table). With nocma=on the cmdline, the stock-proven 256MB device tree pool takes over. Verified after the fix:CmaTotal: 262144 kB, zero nvgpu errors across the boot, GPU devfreq present, GPU at 918MHz. root=/dev/nvme0n1p1: explicit NVMe root. On an NVMe-only Orin NX this is mandatory; see the troubleshooting note below.
The isolation range and IRQ affinity are single-sourced in versions.env. Do not hand-edit them in extlinux; change them at the source and re-bake. CMA sizing belongs to the device tree, not the cmdline; do not reintroduce a cma= boot arg.
Confirm on the running board:
cat /proc/cmdline # expect: no cma= argument
cat /sys/devices/system/cpu/isolated # expect: 1-5
grep CmaTotal /proc/meminfo # expect: 262144 kB (device tree pool)
Runtime knobs
The boot-time service scripts/jetson_rt_tune.sh (installed as jetson-rt-tune.service) applies these on every boot. Apply them by hand when tuning interactively.
Power profile and clocks:
sudo nvpmodel -m 0 # MAXN_SUPER (SUPER table: 0=MAXN_SUPER, 1=10W, 2=15W, 3=25W, 4=40W)
sudo jetson_clocks # lock all clocks to max
CPU governor (performance on every core):
for cpu in /sys/devices/system/cpu/cpu[0-9]*; do
echo performance | sudo tee "$cpu/cpufreq/scaling_governor" > /dev/null
done
Disable device autosuspend for interrupt sources that would otherwise wake and add latency:
# Example: disable USB autosuspend globally.
echo -1 | sudo tee /sys/module/usbcore/parameters/autosuspend
CPU isolation and IRQ affinity
isolcpus keeps the scheduler off cores 1-5, and irqaffinity=0 defaults interrupts to core 0. Some drivers re-pin their own IRQs after they load, so verify and correct device IRQs once the Metis and ZED X drivers are up.
# Inspect where each IRQ is currently allowed to run.
grep . /proc/irq/*/smp_affinity_list
# Force a specific IRQ onto core 0 (mask 0x1).
echo 1 | sudo tee /proc/irq/<IRQ_NUMBER>/smp_affinity
Move any stray kernel threads off the isolated cores with tuna or taskset. The Metis NPU enumerates on PCIe at 0004:01:00.0; after its driver binds, confirm its IRQ is not landing on the isolated set.
Trace and diagnose
Install the test and tracing tools on the target (or in the chroot while baking):
sudo apt update
sudo apt install -y rt-tests rtla trace-cmd linux-tools-generic sysstat stress-ng
Run rtla timerlat alongside a cyclictest sweep to capture a kernel stack trace whenever latency exceeds a threshold. Pick the threshold from the cyclictest maximum:
# Stop tracing automatically on the first sample over 250 microseconds.
sudo rtla timerlat top --cpus 1-5 --auto 250
rtla osnoise attributes time lost to OS noise sources (IRQs, NMIs, softirqs, kernel threads); rtla hwlat isolates hardware-induced latency. Use both to separate software causes from firmware or SMI-style stalls.
Common causes and fixes
| Symptom in trace | Likely cause | Fix |
|---|---|---|
od_dbs_update, dev_pm_opp_set_rate | ondemand governor callbacks | force the performance governor; confirm ondemand is not compiled in |
kworker spikes on cores 1-5 | workqueue work scheduled on isolated cores | pin the workqueue, or move its trigger off the isolated set |
| Long IRQ handler on an isolated core | a driver re-pinned its IRQ | re-set smp_affinity to core 0; rely on threaded IRQs |
| Periodic spikes tied to frequency changes | regulator or OPP transitions | lock clocks with jetson_clocks; pin OPPs for the RT path |
Verify
After any change, re-run the gate and confirm the tuning state. The fastest check is the host-side gate:
make verify # SSH to the board, run verify_tuning.sh, enforce the gate
On the board directly:
uname -v | grep -q 'PREEMPT RT' && echo RT-OK # PREEMPT_RT kernel
cat /sys/kernel/realtime # expect: 1
cat /sys/devices/system/cpu/isolated # expect: 1-5
nvpmodel -q # expect: MAXN_SUPER
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # expect: performance
sudo cyclictest -m -p 99 -t 1 -a 1 -n -D 10s --quiet # Max < 100 us
Troubleshoot
- Blank HDMI from “Exiting boot services” to the desktop. Normal on Orin.
FRAMEBUFFER_CONSOLEis off, andearlycon=efifbdoes not help because the GOP framebuffer is torn down at ExitBootServices. The reliable “did it boot” signal is the USB device-mode gadget at192.168.55.1(and thettyACMconsole) once the board reaches multi-user. - Board hangs at boot waiting for a root device. The eMMC default
root=/dev/mmcblk0p1does not exist on an NVMe-only Orin NX, sorootwaitblocks forever. The bake step writesroot=/dev/nvme0n1p1explicitly; confirm it is present in/proc/cmdline. - Root will not mount on the RT kernel. The initrd carries its own copy of the early-boot modules. On a PREEMPT_RT kernel those must share the kernel’s
preempt_rtvermagic, or the early-boot drivers must be built in. NVMe, PCIe, and PHY are built in for exactly this reason, so nonvme.kois needed in the initrd. - The gate fails intermittently under thermal load. The 10-second gate runs standalone; a board that passes idle can still throttle under the mission load. Re-check with
nvpmodel -qand the thermal cooling-device states, then run the 30-minute sweep with load before trusting the build.
References
- NVIDIA Jetson Linux Developer Guide R36.4.3: nvpmodel and jetson_clocks usage, and the kernel and flash workflow.
- The Good Penguin: practical
rtla timerlatand cyclictest recipes for ARM and embedded targets. - Project docs: KERNEL_OPTIMIZATIONS.md for every
CONFIG_*flag, KERNEL_PATCHES.md for the source-tree patches, FINE_TUNING.md for the boot-time tuning service, BENCHMARKS.md for the measured camera-plus-NPU throughput, TROUBLESHOOTING.md for the full boot and flash failure catalog.