ZED X + Metis benchmarks (Orin NX 16GB)
Real, reproducible numbers for the C++ samples in examples/zedx_metis_cpp/. Every row below was measured headless on the reference device, with tegrastats sampled during each run. Regenerate the whole dataset and charts on any device with:
# build samples, stop the ROS mission so the camera/Metis are free, then:
bash scripts/bench_zedx_metis.sh # -> docs/assets/benchmarks/zedx_metis_bench.csv
/opt/av-env/bin/python scripts/plot_zedx_metis_bench.py # -> the PNGs below
See ZED X + Metis C++ for the architecture and Samples & Tests for run commands.
Method
- Device: Jetson Orin NX 16GB, L4T R36.4.3 / JetPack 6.2, MAXN_SUPER. Axelera Metis M.2 (PCIe), Stereolabs ZED X (GMSL2).
- Samples:
zedx_metis_infer(detector only: grab, letterbox+quantize, Metis, decode+NMS) andzedx_metis_fusion(adds NEURAL_LIGHT stereo depth, skeleton/body tracking, IMU-fused pose, per-object distance + world position, IoU tracking). - Measure: headless, ~14 s measured per config in this dataset (the harness passes
--seconds $SECS, default 15, configurable). Display/encode cost is excluded; this is the raw compute ceiling.fpsis end-to-end (camera grab through decode/fusion). GPU %, power, and temperature aretegrastatsaverages/peaks over the run. - Models:
yolov5s-v7-cocoandyolov8s-cocorun batch-4 (one frame per AIPU core);yolov8l-cocois large enough that it runs single-core batch-1. See the aipu-cores explanation. - Some run-to-run variance (a few fps) is normal: depth cost depends on scene texture, and the board shares one iGPU.
Throughput by configuration

| sample | model | mode | depth-every | bodies | FPS | GPU avg/peak % | power W | tj C |
|---|---|---|---|---|---|---|---|---|
| infer | yolov5s-v7-coco | HD1200@60 | - | 1 | 56.3 | 24/59 | 14.1 | 60 |
| infer | yolov8s-coco | HD1200@60 | - | 1 | 56.4 | 24/58 | 14.0 | 60 |
| infer | yolov8s-coco | SVGA@120 | - | 1 | 92.1 | 20/60 | 13.6 | 60 |
| infer | yolov8l-coco | HD1200@60 | - | 1 | 36.5 | 24/52 | 13.5 | 60 |
| fusion | yolov8s-coco | HD1200@60 | 1 | 1 | 34.6 | 46/98 | 17.3 | 61 |
| fusion | yolov8s-coco | HD1200@60 | 3 | 1 | 45.8 | 34/86 | 15.7 | 61 |
| fusion | yolov8s-coco | HD1200@60 | 6 | 1 | 53.3 | 28/93 | 15.2 | 61 |
| fusion | yolov8s-coco | HD1200@60 | 3 | 0 | 44.2 | 28/77 | 15.0 | 61 |
| fusion | yolov8s-coco | SVGA@60 | 3 | 1 | 34.7 | 24/69 | 14.2 | 61 |
| fusion | yolov8l-coco | HD1200@60 | 3 | 1 | 38.8 | 32/80 | 15.2 | 61 |
| fusion | yolov8l-coco | SVGA@60 | 3 | 1 | 37.3 | 29/85 | 14.2 | 61 |
Decoupling depth cadence (--depth-every)
The single biggest tuning lever for the full-fusion app. Stereo depth (computed inside grab()) and the skeleton net are the only heavy iGPU consumers; detection runs on the Metis NPU and is nearly free. Running depth + skeleton every Nth frame, while detection/display/pose run every frame, lifts throughput with no feature lost (the IoU tracker carries distance/velocity/time-to-collision forward; depth and skeletons just refresh at rate/N).

yolov8s fusion at HD1200@60: 35 fps (depth every frame) to 53 fps (every 6th), all features live. The live display tuning session recorded the same lever at 45 → 57 fps (--depth-every 1 → 3); depth cost is scene-dependent and this dataset was recorded in short ~14 s windows, so a few fps of offset between sessions is expected. The trend, not the exact row, is the result. Flag semantics: ZED X + Metis C++.
GPU load and power

The detector-only path leaves the iGPU almost idle (~24% average, the ZED rectification) because detection is on the Metis NPU. Full fusion drives the iGPU hard: depth-every-frame peaks GR3D at 98% and pulls 17.3 W; raising --depth-every drops both. Thermals stayed at ~60 to 61 C throughout (no throttling; the Orin NX throttles far higher).
What the data says
- Detection is free. Detector-only yolov8s hits the 60 fps camera at HD1200 (56 fps) and 92 fps at SVGA@120, at ~24% GPU. The Metis carries it; the GPU is idle. yolov8l (single-core) is the exception at ~37 fps, NPU-inference-bound.
- Depth is the cost, not detection. Adding NEURAL_LIGHT depth + skeleton is what pulls the GPU to saturation.
--depth-everyis how you buy throughput back without dropping features. - Best “all features, fast”:
yolov8s --depth-every 3to6gives 46 to 53 fps with depth, skeleton, pose, world-frame, and tracking all live. - Most accurate: yolov8l caps near 37 to 39 fps regardless of cadence, because its single-core Metis inference (not depth) is the bottleneck there.
- Skeleton cost: at N=3,
--no-bodiesvs bodies is ~44 vs ~46 fps and a lower GPU peak (77% vs 86%) - the body net is a real but modest GPU consumer. - Power: 13.5 to 14 W detector, up to 17.3 W at full depth-every-frame fusion. All within the module’s MAXN_SUPER envelope.
Raw data: docs/assets/benchmarks/zedx_metis_bench.csv.