ZED X + Metis benchmarks (Orin NX 16GB)

Real, reproducible numbers for the C++ samples in examples/zedx_metis_cpp/. Every row below was measured headless on the reference device, with tegrastats sampled during each run. Regenerate the whole dataset and charts on any device with:

# build samples, stop the ROS mission so the camera/Metis are free, then:
bash scripts/bench_zedx_metis.sh                       # -> docs/assets/benchmarks/zedx_metis_bench.csv
/opt/av-env/bin/python scripts/plot_zedx_metis_bench.py  # -> the PNGs below

See ZED X + Metis C++ for the architecture and Samples & Tests for run commands.

Method

  • Device: Jetson Orin NX 16GB, L4T R36.4.3 / JetPack 6.2, MAXN_SUPER. Axelera Metis M.2 (PCIe), Stereolabs ZED X (GMSL2).
  • Samples: zedx_metis_infer (detector only: grab, letterbox+quantize, Metis, decode+NMS) and zedx_metis_fusion (adds NEURAL_LIGHT stereo depth, skeleton/body tracking, IMU-fused pose, per-object distance + world position, IoU tracking).
  • Measure: headless, ~14 s measured per config in this dataset (the harness passes --seconds $SECS, default 15, configurable). Display/encode cost is excluded; this is the raw compute ceiling. fps is end-to-end (camera grab through decode/fusion). GPU %, power, and temperature are tegrastats averages/peaks over the run.
  • Models: yolov5s-v7-coco and yolov8s-coco run batch-4 (one frame per AIPU core); yolov8l-coco is large enough that it runs single-core batch-1. See the aipu-cores explanation.
  • Some run-to-run variance (a few fps) is normal: depth cost depends on scene texture, and the board shares one iGPU.

Throughput by configuration

FPS by configuration

sample model mode depth-every bodies FPS GPU avg/peak % power W tj C
infer yolov5s-v7-coco HD1200@60 - 1 56.3 24/59 14.1 60
infer yolov8s-coco HD1200@60 - 1 56.4 24/58 14.0 60
infer yolov8s-coco SVGA@120 - 1 92.1 20/60 13.6 60
infer yolov8l-coco HD1200@60 - 1 36.5 24/52 13.5 60
fusion yolov8s-coco HD1200@60 1 1 34.6 46/98 17.3 61
fusion yolov8s-coco HD1200@60 3 1 45.8 34/86 15.7 61
fusion yolov8s-coco HD1200@60 6 1 53.3 28/93 15.2 61
fusion yolov8s-coco HD1200@60 3 0 44.2 28/77 15.0 61
fusion yolov8s-coco SVGA@60 3 1 34.7 24/69 14.2 61
fusion yolov8l-coco HD1200@60 3 1 38.8 32/80 15.2 61
fusion yolov8l-coco SVGA@60 3 1 37.3 29/85 14.2 61

Decoupling depth cadence (--depth-every)

The single biggest tuning lever for the full-fusion app. Stereo depth (computed inside grab()) and the skeleton net are the only heavy iGPU consumers; detection runs on the Metis NPU and is nearly free. Running depth + skeleton every Nth frame, while detection/display/pose run every frame, lifts throughput with no feature lost (the IoU tracker carries distance/velocity/time-to-collision forward; depth and skeletons just refresh at rate/N).

Depth cadence sweep

yolov8s fusion at HD1200@60: 35 fps (depth every frame) to 53 fps (every 6th), all features live. The live display tuning session recorded the same lever at 45 → 57 fps (--depth-every 1 → 3); depth cost is scene-dependent and this dataset was recorded in short ~14 s windows, so a few fps of offset between sessions is expected. The trend, not the exact row, is the result. Flag semantics: ZED X + Metis C++.

GPU load and power

GPU and power

The detector-only path leaves the iGPU almost idle (~24% average, the ZED rectification) because detection is on the Metis NPU. Full fusion drives the iGPU hard: depth-every-frame peaks GR3D at 98% and pulls 17.3 W; raising --depth-every drops both. Thermals stayed at ~60 to 61 C throughout (no throttling; the Orin NX throttles far higher).

What the data says

  • Detection is free. Detector-only yolov8s hits the 60 fps camera at HD1200 (56 fps) and 92 fps at SVGA@120, at ~24% GPU. The Metis carries it; the GPU is idle. yolov8l (single-core) is the exception at ~37 fps, NPU-inference-bound.
  • Depth is the cost, not detection. Adding NEURAL_LIGHT depth + skeleton is what pulls the GPU to saturation. --depth-every is how you buy throughput back without dropping features.
  • Best “all features, fast”: yolov8s --depth-every 3 to 6 gives 46 to 53 fps with depth, skeleton, pose, world-frame, and tracking all live.
  • Most accurate: yolov8l caps near 37 to 39 fps regardless of cadence, because its single-core Metis inference (not depth) is the bottleneck there.
  • Skeleton cost: at N=3, --no-bodies vs bodies is ~44 vs ~46 fps and a lower GPU peak (77% vs 86%) - the body net is a real but modest GPU consumer.
  • Power: 13.5 to 14 W detector, up to 17.3 W at full depth-every-frame fusion. All within the module’s MAXN_SUPER envelope.

Raw data: docs/assets/benchmarks/zedx_metis_bench.csv.