DMABUF Zero-Copy: ZED X to ISP to CMA to Metis
Purpose
This document describes the zero-copy frame path from the ZED X imager to the Axelera Metis NPU and the procedure that verifies it on-device. It is written for whoever is chasing the last CPU copies out of the hot path. The goal is a hot path in which no CPU memcpy of the pixel buffer occurs: the frame lives in a CMA-backed dma_buf, the Metis driver maps that buffer for PCIe DMA, and the result tensor is written back into a separately allocated dma_buf that the application maps at most once.
Both endpoints are verified live on the reference device: the Metis enumerates on PCIe at 0004:01:00.0 and binds to the metis driver (2026-06-10), and the ZED X captures 29.5 FPS stereo with CUDA depth (2026-06-11; see Drivers §1.5). The measured camera-to-NPU pipelines (Samples, ZED X + Metis C++, Benchmarks) still move pixels through host copies: NVMM/dmabuf caps do not negotiate with the installed Voyager 1.6.1 GStreamer elements, so the zero-copy path described here remains a design goal and its runtime numbers are pending the first recorded run (see Verification). The Metis driver uses the axl prefix internally for its ioctls and ftrace tracepoints, which is why the symbols below carry an axl_ / AXL_ name even though the bound driver is metis.
Architecture
ZED X imager --CSI/GMSL--> MAX9296 (ZED Link Mono deserializer)
|
v
+-------------------+
| Tegra VI / ISP | CSI host driver, Tegra 5.15 kernel
+---------+---------+
|
v
+-------------------+
| NvBufSurface | exporter: nvmap / tegra-buf
| (CMA-backed) | dma_buf inode N
+---------+---------+
| fd
+------------+------------+
v v
+-------------------+ +-------------------------+
| libargus consumer | | metis driver (axl) |
| (zedx capture) | | ioctl AXL_IOCTL_ |
| | | IMPORT_FD |
+-------------------+ +------------+------------+
|
v
dma_buf_attach(dev=metis pci)
dma_buf_map_attachment(BIDIRECTIONAL)
sg_table -> IOVA via Tegra SMMU
|
v
+-------------------------+
| Metis PCIe DMA engine |
| inference |
+------------+------------+
|
v
result dma_buf inode M
|
v
AV stack (mmap once or import to GStreamer)
Key invariants:
- The pixel buffer’s
dma_bufinode (call it N) appears in the attachment list of two devices: the camera/ISP device and the Metis PCI device. Attach count is at least 2. - No
memcpyis observable in any user process between the frame arriving from the ISP and the inference being submitted.perf record -e cycles -g -p <pid>must show no__memcpy_genericormemcpy_aarch64frame in the hot path of either the producer or the consumer process. - The driver maps the buffer with
DMA_BIDIRECTIONALand uses the Tegra IOMMU group attached byarm-smmu. - The result tensor is also a
dma_bufallocated from the same CMA heap, so the application does not copy the result tensor in userspace either.
Kernel CONFIG dependencies
These options are enabled in the PREEMPT_RT defconfig. See Kernel Options for the full flag list and Verification Report Section 3 for the verification status of each.
CONFIG_DMA_SHARED_BUFFER=y
CONFIG_DMABUF_HEAPS=y
CONFIG_DMABUF_HEAPS_CMA=y
CONFIG_DMABUF_HEAPS_SYSTEM=y
CONFIG_DMABUF_SYSFS_STATS=y
CONFIG_SYNC_FILE=y
CONFIG_CMA=y
CONFIG_CMA_SIZE_MBYTES=2048
CONFIG_DMA_CMA=y
CONFIG_IOMMU_DMA=y
CONFIG_ARM_SMMU=y
Note on the effective CMA pool: there is no cma= boot argument, so the runtime pool comes from the device tree linux,cma node (256MB; CmaTotal reads 262144 kB, verified on device 2026-06-10). The CONFIG_CMA_SIZE_MBYTES=2048 defconfig value is only the kernel’s fallback default and does not size the pool on this board.
Sanity check after first boot:
zgrep -E 'DMABUF_HEAPS|DMABUF_SYSFS|CMA|ARM_SMMU' /proc/config.gz
ls /dev/dma_heap/ # must contain "linux,cma" and "system"
ls /sys/kernel/dmabuf/buffers/ # populated once the first export happens
If /sys/kernel/dmabuf/buffers/ does not exist after the flash, CONFIG_DMABUF_SYSFS_STATS=y did not take. Re-verify the defconfig fragment injected by scripts/01_extract_and_patch.sh.
Buffer allocation paths: which heap when
| Producer | Heap | Rationale |
|---|---|---|
| ZED X capture (libargus / nvarguscamerasrc) | linux,cma (NvBufSurface backing) | NVMM expects physically contiguous; ISP descriptors require it. |
| Metis result tensor | linux,cma | PCIe DMA wants physical contiguity unless SMMU scatter-gather is used. |
| Telemetry / blackbox / metadata | system | Scatter-gather fine; no DMA constraint. |
ZED X → NvBufSurface → dma_buf FD
L4T’s standard pattern. The ZED X kernel driver presents the camera as a v4l2 / Argus device. Capture path:
/* After CaptureSession::createOutputStream and IFrameConsumer::acquireFrame: */
Argus::IFrame *iFrame = Argus::interface_cast<Argus::IFrame>(frame);
Argus::Image *image = iFrame->getImage();
NV::IImageNativeBuffer *iNative =
Argus::interface_cast<NV::IImageNativeBuffer>(image);
/* Hand back a NvBufSurface FD instead of CPU-mapping the image. */
int dmabuf_fd = iNative->createNvBuffer(
Argus::Size2D<uint32_t>(width, height),
NVBUF_COLOR_FORMAT_NV12_ER, /* matches Metis input expectation */
NVBUF_LAYOUT_PITCH,
/*createFlag*/ 0
);
/* dmabuf_fd is a real Linux dma_buf FD.
Verify: readlink /proc/self/fd/<fd> → "/dmabuf:[NNNNN]" */
/* Pass to Metis: */
struct axl_import_fd req = {
.fd = dmabuf_fd,
.width = width,
.height = height,
.pixfmt = AXL_PIXFMT_NV12,
.flags = AXL_IMPORT_BIDIRECTIONAL,
};
ioctl(metis_dev_fd, AXL_IOCTL_IMPORT_FD, &req);
Notes:
- The
NvBufSurfaceexporter isnvmapon R36.x./sys/kernel/dmabuf/buffers/<inode>/exporter_namereadsnvmapfor ISP-produced buffers. - The pixel format must be NV12. The ISP already produces NV12, so no GPU or VIC colorspace conversion step is needed.
- Do not call
madvise(MADV_DONTNEED)on the mapping. The ISP descriptor table holds the page references for the lifetime of theCaptureSession.
In-tree Metis driver: dma_buf import path
This is the bridge that converts a dma_buf FD from the producer side into a Metis-addressable IOVA. It lives in drivers/misc/axelera/axl_dmabuf.c alongside the existing axl_main.c, axl_pci.c, and axl_ioctl.c. The driver binds as metis but keeps the axl symbol prefix from the upstream Axelera tree.
/* drivers/misc/axelera/axl_dmabuf.c */
#include <linux/dma-buf.h>
#include <linux/dma-mapping.h>
#include <linux/scatterlist.h>
#include <linux/iommu.h>
#include "axl.h"
struct axl_imported_buf {
struct list_head node;
u32 handle;
struct dma_buf *dmabuf;
struct dma_buf_attachment *attach;
struct sg_table *sgt;
enum dma_data_direction dir;
size_t size;
dma_addr_t iova_base;
bool contiguous;
};
int axl_import_dmabuf(struct axl_device *axl,
struct axl_import_fd __user *uargs)
{
struct axl_import_fd args;
struct axl_imported_buf *ibuf;
struct dma_buf *dmabuf;
enum dma_data_direction dir;
int ret;
if (copy_from_user(&args, uargs, sizeof(args)))
return -EFAULT;
dmabuf = dma_buf_get(args.fd);
if (IS_ERR(dmabuf))
return PTR_ERR(dmabuf);
ibuf = kzalloc(sizeof(*ibuf), GFP_KERNEL);
if (!ibuf) { ret = -ENOMEM; goto err_put; }
ibuf->dmabuf = dmabuf;
ibuf->size = dmabuf->size;
dir = (args.flags & AXL_IMPORT_BIDIRECTIONAL) ? DMA_BIDIRECTIONAL :
(args.flags & AXL_IMPORT_TO_DEVICE) ? DMA_TO_DEVICE :
DMA_FROM_DEVICE;
ibuf->dir = dir;
ibuf->attach = dma_buf_attach(dmabuf, &axl->pdev->dev);
if (IS_ERR(ibuf->attach)) { ret = PTR_ERR(ibuf->attach); goto err_free; }
ibuf->sgt = dma_buf_map_attachment(ibuf->attach, dir);
if (IS_ERR(ibuf->sgt)) { ret = PTR_ERR(ibuf->sgt); goto err_detach; }
if (ibuf->sgt->nents == 1) {
ibuf->iova_base = sg_dma_address(ibuf->sgt->sgl);
ibuf->contiguous = true;
} else {
ibuf->contiguous = axl_iova_is_contiguous(ibuf->sgt);
ibuf->iova_base = sg_dma_address(ibuf->sgt->sgl);
}
ibuf->handle = axl_alloc_handle(axl);
mutex_lock(&axl->ibuf_lock);
list_add(&ibuf->node, &axl->imported_bufs);
mutex_unlock(&axl->ibuf_lock);
args.handle = ibuf->handle;
args.iova_base = ibuf->iova_base;
args.size = ibuf->size;
if (copy_to_user(uargs, &args, sizeof(args))) { ret = -EFAULT; goto err_unmap; }
trace_axl_dmabuf_import(ibuf->handle, dmabuf, ibuf->iova_base, ibuf->size);
return 0;
err_unmap: dma_buf_unmap_attachment(ibuf->attach, ibuf->sgt, dir);
err_detach: dma_buf_detach(dmabuf, ibuf->attach);
err_free: kfree(ibuf);
err_put: dma_buf_put(dmabuf);
return ret;
}
int axl_release_dmabuf(struct axl_device *axl, u32 handle)
{
struct axl_imported_buf *ibuf = axl_find_ibuf(axl, handle);
if (!ibuf) return -ENOENT;
trace_axl_dmabuf_release(handle, ibuf->dmabuf);
list_del(&ibuf->node);
dma_buf_unmap_attachment(ibuf->attach, ibuf->sgt, ibuf->dir);
dma_buf_detach(ibuf->dmabuf, ibuf->attach);
dma_buf_put(ibuf->dmabuf);
kfree(ibuf);
return 0;
}
Tracepoint definitions in drivers/misc/axelera/axl_trace.h:
TRACE_EVENT(axl_dmabuf_import,
TP_PROTO(u32 handle, struct dma_buf *dmabuf, dma_addr_t iova, size_t size),
TP_ARGS(handle, dmabuf, iova, size),
TP_STRUCT__entry(
__field(u32, handle)
__field(unsigned long, inode)
__field(u64, iova)
__field(size_t, size)
__string(exporter, dmabuf->exp_name)
),
TP_fast_assign(
__entry->handle = handle;
__entry->inode = file_inode(dmabuf->file)->i_ino;
__entry->iova = iova;
__entry->size = size;
__assign_str(exporter, dmabuf->exp_name);
),
TP_printk("handle=%u inode=%lu exporter=%s iova=0x%llx size=%zu",
__entry->handle, __entry->inode, __get_str(exporter),
__entry->iova, __entry->size)
);
Once compiled in, the events surface at /sys/kernel/debug/tracing/events/axl/axl_dmabuf_import/ and /sys/kernel/debug/tracing/events/axl/axl_dmabuf_release/.
Userspace handoff
/* userspace/axl_zerocopy.c */
int axl_zerocopy_submit(int axl_fd, int dmabuf_fd,
const struct axl_shape *shape,
struct axl_inference_result *out)
{
struct axl_import_fd imp = {
.fd = dmabuf_fd,
.width = shape->w,
.height = shape->h,
.pixfmt = shape->pixfmt,
.flags = AXL_IMPORT_TO_DEVICE,
};
if (ioctl(axl_fd, AXL_IOCTL_IMPORT_FD, &imp) < 0)
return -errno;
int result_fd = axl_alloc_result_dmabuf(axl_fd, shape->result_size);
struct axl_submit req = {
.input_handle = imp.handle,
.result_fd = result_fd,
.model_id = shape->model_id,
.timeout_us = 50000,
};
if (ioctl(axl_fd, AXL_IOCTL_SUBMIT, &req) < 0)
return -errno;
struct pollfd pfd = { .fd = req.fence_fd, .events = POLLIN };
if (poll(&pfd, 1, shape->timeout_ms) <= 0)
return -ETIMEDOUT;
out->result_fd = result_fd;
out->fence_fd = req.fence_fd;
out->latency_us = req.completion_us - req.submit_us;
ioctl(axl_fd, AXL_IOCTL_RELEASE_IMPORT, &imp.handle);
return 0;
}
The dma_fence FD path avoids busy-waiting: the kernel signals the fence from the Metis IRQ handler, which wakes the poll().
GStreamer pipeline
For the AV-stack path that pipes through GStreamer for ROS 2 image_transport interop:
gst-launch-1.0 -v \
nvarguscamerasrc sensor-id=0 ! \
'video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12,framerate=60/1' ! \
nvvidconv ! \
'video/x-raw(memory:NVMM),format=NV12' ! \
axinferencenet model=/opt/axelera/models/yolov8n-coco.json device=metis-0:01:0 \
import-mode=dmabuf output-buffer-mode=dmabuf ! \
fakesink sync=false
The axinferencenet element ships with the Voyager SDK, which jetson-first-boot provisions into /opt/av-env. Voyager 1.6.1 is installed and verified live on the reference device (see AV stack § Voyager inference), but on this build the NVMM/dmabuf caps below do not yet negotiate end to end; the working samples use host copies instead (see Samples). Treat this pipeline as the target shape.
Critical caps constraint: (memory:NVMM) must appear on every link. As soon as a link drops to plain video/x-raw, GStreamer inserts a CPU copy via nvvidconv and zero-copy is broken. Confirm the property names for the installed Voyager SDK before relying on them:
gst-inspect-1.0 axinferencenet | grep -i dma
For ROS 2 publishing without breaking zero-copy, use isaac_ros_argus_camera rather than the generic image_publisher. ROS 2 Humble and Isaac ROS are installed by the repo’s Phase 5 scripts and verified on the reference device; see AV stack.
Kernel tracepoints and ftrace setup
Run scripts/setup_dmabuf_trace.sh before the workload, then scripts/stop_dmabuf_trace.sh <output_file> after. The setup script enables:
dma_fence:{dma_fence_init,dma_fence_emit,dma_fence_signaled,dma_fence_wait_start,dma_fence_wait_end,dma_fence_destroy}axl:{axl_dmabuf_import,axl_dmabuf_release}function_graphfiltered todma_buf_{attach,detach,map_attachment,unmap_attachment,export,fd}
Verification
Run scripts/verify_dmabuf_zerocopy.sh. It starts the GStreamer pipeline, captures ftrace, runs perf record, inspects /sys/kernel/dmabuf/buffers/, and hands the artifacts to scripts/check_dmabuf_invariants.py, which checks four invariants. The camera and the Voyager SDK are both live on the reference device, but the run cannot pass until the NVMM/dmabuf negotiation gap above is closed; until then, treat the outputs below as the expected shape, not a recorded result.
| Invariant | Check |
|---|---|
| I1 | At least one dma_buf inode shows attachments >= 2 during the run |
| I2 | That inode’s exporter_name is one of {nvmap, system, tegra-buf} |
| I3 | axl:axl_dmabuf_import fires at least once (one import per submitted frame) |
| I4 | No memcpy or __memcpy_generic frame in the perf hot path above 1% self |
Pass example (result.json):
{
"I1_attach_count_ge2": true,
"I2_known_exporter": true,
"I3_axl_imports": true,
"I4_no_memcpy_hot": true,
"detail": {
"imports": 300,
"releases": 300,
"fence_wait_lines": 600,
"multi_attach_inodes": ["41023", "41024", "41025", "41026"],
"exporters": ["nvmap"],
"memcpy_hot": []
},
"pass": true
}
Trace excerpt:
gst-launch-1.0-3142 [003] d..2. 123.456789: axl_dmabuf_import: handle=17 inode=41023 exporter=nvmap iova=0x80100000 size=3110400
gst-launch-1.0-3142 [003] d..2. 123.456891: dma_fence_init: driver=axl timeline=metis-0 ctx=2 seqno=17
gst-launch-1.0-3142 [003] d..2. 123.473102: dma_fence_emit: driver=axl timeline=metis-0 ctx=2 seqno=17
gst-launch-1.0-3142 [001] dN.2. 123.473115: dma_fence_wait_start: driver=axl timeline=metis-0 ctx=2 seqno=17
<kworker/u16:2-89> [005] d..2. 123.481903: dma_fence_signaled: driver=axl timeline=metis-0 ctx=2 seqno=17
gst-launch-1.0-3142 [001] dN.2. 123.481940: dma_fence_wait_end: driver=axl timeline=metis-0 ctx=2 seqno=17
gst-launch-1.0-3142 [003] d..2. 123.481955: axl_dmabuf_release: handle=17 inode=41023
In this excerpt the fence wait is about 8 ms for a single yolov8n inference at 1080p NV12 on Metis. mmap_count == 0 on the frame buffer is the definitive signal that no userspace CPU mapping exists. The numbers above are illustrative; replace them with a real capture after the first on-device run.
Failure modes
| Symptom | Diagnosis | Fix |
|---|---|---|
axinferencenet errors unknown property import-mode | Voyager 1.6’s GStreamer plugin uses a different prop name | gst-inspect-1.0 axinferencenet \| grep -i dma; update pipeline |
attachments always == 1 | Voyager runtime is CPU-mapping via mmap | Check lsof for /dmabuf: entries with mem access; switch to libargus path |
memcpy_aarch64 > 5% in perf | nvvidconv is converting format/colorspace | Make sink and source caps bit-identical including format, color_range, framerate, width, height |
dma_fence_wait_* shows > 50 ms | Metis back-pressured or PCIe link degraded | lspci -vvv -d 1f9d:1100 \| grep LnkSta: must show Speed 8GT/s, Width x4 |
exporter_name is system not nvmap | Argus allocating from system heap | Pass nvbuf-memory-type=4 (CMA) on nvarguscamerasrc |
axl_dmabuf_import never fires | Driver path not wired or ioctl number mismatch | dmesg \| grep axl; strace -e ioctl gst-launch ...; verify AXL_IOCTL_IMPORT_FD magic agrees between kernel and userspace include |