Speaker
Description
To support larger bandwidth detector data, systems must be able to move the data directly to the processing elements with minimal software intervention. As an example, LCLS-II operation of the ePixUHR 35K detector will generate data on the order of 250GB/s at 35kHz– far more than the existing CPU-based DAQ setup can handle. Using NVIDIA’s GPUDirect RDMA technology, we implemented a low-latency and high-throughput data flow that allows acquired data to be compressed and processed on the GPU with minimal involvement of the CPU. Our test setup involves an AMD Kintex KCU1500 and an NVIDIA RTX A5000 GPU on the same PCIe root complex. RDMA allows the KCU1500’s custom firmware to transfer data directly to the GPU, skipping the additional DMA transfer to main memory that would usually be required. We use CUDA device launchable graphs to initiate DMA transfers and process the incoming data. This allows the control flow and data processing to take place exclusively on the GPU, with the host processor taking a supervisory role. Confining control flow to the GPU using CUDA graphs resulted in a significant reduction in measured latency. This approach has the potential to support next-generation detectors required for future High Energy Physics experiments.