LAB5 Name : Mintu Kumar Rollno. : 2022296 group members: Aman Kumar(2022060), Saumya Trivedi(2021198), Pankaj Yadav(2022347) Introdunction: In this lab, I implemented a Finite Impulse Response (FIR) filter on a Xilinx Zynq board using two different Direct Memory Access (DMA) channels. One DMA uses the Accelerator Coherency Port (ACP) and the other uses the High-Performance (HP) port. The purpose of this experiment is to compare the execution times of the FIR filter when it is implemented in software on the Processing System (PS) versus when it is accelerated in hardware on the Programmable Logic (PL). The ACP port provides cache-coherent data transfers between the PS and the PL, which can help maintain data consistency. On the other hand, the HP port is designed for fast data transfers with high throughput. By using both, I was able to see how each method performs in terms of speed and efficiency. Block Diagram: Block Diagram with ILA: LAB5 1 Address Mapping: LAB5 2 code : LAB5 3 LAB5 4 // 2) Configure ACP DMA and send the data XAxiDma DMA_instance; XAxiDma_Config *DMA_Config = XAxiDma_LookupConfig(XPAR_AXI_DMA_ACP_DEVIC E_ID); ... XAxiDma_CfgInitialize(&DMA_instance, DMA_Config); XPAR_AXI_DMA_ACP_DEVICE_ID is defined in xparameters.h, generated by Vivado. XAxiDma_CfgInitialize sets up the DMA driver so we can start transfers. // 3) Start timing for the PL-based FIR via ACP XTime PL_Start, PL_End; XTime_SetTime(0); XTime_GetTime(&PL_Start); // DMA: PL writes output to FIR_PL_Output status = XAxiDma_SimpleTransfer(&DMA_instance,(UINTPTR)FIR_PL_Output,FIR_Data_ Size * sizeof(int),XAXIDMA_DEVICE_TO_DMA); status = XAxiDma_SimpleTransfer(&DMA_instance,(UINTPTR)FIR_input,FIR_Data_Size * sizeof(int),XAXIDMA_DMA_TO_DEVICE); // Wait for TX (DMA_TO_DEVICE) to complete ... // Wait for RX (DEVICE_TO_DMA) to complete ... XTime_GetTime(&PL_End); Calls to send the input array to the PL (DMA_TO_DEVICE) and to receive the output array from the PL (DEVICE_TO_DMA). Uses status register polling to wait for both the transmit (TX) and receive (RX) to finish. Times this process with XTime_GetTime to measure the PL execution time in microseconds. Similar structure to comp_PSvsACP, but uses the HP port for DMA. : // 2) Configure HP DMA XAxiDma DMA_instanceHP; XAxiDma_Config *DMA_ConfigHP = XAxiDma_LookupConfig(XPAR_AXI_DMA_HP_DEVI CE_ID); XAxiDma_CfgInitialize(&DMA_instanceHP, DMA_ConfigHP); // 3) Flush caches, start timing, do DMA transfers Xil_DCacheFlushRange((UINTPTR)FIR_input, sizeof(int)*FIR_Data_Size); Xil_DCacheFlushRange((UINTPTR)FIR_PL_Output, sizeof(int)*FIR_Data_Size); ... // Wait for TX/RX to complete ... Xil_DCacheInvalidateRange((UINTPTR)FIR_PL_Output, sizeof(int)*FIR_Data_Size); Differences LAB5 5 We explicitly flush the caches before sending data to the PL, and invalidate after receiving data back. This is often done for non-coherent ports (HP) to ensure correct data is in memory. The rest of the steps—timing, checking for mismatches, printing results—are similar to the ACP test. output: AXI Port PS (Software PL (Hardware FIR) Time FIR) Time Observations / Reason ACP is cache-coherent, reducing overhead ACP 64.984619 µs 5.073846 µs for small data.- Hardware acceleration on the PL is significantly faster than software on the PS. - HP port requires manual cache HP 64.510773 µs 7.006154 µs flush/invalidate.- Slightly more overhead compared to ACP in this example, but still much faster than the PS. LAB5 6 Key Points 1. PS (Software) vs. PL (Hardware) Execution: In both ACP and HP scenarios, the PL (hardware FIR) is significantly faster than the PS (software FIR). This highlights the benefit of offloading the FIR filtering to dedicated hardware in the programmable logic. 2. ACP vs. HP: The ACP port is cache-coherent, which can reduce data transfer overhead for relatively small data sets, leading to faster total execution time (about 5 µs in this example). The HP port is optimized for high throughput, especially for large data transfers, but requires explicit cache maintenance (flush/invalidate), adding a bit more overhead for small data sets (about 7 µs here). 3. Overall Performance: Both hardware-accelerated (PL) methods are much faster than software running on the PS. Choice of ACP vs. HP can depend on data size and cache-coherency needs. For small data, ACP can be slightly faster due to built-in coherency. For large streaming data, the HP port may become more efficient. LAB5 7 LAB5 8