[MSc] part 2- Implementation of Fast FIR Algorithms based on Partial Results Reuse on Zynq-7000 SoC by mateuszgrzyb

5 Software Application

When elaborated synthesis is ready, user can extract a bitstream file. A bitstream is a file containing the FPGA configuration information. There is such information as the internal logic and digital circuit of the FPGA and device-specific information from other files associated with the FPGA target. When bitstream is ready, the hardware information can be exported to the Software Development Kit (SDK) tool. There user will specify the software application for the designed logic.

The Design Information is as follows:

• Target FPGA Device: 7z020

• Part: xc7z020clg484-1

• Target Processor: ps7_cortexa9_0

• Board Support Package OS – standalone

Standalone is a simple, low-level software layer. It provides access to basic processor features such as caches, interrupts, and exceptions as well as the basic features of a hosted environment, such as standard input and output, profiling, abort, and exit. BSP contains information about peripheral drivers presented in the design. The DMA driver can be distinguished among other drivers, mainly referred to the processor unit.

The user has the outlook for the whole presented IPs in the design. The version of the implemented Fast FIR block is 1.00a, while for the DMA is 7.1.

Furthermore, user can inspect the Address Map for processor ps7_cortexa9. Register address for the AXI DMA is started at 0x404000000, its High Address is 0x4040ffff. DMA is a slave to the processor, the communication between CPU and DMA is established thanks to AXI-LITE protocol. Note that Fast FIR IP is not present in the Address Map since it’s not directly connected to the processor. What is more, there are three regions of memory. One of them is external DDR and two of them are On-Chip Memory.

The key point of the software application is to initialize transmission between designed Fast FIR IP and external DDR via DMA. What is more, the application should check if the output of the IP is correct. To achieve these assumptions, the main function of software application realizes following steps:

1. create the array of pixels

2. create the empty array for convolved pixels

3. concatenate 3 pixels in one 32-bit word

4. initialize the AXI DMA device

5. clear the Cache

6. establish transfer between DMA and the device and vice versa

7. polling operations on DMA

8. retrieve convolved pixels from 32-bit words

9. perform correctness test

10. return 0 if all steps end successfully

Each step will be discussed in detail in this chapter.

First, the array of 60 pixels is created. Pixels are stored in the DDR external memory in incremented order The size of each pixel is 8 bit per pixel. Thus, u8 representing unsigned char is used as data type for the array. Secondly, the empty array for 60 pixels is initialized. In this array, pixels after convolution process will be stored.

Each 3 consecutive pixels are stored in one 32-bit word as indicated on Figure 7 For testing purpose, there are 20 32-bit words overall stored in the array ready to send to peripheral. This configuration will be called AXI Stream Data FIFO, the size of the FIFO exactly matches the size of the main loop in Fast FIR IP. The array is stored in DDR. 3 pixels are places in one 32-bit word thanks to the concatenation method.

This listing presents convolution in the software application. NUMBER_PIXEL is a macro equals to 20, a is the array of 32-bit words, a_pixel is the array of configured pixels. MOVE indicates left bit shifting by 8 positions.

Thenext stepisto initialize theAXI DMADevice. Theall-necessary information to setup and utilize AXI DMA Device is in xaxidma.h reference file. From this file, we can learn that XAxiDMA is the struct containing the driver instance data. This data must be allocated for each DMA engine. Another structure, which passes the hardware building information to the driver, is XAxiDma_Config. It contains such information as Device ID, the Base Address or Address Width DeviceId is the unique Device ID of the device to lookup for. The configuration is set thanks to XAxiDma_LookupConfig() function. It takes the Device ID, and it returns the

configuration parameters. If configuration is not present, the application will return XST_FAILURE status, considered as error.

The XAxiDma_CfgInitialize() function initializes a DMA engine. This function must be called before a DMA engine is utilized. To initialize an engine, the register base address and instance data must be set, and the hardware must be in a quiescent state. The function returns the status of initialization.

The last thing to do in DMA configuration is to disable interrupts. The interrupts are disabled in both directions, that is from DMA to the peripheral and conversely. The polling operation will be applied to check whether DMA has completed its transaction between peripheral.

Before the transaction starts, it is important to clear the processor’s cache. Without this step, the so-called irrelevant data will override the main memory. For this reason, the cache is explicitly flushed in the software before the transaction appears.

When all above steps are completed, it is possible to establish the transaction between DMA and Fast FIR peripheral. DMA engine can’t be in the Scatter-Gather (SG) mode. The prototype of function transaction is as follows:

u32 XAxiDma_SimpleTransfer(XAxiDma *InstancePtr, UINTPTR BuffAddr, u32 Length, int Direction);

The function works only in Simply mode, if it is in SG mode, it won’t submit. It won’t submit also if engine is busy. The function takes 4 parameters:

• InstancePtr is the pointer to the driver instance

• BuffAddr is the address of the source/destination buffer

• Length is the length of the transfer

• Direction is DMA transfer direction; it can take two valid values:

XAXIDMA_DMA_TO_DEVICE

XAXIDMA_DEVICE_TO_DMA

The simple transfer is initialized in both directions: from DMA to the Fast FIR IP and conversely. The total length of the transfer is 20 packets of 32-bit words, that is NUMBER_PKT*sizeof(u32), in both directions.

If simpletransfer functionwon’t success, it will return relevant parameter and application will end with failure result.

After mentioned operations, it is appropriate to check if DMA engine is still busy. This method is so-called polling operation. It is done by the following instruction:

1. // Polling operations

2. while ((XAxiDma_Busy(&AxiDma,XAXIDMA_DEVICE_TO_DMA)) || (XAxiDma_Busy(&AxiDma,XAXIDMA_DMA_TO_DEVICE))) {

3. // Wait

4. }

6. sleep(1);

Polling in electronics refers to a method of communication between devices in which a controlling device sends requests to multiple controlled devices, one at a time, to determine if they have any data or instructions to transmit. The controlling device, processor in this case, sends a request to the controlled device (DMA engine) in sequence, and waits for a response before sending the next request. XAxiDma_Busy() function checks whether specified DMA channel is busy. It returns TRUE if channel is busy, FALSE otherwise. The polling is constructed in while loop. The loop is empty; thus, it will be exited only when DMA engine is not busy. [23]

When polling is completed, the further step is to retrieve from the feedback STREAM. The objects from feedback stream are stored in the array called b. There are overall 20 objects, as FIFO was designed in Fast FIR IP. Each object consists of 3 convolved pixels stored in one 32-bit word. By applying 0xFF mask and right shifting, 3pixels are extractedfrom 32-bit word.

5.1 AXI Interfaces

In this section, the main transactions between AXI DMA and Fast FIR IP will be discussed. The AXI Stream Protocol is applied for the transactions between DMA and the peripheral. The Full AXI4 Protocol is applied for read/write operationfrom DDR to stream and conversely. The AXI4-Lite Interface is applied for configuring DMA engine.

The AXI Stream protocol is a widely used protocol in digital systems for the efficient transfer of streaming data. It is part of the AXI family of protocols that are commonly used for communication between digital components in an integrated circuit (IC).

The AXI Stream protocol is designed for high-speed, one-way, unidirectional data transfer between a master and a slave. The data is transferred in a continuous stream of packets, without any addressing or flow control information. The protocolis designed to provide a highbandwidth, low-latency interface for transferring large amounts of data.

The AXI Stream protocol defines two types of signals: data signals and control signals. The data signals carry the actual data being transferred, while the control signals provide information about the data transfer process.

The data signals consist of a single data channel and optional sideband signals. The data channel carries the actual data being transferred, while the sideband signals provide additional information.

Thecontrol signalsincludeastart-of-packet (SOP)signal, anend-of-packet (EOP) signal, and a ready signal. The SOP signal indicates the start of a new packet, while the EOP signal indicates the end of a packet. The ready signal is used to indicate that the receiving component is ready to receive data.

TheAXIStream protocolisoftenusedinapplicationssuchasvideoandaudio processing, network packet processing, and high-speed data transfer. Its simplicity and high-bandwidth capabilities make it a popular choice for these types of applications.

5.2 Integrated Logic Analyzer

In the system architecture, the Integrated Logic Analyzer (ILA) was placed. ILA is a type of tool used in digital logic analysis and debugging. The ILA is integrated into a PL part of the device and allows for the capture and analysis of digital signals within the device. The all-main transactions, noticed in this section, will be overviewed. The captured window in ILA analyzer incorporates 1024 samples. Each sample occur in the certain time event, and this event will be called a timestamp.

32 The Stream Transmission between DMA and FAST FIR IP. Captured by the Integrated Logic Analyzer

The Figure 32 presents the stream transactions that take place between DMA and the peripheral. Slot 1 presents the data transfer from the DMA to the Fast FIR IP. Slot 0 presents the data transfer from Fast FIR IP to the DMA. The transmission occurs only when Tvalid and Tready are both high. Tdata in slot 1 contains prepared values of 3 pixels in one 32-bit word.

On the other hand, Tdata in slot 0 includes 3 convolved pixels after Fast FIR computation. There is a one bottleneck event, which lasts for 5 timestamps. Thus, the whole transaction, for 20 data objects, occurs 25 timestamps. It takes 3 timestamps for the computation of the output after that receiving the first input. After that, new output is delivered every timestamp (considering no bottlenecks), and it agrees with the main loop interval – 1 clock cycle. DMA sends the Tlast signal when the length of the transfer is reached. The Tlast signal is send from Fast FIR IP to DMA after the last sample is computed (#20), and in fact, it is defined by the size of the for loop in the C-code.

Figure

Figure 33 AXI4-Full Protocol. Read mode. Captured by the Integrated Logic Analyzer

Figure 33 represents the read data transfer from external DDR to the DMA engine. It follows the full AXI4 protocol. This protocol is memory mapped. The master sends the address it wants to read on the Read Address (AR) channel. When inspected, the address consents with the address obtained in the terminal by the following instruction:

1. xil_printf("Memory address of a: %p\n", &a);

The requested data can be inspected in the R channel. Notice that there are 2 channels for read operation – address and read, and there are 3 channels for write operation – address, data and response (B channel). The Response signal is also included within R channel. It is observed that OKAY value is raised, which means a normal access has been successful.[12]

Figure 34 AXI4-Full Protocol. Write mode. Captured by the Integrated Logic Analyzer

Figure 34 represents the write data transfer from the DMA engine to the external DDR memory. The master sends an address on the Write Address (AW) channel and transfers data on the Write Data (W) channel to the slave. The memory address completes the address retrieved by the following instruction:

1. xil_printf("Memory address of b: %p\n", &b);

Thedataissent intheburst mode.TheFull AXI4isthebest suitableforhigh-performance memory-mapped requirements, such as read/write operation from the DDR to the DMA and conversely. The Full AXI4 provides 5 channels to meet industry requirements. Overall, the AXI-4 Full Protocol is the most comprehensive version of the AXI-4 protocol, and it defines the complete set of signals, properties, and behavior of the protocol.

5.3 System Output

To Program FPGA, the JTAG standard is incorporated. The communication between Zedboard and PC is done thanks to the UART protocol. The Port depends on the Device Manager in the User PC, other setting are as follows:

• Baud Rate – 115200

• Data Bits – 8

• Stop Bits – 1

• Parity – none,

• Flow control – none

When the Fast FIR IP finished its computation and the DMA engine ended its task, the output is printed in the SDK terminal. What is more, the correctness test is specified. For the correctness requirements, another simple convolution is done by the CPU.

The conv function realizes the function described in the equation number 1. If the output from the Fast FIR IP agrees with output calculated by the CPU, the terminal prints “Passed!” statement, otherwise it points out that “Results don’t match”. The results from Fast FIR IP are correct and this is confirmed by the correctness test drove by the CPU. Sometimes, it happens that the first two results are not correct due to wrong initialization of the static registers in the Fast FIR IP. Notwithstanding, it doesn’t impact the real application. In fact, only two first pixel of the first frame can be affected, the rest will be correct certainly.

Figure 35 The output of Fast FIR IP and correctness evaluation in the PC terminal

Figure 35 presents the example of pixels results after computation by the Fast FIR IP. According to the correctness test, in this scenario, two first values don’t match desired output.

This chapter summarized the system architecture and software application.

6 Conclusions and future perspectives

In the first part of the experiment, the Fast FIR algorithm based on the Partial Results Reuse has been successfully implemented as IP Block. The algorithm works in MIMO architecture. The algorithm works for any N-tap generic FIR filter. In the study, the algorithm was implemented for N=3. In this case, every one clock cycle, it can read 3 inputs and deliver 3 outputs for the convolution purpose. What is more, thanks to the Partial Results Reuse, the number of atomic multiplications drops from 9 to 6, for three inputs/outputs system. In computer hardware, decreasing the number of multiplications or replacing multiplication with addition is significant since ADD operation is as fast as or even faster than MUL operation. Modern HLS tools are based onpopular compilers like GCC or Clang. When it points to integer computation, those compilers avoid performing MUL operation very frequently, replacing it with proper combination of bit shifting and ADD/SUB instructions. In HLS, there is no point to replace integer multiplication when kernel is static and is known during compilation time, because the Partial Results Reuse technique will consume only more additions and resource utilization will be larger. However, in other techniques, for example when kernel is not static and multiplications are actually performed by the multiplication block, still it will be very worthwhile approach.

The totally different perspective was achieved when kernel was replaced with floatingpoint coefficients. In this study case, the multiplication was performed by FMUL instruction and complex DSP blocks were added to the design during RTL translation. Reducing number of multiplications allowed to decrease number of FMUL operations and consequently to reduce latency and interval. Moreover, the resource utilization was lower, and what is important – the number of DSP blocks dropped significantly. This both scenarios conclude in what situation applying Partial Reuse is beneficial.

In the second part of the research, the designed Fast FIR IP block was implemented on Zynq-7000 System-on-Chip. In this architecture, the task of the system was to check the data transmission between external DDR memory and constructed Fast FIRIP, which isnot memory mapped, with the DMA technique. In this approach, the processor is offloaded from the data transmission task since its only task is to configure DMA engine. In this architecture, the various AXI Protocols were checked. AXI4-Lite protocol was used to configure DMA engine, AXI4-Full protocol was used for data transmission between DMA and external DDR. Finally,

AXI4-Stream protocol was applied for data transmission between DMA and not memory mapped IP. The protocols were utilized effectively, it turned out that they can work simultaneously. Itisadecent attitude, becauseDMAcanstart streamingto theperipheralalmost immediately after it reaches data from DDR. In the opposite direction it works similar, peripheral delivers outputs almost instantly when it receives inputs. These advantages reduce overall latency of the system. An important conclusion is that AXI4-Stream protocol is the best to use for peripherals that don’t have memory address and they work in the stream mode. From the other hand, AXI4-Full protocol best fits tasks when one has to retrieve data allocated in the specific memory region. Both protocols are highly efficient.

Furthermore, outputs delivered by the Fast FIR IP are checked by the correctness test performed by the CPU, and they are accurate. It points that designed IP is ready for use in more complex system.

Figure 36 presents the example of Fast FIR IP application, it can be allocated in the place of CV accelerator. In this setting, the system is intended to enable the processing of pixels as soon as they become available.

The 32-bit AXI word contains 3 pixels of 8 bpp. This means that there is unused 8-bit space. That space can be used for control purpose or for additional pixel. Although, placing additional pixelcan reducethe latency evenmore(the whole rowfrom DDR will be transmitted faster), it requires better mastering of input/output order.

It was stated that Partial Results Reuse technique is very useful for floating-point computation. For example, floating-point kernels are used for the blurring image purpose. But

Figure 36 Example of Fast Fir IP application

in fact, any kernel can be provided in the software application, it only requires the slightly change of the function prototype. [24]

void fast_fir(hls::stream<ap_axis<32,2,5,6>> &x, hls::stream<ap_axis<32,2,5,6>> &y, float h[3]);

Fast FIR IP utilizes 1-D convolution, in future it can be adapted to the 2-D convolution.

From the other hand, RGB images consists of three channels. Enabling additional channels in the DMA engine will allow to convolve every channel independently and concurrently.

In recent years, AXI5-Stream protocol was established. In introduces new signal

T_WakeUp signal. One may consider applying the latest Stream protocol.

Finally, verifying that DMA finished its task is done by the polling method. More appropriate will be taking advantage of the interrupting approach.

To sum up, the Partial Results Reuse technique allows to reduce number of atomic multiplications in MIMO system for N-tap FIR filter computation. Reducing number of multiplications is very beneficial, especially in floating-point operations. Besides that, designed

Fast FIR IP was tested and it is ready to use on Zynq-7000 System-on-Chip. Designed IP is not memory-mapped and AXI4-Stream protocol, designed specifically for streaming transmission, should be used to establish connection between DMA and designed peripheral.

7 List of Figures

FIGURE

FIGURE 6 THE RELATIONSHIP BETWEEN NUMBER OF REQUIRED OUTPUTS AND PERFORMED ATOMIC

FIGURE

8 References

[1] W. Hohl, ARM Assembly Language: Fundamentals and Techniques, 1st ed., vol. 1. Helion, 2009.

[2] R. Saleh et al.,“System-on-Chip:ReuseandIntegration,” Proceedings of the IEEE, vol. 94, no. 6, pp. 1050–1069, Jun. 2006, doi: 10.1109/JPROC.2006.873611.

[3] “Zynq-7000 SoC Data Sheet: Overview,” no. Zynq-7000 SoC. 2018. [Online]. Available: https://docs.xilinx.com/v/u/en-US/ds190-Zynq-7000-Overview

[4] “ZedBoardProductBrief(Datasheet),”no.ZedBoard. 2022.

[5] R.M.R.YanamalaandM.Pullakandam,“AnEfficientConfigurableHardwareAcceleratorDesignforCNN on Low Memory 32-BitEdgeDevice,”in 2022 IEEE International Symposium on Smart Electronic Systems (iSES), Dec. 2022, pp. 112–117. doi: 10.1109/iSES54909.2022.00033.

[6] R.G.Lyons,“UnderstandingDigitalSignalProcessing(2ndEdition),” 2nded.,2004.Accessed:Feb. 22, 2023. [Online]. Available: http://proquest.safaribooksonline.com/0131089897/ch05lev1sec2

[7] Y. Sun, G. Wang, B. Yin, J. R. Cavallaro,andT.Ly,“High-level Design Tools for Complex DSP Applications,” in DSP for Embedded and Real-Time Systems, Elsevier, 2012, pp. 133–155. doi: 10.1016/B978-0-12386535-9.00008-1.

[8] Marco Vitone, Nicola Petra. (2021). Reconfigurable Datapath for Hardware Acceleration of Convolutional Neural Network. SIE-2021, the 52nd Annual Meeting of the Associazione Società Italiana di Elettronica (SIE). Trieste.

[9] J. Potsangbam and M. Kumar, Design And Implementation of Combined Pipelining and Parallel Processing Architecture for FIR and IIR Filters Using VHDL, vol. 10. 2019. doi: 10.5121/vlsic.2019.10401.

[10] A. Fog,“Listsofinstructionlatencies,throughputsandmicro-operation breakdowns for Intel, AMD, and VIA CPUs,” Technical University of Denmark, 2022. Accessed: Feb. 22, 2023. [Online]. Available: https://www.agner.org/optimize/instruction_tables.pdf

[11] @TheDevelopmentChannel,“VIVADOHLSTrainingAXI Streaminterface #07.”2015.

[12] “Vitis High-Level Synthesis User Guide,” no. UG1399. 2022. [Online]. Available: https://docs.xilinx.com/r/en-US/ug1399-vitis-hls

[13] “SDAccel Development Environment Help,” no. UG1188. 2018. [Online]. Available: https://www.xilinx.com/htmldocs/xilinx2018_2/sdaccel_doc/index.html

[14] cathalmccabe,“Tutorial:usingaHLSstreamIPwithDMA(Part1:HLSdesign).” 2021.

[15] Y. Uguen, F. de Dinechin, V. Lezaud, and S. Derrien, “Application-Specific Arithmetic in High-Level SynthesisTools,” ACM Transactions on Architecture and Code Optimization, vol. 17, no. 1, pp. 1–23, Mar. 2020, doi: 10.1145/3377403.

[16] “Processing System 7, LogiCORE IP Product Guide,” no. PG082. 2017. [Online]. Available: https://docs.xilinx.com/v/u/en-US/pg082-processing-system7

[17] VipinKizheppatt,“IntroductiontoDirectMemoryAccess (DMA).” 2020.

[18] “Zynq UltraScale+ Device Technical Reference Manual,” no. ug1085. 2023. [Online]. Available: https://docs.xilinx.com/r/en-US/ug1085-zynq-ultrascale-trm

[19] “AXI DMA LogiCORE IP Product Guide,” no. PG021. 2022. [Online]. Available: https://docs.xilinx.com/r/en-US/pg021_axi_dma/AXI-DMA-v7.1-LogiCORE-IP-Product-Guide

[20] “UltraScale Architecture Libraries Guide,” no. UG974. 2022. [Online]. Available: https://docs.xilinx.com/r/en-US/ug974-vivado-ultrascale-libraries/Introduction

[21] M. Armanuzzaman andZ. Zhao, “BYOTee:TowardsBuilding Your OwnTrustedExecutionEnvironments UsingFPGA.” arXiv,2022.doi:10.48550/ARXIV.2203.04214.

[22] “Vivado Design Suite 7 Series FPGA and Zynq-7000 SoC Libraries Guide ,” no. UG953. 2011. [Online]. Available: https://docs.xilinx.com/r/en-US/ug953-vivado-7series-libraries

[23] @MKS075,“DifferencebetweenInterruptandPolling.”2023.

[24] VictorPowell,“ImageKernels.”2015.

[25] XilinxInc.,“Zynq-7000Processing SystemIP.”

[26] Xilinx Inc.,“Zynq7000SoC.”

Appendix A – Fast FIR based on Partial Results Reuse, C-code logic

The listing below presents the logic implemented for 3-tap FIR filter computation. Note: Actual code used in Vivado HLS is slightly different in order to meet AXI stream protocol requirements.

1. #define N 3 //n-tap FIR, n=3

2. #define MOVE 8 //8 bits

3. #define MASK 0xFF

4. void fast_fir(int &x, int &y)

5. {

6. 7. u8 h[N] = {1, 2, 3}; //kernel

8. u8 x_read[N], y_read[N], p[6];

9. static u8 blue_coeff[2] = {0}; //coeff for the next iteration

10. for (int idx = 0; idx<20; idx++) {

11. x_read[0] = x & MASK; //x(n)

12. x_read[1] = x>>MOVE & MASK; //x(n-1)

13. x_read[2] = x>>2*MOVE & MASK; //x(n-2)

14.

15. p[0]=partial(x_read, h, 0, 1, 1, 2); //partial results

16. p[1]=partial(x_read, h, 0, 0, 2, 2); 17. p[2]=partial(x_read, h, 1, 1, 1, 1); 18. p[3]=partial(x_read, h, 1, 2, 0, 1);

19. p[4]=partial(x_read, h, 2, 2, 0, 0);

20. p[5]=partial(x_read, h, 0, 2, 0, 2);

21.

22. y_read[2] = p[4] + blue_coeff[0]; //y(n-2)

23. y_read[1] = p[3] - p[2] - p[4] + blue_coeff[1]; //y(n-1)

24. y_read[0] = p[5] - p[0] - p[3] + 2*p[2]; //y(n)

25.

26. blue_coeff[0] = p[0] - p[2] - p[1];

27. blue_coeff[1] = p[1]; //Coefficients used in the next iteration

28.

29. y = y_read[2]<<2*MOVE | y_read[1]<<MOVE | y_read[0]; //concatenation

30. }

31. }

32.

33. u8 partial(u8 x[N], u8 h[N], int p0, int p1, int p2, int p3)

34. {

35. int k, j;

36. u8 sum_x, sum_h;

37. sum_x = sum_h = 0;

38. for (k=p0; k<=p1; k++)

39. sum_x += x[k];

40. for (j=p2; j<=p3; j++)

41. sum_h += h[j];

42. return sum_x * sum_h; 43. }