[MSc] part 1- Implementation of Fast FIR Algorithms based on Partial Results Reuse on Zynq-7000 SoC by mateuszgrzyb

Università degli Studi di Napoli Federico II

Politechnika Łódzka

Dipartimento di Ingegneria Elettrica e delle Tecnologie dell’Informazione

Classe delle Lauree Magistrali in Ingegneria Elettronica, Classe n. LM-29

Wydział Elektrotechniki, Elektroniki, Informatyki i Automatyki

Katedra Przyrządów Półprzewodnikowych i Optoelektronicznych

Corso di Laurea Magistrale in Ingegneria Elettronica

Praca dyplomowa magisterska

Implementation of Fast FIR Algorithms based on Partial Results Reuse on Zynq-7000 System-on-Chip

Implementacja szybkich algorytmów FIR na układzie Zynq-7000 w oparciu o wykorzystanie wyników cząstkowych

Promotor: Student:

Prof. Nicola Petra Mateusz Grzyb

Dr Janusz Woźny Matr. M61/646 nr albumu 232975

Napoli, Lodz, May 2023

ABSTRACT

2D convolution requires many multiplications, commonly performed byFIR filter, which can be challenging to implement efficiently on computer hardware. This project presents a novel approach to Fast FIR algorithms by implementing Partial Results. The presented algorithm works in MIMO mode and can be adapted for any N-tap filter. In this research, the hardware is implemented for N=3. In every iteration, it reads three inputs and produces three outputs. In classic solution, it requires 9 atomic multiplications. Thanks to Partial Results, multiplications drop from 9 to 6. In computer hardware, reducing the number of multiplications is crucial. It allows to minimize latency and interval by 8 clock cycles. Furthermore, the proposed method reduces the number of DSPs by 14. The project involves the design of an IP in HLS and the incorporation of DMA to optimize performance. AXI Stream is utilized to enhance data transfer efficiency. The project is applied on a Zynq-7000 SoC

Keywords: Fast FIR algorithms, Partial Results Reuse, High-Level Synthesis, Zynq-7000

System-on-Chip, Direct Memory Access

STRESZCZENIE

2-wymiarowy splot wymaga wielu mnożeń, powszechnie wykonywanych przez filtr FIR, co może być wyzwaniem dla wydajnej implementacji na sprzęcie komputerowym. Niniejszy projekt prezentuje nowatorskie podejście do algorytmów Fast FIR poprzez implementację wyników częściowych. Prezentowany algorytm działa w trybie MIMO i może być zaadaptowany dla dowolnego N-1 rzędowego filtra. W tym badaniu akcelerator jest zaimplementowany dla N=3. W każdej iteracji odczytuje on 3 wejścia i produkuje 3 wyjścia. W klasycznym rozwiązaniu wymaga to 9 osobnych mnożeń. Dzięki wynikom cząstkowym liczba mnożeń maleje z 9 do 6, co ma kluczowe znaczenie. Pozwala to na zminimalizowanie opóźnienia i interwału o 8 cykli zegarowych. Ponadto proponowana metoda zmniejsza liczbę procesorów DSP o 14. Projekt obejmuje zaprojektowanie IP w HLS i włączenie DMA w celu optymalizacji wydajności. AXI Stream jest wykorzystywany do zwiększenia wydajności transferu danych. Projekt został zastosowany na układzie SoC Zynq-7000.

Słowa kluczowe: szybki algorytm FIR, Wyniki Cząstkowe, High-Level Synthesis, Zynq-7000 System-on-Chip, Direct Memory Access

1 Introduction

A System-on-Chip (SoC) is an integrated circuit that contains all the components needed to power a particular electronic system or device, such as a mobile phone, digital camera, or computer. An SoC typically combines a microprocessor or microcontroller, memory, input/output interfaces, and other specialized components like digital signal processors, graphics processing units, and power management circuits onto a single chip. [1]

SoCs are designed to be highly efficient and compact, allowing manufacturers to create powerful, feature-rich devices while minimizing their size and power consumption. They are usedin awiderangeof applications, from mobiledevicesto automotiveelectronicsandInternet of Things (IoT) devices. Because they integrate so many components into a single chip, SoCs can be less expensive, more reliable, and easier to maintain than systems that use separate components. [2]

An example of SoC is Zynq-7000. Zynq-7000 is a family of System-on-Chip (SoC) devices designed by Xilinx. The Zynq-7000 SoC integrates a dual-core ARM Cortex-A9 processor with a field-programmable gate array (FPGA) fabric in a single device. This combination of processing power and programmable logic makes the Zynq-7000 SoC ideal for a wide range of applications that require both hardware and software functionality. [3]

ZedBoard is a development board that is designed to enable rapid prototyping and development of embedded systems based on Zynq-7000 (SoC). The ZedBoard features a range of connectivity options, including Gigabit Ethernet, USB, HDMI, and an FMC expansion port. It provides an accessible platform for learning and experimenting with advanced hardware and softwaredesign concepts, aswell asapowerfultoolfordevelopingandtestingcustomhardware designs. Developers can use the Zynq-7000 SoC to create custom hardware accelerators, interface with a wide range of peripherals, and execute complex algorithms in software. [4]

A hardware accelerator is a specialized piece of hardware designed to perform a specific set of operations more efficiently than a general-purpose processor. It can be used to speed up and offload workloads from a CPU, enabling faster performance and improved power efficiency. These devices are designed to handle specific types of computations, such as graphics rendering, machinelearning, cryptography, or signal processing, andare optimizedfor performance and power efficiency.

Using a hardware accelerator can provide significant benefits in terms of processing speed and energy efficiency, especially for workloads that are computationally intensive or require ahigh degree of parallelism. As such, hardware accelerators are widely used in a variety of applications, including data centers, scientific computing, autonomous vehicles, gaming, and more.

An example of a hardware accelerator task can be one for the computation of Convolutional Neural Networks (CNNs). Those accelerators are designed to handle many parallel operations simultaneously, making them well-suited for performing the computationally intensive matrix multiplication operations required by CNNs. By offloading

Figure 1 Top-view of the ZedBoard

these operations from the CPU to the hardware accelerator, the overall speed of the trainingand inference process can be significantly improved. For example, consider the task of training a CNN to classify images. A standard CPU might take a long time to perform the matrix multiplication operations required to train the network, while a good design hardware accelerator can perform these operations in a fraction of the time. [5]

There is a relationship between convolution and FIR filters. That is, FIR filters can be implemented using the convolution operation. The coefficients of the filter correspond to the kernel or mask used in the convolution operation, and the input signal is convolved with this kernel to produce the filtered output. [6]

To meet the high abstraction specification of designed devices, High-Level Synthesis (HLS) is more often applied. HLS is a process of transforming a high-level software description of a digital circuit, typically written in a high-level programming language such as C/C++ or SystemC, into a hardware description language (HDL) such as Verilog or VHDL. The resulting hardware description can be synthesized into a physical circuit using a logic synthesis tool and can be implemented in a Field Programmable Gate Array (FPGA).

HLS provides a way for designers to describe a complex digital system using familiar programming languages and abstractions, rather than low-level HDLs, which are more difficult to use and understand. HLS tools typically provide automated analysis and optimization techniquesto improvetheperformance,power consumption, andareautilizationof theresulting circuit.

HLS has become increasingly important as the complexity of digital systems has increased, and designers seek ways to increase productivity, reduce time-to-market, and improve the quality of their designs. HLS has been used in a variety of application domains, including image and video processing, digital signal processing, wireless communication, and machine learning, among others. [7]

1.1 The Aim of the Work

The main goal of the research is the hardware implementation of the Fast FIR algorithm. The hardware will be implemented on the Zynq-7000 System-on-Chip. By this means, the aim of the research is to show that implemented hardware can be successfully utilized in embedded systems.

The scope of the work is divided into three parts. The first part is aimed at explaining the fast FIR algorithm based on the Partial Results Reuse. The most important part will be to find the coefficients of such Partial Results.

In the next stage, the study is focused on hardware implementation The logic of the Fast FIR algorithm has to be written in C/C++ language and translated into HDL level thanks to the HLS tool resulting in the designed Fast FIR IP block. This IP should workin MIMO mode; that is, every clock cycle it reads 3 inputs and delivers 3 outputs. Each data object will be a pixel with the size of 8 bits each. This part should be done in the Vivado HLS tool.

In the last part of the research, the designed IP should be implemented on the Zynq-7000 System-on-Chip. The designed IP, called a peripheral, won’t be memory mapped, i.e,. it won’t be visible bythe CPUdirectly. Thus,the transmission betweenthe peripheral and external DDR memory via DMA will be checked. To achieve this, the appropriate system architecture will be realized in the Vivado 2017.4 tool, and proper logic will be realized in the Software Development Kit (SDK). Finally, the correctness test will be performed.

The thesis explains those 3 stages in detail Finally, the conclusions and future perspectives are provided.

2 Fast Fir Algorithm

A N tap filter is expressed by the following equation:

��(��)=��(��)∗ℎ(��)= ∑��(�� )ℎ(��) �� 1 ��=0 ( 1 )

where y(n) is the output signal, x(n) is the input signal and h(n) is the kernel with the size N.

This leads to that the FIR of (N-1)th order filter is realized by the discrete convolution computation. FIR filters are the causal, linear and time-invariant systems. We can speak about N-tap filter because x[n-i] terms are frequently called taps. At each iteration of the algorithm, N inputs are read [x(n), x(n-1), …, x(n-N-1)] and Noutputs are computed [y(n), y(n-1), …, y(nN-1)]. For example, taking N=3 leads to the following equations:

��(�� 2)= ��(�� 2)ℎ(0)+ ��(�� 3)ℎ(1)+ ��(�� 4)ℎ(2)

��(�� 1) = ��(�� 1)ℎ(0) + ��(�� 2)ℎ(1) + ��(�� 3)ℎ(2)

��(��) = ��(��)ℎ(0) + ��(�� 1)ℎ(1) + ��(�� 2)ℎ(2)

��(��+1) =��(��+1)ℎ(0) + ��(��)ℎ(1) + ��(�� 1)ℎ(2)

��(��+2) = ��(��+2)ℎ(0) + ��(��+1)ℎ(1) + ��(��)ℎ(2) ( 2 )

It appears that one can take taps from the future in the current iteration to calculate taps from the past. Thus, taps are divided into 3 categories:

- Highlighted green: computed in the current iteration; used in the current iteration and following iteration,

- Highlighted blue: computed in the previous iteration; used in the current iteration,

- Highlighted yellow: computed it the following iteration.

Overall, to produce three outputs (y(n), y(n-1), y(n-2)) there are 9 atomic multiplications and 6 separate summations. [8]

The presented algorithm is a MIMO system. MIMO stands for multiple inputs multiple outputs parallel system. The MIMO system can be derived from a SISO (single input single output) implementation.

Figure 2 Derivation of a MIMO system from a SISO system

Figure 2 presents the derivation of a MIMO system from aSISO system. In SISO system, one output is generated in every clock cycle, while in N-parallel MIMO system, N output signals are generated in every clock cycle. This indicates that the N-parallel filter runs at N times the original filter's rate. The sampling frequency in a parallel filter is raised while the clockfrequencystaysthesame. Asaresult, thelevel of parallelism affectsthesystem'seffective sampling speed. [9]

The parallel structure for N-tap FIR filter can be derived from the direct form of N-tap FIR filter.

Let’s consider N = 3.

Figure 3 The direct structure of 3-tap

The difference equation of the system depicted in the Figure 3 above can be presented as:

FIR filter

This is single input single output system. The x(n) input is delayed twice bythe unit delay unit. Each delay is equivalent to the z^-1 discrete-time operator in the Z-domain. H0 – h2 are the kernel coefficients.

The N-parallel system can be obtained by reproducing SISO N-1 times. It results in the following schematic.

Figure 4 schematic of the FIR filter with the 3rd level of parallelism

Figure4showsa3-tapFIRfilterreproducedtwice,resulting in the3rd level ofparallelism. Here the unit delay z-1 is replaced by the tapped delay block with the ‘number of delays’ parameter.Each delayisequivalent to thez-1 discrete-timeoperator, whichtheUnit Delayblock represents.

The set of difference equations of the parallel system is:

��(3��) = ℎ(0)��(3��) + ℎ(1)��(3�� 1) + ℎ(2)��(3�� 2)

��(3��+1) = ℎ(0)��(3��+1) + ℎ(1)��(3��) + ℎ(2)��(3�� 1)

��(3��+2) = ℎ(0)��(3��+2) + ℎ(1)��(3��+1) + ℎ(2)��(3��) ( 3 )

The presented system can be further pipelined, generating the following schematic, presented in Figure 5

Figure 5 The parallel and pipelined MIMO FIR filter

Here the tapped delay block is moved after the multiplication block. If we substitute n for 3k+2, we end in the following difference equations:

��(�� 2) = ��(�� 2)ℎ(0) + ��(�� 3)ℎ(1) + ��(�� 4)ℎ(2)

��(�� 1) = ��(�� 1)ℎ(0) + ��(�� 2)ℎ(1) + ��(�� 3)ℎ(2)

��(��) = ��(��)ℎ(0) + ��(�� 1)ℎ(1) + ��(�� 2)ℎ(2) ( 4 ) which satisfies the equations presented at the beginning of this chapter. In fact, x(n-3)h(1), x(n-4)h(2), x(n-3)h(2) partial taps are calculated in the previous parallel iteration, and are replaced by x(n)h(1), x(n-1)h(2), x(n)h(2) respectively. [9]

2.1

Partial Results Reuse

According to Agner’s Fog instructiontables, ADD/SUB instructions are faster than MUL instruction in the term of latency, as presented in Table 1. [10]

Table 1 Latency comparison between ADD and MUL operations

In presentedarchitectures, both ADD/SUBandMUL areexecutedbythe arithmetic-logic unit (ALU), which is the part of a central processing unit (CPU). Even progress in recent years, eq Intel Pentium released in 1993 with MUL operation of up to 11 clock cycles vs Intel Coffee Lake (2017) – upto 4clock cycles, has not allowedthe MUL operation to be as fast as the ADD operation (1 clock cycle vs usually 3-4 clock cycles). It concludes that to enhance any algorithm, it is a convenient practice to avoid MUL operation and replace them with ADD operations if this is acceptable for a satisfactory performance of the used algorithm.

The proposed FIR algorithm can be improved with respect to partial results, i.e. reduce number of multiplications. One can observe that in each output (y(n+k), k = -2, -1, 0, 1, 2) there are set of inputs x(n+k) and coefficients h which are repeated. For example, x(n) is repeated in order to compute y(n), y(n+1), y(n+2). X(n-1) is repeated in y(n-1), y(n), y(n+1) and so on. H0h2 are repeated everywhere. The proper combination of atomic results can lead to avoid unnecessary computation. This combination of atomic results is called a partial result.

The partial result is described by the following formula. [8]

5 ) where p0, p1, p2, p3 are partial coefficients. For instance:

��(��+1) =��(��+1)ℎ(0) + ��(��)ℎ(1) + ��(�� 1)ℎ(2)

The green values can be computed as ��(��)ℎ(1) + ��(�� 1)ℎ(2) = ��(��,0,1,1,2) ��(��,1,1,1,1) ��(��,0,0,2,2)

The best strategy to find the first partial is to check where only one atomic result is calculated in the current iteration. It happens in the computation of y(n-2) and y(n+2).

Thus, x(n-2)h(0) can be calculated as P(n,2,2,0,0) and x(n)h(2) as P(n,0,0,2,2). Those are the first partials found. In the general case, in N-tap filter, one atomic result is calculated in computation of y(n±(N-1)).

The next step isto find newpartials anduse knownpartials tocompute y(n-1) andy(n+1).

For example, y(n-1) is expanded as x(n-1)h(0) + x(n-2)h(1) + x(n-3)h(2). In current iteration, x(n-1)h(0) and x(n-2)h(1) taps are calculated. The coefficients which stand next to x are 1, 2. The coefficients which stands next to h are 0,1. Thus, p0 = 1, p1 =2, p2 = 0, p3 = 1. In fact, p0 and p1 are the lowest and the highest coefficients stands next to x(n-k), respectively. Similarly, as p2 and p3 are the lowest and the highest coefficients of the kernel h.

The found partial is P(n,1,2,0,1). The expanded form of this partial is:

��(��,1,2,0,1) = [��(�� 1) + ��(�� 2)] ∗ [ℎ(0) + ℎ(1)] = ��(�� 1)ℎ(0) + ��(�� 2)ℎ(1) + ��(�� 1)∗ℎ(1) + ��(�� 2)∗ℎ(0) ( 7 )

x(n-1)h(0) + x(n-2)h(1) are atomic results necessary for y(n-1) computation. P(n,1,2,0,1) results also in unnecessary outputs like x(n-1)*h(1) and x(n-2)*h(0), which have to be subtracted in order to get right value. x(n-1)*h(1) can be written as P(n,1,1,1,1), which is a new partial found. x(n-2)h(0) can bewrittenas P(n,2,2,0,0) which is thepartial found in the previous operation. In other words, x(n-1)h(1) + x(n-2)h(0) can be calculated as:

��(�� 1)ℎ(1) + ��(�� 2)ℎ(0) = ��(��,1,2,0,1) ��(��,1,1,1,1) ��(��,2,2,0,0) ( 8 )

To sum up, in this calculation two new partials are found, and one old partial is used. P(n,1,2,0,1) is the unique partial for this computation, while P(n,1,1,1,1) is shared with calculation of y(n+1)

The last step is to find partial for y(n).

��(��) = ��(��)ℎ(0) + ��(�� 1)ℎ(1) + ��(�� 2)ℎ(2)

Obviously, the new partial has a form P(n,0,2,0,2). In order to obtain proper value, from P(n,0,2,0,2) must be subtracted partials occurred in previous operations.

Overall, 6 partials are enough to determine y(n), y(n±1), y(n±2). Those are: [��(��,0,0,2,2),��(��,2,2,0,0),��(��,1,1,1,1),��(��,0,1,1,2),��(��,1,2,0,1),��(��,0,2,0,2)]

3-tap filter can be rewritten as:

��(�� 2) = ��(��,2,2,0,0) + ��(��,0,1,1,2)– ��(��,1,1,1,1)– ��(��,0,0,2,2)

��(�� 1) = ��(��,1,2,0,1)– ��(��,1,1,1,1)– ��(��,2,2,0,0) + ��(��,0,0,2,2)

��(��) = ��(��,0,2,0,2)– ��(��,0,1,1,2)– ��(��,1,2,0,1) + ��(��,1,1,1,1) + ��(��,1,1,1,1) ( 9 )

where partials underlined blue are computed in the previous iteration

Furthermore, each partial can have its own index in the partials’ array, and it leads to further simplification:

��(�� 2) = ��[1] + ��[3]– ��[2]– ��[0]

��(�� 1) = ��[4]– ��[2]– ��[1] + ��[0]

��(��) = ��[5]– ��[3]– ��[4] + ��[2] + ��[2] ( 10 )

In other words, computer will calculate each partial only once, place it in the partials’ array and have access to the right index when necessary. Each partial requires only one multiplication. Overall, only 6 multiplications are required for the computation of the 9 atomic results. This means that choosing the right set of partial results allows to minimize the number of multiplications for significant values of N. To reproduce NxN kernel size, the designed hardware has to be replicated N-1 times in the architecture. [8]

Table 2 Number of required multiplications for the NxN kernel size

Outputs vs Multiplications

Figure 6 The relationship between number of required outputs and performed atomic multiplications

Figure 6 present the chart, which describes the relationship between number of required outputs and performed atomic multiplications after involving partials results. The number of required partials for N-tap filter can be predicted by the equation y = 0.4309x2 + 0.614x, where x is the dimension of the kernel.

3 High-Level Synthesis

The IP block for the system design was created in the Vivado 2017.4 HLS tool. This software allows user to transform code in high-level language like C or C++ into register transfer language (RTL) like Verilog or VHDL. The designed IP performs 3-tap fast fir algorithm described in the previous section. The IP is intended to run on the XC7Z020CLG484-1 system on chip (commonly known as ZedBoard). The IP retrieves data from DDR via DMA controller under the user control. For this purpose, IP communicates with DMA via AXI4-Stream Protocol. The CPU doesn’t have direct access to the IP since it’s not memory mapped. The IP is adjusted for the computer vision applications, especially for the convolution computation. In other words, the incoming data is expected to be the pixel representation. Each pixel is considered to be 8 bppsize (bites per pixel), i.e., it can represent up to 28 = 256 different colors. To satisfy this condition each pixel is presented as an unsigned char character datatype, i.e. it can represents value in the range from 0 to 255. The presented algorithm requires 3 different data onthe input and 3 outcomes on the output, to achieve this standard, the AXI word specification is applied. AXI word is regarded as 32-bit word size. In this case, it is enough space for placing 3 pixels. The accommodation of the pixels is depicted on the below graph.

Figure7introduceshowthree8bpppixelsareplacedin one32-bit word.Insimplywords, one can say that 3 pixels occupy space next to each other in one AXI word.

To recover data from the AXI word, the 0xFF mask is applied three times. After each masking operation, the AXI wordisshifted by8bitsto theright. It allows torecover next pixels. Ultimately, 3 pixels are saved at x_read[3] array and they are used for further computation.

Figure 7 32-bit AXI word representation

After reading pixels from incoming stream, the 6 partials are calculated. Partials are determined thanks to C-function, which expects pointers to array of read pixels, kernel coefficients and values of p0-p3 coefficients from the partial equation.

unsigned char partial(u8 x[N], u8 h[N], int p0, int p1, int p2, int p3);

1. u8 partial(u8 x[N], u8 h[N], int p0, int p1, int p2, int p3)

2. {

3. int k, j;

4. u8 sum_x, sum_h;

5. sum_x = sum_h = 0;

6. for (k=p0; k<=p1; k++)

7. sum_x += x[k];

8. for (j=p2; j<=p3; j++)

9. sum_h += h[j];

10. return sum_x * sum_h;

11. }

6 partials are calculated with set up coefficients. These partials are used to estimate y(n), y(n-1), y(n-2) outputs. Atomic results used for calculation of x(n)h(1), x(n-1)h(2), x(n)h(2) are stored in the static array and they are used in the future iteration. After all, the 3 outputs are calculated using partials estimated in the current and previous run. Subsequently, 3 outputs are concatenated, placed back on the AXI word, and write onto stream object. The concatenation operation is done by the following instruction:

data = y_read[2]<<2*MOVE | y_read[1]<<MOVE | y_read[0];

where MOVE is macro for 8*sizeof(char), which is 8 bits (considered pixel size) and | sign is the bitwise alternative.

The whole operation discussed above is done by the fast_fir C-function, defined by the following prototype:

void fast_fir(hls::stream< ap_axis<32,2,5,6> > &x, hls::stream< ap_axis<32,2,5,6> > &y);

It is a void function, ie it doesn’t return any value. The function expects addresses of streaming data objects In this case, ap_axis<> type is defined and used to create a stream variable. Ap_axis<> stands for Arbitrary Precision Data Type. In classic C, like C99, user is limited to use data with a certain bit width. For example, the smallest addressable unit is char type with minimum size of 8 bits. If user needs only 7 bits, there will remain one unnecessary bit. This can be overcome by using bit field in structure; however, this method also will consumeadditional space, min4 bytesfor structurepackaging. Tosolvethisissue, theArbitrary Precision Data Type comes with help. RTL buses support arbitrary lengths. If 7-bit data is required, one doesn’t need to implement data with 8-bit boundary but can call ap_int<7>

Such a procedure is called to create ap_axis<32,2,5,6> variable. It stands for Arbitrary Precision AXI stream struct, which is defined as:

template<int D,int U,int TI,int TD> struct ap_axis{ ap_int<D> data; ap_uint<D/8> keep; ap_uint<D/8> strb; ap_uint<U> user; ap_uint<1> last; ap_uint<TI> id; ap_uint<TD> dest; };

Figure 8 shows the structure that represents AXI4-Stream Interface with Side-Channels.

Figure 8 Example of IP block with AXI4-Stream with Side-Channels

After synthesis arguments are implemented as data ports, the standard AXI4-Stream

TVALID and TREADY protocol ports and all the optional ports described in the struct. The TLAST signal is optional in the AXI stream protocol and may not be utilized directly, as it is in this example, allowing the HLS tools to omit it. The TLAST signal is optional in the AXI stream protocol and may not be utilized directly, allowing the HLS tools to omit it. However, the AXI DMA demands it. TID and TDEST respectively identify the source and destination device for a particular data. TDEST tells the interconnect where the data should be sent. TID tells the destination who sent the data. TDATA is the data sent, which in fast fir application is 32-bit AXI word size. TVALID and TREADY will be further discussed later.

Generally speaking, stream object behaves like FIFO, First In First Out, ie the next data is processed only if previous was read. For better understanding stream object, it can be imaginedasabstract datatype (ADT),forinstanceasqueue That means,each datain thestream object contains a node with address to the next object. When last object is processed, TLAST value is raised.

The designed IP is a FIFO block. The size of IP’s FIFO is determined by the for instruction:

for (int idx =0; idx < K, idx++), where K would be the size of FIFO. K equals to 214 would be enough to convolve the whole row from OV7670 cheap camera with resolution 640x480. (Because in each word there are transmitted 3 pixels, 214*3 = 642 > 640). The FIFO with indefinite size will be translated to RTL if user would induce while (true) instruction with proper breaking condition. [11]

Note: The FIFO with indefinite size was tried in application (while(true) {instructions; break;}). Unfortunately, the data transfer between DDR and IP was unsuccessful. Thus, the FIFO with static size was used.

For proper RTL translation the following pragmas were added to the c-code:

1. #pragma HLS INTERFACE axis port=x

2. #pragma HLS INTERFACE axis port=y

3. #pragma HLS INTERFACE ap_ctrl_none port=return

4. #pragma HLS DATAFLOW

5. #pragma HLS PIPELINE

6. #pragma HLS UNROLL

The INTERFACE pragma specifies how RTL ports are created from the function definition during interface synthesis. In this instance, the ports in the RTL implementation are derived from function arguments – x and y. Both function arguments are implemented using an AXI4-Stream interface. Based on how the streams are used in the code, Vivado HLS will automatically identify whether they are inputs or outputs.

HLS INTERFACE ap_ctrl_none port=return pragma simply tells to not use a control register, for example to start and stop the IP by AXI-Lite interface. The key concept of the IP design is to check the data transmission between IP and DDR via DMA without CPU involvement. Meaning the designed IP shouldn’t be memory mapped. [12]

The DATAFLOW pragma allows for task-level pipelining, which increases the concurrency of the RTL implementation and the design's overall throughput by allowing functions and loops to operate concurrently.

Figure 9 depicts the principle of HLS Dataflow pragma. All operations are performed sequentially in a C description. Vivado HLS aims to reduce latency and increase concurrency. Nevertheless, data dependencies can restrict this. For example, array-accessing functions or loops must complete all read/write operations before terminating. This inhibits the execution of thesubsequent function or loop thatconsumesthedata. TheDATAFLOWoptimizationpermits a function or loop's operations to begin before the previous function or loop has completed all its operations.[12]

The further optimalization can be done by pragmaHLS PIPELINE. This pragma must be placed within the body of the function or loop.

Figure 9 HLS Dataflow Pragma principle

Figure 10 HLS Pipeline Pragma principle

Figure 10 illustrates the principle of HLS Pipeline Pragma. This pragma reduces the initiation interval (II) for a function or loop by allowing the concurrent execution of operations. II is the number of clock cycles between the launch of successive loop iterations. Pipelining a loop enables concurrent implementation of the loop's operations, as depicted in the figure above. Figure (A) depicts the default sequential operation with three clock cycles between each input read (II=3) and eight clock cycles required before the final output write is performed. (B) illustrates the pipelined operations with one cycle between reads (II=1) and four cycles until the final write. [12]

By default, C/C++ functions loops are kept rolled. When loops are rolled, synthesis generates the logic for a single iteration of the loop, and the RTL design executes this logic for each iteration of the loop in order. The number of iterations within a loop is determined by the loop initiation variable. The UNROLL pragma allows some or all loop iterations to occur in parallel. In conclusion, data access and throughput can be improved thanks to UNROLL pragma. The operation of the mentioned pragma can be demonstrated by the following code, taken from the Vivado HLS user guide:

1. for(int i = 0; i < X; i++) {

2. pragma HLS unroll factor=2

3. a[i] = b[i] + c[i];

4. }

This results in:

1. for(int i = 0; i < X; i += 2) {

2. a[i] = b[i] + c[i];

3. if (i+1 >= X) break;

4. a[i+1] = b[i+1] + c[i+1];

5. }

Note:

Vivado HLS tool requires the loop bound to be known at compile time to unroll a loop completely. What is more, the PIPELINE directive can only be applied when all loops are fully unrolled. Unfortunately, the subfunction “partial” cannot be pipelined since subloops are not being unrolled. Subloops operates on variables which are input arguments. Trip count, which is the minimum number of times a loop executes, must be a constant. In theory, pragma HLS loop_tripcount should be beneficial, with syntax #pragma HLS loop_tripcount min=<int> max=<int>, where min and max are minimum or maximum number of loop iterations, respectively. Unfortunately, Vivado 2017.4 HLS raises warning even if loop_tripcount is set. Users reports that up-to-date versions of Vivado better supports UNROLL directive.

3.1 Testbench

Writing a testbench is the foremost method for verifying the IP design. Testing the IP design is the last step before generating RTL code. The well-written testbench have several benefits. The most important one is validating algorithm through testbench is much more efficient than developing and debugging RTL code. Moreover, it takes time to synthesize an incorrect C function and then investigate the implementation details to find why the function is not performing as planned.

Thetest bench includesthemain()function.Themain()functionevaluatesthecorrectness of the top-level function for synthesis, calling this function and verifying its output. The return value of the main() function can be zero or non-zero, which indicates that results are correct or incorrect, respectively. [12]

The designed testbench is oriented to imitate the IP’s behaviour in real application. The main() function in testbench realizes following steps:

1. Creating stream objects – inputStream and outputStream

2. Generating an array of unsigned char data (pixels)

3. Creating a struct of AXI-4 Stream Interface with side-channels

4. Concatenating 3 pixels onto one AXI word, initializing side-channels, ie:

keep = 1

strb = 1

user = 1

id = 0

dest = 0

5. Pushing data onto stream object

6. Calling fast_fir top function when stream object is ready to send

7. Reading output stream

8. Printing results

Figure 11 The terminal output after performing testbench

Figure 11 shows the data output visible in the Vivado HLS terminal after performing the testbench file.

3.2 C-Synthesis

When the synthesis is completed, the code written in C language is transcompiled into a register-transfer level (RTL) design in a hardware description language (HDL). In another words, during synthesis, Vivado analyzes the C code and realizes its behaviour into registertransfer level structure. When the synthesis is completed, user can check such parameters as:

timing constraints

latency and throughput

resources utilization

interface output

function schedule

dataflow of the design

Each of these parameters will be discussed in this chapter. [12]

Figure 12 original output with the FIFO size 20 and kernel with integer coefficients

Figure 12 depicts the IP summary in terms of timing constraints, latency and resource utilization. The experiment was conducted for FIFO size of 20 and kernel with integer coefficients. The estimated timing is the measurement, which tells how much time is demanded to complete all tasks that Vivado scheduled in one clock cycle. It must be within target clock period and its uncertainty (12.5%) since Vivado will raise error. Program will raise warning if estimated timing exceeds the uncertainty. Thus, preferable estimated clock period is up to 8.75 ns for 10.00 ns target clock period. The designed timing for fast FIR function is 7.50 ns, which fits the target clock device.

Latency is the number of clocks cycles takes to complete the procedure In terms of latency, the lower the latency, or the closer it is to zero, the better. With a lower latency, there is less delay when sending data within a system, resulting in a faster system From the other hand, the interval is the number of clock cycles required before the function can accept a new datainput. ThedesignedIP proceedsfast FIRfunctionin 24clockcyclesandaccept new stream income every 25 clock cycles. This numbers wereobtained for processing of 60 pixels. (Which requires 20 iterations, because each AXI word can hold 3 pixels, ie 60/3 = 20).

The utilization panel presents the basic structure for FPGA implementation. It shows the number of resources required for the block implementation According to this panel, the basic structure for FPGA consists of:

Look-up table (LUT) – the truth table that determines the output for any given input

Flip-Flop (FF) – register that stores the results of the LUT

DSP48 – computational block for arithmetic logic applications

BRAM – embedded memory used as block of dual-port RAM. In this case it can hold 18k, which this amount is device specific. [13]

The synthesized IP demands 426 Flip-Flops and 1045 LUTs to work properly.

Figure 13 Available ports after the C synthesis

Figure 13 presents the list of ports available in the designed IP.

AXI4-Stream Protocol signals were fully implemented to the design, including sidechannels. Extremely important is TLAST signal because it is demanded by the Direct Memory Access device. One can notice that Vivado, basing on the function behaviour, correctly adjust inputsandoutputsto therightportsitself.Theapproachit to check whetherIPcancommunicate with the DDR memory via DMA without CPU control, thus interrupts signals were not added to the design. The accomplished function will be checked by the polling operation of DMA. [14]

The dataflow is presented in Figure 14. During C synthesis and C/RTL cosimulation, Vivado extrapolated loop responsible for computation of fast FIR algorithm. The initiation interval (II) of this loop is 1, which means that every clock cycle, the loop is processing new data. The input and output ports are indicated.

3.3 The Hardware Performance

Figure 15 The performance comparison if pragmas are applied

The Figure 15 shows the difference in performance of fast FIR function depending on the application of PIPELINE and DATAFLOW directives There is a trade-off between latency

Figure 14 Dataflow region of designed IP

and resources used. A slight increase of resources used allows to significantly better performance in terms of latency and interval.

3.3.1 Kernel with integer coefficients

Figure 16 The performance comparison between Classic Fir and Partial Reuse for integer kernel coefficients

Figure 16 presents the performance comparison between Classic Fir and Partial Reuse for integer kernel coefficients

The key point of partial results reuse is to reduce multiplication operations, which occur in the FIR algorithm. According to words in paper written by Y. Uguen et al. “replacing one multiplication with one addition is always a win situation”, it should be better latency achieved and preferable resources utilization. Therefore, one may wonder why latency is one clock cycle bigger and resources utilization is far more considerable since partial results reuse allows to decrease number of performed multiplication from 9 to 6 for 3-tap FIR filter. The answer for this question is to understand how the high-level synthesis is implemented. Modern HLS tools are based on widely used compiler projects like GCC or Clang/LLVM. Particularly, Vivado HLS supports GCC compiler. For instance, let’s consider a simple multiplication of integer by a constant. Let’s take 7x, which can be rewritten as 7x = 8x – x = 23x – x, which we will be computed as shift and subtraction operation. In fact, a multiplication of constant by 2N is shift left by N bits. Thus, this multiplication is not done by MUL operation, but requires one ADD and one SHL operations. (In hardware, ADD is equivalent to SUB). It can be demonstrated by the following listings:

int mul7(int x){ return x*7; }

a: 89 d0 mov %edx,%eax

c: c1 e0 03 shl $0x3,%eax

f: 29 d0 sub %edx,%eax

The latency and resources, presented on the Figure 16, were achieved for kernel with integers constants, h[3] = {1, 2, 3}. Previous assumptions are confirmed by checking the clock scheduler. For instance, it appears that calculation of y_read[1] = x_read[1]*h[0] + x_read[2]*h[1] + x_past[0]*h[2], doesn’t require 3 multiplications, but instead it is performed by SHL, ADD and SUB operations. Application of partial reuse, in fact, reduces the number of multiplications, but requires more additions to complete. Thus, it results in one more clock cycle to complete and higher resources utilization. This concept is true when it is applied to integers multiplication by a constant [15]

Figure 17 The performance comparation between classic FIR and Partial Reuse for kernel as function argument

In the previous solution, the values of kernels were known in the time of compilation. Thus, HLS could provide proper combination of SHL and SUB/ADD to avoid multiplications. In the scenario presented in the Figure 17, the values of kernels are provided as function’s arguments, that means thevalues arenot knownduring thecompilation. Inthiscase, thescheduler indicates 6 or 9 multiplications, respectively for Partial Results Reuse and Classic Fir. What is important, it Partials results in better resource utilization in terms of DSPs (FF and LUTs are not important compared to DSPs). Moreover, Partial Results technique shows better timing constraints.

3.3.2 Kernel with floating-point coefficients

Figure 18 The performance comparison between Classic Fir and Partial Reuse for floating kernel coefficients

Figure 18 shows the performance comparison between Classic Fir and Partial Reuse for floating kernel coefficients. The utilization and latency are depicted. In this case, it consumes more resources than integer computation.

Things gets complicated when it comes to floating-point multiplications. As the logic required to implement agiven floating-point arithmetic operationis significantlymorecomplex than for integer arithmetic, floating-point multiplication come at the expense of increased area and latency from a hardware design point of view. From the performance panel, in classic FIR algorithm there are 9 floating multiplications presented as FMUL operation. Each FMUL operation requires 4 clock cycles to process. Thus, applying partial results for fast 3-tap FIR filter computation results in 6 atomic multiplications, which indeed are indicated in the timing scheduler. For this case, improving FIR algorithm leads to improved latency and significantly less resource used. The main advantage is the fact of decreased number of DSP48 blocks, from 39 to 18. DSP48 are the most complex computational block available in a Xilinx FPGA. [13]

In conclusion, avoiding multiplication is crucial in floating-point numbers computation. This section revealed in what situations applying partial results in fast fir computation is beneficial whether not. Furthermore, it explains HLS behaviour based on C compilers.

4 System Architecture

The system design was obtained in the Vivado 2017.4 software. Each component of the system is presented by the block. The system consists of processing system (PS) and programmable logic (PL). The Processor Core of the PS is Dual-core ARM Cortex-A9 in 28 nm process technology. The PL is equivalent to FPGA. Thus, PS is the best for software application likedynamictaskslike operatingsystems, general purposesequentional tasks. From the other hand, PL is suitable for hardware application like intensive data computation or peripheral communication. The Advanced eXtensible Interface (AXI) is responsible for communication between PS and PL. The system design is presented in the Figure 20.The designed system consists of the following blocks:

1. ZYNQ7 Processing System

2. Processor System Reset

3. AXI Interconnect

4. AXI SmartConnect

5. AXI Direct Memory Access

6. Concat

7. Fast FIR block (elaborated in Vivado HLS)

8. System ILA

Fast FIR IP was designed in Vivado HLS, other IPs are provided by the Xilinx company.

19 System Architecture

Figure

4.1 ZYNQ7 Processing System

TheProcessing System7coreisthesoftwareinterfacesurrounding theprocessing system of the Zynq-7000 platform. The Processing System 7 core features as a logical link between the PS and PL. The Processing System IP wrapper aid user to integrate IPs with the processing system.

20 Zynq-7000 Processing System IP

Figure 20 shows the features of PS part and the possibility to connect PL part.

Figure

Figure 21 Zynq Block Design view in Vivado software

Figure21 showstheZynq BlockDesign windowavailablewhenProcessing Unit isadded to the design.

The following features were enabled for the intentional system design:

PL Fabric Clock enabled at 100 MHz frequency

Memory Type selected as DDR3 with 32bit Width Bus

PL-PS interrupts enabled

UART enabled at Baud Rate 115200 bits/s

General purpose AXI master interface enabled

High performance slave interface enabled in order to set communication with DMA

PL Fabric Clock is generated by the PS and can be used by PL. The Processing System Reset allows user to set certain parameters to enable/disable features. In fact, no kind of reset is checked in the designed application. [16]

4.2 Direct Memory Access

Direct Memory Access (DMA) is a process that allows a device, such as a disk controller or graphics card, to transfer data to or from the main memory, which improves system performance and reduces the load onthe CPU. Thisallows the device to transfer data to or from memory without the need for constant CPU intervention, which saves CPU cycles and reduces overall latency. In fact, CPU is not completely bypassed. CPU is responsible to enable DMA and start data transmission, so one can say that transfer done by DMA is under control of CPU, but CPU is free to do other activity during this time.

Programmed input–output (PIO) is a method of data transmission, indicated inthe Figure 22 between a central processing unit and a peripheral device. In this scenario, processor is the one that initializes datatransfer from memory to peripherals. Thus, processor istheonly master. First processor reads data from memory location like RAM, stores data in an internal register and sends data to peripherals. Read/write operations require multiple clock cycles to complete. Thus, this method is inefficient when transferring large amounts of data between memory and peripherals. Read/write operations require multipleclock cycles to complete. Thus, this method is inefficient when transferring large amounts of data between memory and peripherals. For instance, video frames are stored in memory and must be transferred to the display controller. In PIO method 30 fps requirement simple will not be met. Another disadvantage of PIO is that the processor wastes time on data transfer rather than performing useful processing. Process is the finest at performing data processing, not acting like data mover agent. [17]

Figure 22 PIO data transmission

In the scenario presented in the Figure 23 two operations can happen concurrently, ie read/write from memory to DMA and read/write from DMA to peripheral. In this architecture, DMA is a master to the peripheral, but still is a slave to the processor unit. This means, that transmission is done under the control from the processor. Notice that the peripheral is not connected directly to the system bus i.e., it is not memory mapped. Peripheral is directly connected to the DMA controller. In this case, the address of peripheral doesn’t exist, thus processor to start DMA configures following options:

Starting address of the memory location

Transfer length

Direction of the transfer (memory to peripheral or vice versa)

Figure 23 Direct Memory Access concept

Figure 24 Zynq-7000 SoC Block Diagram

Figure 24 represents the block diagram allocated on Zynq-7000 SoC. This SoC is the part of the Zedboard.

In ZedBoard, DDR3 has 512 MB capacity, therefore this memory will be used for storing frames from a camera. External DDR is read through Multiport DRAM Controller, which acts as the arbitrator for the memory access. The designed IP and DMA IP are localized in the PL part of the Zynq System. From the schematic in the Figure 24, it can be noticed that memory transfer can be done through High-Performance AXI port (HP) and AMBA Interconnect. As name indicates, the performance of these port is much greater than General-Purpose AXI Ports. HPs are designed to push large amount of data from the PS to the PL.[18]

AXI Direct Memory Access IP, presented in the Figure 25, is designed for the features described in this section. IP provides high-bandwidth direct memory access between the AXI4 memory mapped and AXI4-Stream-type target peripherals. It offloads data movement tasks from the Central Processing Unit. Access to initialization, status, and management registers are provided by AXI4-Lite slave interface. IP offers AXI4 Memory Map and AXI4-Stream data width support of 32 bits, which will be used to send 3 pixels in one 32-bit word.

In total DMA IP has 5AXI interfaces, twoof themare slave interfaces, while 3 remaining are master interfaces. AXIS stands for AXI Stream Interface, which is point-to-point interface, thus it is not memory mapped. It is used to transfer data from DMA and peripheral, and conversely. In this interface, one 32-bit word is transferred every one clock cycle. The data ports in DMA are as following:

• M_AXIS_MM2S is used for writing data from DMA controller to the peripheral (Fast FIR IP in the study scenario).

• S_AXIS_S2MM is used for writing data from the peripheral to the DMA controller.

• M_AXI_S2MM is used by the DMA to write data to the DDR. It follows the full AXI protocol.

• M_AXI_MM2S is used to read data from the DDR.

• S_AXI_LITE is used for configuring DMA from the processor. It follows AXI4-Lite protocol. [19]

To sum up this section, DMA is the data mover agent from the external DDR memory to the peripheral, which is the Fast FIR IP in the system design. DMA is configured by the processor in the software application. The main advantage of such architecture is that processor

Figure 25 AXI DMA IP

is offloaded from moving data from DDR to the peripheral IP. In this time, processor can do some another valuable operation.

4.3

Fast FIR IP

26 Fast FIR block IP

The Fast FIR block IP, presented in the Figure 26, was designed in the Vivado HLS tool. The input – x and output – y follows full AXI4-Stream protocol. X is the incoming stream from DMA, y is the outgoing signal to DMA. The signals are unrolled on the presented figure. When TREADY signal is high and signal is VALID from DMA, there is a burst transaction between modules. Thetransmission ends whenTLAST is asserted. The Stream interfaceis synchronized to the ap_clk signal. All objects in stream are sampled to the rising edge of clock signal, that is every sample, consisted of 3 pixels, is computed every clock signal. The inside of the Fast FIR IP and rules of computation were described in previous sections.

4.4

RTL synthesis and implementation

Synthesis is the transformation of an RTL-specification into a gate-level representation. Thepresentedsystem architecture intheFigure2 wassynthesized to theRTL. Vivadosynthesis tool supports, inter alia, SystemVerilog (EEE Std 1800-2012) and Verilog IEEE Std 13642005).

Figure

Figure 27 The top-view schematic of the synthesized design

Figure 27 presents the retrieved schematic of designed concept after RTL analysis. It is visible that DDR is external port in the architecture. The more specific view can be achieved after extracting design_1 block. All netlists between modules, including AXI4 protocols and AMBA connections, will be visible.

The extrapolated view of the Fast FIR IP schematic is presented on the Figure 28. The schematic is comparable to the presented earlier Dataflow path on the Figure 14 That is, extracted Loop_1_proc16 is the main loop responsible for the computation. Inside this loop are presented all cells of the FAST_FIR IP, like all LUTs and FDREs. FDRE is a single D-type flip-flop with clock enable and synchronous reset. Only LUTs and flip-flops were synthesized for the integer kernel in fast FIR algorithm. If float kernel was considered, there will be also visible DSP blocks in the schematic. [20]

Figure 28 Fast FIR IP schematic

The FPGA floorplan and allocated highlighted resources are presented in the Figure 29. It is possible to zoom in and examine the slices to determine what each one does. For instance, in clock region X0Y0, there is Block Memory, which parent is System ILA. For the other hand, in clock region X1Y0 there is Block Memory, which parent is AXI DMA. The orange slices represent the PS7 softcore CPU cells. The number of fully implemented nets is 16463.[21]

Figure 30 The total resource utilization of the synthesized design, graph and table

Figure 29 FPGA floorplan of synthesized design

The above figures present the total resource utilization of the synthesized design. The LUT, FF and BRAM were described earlier.

• LUT, called also as function generator, can be implemented as a synchronous RAM resource, called as LUTRAM

• BUFG is a high-fanout buffer that connects signals to the global routing resources for low skew distribution of the signal. Clock skew is a phenomenon that occurs in synchronous digital circuit systems when different components receive the same clock signal from the same source. Thus, BUFGs are typically used on clock nets to preserve low skew phenomenon. [22]

The power report is presented in the Figure 31. The Total On-Chip Power is 1.758 W. This is the sum of Dynamic Power and Static Power. The Static Power is when the device is powered and not configured. Dynamic Power refers to power from logic user activity. It is presented on the graph, that most of the power is consumed by the ARM Cortex-A9 CPUs. The highest estimated temperature of the working device is 45.3 °C. Effective thermal resistance is 11.5 °C/W.

Figure 31 Power analysis of the implemented design

Table 3 Timing constraints for the designed system

Slack is the time difference between the predicted arrival of a signal and its actual arrival. The Total Negative Slack (TNS) is the sum of the negative slack in the design. It is reported to 0, thus design meets timing constraints. Worst Negative Slack (WNS) has a positive value, that means the path passes the test. Path fails if the result is negative. Similarly, Worst Hold Slack (WHS) has a positive value, that mean path passes. Total Hold Stack (THS) is 0 and it means that this value meets timing constraints. Total Pulse Width Violation (TPWS) is the sum of all pulse width violations, it is 0 as expected.