Xcell Journal issue 82

Page 52

TOOLS OF XCELLENCE

Typical performance and area trade-offs for parallel FFTs on Virtex-7 class devices FFT Architecture Length 1024 16-bit

Number of Complex Input Samples

Max System Clock Achieved

FFT Throughput (Samples/s)

Hardware Multiplier Utilization

Latency (System Cycles)

Streaming

1

500 MHz

500 MS/s

32

1260

Parallel x2

2

500 MHz

1 GS/s

64

630

Parallel x4

4

490 MHz

1.968 GS/s

128

360

Parallel x8

8

490 MHz

3.92 GS/s

260

220

Parallel x16

16

440 MHz

7.088 GS/s

408

145

Table 1 – Area scalability is generalized by hardware multiplier utilization. Throughput scalability vs. area is slightly better than linear and generally very usable for increasing throughput to multigigahertz sample rates.

Looking at the table, a few general features can be seen in the trade-off curve:

4. Latency decreases as parallelism increases.

1. As parallel throughput increases, multiplier (area) utilization increases, with a slightly lower multiple (better than linear).

Note that the specific numbers measured in Table 1 are valid only for a given target and configuration of the FFT. In this case, that is a length of 1024, with 16-bit input, dynamic length programmability (4 through 1024) and flow control. Flow control is very important for applications such as spectral monitoring, where side-channel information is often utilized to change the FFT size (in order to change the resolution bandwidth) or to temporarily stall the FFT while

2. Slower system clocks and timing closure yield sublinear throughput growth as parallelism increases. However, on modern FPGAs this degradation is diminishing. 3. Overall better-than-linear throughput/area growth is realized due to No. 1 and No. 2 above.

Multiple outputs/clk

Multiple outputs/clk Parallel FFT

Optional flow control synchronization, and dynamic lenghth inputs

Optional flow control and synchronization outputs

System clock

Figure 1 – A parallel FFT processes multiple samples at a time to scale throughput beyond achievable system clocks of the target device. Optional features include flow control, synchronization and dynamic length programmability. 52

Xcell Journal

other operations, such as acquisition, are going on. In theory, you can accomplish flow control by inserting buffers before the transform operation. But for acquisition-driven operations like spectral monitoring, it’s not easy to precompute the size of the buffer required, resulting in the need to maintain large, fast and expensive memory banks. IMPLEMENTATION ARCHITECTURE While there are a number of ways to implement FFTs, a parallelized version of the Radix2 Multi-Path Delay Commutator kernel (Radix2-MDC) [3] works very well as a modular method to create configurable parallel-FFT cores that scale well in advanced FPGA devices. The Radix2-MDC is a classical approach to building pipelined FFTs of varying lengths, as shown in Figure 2a for a 16-length FFT. It breaks the input sequence into two parallel data streams flowing forward with the correct “distance” between data elements that are entering the butterfly (a subelement of FFT algorithms) and that are scheduled by proper delays. The Radix2-MDC is relatively easy to parallelize using a wider data path and vector operations, as shown in Figure 2b. MDC structures also lend themselves easily to flow control and dynamic length reconfiguFirst Quarter 2013


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.