TOOLS OF XCELLENCE
Typical performance and area trade-offs for parallel FFTs on Virtex-7 class devices FFT Architecture Length 1024 16-bit
Number of Complex Input Samples
Max System Clock Achieved
FFT Throughput (Samples/s)
Hardware Multiplier Utilization
Latency (System Cycles)
Streaming
1
500 MHz
500 MS/s
32
1260
Parallel x2
2
500 MHz
1 GS/s
64
630
Parallel x4
4
490 MHz
1.968 GS/s
128
360
Parallel x8
8
490 MHz
3.92 GS/s
260
220
Parallel x16
16
440 MHz
7.088 GS/s
408
145
Table 1 – Area scalability is generalized by hardware multiplier utilization. Throughput scalability vs. area is slightly better than linear and generally very usable for increasing throughput to multigigahertz sample rates.
Looking at the table, a few general features can be seen in the trade-off curve:
4. Latency decreases as parallelism increases.
1. As parallel throughput increases, multiplier (area) utilization increases, with a slightly lower multiple (better than linear).
Note that the specific numbers measured in Table 1 are valid only for a given target and configuration of the FFT. In this case, that is a length of 1024, with 16-bit input, dynamic length programmability (4 through 1024) and flow control. Flow control is very important for applications such as spectral monitoring, where side-channel information is often utilized to change the FFT size (in order to change the resolution bandwidth) or to temporarily stall the FFT while
2. Slower system clocks and timing closure yield sublinear throughput growth as parallelism increases. However, on modern FPGAs this degradation is diminishing. 3. Overall better-than-linear throughput/area growth is realized due to No. 1 and No. 2 above.
Multiple outputs/clk
Multiple outputs/clk Parallel FFT
Optional flow control synchronization, and dynamic lenghth inputs
Optional flow control and synchronization outputs
System clock
Figure 1 – A parallel FFT processes multiple samples at a time to scale throughput beyond achievable system clocks of the target device. Optional features include flow control, synchronization and dynamic length programmability. 52
Xcell Journal
other operations, such as acquisition, are going on. In theory, you can accomplish flow control by inserting buffers before the transform operation. But for acquisition-driven operations like spectral monitoring, it’s not easy to precompute the size of the buffer required, resulting in the need to maintain large, fast and expensive memory banks. IMPLEMENTATION ARCHITECTURE While there are a number of ways to implement FFTs, a parallelized version of the Radix2 Multi-Path Delay Commutator kernel (Radix2-MDC) [3] works very well as a modular method to create configurable parallel-FFT cores that scale well in advanced FPGA devices. The Radix2-MDC is a classical approach to building pipelined FFTs of varying lengths, as shown in Figure 2a for a 16-length FFT. It breaks the input sequence into two parallel data streams flowing forward with the correct “distance” between data elements that are entering the butterfly (a subelement of FFT algorithms) and that are scheduled by proper delays. The Radix2-MDC is relatively easy to parallelize using a wider data path and vector operations, as shown in Figure 2b. MDC structures also lend themselves easily to flow control and dynamic length reconfiguFirst Quarter 2013