Xcell Journal issue 82 by Xilinx Xcell Publications

X C E L L E N C E I N W I R E L E S S C O M M U N I C AT I O N S

Figure 4 from a reference vector and writes a set of coefficients. Subsequently, we compared the coefficients to a reference implementation written in MATLAB that visualizes the difference between both sets of coefficients. We performed the software profiling in two levels. First, we ran the software on a standard x86 server and used the gprof software profiling tool, to get a first estimate of the expected bottlenecks. Second, we ran the software on the ARM processor and, depending on the gprof results, instrumented the subfunctions of interest with calls to the global CPU timer of 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:

the Zynq All Programmabale SoC. This timer runs at half the CPU frequency, hence giving excellent resolution with the overhead of only a few cycles. Figure 5 shows the profiling results of the three main function blocks running on the ARM processor, indicating that the AMC block is the bottleneck of the application. It consumes 97 percent of the overall update time, making it a prime candidate for hardware acceleration. Prior to profiling, we expected the solver used for coefficient computation in Figure 4 would consume a larger part of the update time, because in contrast to the other functionality it

typedef struct { ap_int<32> real; ap_int<32> imag; } CINT32; typedef struct { ap_int<64> real; ap_int<64> imag; } CINT64; CINT64 CMULT32(CINT32 x, CINT32 y){ CINT64 res; ap_int<33> preAdd1, preAdd2, preAdd3; ap_int<65> sharedMul; preAdd1 = (ap_int<33>)x.real + x.imag; preAdd2 = (ap_int<33>)x.imag - x.real; preAdd3 = (ap_int<33>)y.real + y.imag; sharedMul = x.real * preAdd3; res.real = sharedMul - y.imag * preAdd1; res.imag = sharedMul + y.real * preAdd2; return res;

Non-standard bit-widths

3 pre-adders 3 multiplications instead of 4

}

Figure 7 – Optimized complex multiplication code

1: void amc_accelerator_top(_) Limit the number of multiplications 2: { 3: #pragma HLS allocation instances=mul limit=3*UNROLL_FACTOR operation 4: 5: <function body> 6: CINT64 Marray[MSIZE]; Partition based on the unrolling factor 7: 8: #pragma HLS array_partition variable=Marray cyclic factor=UNROLL_FACTOR dim=1 9: 10: 11:#pragma HLS resource variable=Marray core=RAM_2P Resource mapping 12: <function body> 13: label_compute_M: for (int i=0; i < MSIZE; ++i) 14: { 15:#pragma HLS pipeline Loop pipelining and unrolling 16:#pragma HLS unroll factor=UNROLL_FACTOR 17: <loop body> 18: } 19: <function body> 20: }

Figure 8 – Code snippet for loop unrolling First Quarter 2013

was performing double-precision floating-point operations. However, the ARM’s floating-point unit solved the task very efficiently. Before actually implementing a hardware accelerator for the AMC block, we examined potential software optimization possibilities. The SIMD NEON engine of the ARM processor has a 128-bit-wide data path. Since the AMC algorithm works on 64-bit fixedpoint data types, the NEON engine can carry out two parallel computations, as Figure 6 illustrates. Instead of using low-level assembly instructions to access the NEON engine, the compiler provides a set of functionlike wrappers for the instructions. These wrappers, which are called intrinsics, provide type-safe operations, while allowing the compiler to automatically schedule the C variables to NEON registers. Applying the intrinsics in the C code results in a speed-up factor of two. Furthermore, during the NEON operations, the normal ARM processor is free and can continue processing simple non-NEON instructions like loop conditions and pointer increments, while the NEON engine runs in parallel. HARDWARE IMPLEMENTATION FOR DIGITAL PREDISTORTION To improve the overall parameter update time, we implemented an AMC accelerator using the Vivado HLS tool based on the design flow in Figure 3. Our accelerator’s programmable configurations support a number of different predistorter coefficients and allow the flexible selection of nonlinear terms in the Volterra-series-based model. Hence, it is possible to support several DPD configurations using the same AMC accelerator. In addition, you can make new changes in the existing C++ code and in the compiler directives to generate a brand-new accelerator in a much shorter time than if you were doing a hand-coded RTL design. Let’s take a closer look at some specific examples of code rewriting and compiler directives that we used for the AMC accelerator. Xcell Journal