Systolic Algorithm Mapping for Coarse Grained Reconfigurable Array Architectures Kunjan Patel and C. J. Bleakley UCD Complex and Adaptive Systems Laboratory UCD School of Computer Science and Informatics University College Dublin, Dublin 4, Ireland Email: http://www.kunjanpatel.co.nr/contact, chris.bleakley@ucd.ie
Abstract. Coarse Grained Reconfigurable Array (CGRA) architectures give high throughput and data reuse for regular algorithms while providing flexibility to execute multiple algorithms on the same architecture. This paper investigates systolic mapping techniques for mapping biosignal processing algorithms to CGRA architectures. A novel methodology using synchronous data flow (SDF) graphs and control and data flow (CDF) graphs for mapping is presented. Mapping signal processing algorithms in this manner is shown to give up to a 88% reduction in memory accesses and significant savings in fetch and decode operations while providing high throughput.
1
Introduction and Related Work
Biosignal processing is widely used in the field of biomedical engineering. Biosignals are generally one dimensional and multichannel. To perform monitoring without interrupting the patientâ€™s daily life, development of portable low power biosignal processing devices is essential especially for implantable devices. A Coarse Grained Reconfigurable Array (CGRA) architecture consists of a grid of interconnected reconfigurable processing units which can perform logical or arithmetic operations. CGRA architectures promise low power consumption and high performance while maintaining high flexibility [1]. Mapping applications to array architectures has been a topic of interest to researchers since efficient algorithm mapping is crucial for achieving high performance. Mapping of some DSP algorithms onto the MONTIUM coarse grained reconfigurable architecture was presented in [2]. Applications were mapped specifically for the MONTIUM architecture and the mapping was performance centric. A Synchronous Data Flow (SDF) graph was presented in [3] for mapping and scheduling applications to parallel DSP processors. It showed the usability of the SDF graph for concurrent and automatic scheduling for parallel processors. A CycloStatic Data Flow (CSDF) graph was presented in [4]. It allowed static scheduling of high frequency DSP algorithms in multiprocessor environments. However, it is not always possible to find repeatable finite schedules [5]. This paper presents a novel two layer data flow graph approach to map biosignal algorithms in a systolic manner to CGRA architectures. There are two main differences between the proposed approach and previously proposed approaches. First, the mapping of algorithms is done for CGRA architecture and hence the mapping is constrained by the architecture. Second, the proposed mappings are focused on low power consumption rather than high performance. Reading data from memory
II
is one of the most power consuming processes in the processor execution cycle [6]. Hence, power consumption is reduced by reducing the number of memory accesses and elimination of the fetchdecode stages of the execution cycle. The high degree of computational parallelism in CGRAs allows for aggressive voltage scaling. Case studies of mapping for various biosignal processing algorithms are provided. To the authors’ knowledge, this is the first time that systolic style mapping for CGRA architectures has been described and evaluated.
2
Proposed Algorithm
To map an algorithm in a systolic manner onto a CGRA architecture requires spatiotemporal mapping of the algorithm. To address this problem, the algorithm which is going to be mapped is first presented as a Synchronous Data Flow (SDF) graph [3] which is a very abstract view of the application and then each node of the SDF graph is presented as a Control and Data Flow (CDF) graph [7]. The algorithm is as follows: Step 1: Prepare the SDF graph for the application Step 2: Rearrange SDF graph for systolic mapping To map the algorithm in a systolic manner, all of the computing elements should run simultaneously and so the blocking factor (j ), which determines the number of nodes run in parallel, should be equal to the number of nodes (n) presented in the SDF graph. A series of rearrangements of the SDF graph needs to be performed until the targeted blocking factor is achieved. Step 3: Schedule the SDF graph In systolic arrays, each CFU executes an operation in a single cycle and it executes the same operation during every cycle until the CFU is reconfigured or disabled. If i is the index of the node in the SDF graph then scheduling (ψ) will be: ψi = {1}; ∀i
(1)
Step 4: Prepare CDF graph for each node in SDF graph A CDF graph for each node in the SDF graph is prepared. As mentioned before, each CFU operation must be finished in a single cycle. So, this phase is dependant on the architecture of the CFU. If a CFU is not able to execute the operation in a single cycle then the operation will be divided into smaller operations and the mapping process is repeated from Step 1. Step 5: Get the topology matrix and delay matrix The topology matrix for the SDF graph is prepared. The operations of nodes are allocated to CFUs according to the connectivity in the topology matrix (Γ ). This operation allocation task is dependant on the interconnection topology of the array and topology matrix act as a guide for this purpose. Because a systolic mapping
III
requires synchronization in data injection, the delay matrix is prepared for applications where the data is not injected at the same time in all CFUs. There are no specific rules in the SDF graph paradigm to constrain the number of I/O ports in the node. So, to keep the mapping constrained to the number of I/O ports in the CFU and since j = n, the following conditions should be satisfied for the topology matrix. Condition 1 (for binding number of output ports): R X
pin ≯ total number of outputs in a CFU
i=1
where R = number of arches in the SDF graph and p = number of produced tokens at the node. Condition 2 (for binding number of input ports): R X
cin ≯ total number of inputs in a CFU
i=1
where c = number of consumed tokens at the node.
3
Application Mapping
The algorithms listed in Table 1 were manually mapped to CGRA as shown in Figure 1. The CFU model is shown in Figure 1(a). Each CFU in the CGRA can perform one of the five operations, multiply accumulate (MAC), multiply subtract (MSUB), addition (ADD), subtraction (SUB) or no operation (NOP). The array architecture has interconnections as shown in Figure 1(b).
NOP – No operation ADD – Add SUB – Subtraction MAC – Multiply accumulate MSUB – Multiply subtraction
(a)
(b)
Fig. 1: a) A model of the considered CFU; b) A CGRA architecture example
4
Results
All the algorithms described above were modelled and simulated using a Configurable Array Modeller and Simulator (CAMS) [8] for CGRA architectures. CAMS
IV Input Data Coefficient
1 0 0 Γ = 0 0 0 −1
−2 0 1 −2 0 1 0 0 0 0 0 0 −1 −1 (a)
0 0 −2 1 0 0 −1
0 0 0 −2 1 0 −1
0 0 0 0 −2 1 −1
From output register of previous CFU
Execution cycle Data input 7
1 1
1 1 2 1
2
1 1 2 2
1 3
1 2 3
(b)
4
1 1 2 4
5
Store data in output register
1 1 2 5
6
1 6
To next CFU
(c)
Fig. 2: a) 5 taps FIR filter SDF graph; b) FIR filter CDF graph for a single CFU is a cycle accurate functional simulator for CGRA architectures written in the Java programming language. For filters, coefficients were determined using Matlab and the results were verified against Matlab. The performance figures for the CGRA were compared with those for the TI C5510. For TI C5510 DSP processor, the results were derived from manual and mathematical analysis of equations from [9][10]. Table 1 shows a comparison in terms of the number of operations and the number of CFUs required to map the algorithms discussed in the previous section. The number of operations required for the CGRA architecture and DSP is almost same in all the cases. However, using the CGRA architecture, higher throughput can be achieved for continuous data processing because of parallelism and systolic mapping.
Table 1: Peformance of some common biosignal applications Total Operations Iterations (for single iteration) Number CGRA DSP of CFUs FIR filter 256 6 6 6 Matrix Multiplication 25 67 64 16 Matrix Determinant 25 15 17 5 FFT Butterfly 25 9 8 8 Wavelet Filterbank 256 8 8 10 DFT 8 61 61 61 Algorithm
Table 2 shows the number of register accesses (RGA) and the number of RAM accesses (RMA) required for the CGRA architecture and DSP. An improvement in RMA of up to 8.5 times can be seen because of the systolic mapping. Figure 3 shows a comparison of RAM Data Reuse (RDR) for all three CGRA architectures. RDR is given by: Number of unique RAM addresses accessed (2) Number of RAM accesses It is clear from the results that data reuse for CGRA architectures is considerably higher than that of DSP processor except for the FFT butterfly. RDR =
V
Table 2: Register and RAM accesses comparison for some biosignal algorithms Algorithm
CGRA DSP Memory access RGA RMA RGA RMA reduction (%) FIR filter 12 2 12 7 71 Matrix Multiplication 208 42 256 192 78 Matrix Determinant 35 16 26 10 73 FFT Butterfly 25 8 68 5 60 Wavelet Filterbank 16 2 16 9 77 DFT 188 15 183 130 88
$
#
"
!
+
)
#
$
#
,

,
&
*
)
)
(
'
%&
Figure 4 shows a comparison of the number of fetchdecodes required on a TIC5510 DSP and the number of reconfigurations required for the CGRA architecture to execute the algorithms described before. The number of iterations are shown in brackets. It can be seen that activity can be reduced by avoiding the fetch and decode steps using systolic CGRA architectures for regular biosignal processing algorithms.
Fig. 3: RDR comparison of some biosignal algorithms
5
Conclusion
This paper proposed mapping biosignal applications in a systolic manner onto a CGRA architecture. To map biosignal processing algorithms on the CGRA architecture, two types of graphs, SDF graph and CDF graph, were integrated in the mapping procedure to garner the structure of the signal processing algorithms. This type of signal processing technique shows up to a 88% reduction in memory accesses for regular algorithms compared to that of a conventional DSP. The paper illustrates the efficiency of the proposed approach for low power biosignal applications.
!
!
!
"%
" 6
)*
6C
/
7
&:

/ 5
,
8
.
.
5
4
2
1B
A
@
6
,
2

.
8
/
&

2 8
> ,?
#*
)
<
2 =
.

;
&:

8
8
%*
#
)
#*
1
&
/ 0
6
)

+
. /
6,
9
.
7
#*
8
8
6
)
,
/

0
.

/ 5
+ 2
/
,
4

2
/ 3
1
+
,
%*
#
)
&'(
#
%
"
"%
%
#
$%
"#
VI
Fig. 4: A comparison of the required number of fetches, decodes and configurations
Acknowledgments This research was funded as a part of the Efficient Embedded Digital Signal Processing for Mobile Digital Health (EEDSP) cluster, grant no. 07/SRC/I1169, by Science Foundation Ireland (SFI).
References 1. Hartenstein, R.: Coarse grain reconfigurable architecture (embedded tutorial). In: Proceedings of the 2001 conference on Asia South Pacific design automation, ACM New York, NY, USA (2001) 564â€“570 2. Heysters, P., Smit, G.: Mapping of DSP algorithms on the MONTIUM architecture. In: Parallel and Distributed Processing Symposium, 2003. Proceedings. International. (April 2003) 6 pp.â€“ 3. Lee, E., Messerschmitt, D.: Synchronous Data Flow. Proceedings of the IEEE 75(9) (Sept. 1987) 1235â€“1245 4. Bilsen, G., Engels, M., Lauwereins, R., Peperstraete, J.: CycloStatic Data Flow. In: Acoustics, Speech, and Signal Processing, 1995. ICASSP95., 1995 International Conference on. Volume 5. (May 1995) 3255â€“3258 vol.5 5. Parks, T., Pino, J., Lee, E.: A Comparison of Synchronous and CycloStatic Dataflow. In: Signals, Systems and Computers, 1995. 1995 Conference Record of the TwentyNinth Asilomar Conference on. Volume 1. (Oct1 Nov 1995) 204â€“210 vol.1 6. CasasSanchez, M., RizoMorente, J., Bleakley, C.: Power Consumption Characterisation of the Texas Instruments TMS320VC5510DSP. Lecture notes in computer science 3728 (2005) 561 7. Namballa, R., Ranganathan, N., Ejnioui, A.: Control and Data Flow Graph Extraction for HighLevel Synthesis. VLSI, IEEE Computer Society Annual Symposium on 0 (2004) 187 8. Patel, K., Bleakley, C.J.: Rapid Functional Modelling and Architecture Exploration of Coarse Grained Reconfigurable Array Architectures. Submitted, under review (2009) 9. Smith, S.: Digital signal processing: a practical guide for engineers and scientists. Newnes (2003) 10. Meher, P.: Efficient Systolic Implementation of DFT Using a LowComplexity ConvolutionLike Formulation. Circuits and Systems II: Express Briefs, IEEE Transactions on 53(8) (Aug. 2006) 702â€“706