Page 1

G. SOWMYA BALA* et al.

ISSN: 2250–3676



PG scholar, Dept. of ECM. K.L.University, Vaddeswaram, A.P, India, Assistant Professor, Dept. of ECM. K.L.University, Vaddeswaram, A.P, India, 3 Head of Dept. of ECM, K. L. University, Vaddeswaram, A.P, India,


Abstract The present manuscript proposes an advanced methodology based on which the LUT is modeled through which an FIR filter is designed for efficient area utilization. As it is a known fact that most of the DSP processors concerned with the multiply and accumulate structures rather than the memory mapping structures, this paper presents the methods to reduce the need of the additional components needed with in the DSP cores used in FPGA. So far, several methods have been proposed to optimize the memory size. Here a new approach is specified that further eliminate the drawbacks due to the use of multiply and accumulate structures. A simplified and optimized method is specified that performs the logical and functional operations in an efficient manner. This advanced methodology not only eases the computation time but also reduce the required memory size. A different form of LUT design process for efficient utilization of available LUT avoiding the need to increase its size with the increase in the no. of words stored. An FIR filter is designed based on this methodology to show that there is a reduction in the memory size than the earlier specified methods, which also shows that the need for additional components with in the DSP cores of an FPGA can be reduced to the desired level. The methodology specified makes use of the A-OMS technique and an input coding technique to overcome the drawbacks and is synthesized using Xilinx ISE synthesizer and simulated using the Xilinx 10i.

Index Terms: Digital Signal processors, Field Programmable Gate Array, LUT, and FIR filter. ----------------------------------------------------------------------- *** -----------------------------------------------------------------------1. INTRODUCTION The main factors that lead to the present research trends are area, power, performance, and cost. Each is proportionate with each other in such a way that there is an increase in performance by reducing the power consumed and by reducing the area needed, this ultimate to the cost reduction. Therefore, at present electronics such as cellular phones and wireless devices for several crucial operations a different form of algorithms are been used to increase the speed, reduce the area and power consumption. It is known fact that due to an increasing demand for complex DSP applications low cost, high performance, Soc implementations of DSP algorithms are receiving increased attention among researchers and design engineers. There is a need for acquiring new methods and are to be developed to provide a better performance for high-end applications. These are going to adopt ASICs and DSP chips as a traditional solution for high performance. Even though, some factors like high development cost and time-to-market associated with ASICs can be prohibitive for certain applications and similarly the case for DSP programmable processors that are having sequential execution

architecture, it can be unable to meet the desired performance. In addition to the two specified above the alternative is embedded FPGAs that are provided with a very attractive solution by providing a balance in terms of high flexibility, time-to-market, cost and performance [1, 2, 3]. To provide a better performance diverse range of signal processing applications make use of the FPGA based signal processors. These are used for the reasons of performance, economics, flexibility and power consumption. It is also known that the FPGA technology have also been embraced by the telecommunication industry. In addition, from the facts it is reported that around 50% of all FPGA production finds its way into telecommunications and network equipment of one sort or another. The FPGAs are provided with flexibility and performance that allows the designers to define a new methodology to track evolving standards easily. There is an exponential growth in the insertion of FPGAs in DSP hardware, which enhances by simple access to the intellectual property (IP) cores. In addition, the FPGAs provide the flexibility to achieve third and future generation

IJESAT | Jan-Feb 2012 Available online @


G. SOWMYA BALA* et al.

ISSN: 2250–3676

[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume-2, Special Issue-1, 133 – 137 communication infrastructure that must support multiple modulation formats and air interface standards while simultaneously providing high levels of performance. As FPGA technology has grown in complexity and sophistication, its uses have spread. FPGAs are the platform of choice when it comes to implementing high-performance digital signal processing (DSP) systems. As Kevin Morris wrote in a recent article, “…for high-performance algorithmic design, FPGAs are capable of performance, efficiency, and cost-effectiveness orders of magnitude better than alternative solutions like DSP processors…”. Every designer implements an algorithm in different ways and with different precision requirements. These requirements vary not only from design to design, but also within each stage of a design, such as finite impulse response (FIR) filters, fast Fourier transforms (FFTs), detection processing, and adaptive algorithms. By mapping the signal-processing precision requirements along a continuum, it is found that different applications fall naturally across the DSP precision spectrum, A Finite Impulse Response Digital Filter, usually consisting only of Zeros (no Poles), and generally implemented by a fixed point DSP processor to produce at low cost, Equiripple digital filters [5, 6]. The design process of a digital filter in an FPGA makes use of simple methods that involve algorithms, parallel structures and Distributed Arithmetic algorithms to exceed the performance of multiple general-purpose DSP devices. The FPGAs do not consist of a separate or a dedicated multiplier. The use of Distributed Arithmetic for array multiplication in an FPGA is one technique used to implement and increase the function’s data bandwidth and throughput by several order of magnitudes over off-the-shelf DSP solutions. In this paper, an advanced methodology that overcomes the drawbacks of the method specified in the earlier is defined. An FIR filter is designed based on this methodology and could be used for high-speed signal processing. The LUT so far specified makes use of the multiplier less approach that can be capable to store up to a limit. A different approach is now shown that is capable to handle an N number of bits with in the available LUT space. Antisymmetric product coding and Odd Multiple Storage are used previously to optimize LUTs with in a DSP cores for their related operations [6, 7]. The 2’s complement operation could be simplified as the input address and LUT output could always be transformed to odd integer values. Consequently, a different form of coding scheme is defined here to combine the benefits of these methods. Forming an LUT using this advanced method that aims mainly to provide the efficient reminiscence based computations and to perform

operations for required functional computational. The coding scheme includes a decomposition process and input-coding method. This advanced method is described briefly in the further sections. The next section consists of a Digital Filter design based on the proposed LUT. Finally, this can be ended with the comparative analysis with the earlier techniques.

2. ADVANCED LUT DESIGN PROCESS Conventional LUT - based multiplier requires increase in the LUT size with an increase in the input word length, which is area inefficient. In order to provide an area efficient look-uptable for large data operation, some optimization schemes have been presented. Of them in one method, instead of the entire values only the odd multiple values are stored and with another one, there is a reduction in LUT size to half of its original where the product words are recorded as antisymmentric pairs. Other than this a new method is used that further optimize the results obtained from the abovespecified methods.

Figure 1: Block Diagram of computing system with Input Coding technique The advanced method further optimizes the size requirement of LUT where modified methods of odd multiple storage, combined with input coding is utilized and are described here.

2.1. Input Coding Technique: In this coding scheme, the given input word is decomposed in to a group of four bits such that the product value is computed for the each sub-word. Let the input word X consisting of N

IJESAT | Jan-Feb 2012 Available online @


G. SOWMYA BALA* et al.

ISSN: 2250–3676

[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume-2, Special Issue-1, 133 – 137 number of sub-words and each sub-word consist of the n number of bits. It is represented as X = {XN, X N-1… X3, X2, X1, X0}. Each sub-words are encoded through the encoder, then the word X is encoded as X’ consisting of the same number of sub-words with a single bit flag represented as F and this can be written as X’ = {F X’N, X’ N-1… X’3, X’2, X’1, X’0}. The encoded sub-words consist of n-1 bits and the flag is a one bit flag used to represent the overflow condition. Each sub-word is associated with a carry flag for which the overflow condition specifies the correct functioning. It is given as follows Fi = 1, when integer equivalent to the binary value Xi is greater than or equal to 2^n/2 and the value is equal to 0 when its value is other than specified condition. Figure 3: Line Address Decoder Block Diagram The internal operation and functional mapping is clearly shown in the figure above which is used in the Input-coding block of the figure 1. The 4 – 9 line address decoder block is shown in the figure 3. This decoder block mainly used to produce the decoded output. The final product value is computed through the address mapping in the LUT block that is next to the decoder. A reset signal is used to reset the values to the LUT in correspondence with the input value. In the above figure the carry flag is F is represented by Ci, this is considered to indicate the overflow condition if any from the previous computation

3. CONVENTIONAL LUT MULTIPLIER BASED Figure 2: Input Coding internal operation Block Diagram Let us consider an example in which the input word X consists of the following bits {1011010111000111}. These bits are decomposed into four equivalent sub-words each consist of 4-bits. The decomposed bits are written as X = {(1011) (0101) (1100) (0111)}. Now the equivalent integer value for the each binary sub-word are {7, 12, 5, 11} respectively. The F bit for each sub-word are {0, 1, 0, 1} respectively. Each of this sub-words of the word X are encoded and the values are as follows {7 (0111), 4 (0100), 6 (0110), 5 (0101)}. The product value thus computed using this encoded representation is as follows C = A X = 7A – (4x2^4) A + (6x2^8) A – (5x2^12) A + 2^16 A. The circuit diagram is as shown in the figure 1.

FINITE IMPULSE RESPONSE FILTER FIR filters are one of the most basic building blocks used in digital signal processing, testing the DSP hardware performance that can communicate [8]. Multiply accumulates must be performed at an ever-increasing rate and demands in the number of MACs per second range. It works on both symmetrical and asymmetrical coefficients. It supports the internal cascading for better performance. A maximum of 16 bits for input, 24 bits as coefficient data and a maximum of 24 output bits are supportable. It can support and perform unsigned and signed Arithmetic operations. LUT based Finite Impulse Response filter results in the high-performance finite impulse response LUT based FIR filters are used for a variety of applications:     

High-pass, Low-pass, Band-pass Speech synthesis, rcosine & root rcosine filter, Waveform shaping Digital decimation in DSL applications High speed modems, ADSL, VDSL, SDSL

IJESAT | Jan-Feb 2012 Available online @


G. SOWMYA BALA* et al.

ISSN: 2250–3676


Image processing, Digital demodulator, HDTV, DTV Communication/Networking DSP, Multimedia, Speech Codec’s

Data presented at the filter data input is stored within the filter module in an array of internal registers - one per tap. Filter coefficients provided by the user are stored in internal look-up tables and accessed during filter operation in accordance with the arithmetic algorithm. Partial results from each look-up table are added to form a result at the filter output.

Figure 4: Finite Impulse Response Filter Block Diagram.

The combinatorial logic may be actually implemented as a small look-up table memory (LUT) or as a set of multiplexers and gates. LUT devices tend to be a bit more flexible and provide more inputs per cell than multiplexer cells at the expense of propagation delay. The functional approaches perform Boolean decomposition of the logic functions of the nodes into sub-functions of limited support size realizable by individual LUTs. In practice, FPGA mapping for large designs is done using structural mappers, whereas the functional mappers are used for re synthesis after technology mapping. Input coding (IP coding) based LUT multiplier is used that further reduce the need of the additional time. This reduction in the additional computational time required for every new entry of the input word is optimized to the desired level. The IP-LUTM replaced the conventional multiplier scheme used in the earlier techniques. The block diagram represents the proposed IP-LUTM based finite impulse response filter. After the multiplication operation is performed through the proposed LUT multiplier the output from each of this block, undergo the arithmetic adder unit. The summation operation is performed in the arithmetic adder unit and the final output is obtained that is represented by Y(n) as shown in the figure. In the proposed method, five taps are considered. It is possible to compute with any number of taps with this method.

This can be the final output. The direct form realization of the FIR filter is as shown in the figure 4. The information about the bit-width of the input data, coefficients, and the number of filter taps has to be defined initially. A parameterization window is used to indicate the number of bits necessary at the output to encompass the dynamic range of the filter output. The output may be reduced to fewer bits if desired.



Look-Up Table multipliers are simply a block of memory containing a complete multiplication table of all possible input combinations. The large table sizes needed for even modest input widths make these impractical for FPGAs. Its features are as follows:  Complete times table of all possible input combinations  One address bit for each bit in each input  Table size grows exponentially  Very limited use  Fast - result is just a memory access away

Figure 5: Proposed IP-LUTM based Finite Impulse Response Filter Block Diagram.

5. CONCLUSION This paper undergoes on describing an advanced method to provide an efficient way to extract the FPGAs with effective area utilization. The advanced methodology reduces the need

IJESAT | Jan-Feb 2012 Available online @


G. SOWMYA BALA* et al.

ISSN: 2250–3676

[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume-2, Special Issue-1, 133 – 137 of the additional components needed to perform the logical and functional related operations. This approach is used to design the Finite Impulse response filter. It is shown that the LUT designed with this advanced method is capable to store the required number of words and can compute the desired result from the limited space of LUT. This results in the reduction of the computation time and the delay factor is reduced to the desired extend. The comparison with the earlier method with respect to the area delay product is shown in the table below. It is clearly shown that the area-delay product for the proposed design is reduced to the desired level as it is tested by designing a filter circuit of a DSP core and can be used in FPGAs. The applicable areas are mentioned clearly in the above section. Table 1: Comparisons of the Area Delay Product Multiplier

LUTM IP-coding based LUTM

Inputword length 8 – bit 16 – bit 32 – bit 8 – bit 16 – bit 32 – bit




654.09um2 2650.10 um2 8971.90 um2 610.12 um2 2096.70 um2 7017.28 um2

2.97ns 5.80ns 9.54ns 2.30ns 5.10ns 9.16ns

1942.7 15370.5 85591.9 1403.3 10693.2 64278.3

The design process enhances the system performance in terms of not only speed and area it also increasing the overall throughput by increasing the rate for transmitting a signal. Hence, it can be applicable for signal processing tasks. It requires N times less number of decoders and memory requirement is reduced to ½ that of the usual design method. Therefore, more than 30% reduction in terms of area and around 15% reduced delay factor than usual design methods (DA, Conventional Multiplier) for the implementation of a Ntap FIR filter having the same throughput per cycle is resulted. This could be used for memory-based implementation of cyclic and linear convolutions, sinusoidal transforms, and inner-product computation.

ACKNOLEDGEMENT The authors gratefully acknowledge the support provided by the S. Balaji, Head of Department (Electronics and Computer Engineering,), K.L.University, Vaddeswaram, Guntur District, A.P, India for carrying out this work

REFERENCES [1]. A.K.Sharma, Advanced Semi conductor Memories: Architectures, Designs, and Applications. Piscataway, NJ: IEEE Press, 2003. [2]. D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T.Stouraitis, “A systolic array architecture for the discrete sine transform,” IEEE Trans. Signal Process.,Sep. 2002. [3]. Elias Ahmed and Jonathan Rose “The Effect of LUT and Cluster Size on Deep-Submicron FPGA Performance and Density”, IEEE transactions on very large scale integration (VLSI) systems, vol. XX, no. Y, month 2003 [4]. H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, “A memory-efficient realization of cyclic convolution and its application to discrete cosine transform,” IEEE Trans.Circuits Syst. Video Technol., Mar. 2005. [5]. M. Mehendale, S. D. Sherlekar, “Area-delay tradeoff in distributed arithmetic based implementation of FIR filters,” in Proc. 10th Int. Conf. [6]. P. K. Meher, “New approach to LUT implementation and accumulation for memory-based multiplication,” May 2009. [7]. P. K. Meher, “New look-up-table optimizations for memory-based multiplication,” in Proc. ISIC, Dec. 2009. [8]. Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner, “FPGA Implementation of High Speed FIR Filters Using Add and Shift Method” IEEE Trans. Circuits Syst., 2006.

6. FUTURE SCOPE In this paper, an area efficient optimization scheme has been proposed for effective utilization of it. The design process reduces the required area and increases the overall system performance. In the future work the power consumed due to the decomposition can be considered and is to be reduced which further increase the entire system performance.

IJESAT | Jan-Feb 2012 Available online @



1 PG scholar, Dept. of ECM. K.L.University, Vaddeswaram, A.P, India, 2 Assistant Professor, Dept. of ECM. K.L.Univ...


1 PG scholar, Dept. of ECM. K.L.University, Vaddeswaram, A.P, India, 2 Assistant Professor, Dept. of ECM. K.L.Univ...