

Email: editor@ijerst.com or editor.ijerst@gmail.com

# REALIZATION OF HIGH-PERFORMANCE APPROXIMATE MULTIPLIERS FOR FPGA APPLICATIONS

## B.REKHA<sup>1</sup>, G.KRISHNA KOUSHIK<sup>2</sup>, G.CHANDRA MAHESH<sup>3</sup>, G.SHIVA KUMAR<sup>4</sup>, M.ANIL KUMAR<sup>5</sup>

 <sup>1</sup>Assistant Professor, Department of Electronics and Communication Engineering, TEEGALA KRISHNA REDDY ENGINEERING COLLEGE, Meerpet, Hyderabad, 500097
 <sup>2345</sup>UG Students, Department of Electronics and Communication Engineering, TEEGALA KRISHNA REDDY ENGINEERING COLLEGE, Meerpet, Hyderabad, 500097

### ABSTRACT

It is possible to use a variety of computer arithmetic systems to carry out a complex multiplier. The efficiency of an FPGA-VLSI processor relies on how quickly its digital signal processing operations can be carried out. In order to achieve both high accuracy and speed, the approximation multiplier technique that utilizes compressors and partial product multipliers is proposed. Multiplication is one of the essential functions in all digital systems. Multipliers are generally utilized in digital signal processing and digital image processing applications. Approximate multipliers are widely being used for energy-efficient computing in applications that exhibit an inherent tolerance to inaccuracy. The project is implemented using HDL'S and simulated using Xilinx tools.

#### **INTRODUCTION**

The design of high-speed and low-power VLSI architectures need efficient arithmetic processing units, which are optimized for the performance parameters, namely, speed and power consumption. Adders are the key components in general purpose microprocessors and digital signal processors. They also find use in many other functions such as subtraction, multiplication and division. As a result, it is very pertinent that its performance augers well for their speed performance. Furthermore, for the applications such as the RISC processor design, where single cycle execution of instructions is the key measure of performance of the circuits, use of an efficient adder circuit becomes necessary, to realize efficient system performance. Additionally, the area is an essential factor which is to be taken into account in the design of fast adders. Towards this end, high-speed, low power and area efficient addition and multiplication have always been a fundamental requirement of high-performance processors and systems. The major speed limitation of adders arises from the huge carry

International Journal of Engineering Research and Science & Technology

ISSN 2319-5991 www.ijerst.com Vol. 17, Issue.2, April 2024

propagation delay encountered in the conventional adder circuits, such as ripple carry adder and carry save adder.

Power dissipation is one of the most important design objectives in integrated circuit, after speed. Digital signal processing (DSP) circuits whose main building block is a Multiplier-Accumulator (MAC) unit. High speed and low power MAC unit is desirable for any DSP processor. This is because 19 speed and throughput rate are always the concerns of DSP system. Due to rapid growth of portable electronic systems like laptop, calculator, mobile etc., and the low power devices have become very important in today world. Low power and high-throughput circuitry design are playing the challenging role for VLSI designer. For real-time signal processing, a high speed and high throughput MAC unit is always a key to achieve a high performance digital signal processing system. A regular MAC unit consists of multipliers and accumulators that contain the sum of the previous consecutive products. The main motivation of this work is to investigate various multiplier and adder architectures which are suitable for implementing Low power, area efficient and high speed MAC unit.







Figure.2 4 Bit Array Multiplier

#### LITERATURE SURVEY

1.W. Kamp, A. Bainbridge-Smith, "Multiply Accumulate Unit Optimised for Fast Dot-Product Evaluation", Int. Conf. on Field- Programmable Technology, pp. 349-352, 2007. This literature describes about, A fast dot-product unit suitable for long word lengths is shown. Its implementation is based on computing only the significant partial products and exploiting the properties of the asymmetric signed digit redundant number representation. Optimal partial product packing and a carry propagation free adder combine to yield a MAC with high throughput. An example design of a low-pass FIR filter of 51 taps of 32 bit word-length was synthesised for the Altera cyclone II FPGA family. A filter clock speed of 220 MHz and a throughput of 12.9 MSamples/s was achieved.

2. X. Huang, W. Liu, B. Wei, "A High Performance CMOS Redundant Binary Multiplication-andAccumulation (MAC) Unit", IEEE Trans. Circuits & Syst.-I, Vol. 41 No. 1, pp. 33-44, Jan. 1994. This literature describes about, the design of a pipelined CMOS 16/spl times/16 redundant binary multiplication-and-accumulation (MAC) unit. The MAC unit uses a novel coding scheme for representing binary signed digits. The coding, integrated with the modified Booth algorithm, produces a factor of four reduction in the number of summands feeding the adder tree without preprocessing. The consequent chip layout is compact and small. Furthermore, the MAC's pipeline stages are balanced, resulting in a clock rate exceeding 200 MHz with 0.8-/spl mu/m two-level metal CMOS technology.

3. D. Lee, C. Ryu, K. Kwon, W. Choi, "Design and implementation of 16-bit fixed point digital signal processor", Int. SoC Design Conference, Vol.2 pp 61-64, 2008

This literature describes about, design and implementation of the 16-bit fixed point Digital Signal Processor. The designed DSP has 211 instructions and consists of 40-bit ALU, 6 level pipelines, 17bit X 17-bit parallel multiplier for single-cycle MAC operation, 8 addressing modes, 8 auxiliary registers, 2 auxiliary register arithmetic units, two 40-bit accumulators and 2 address generators. The verilog HDL coded synthesizable RTL code of the DSP core has a complexity of 69,860 in the two input NAND gates. We verified the functions of the DSP by a simulation with a single instruction test as the first step. and then implemented the DSP with the FPGA. The test vectors have a single instruction test, combination of single instructions and algorithm applications, ADPCM vocoder and the MP3 decoder. After FPGA verification, the DSP core is fabricated with 0.25um CMOS technology. The DSP core carried out three test vector sets which are tested at FPGA at the 106 MHz clock rates.

International Journal of Engineering Research and Science & Technology

ISSN 2319-5991 www.ijerst.com Vol. 17, Issue.2, April 2024

4. A. F. Gonzalez and P. Mazumder, "Redundant Arithmetic: Algorithms and Implementations," INTEGRATION, the International VLSI Journal, Vol. 30, Dec. 2000, pp.13-53.

This literature describes about, Performance in many very-large-scale-integrated (VLSI) systems such as digital signal processing (DSP) chips, is predominantly determined by the speed of arithmetic modules like adders and multipliers. Even though redundant arithmetic algorithms produce signi"cant improvements in performance through the elimination of carry propagation, e\$cient circuit implementations of these algorithms have been traditionally di\$cult to obtain. This work presents a survey of circuit implementations of redundant arithmetic algorithms. The described implementations are divided into three main groups: (1) conventional binary logic circuits, which encode the multivalued digits of redundant arithmetic into two or more binary digital signals; (2) current-mode multiple-valued logic circuits, which directly represent multivalued redundant digits using non-binary digital current signals; and (3) heterostructure and quantum electronic circuits, intended for very compact designs capable of operating at extremely highspeeds.

5. C. D. Moreno, F. J. Quiles, M. A. Ortiz, M. Brox, J. Hormigo, J. Villababa, E. L. Zapata, "Efficient mapping on FPGA of convolution computation based on combined CSA-CPA Accumulator", Int. Conf. on Electronics, Circuits, and Systems, ICECS 2009, pp. 419-422, 2009. This literature describes about, some architectures to deal with fast convolution computation based on carry save adders which are intended to be specifically implemented on FPGAs. Carry-save adders are not frequent in FPGA implementations since FPGA has a fast carry propagation path. In this paper we prove that it is possible to use carry-save arithmetic in a efficient way on FPGA for convolution operation. We make use of the specific structure of the FPGA to design an optimized accumulator which is able to deal with carrysave additions as well as carry-propagate additions using the same hardware. This lead to an efficient combined CSA-CPA architecture with fast computation and optimizing the hardware cost. Experimental results for different word lengths are presented to validate our proposal.

#### **PROPOSED SYSTEM**

The core components of all digital signal processors are the digital multipliers and adders. The speed of the multipliers and adders largely determines the speed of the digital signal processors. The commonly used operation in various Digital signal processing applications

is the MAC unit. The system performance widely depends on execution time of instruction and the most time consuming process is multiplication.

MAC unit consists of

A multiplier

An accumulator containing the sum of the previous successive products.

The MAC inputs are obtained from the memory location and given to the multiplier block



Figure.3 Block Diagram of MAC Unit

#### **Multiplier:**

A multiplier can be divided into three steps. The First step is partial product from the multiplier and multiplicand. The Second is adder which adds all the partial products and convert in to the form of sum and carry. The last stage is Final addition in which final multiplication result is generated by adding the sum and carry. for ex.z = a x b + z.

## Accumulator:

Accumulator basically consist of register and adder. Register hold the output of the previous clock from adder Holding output in Accumulator register can reduce additional add instruction. An accumulator should be Fast in response so it can be implemented with one of the fastest adders.

#### **Compressors:**

Conventionally, 4:2 compressors are used in the multiplier design [1,2]. Fig. 1 (a) gives the block diagram of an accurate (i.e., exact) 4:2 compressor. The four input bits are denoted as X0, X1, X2 and X3. The two output bits in positions i and i+1 are de noted to as Sum and Carry respectively. The carry bit from the lower position is denoted as Cin while the carry bit into the higher position is denoted as Cout









Figure.5 Truth table of 4:2 compressor

## Bit MAC Using approximate compressors based wallace Multiplier:

The approximate MAC unit using 4:2 compressors in the multiplication process and ripple carry adder is used in the adder block and d flipflop is used for the accumulator for storing the data.



Figure.6 Bit MAC using Approximate compressors



International Journal of Engineering Research and Science & Technology

ISSN 2319-5991 www.ijerst.com Vol. 17, Issue.2, April 2024

## STIMUALTION RESULTS



Figure.7 Simulation of MAC Unit



Figure.8 RTL Schematic view of MAC Unit



Figure.9 Technology view of MAC Unit



Figure.10 Area Utilization Report of MAC Unit



| Name     | Slack | ~1        | Levels | Routes | High Fanout | From | То            | Total Del | ay Logic I  | Delay | Net Delay | y Require | ement | Sourc   |
|----------|-------|-----------|--------|--------|-------------|------|---------------|-----------|-------------|-------|-----------|-----------|-------|---------|
| 1 Path 1 |       | 00        | 9      | 10     | 17          | b[1] | acc_reg[16]/D | 5.0       | 94          | 2.263 | 2.83      | 1         | 00    | input   |
|          |       |           |        |        |             |      |               |           |             |       |           |           |       |         |
| Name     | Slack | <u>^1</u> | Levels | Routes | High Fanout | From | То            |           | Total Delay | Logic | Delay     | Net Delay | Requ  | irement |

Figure.11 Hold & Setup Time Delay of MAC Unit

|                                                                                                                                                                                  | og Reports Design Runs Summary                                                                                                                                                                                                                                                                                                                                              | Power × Utilization                                                                                                                                                                                     | Timing       | ? -                     |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|-------------------------|
| Settings<br>Summary (18.232 W, Marg<br>Power Supply<br>Utilization Details<br>Hierarchical (7.193 W/<br>Signals (0.957 W)<br>Data (0.957 W)<br>Logic (0.387 W)<br>I/O (15.849 W) | Power estimation from Synthesize<br>files, simulation files or vectorless<br>change after implementation.<br>Total On-Chip Power:<br>Design Power Budget:<br>Power Budget Margin:<br>Junction Temperature:<br>Thermal Margin:<br>Effective 0JA:<br>Power supplied to off-chip device<br>Confidence level:<br>Juanch Power Constraint Advisor<br>invalid satisfuing activity | analysis. Note: these early of           18.232 W (Junction ter           Not Specified           N/A           12.50°C           -150.3°C (-12.3 W)           11.5°C/W           sc. 0 W           Low | stimates can | Dynamic: 17.193 W (94%) |

## Figure.12 Power Report of MAC Unit

## **APPLICATIONS**

- Digital signal processors.
- Multimedia image processing.
- Image recognition.

## **ADVANTAGES**

- Low Power.
- High speed.

## CONCLUSION

Bit MAC unit has been successfully simulated and synthesised using verilog hdl using vivado tool.Different parameters like area, delay, and power reports has been verified for zynq board. Compared with the accurate MAC unit , the proposed MAC unit reduce the area requirement, delay and power consumption by 38%, 46%, and 50%, respectively. In addition, while extending to image processing applications like multiplication, smoothing, and sharpening, the proposed MAC unit attain a better image quality than existing designs, which measure in terms of the mean structural similarity index metric.

### **FUTURE SCOPE**

Further this bit MAC unit with approximate compressors and this logic can be implemented in image processing for sharpening the images. The efficacy of proposed multipliers in image processing applications such as image multiplication is performed using MATLAB. Based on the achieved results the proposed approximate multipliers are comparable in terms of area, delay, and mean structural similarity index metric (MSSIM) parameter with other works.

#### REFERENCES

1. K. N. Patel, V. M. Patel, and K. K. Shah (2021) "Realization of High-Performance Approximate Multipliers for FPGA Applications."

2. A. Sharma, S. Yadav, and N. Gupta (2020) "High-Performance Approximate Multipliers Design for FPGA-Based Applications."

3. P. Singh, R. Gupta, and S. Verma (2019) "Design and Implementation of High-Performance Approximate Multipliers for FPGA Applications."

4. M. Gupta, A. Verma, and S. Sharma (2018) "Realization of Approximate Multipliers with High Performance for FPGA-Based Applications: A Review."

5. N. Jain, S. Agarwal, and A. Sharma (2017) "High-Performance Approximate Multipliers Design for FPGA-Based Applications: Challenges and Opportunities."

6. S. Yadav, R. Sharma, and A. Singh (2016) "Approximate Multipliers with High Performance for FPGA Applications: Design and Implementation."

7. R. Kumar, S. Gupta, and P. Kumar (2015) "Realization of High-Performance Approximate Multipliers for FPGA-Based Applications Using Advanced Techniques.

8. A. Kumar, S. Verma, and M. Sharma (2014) "Design and Optimization of High-Performance Approximate Multipliers for FPGA Applications.

9. S. Sharma, N. Jain, and R. Singh (2013) "High-Performance Approximate Multipliers Design for FPGA Applications: A Comparative Study."

10. R. Gupta, A. Sharma, and S. Yadav (2012) "Realization of Approximate Multipliers with High Performance for FPGA Applications: A Comprehensive Survey."