Low power residue number system using lookup table decomposition and finite state machine based post computation

Balaji Morasa¹, Padmaja Nimmagadda²
¹Department of Electronics and Communication Engineering, Jawaharlal Nehru Technological University Anantapur, Ananthapuramu, India
²Department of Electronics and Communication Engineering, SriVidyanikethan Engineering College, Tirupati, India

ABSTRACT

In this paper, memory optimization and architectural level modifications are introduced for realizing the low power residue number system (RNS) with improved flexibility for electroencephalograph (EEG) signal classification. The proposed RNS framework is intended to maximize the reconfigurability of RNS for high-performance finite impulse response (FIR) filter design. By replacing the existing power-hungry RAM-based reverse conversion model with a highly decomposed lookup table (LUT) model which can produce the results without using any post accumulation process. The reverse conversion block is modified with an appropriate functional unit to accommodate FIR convolution results. The proposed approach is established to develop and execute pre-calculated inverters for various module sets. Therefore, the proposed LUT-decomposition with RNS multiplication-based post-accumulation technology provides a high-performance FIR filter architecture that allows different frequency response configuration elements. Experimental results shows the superior performance of decomposing LUT-based direct reverse conversion over other existing reverse conversion techniques adopted for energy-efficient RNS FIR implementations. When compared with the conventional RNS FIR design with the proposed FSM based decomposed RNS FIR, the logic elements (LEs) were reduced by 4.57%, the frequency component is increased by 31.79%, number of LUTs is reduced by 42.85%, and the power dissipation was reduced by 13.83%.

1. INTRODUCTION

An electroencephalograph (EEG)-based signal measurements are not only investigated for the brain-computer interface (BCI) but also provides a diagnostic channel for many brain-related problems during clinical measurements [1]-[2]. The signal activity measures by non-invasive EEG records from the scalp are accumulated from other sources and artifacts [3]. To overcome this contamination during EEG signal classification, several signal processing and feature extraction methods are investigated for both high performances BCI and diagnostic measurements. The feature extraction includes spatial filtering to generate different forms of spatial patterns and incorporates covariance analysis to maximize the class differences in spatial scale. But EEG signals rhythmic activities are highly correlated with associated frequency bands [4]. To accomplish this task, raw EEG signals are pre-processed with some narrow-band spectral filters are just
before the spatial filtering [5]. But selecting the most prominent frequency band manually for each class and the associated optimization models leads to some computational burdens. In recent years investigation of spatial and spectral filters has been steadily emerged for reducing the false rate. The time-domain information's EEG signals are not sufficient for classification and evaluation. In real-time EEG signal is evaluated with different forms by the individuals for clinical diagnosis [6].

In general analysis of EEG signals comes with the formulation of the spatial weights obtained from electrodes which are known as a common spatial pattern. For signal classification, neural networks are most commonly preferred which can provide a better framework for characterizing these spatial patterns. To process the weights using multilayer perceptron (MLP), all static weights can be replaced with finite impulse response (FIR) filters as an extension of existing neural networks to accumulate the frequency bands that relate to the most appropriate signal measures. Higashi and Tanaka [7], by optimizing the objective function (a natural extension of CSP), a discriminate filter bank was developed using FIR filter design. Meng et al. [8] utilized both spatial and spectral features and accordingly learning task is accomplished to maximize class discriminations among different class labels. By estimating the parametric distribution of these spatial-spectral features some mutual information (MI) is derived and the cost function is optimized during the iterative learning process.

As opposed to optimizing the cost function of Spatio-temporal features, several models have been proposed [9] that optimize the filters for cost-effective signal measurement and analysis. On the other side, the successfullness of common spatial pattern analysis (CSP) directly depends on the order of the FIR filter. The core objective of this paper is developing a new optimized FIR core for Spatio-temporal. For better discriminability, FIR temporal filters require improved flexibility and length should be in higher order. In this paper, a high-performance FIR filter design using a residue number system (RNS) is proposed which can accommodate the benefits of both parallel processing, complexity reduction, and energy efficiency. In this context, the computational intensiveness of FIR filter design processing blocks is effectively optimized using residue number system (RNS) arithmetic.

2. RESEARCH METHOD

Several attempts have been [10]-[11] made to optimize the accumulator and multiplication unit to provide complete system requirement of filter design with appropriate hardware units. In existing works, methodologies invented for RNS computation are broadly categorized into two types-LUT-based models and conventional binary modules. Lookup table based RNS system offers improved system performance over a smaller range of moduli sets and binarized RNS model shows better performance over large size moduli. In most cases, the performance metrics in terms of accuracy in the FIR filter largely depend on the number of FIR coefficients and associated precision levels.

However, LUT-based reverse computation for RNS arithmetic has a large computational cost and takes a long time. He came up with a binary coded structure for calculating residues and a thermometer coded style for generating modular inner products in [12]. When designing FIR filters, this distributed arithmetic uses no carry propagation in accumulating and pre-computed LUT blocks in order to maximize operating speed and minimize hardware complexity [13]. High-performance booth multiplier for FIR filter design flexibility and low complexity are incorporated into the RNS accumulator. Because it depends on a redundant residue number system [14], the low-cost fault-tolerant FIR filter does not require any additional hardware. The proper down convert moduli set is used for FIR calculations to eliminate faults generated by MAC computations based on single event upset (SEU) [15]. A less modular multiplication binary number to residue number converter was presented in order to reduce hardware complexity and power consumption. A pre-loaded product block reduces the computational cost and latency of generating partial products for each FIR tap in this technique. As discussed in [16], end-around carry units (EAC) eliminate the performance trade-off inherent in any RNS FIR filter design that includes additional taps. Touil et al. [17] FIR filter design optimization is carried out using non-recursive filtering algorithms and an appropriate mathematical model. Some structural implementation is invented for a modular digital filter by using the residue ring measures and the symmetric characteristics FIR filter responses for optimization. As compared to the existing modular filter the proposed FIR design reduces impulse response by half of the FIR length. Here the hardware complexity is considerably reduced using finite field algebra and the accuracy of computation is also improved in the modular arithmetic-based digital FIR filter implementations.

The performance metrics of the hardware implementation of the RNS system depend on the operations reverse conversion process which is complex in nature. To mitigate this problem the diagonal function (DF) is introduced in [18] for RNS construction which allows efficient hardware implementations of modulo 2n, 2n−1, or 2n+1. Here different approaches are constructed using DF to accommodate the different forms of magnitude comparison and reverse conversion process. Jaberipur et al. [19] Diminished-1 (D1) encoding is introduced in the RNS system and its potential metrics on modulo-(2n + 1) addition, subtraction,
Convolutional neural networks are widely used for many pattern recognition systems. But it requires a large amount of memory to hold weights during the process of learning. To reduce the hardware cost of CNN implementation the residue number system (RNS) is used in [20] each layer of the convolutional neural network. The hardware implementation of the RNS based CNN showed that the use of residue arithmetic saves 7.86%–37.78% hardware resource as compared to the conventional two’s complement method. The RNS CNN also reduces the overall recognition time by 41.17%. Valueva et al. [21] CNN architecture is decomposed into hardware and software to maximize the system performance of RNS hardware components. The inclusion of software parts offers significant memory efficiency while during the process of learning. Cardarilli et al. [22] the characteristics of RNS and conventional TCS are analyzed in the different stages of DSP applications and unique design space exploration (DSE) methodology is introduced for high-performance digital FIR filter implementation. This DSE offers energy efficiency for several emerging applications like machine learning and internet-of-things.

Vinitha and Sharma [23] DA based FIR implementation is proposed using an efficient lookup table (LUT) design. Hereby utilizing the even multiple storage (EMS) scheme the size of the LUT-based multiplier is reduced by half which reduces the path delay and optimizes the computational complexity overhead in FIR design. In [24] memory-efficient ROM-free reverse converters design is proposed for the different sizes of moduli set to perform high-speed arithmetic with highly balanced modulus and appropriate adder units. The memory-less RNS offers significant energy efficiency with some notable path delay accumulation due to dynamic post computation.

Pontarelli et al. [25] presented a comparative analysis of the FIR filter design using the RNS with other well-known models which make use of the conventional positional number system. Here RNS based FPGA hardware implementation shows that the frequency of FIR filters is increased by about 4 times, and computational complexity is reduced by 3 times when the RNS system is incorporated for FIR computation as compared to the traditional binary number system. In addition to the path delay and area efficiency, the RNS based FIR design also offers energy efficiency of up to 23%. In RNS system inter-modulo computation consumes maximum resources which are directly associated with the complex reverse conversion process.

NavaeiLavasani et al. [26] used mixed-radix conversion (MRC) algorithm for implementing RNS reverse converter design for ternary computation with moduli set \{3n−2, 3n−1, 3n\}. The integration of the RNS system in ternary DSP applications offers effective number representation and maximizes the inherent parallelism for a high throughput rate. For hardware efficient RNS implementation in applications like edge computing and FIR filter design, it is required to embed these reverse converters as a simplified arithmetic unit for each class of moduli sets. And effective hardware resource utilization [26] also saves a considerable amount of area and power consumption in hardware realization of RNS system with some negligible penalty in path delay. The experimental results show the superiority of reverse converter optimization using different design methodology.

Using less hardware, systolic array architecture [27] lowers the critical path, allowing for faster processing times. Because of the use of parallel processing and pipelines, the overall chip size and power consumption have been reduced significantly. Optimizing hardware resources for higher-order filters presents a major difficulty.

3. PROPOSED RNS SYSTEM IN FIR FILTER DESIGN

For FIR filter design as shown in Figure 1, the input signal samples and filter coefficients not only take real values but also include negative numbers. In RNS-number system, only positive integers are used for arithmetic computation and also having dynamic range constraints which are in the range of \([0, M−1]\). To accommodate the negative numbers for RNS based FIR filter design implementation, some number conversions are used for binary representation.

In recent years, DSP applications have dealt with high-bit-size samples for enhanced precision, which necessitated a huge amount of hardware resources for computation. The word length size is always a trade-off, regardless of the arithmetic models employed for data calculation. To keep the performance metrics in terms of speed, some mathematics based on modular fields for arithmetic computations must be invented.
3.1. Residue number system

In many real-time applications, RNS proves to be a potential alternative to the traditional 2's complement algorithm due to its inherent properties of parallel computation and predetermined parameter measurements using modulo-set formulations. However, a fully automated RNS digital system implementation is still not possible due to its complex post-processing steps, performing pre-computation during RNS arithmetic computations for real-time applications and storing the results in memory. In addition, the modular and distributed nature of RNS provides additional performance metrics and is widely used in many fields such as cloud, wireless communication systems, and DSP applications [28]. Its tolerance for soft errors and power efficiency makes RNS-based data computing the most prominent one for optimizing digital designs. Through RNS based arithmetic throughput rate is moderately achieved with the degree of computational parallelism and decomposition levels during hardware implementation.

3.2. Modulo $m_i$ multiplier

In RNS based system, the selected moduli's are constants and remain within that limit even after the residue computation. For residue computation, integer arithmetic is used which is performed in parallel. The performance efficiency of integer arithmetic is also influencing the overall RNS system performance. The advantages of RNS system are high performance computation, energy efficient and hardware complexity reduction.

As shown in Figure 2 both input sample values ($x_n$) and FIR coefficients ($h_n$) are converted into residues using moduli conversion block and final FIR convolution result ($y_n$) is generated using reverse conversion block. The problems over own dynamic range cover are solved by adopting different sizes of moduli and the various sets of moduli components. The required dynamic range is pre-determined and moduli’s are selected accordingly else RNS will produce erroneous results. To accomplish this task some statistical measures are required to pre-determine the bit sizes of moduli and the moduli set numbers which can accommodate all possible ranges of tap results during FIR computation.
3.3. Reverse converter

As per the Chinese remainder theorem (CRT), the reverse converter module required some unified computations for converting the resultant residue into an actual integer. Though the direct digital implementation of the reverse converter difficult task to accomplish due to its iterative division units and numerous multiplication units still many efficient models are invented to generate the results. In most memory-based pre computation is used for reverse converter design. The reduction of inter-module carries propagation and residue level multi-channel arithmetic computations in RNS is significantly influencing the system performance over intensive DSP applications. There is no carry propagation between residual channels and the residue computation is performed concurrently over each residue channel which leads to optimized critical path delay overhead.

4. EXPERIMENTAL RESULTS

4.1. Performance analyzes

The performance evaluation of the FSM-based RNS FIR design is validated over different FIR taps to examine the trade-off measure. Based on the observation as shown in Tables 1 and 2, the proposed LUT decomposition driven DA based RNS FIR model outperformed all other state-of-the-art methods in terms of achievable throughput since it incorporates the metrics from both parallel computations as well as reduce critical path during reverse computation. The suggested RNS system takes advantage of inherent concurrency within residue channels as well as FPGA device capabilities. In Table 1, the 8 bit word length for moduli set (7, 8, 9) the area is decreased by 4.5% whereas for 16 bit word length for moduli set (31, 32, 33) the area is decreased by 15.36%. In 8 bit word length for moduli set (7, 8, 9) the frequency component is increased by 24.08% whereas for 16 bit word length for moduli set (31, 32, 33) the frequency is improved by 16.18%. In Table 2, the 4 tap FIR filter, area in the proposed method is decreased by 4.15% and the frequency component is increased by 13.30% whereas in 16 tap FIR filter area is decreased by 4.98% and the frequency is increased by 19.57% when compared with the conventional reverse computation method.

<table>
<thead>
<tr>
<th>Input Word length size</th>
<th>Moduli set (2n+1,2n,2n-1)</th>
<th>RNS with conventional reverse computation [29]</th>
<th>RNS with LUT decomposed reverse computation (Proposed method)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Area (LE’s)</td>
<td>Fmax</td>
<td>Area (LE’s)</td>
</tr>
<tr>
<td>8 bit (7,8,9)</td>
<td>4281</td>
<td>57.3MHz</td>
<td>4,088</td>
</tr>
<tr>
<td>16 bit (31,32,33)</td>
<td>14374</td>
<td>24.96MHz</td>
<td>12165</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>FIR length</th>
<th>RNS with conventional reverse computation [29]</th>
<th>RNS with LUT decomposed reverse computation (Proposed method)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Area (LE’s)</td>
<td>Fmax</td>
</tr>
<tr>
<td>4 tap</td>
<td>2096</td>
<td>63.46MHz</td>
</tr>
<tr>
<td>16 tap</td>
<td>8623</td>
<td>57.3MHz</td>
</tr>
</tbody>
</table>

4.2. Trade-off analyzes

Figure 1 illustrates an architecture that is compatible with all potential dynamic word length variations. However, each of the RAM sizes is resized based on their hierarchical moduli information, and linearization is performed over aspect ratios to maximize operand bit size. The aspect ratios are derived analytically using the model reverse conversion technique outlined in the previous section. Aspect ratios of certain moduli sets are chosen to be near to the values of the moduli sets.

As demonstrated in Table 3, the growth of RAM size with moduli set bit size is exponential, which is convenient in terms of the attainable frequency response. With the parallel modulo FIR implementation, this table displays the FPGA synthesis results in terms of the number of LEs and delay measures for different values of moduli sets. The maximum frequency of the filter for the 8-bit word length is 57.3MHz, while the maximum frequency of the filter for the 12-bit and 16-bit word lengths is limited by the accumulation of reverse conversion results. The state-of-the-art comparison of proposed FIR design with other FPGA RNS FIR methods for three moduli set with a 32-bit unsigned input for a Xilinx Spartan 3E FPGA device has reduced 50% power dissipation and the frequency component is increased by 36.56% in proposed FSM decomposed RNS FIR design [30], [31].
4.3. Critical path retention performance measure

Observation from Figure 3 can be made that the performance trade-off comparison for the conventional reverse computation of RNS with the proposed model, the filter tap extension with the logical elements were compared and the total performance loss is smaller when tested with possible higher-order during FIR filter design. For 4 tap and 16 tap filter, the logic elements are utilized as 2009 and 8193 when compared with the conventional values 2096 and 8623 respectively. Observation from Figure 4 can be made that during filter convolution, path delay management in the FIR MAC network is achieved utilizing LUT-driven RNS networks, which operate as delay optimization models. The total time necessary to formulate the convolution output is reduced in this fashion. It's worth noting that the filter length must be sufficient to keep the majority of the finite filter coefficients. As a result, executing high-order lengths has a low computational trade off when compared with the conventional reverse computation with the proposed model. For 4 tap, 8 tap and 16 tap filter the logic elements were used as 2096, 4281, and 8623 respectively.

4.4. Comparison with other state-of-the-art RNS FIR model

Due to concurrent FSM-based LUT transformation, the proposed RNS system can achieve a significant path delay optimization margin against some known RNS FIR design and post accumulation driven reverse conversion can solve power management issues and mitigate all sorts of energy related problems in the RNS FIR system. As compared to the RNS FIR model invented in the proposed RNS consumes lesser logical resources with the least path delay propagation due to simplified reverse conversion operations. Moreover, the dynamic ranges of RNS which are largely dependent on moduli sizes can be extended without causing performance trade-off. The performance metrics of the proposed FSM decomposed RNS with improved system performance and energy efficiency as shown in Table 4. When compared with the conventional RNS FIR design with the proposed FSM based decomposed RNS FIR, the Logic Elements (LEs) were reduced by 4.57%, the frequency component is increased by 31.79%, number of LUTs is reduced by 42.85%, the number of transitions is reduced by 5.15% and the power dissipation was reduced by 13.83%.

### Table 3. State of the art comparison of proposed FIR design with other FPGA RNS FIR models

<table>
<thead>
<tr>
<th>Methods</th>
<th>RNS Model used</th>
<th>Input/Coefficient size</th>
<th>FPGA device used</th>
<th>Speed (MHz)</th>
<th>Power dissipation (mW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Core Functional decimal equivalent binary conversion RNS FIR design [29]</td>
<td>Five moduli set</td>
<td>32-bit unsigned</td>
<td>Xilinx Spartan 3E</td>
<td>230 MHz</td>
<td>175mW</td>
</tr>
<tr>
<td>Proposed FSM decomposed RNS FIR design</td>
<td>Three moduli set</td>
<td>32-bit unsigned</td>
<td>Xilinx Spartan 3E</td>
<td>362.58 MHz</td>
<td>73.49mW</td>
</tr>
</tbody>
</table>

### Table 4. Performances analyzes of proposed FIR design with QUARTUS II hardware synthesis

<table>
<thead>
<tr>
<th>Multiplier Model</th>
<th>Hardware used (LEs)</th>
<th>$F_{\text{max}}$ (MHz)</th>
<th>Number of LUTs /size</th>
<th>Number of transitions (millions/sec)</th>
<th>Power dissipation (mW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Conventional RNS FIR [29]</td>
<td>3016</td>
<td>46.02 MHz</td>
<td>3/7</td>
<td>72.521</td>
<td>200.17mW</td>
</tr>
<tr>
<td>FSM based decomposed RNS FIR</td>
<td>2878</td>
<td>67.47 MHz</td>
<td>3/4</td>
<td>68.779</td>
<td>172.48mW</td>
</tr>
</tbody>
</table>
5. CONCLUSION

In this paper, a memory-efficient reverse conversion-based RNS system and associated FSM-based architectural optimization are presented to narrow down the energy level utilization in RNS based FIR filters for EEG signal classifications. It is noted that design optimization is carried out only in the reverse conversion stage while other processing units are kept as generic for parametric variations within the RNS system. The decomposed LUT-based reverse conversion and FSM ordering techniques offer significant hardware complexity reduction and result in considerable energy efficiency as compared to direct single compound memory-based reverse conversion realization. As stated earlier, this alternative form of RNS FIR filter structure shows the least significant performance trade-off for higher-order FIR filters. The area is decreased by 4.5% and the frequency is improved by 24.08% for 8 bit word length and the area is decreased by15.36% and the frequency is improved by 16.18% for 8 bit word length. In 4 tap FIR filter, area in the proposed method is decreased by 4.15% and the frequency component is increased by 13.30% whereas in 16 tap FIR filter area is decreased by 4.98% and the frequency is increased by 19.57% when compared with the conventional reverse computation method. When compared with the conventional RNS FIR design with the proposed FSM based decomposed RNS FIR, the logic elements (LEs) were reduced by 4.57%, the frequency component is increased by 31.79%, number of LUTs is reduced by 42.85%, the number of transitions is reduced by 5.15% and the power dissipation was reduced by 13.83%.

REFERENCES


Low power residue number system using lookup table decomposition and finite state ... (Balaji Morasa)


BIOGRAPHIES OF AUTHORS

Balaji Morasa completed the B. Tech degree in Electronics and Communication Engineering and M. Tech degree in Embedded Systems from JNT University, Hyderabad, India. He is currently pursuing his Doctor of Philosophy degree from JNT University Anantapur, with specialization in VLSI Signal Processing. He has authored or co-authored National and International Research publications. His research interests include advance filter designs using novel VLSI design architectures. He can be contacted at email: balajichaitra3@gmail.com.

Dr. Nimmagadda Padmaja her B E from Mumbai University and M Tech and Ph. D from S V University in the area of Atmospheric Radar Signal Processing. Presently, working as Professor, SreeVidyanikethan Engineering College (Autonomous), Tirupati, India. For her credit she is having 50 technical papers in various reputed International peer reviewed Journals and conferences. Her areas of interests include VLSI signal processing, image processing and communication systems. She is a Life Member of IETE, IAENG, ISCA, ISTE and IACSIT. She can be contacted at email: padmaja202@gmail.com.

Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 1, April 2022: 127-134