Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL NT-001 ETPL NT-002 ETPL NT-003 ETPL NT-004 ETPL NT-005 ETPL NT-006 ETPL NT-007 ETPL NT-008 ETPL NT-009 ETPL NT-010 ETPL NT-011 ETPL NT-012 ETPL NT-013 ETPL NT-014 ETPL NT-015 ETPL NT-016 ETPL NT-017 ETPL NT-018 ETPL NT-019 ETPL NT-020 ETPL NT-021 ETPL NT-022 ETPL NT-023 ETPL NT-024 ETPL NT-025 ETPL NT-026 ETPL NT-027 ETPL NT-028 ETPL NT-029 ETPL NT-030 ETPL NT-031 ETPL NT-032 ETPL NT-033 ETPL NT-034 ETPL NT-035 ETPL NT-036 ETPL NT-037 Answering “What-If” Deployment and Configuration Questions With WISE: Techniques and Deployment Experience Complexity Analysis and Algorithm Design for Advance Bandwidth Scheduling in Dedicated Networks Diffusion Dynamics of Network Technologies With Bounded Rational Users: Aspiration-Based Learning Delay-Based Network Utility Maximization A Distributed Control Law for Load Balancing in Content Delivery Networks Efficient Algorithms for Neighbor Discovery in Wireless Networks Stochastic Game for Wireless Network Virtualization ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification, A Utility Maximization Framework for Fair and Efficient Multicasting in Multicarrier Wireless Cellular Networks Achieving Efficient Flooding by Utilizing Link Correlation in Wireless Sensor Networks, Random Walks and Green's Function on Digraphs: A Framework for Estimating Wireless Transmission Costs "A Flexible Platform for Hardware-Aware Network Experiments and a Case Study on Wireless Network Coding Exploring the Design Space of Multichannel Peer-to-Peer Live Video Streaming Systems Secondary Spectrum Trading—Auction-Based Framework for Spectrum Allocation and Profit Sharing Towards Practical Communication in Byzantine-Resistant DHTs Semi-Random Backoff: Towards Resource Reservation for Channel Access in Wireless LANs Entry and Spectrum Sharing Scheme Selection in Femtocell Communications Markets On Replication Algorithm in P2P VoD, Back-Pressure-Based Packet-by-Packet Adaptive Routing in Communication Networks Scheduling in a Random Environment: Stability and Asymptotic Optimality An Empirical Interference Modeling for Link Reliability Assessment in Wireless Networks On Downlink Capacity of Cellular Data Networks With WLAN/WPAN Relays Centralized and Distributed Protocols for Tracker-Based Dynamic Swarm Management Localization of Wireless Sensor Networks in the Wild: Pursuit of Ranging Quality Control of Wireless Networks With Secrecy ICTCP: Incast Congestion Control for TCP in Data-Center Networks Context-Aware Nanoscale Modeling of Multicast Multihop Cellular Networks Moment-Based Spectral Analysis of Large-Scale Networks Using Local Structural Information Internet-Scale IPv4 Alias Resolution With MIDAR Time-Bounded Essential Localization for Wireless Sensor Networks Stability of FIPP -Cycles Under Dynamic Traffic in WDM Networks Cooperative Carrier Signaling: Harmonizing Coexisting WPAN and WLAN Devices Mobility Increases the Connectivity of Wireless Networks Topology Control for Effective Interference Cancellation in Multiuser MIMO Networks Distortion-Aware Scalable Video Streaming to Multinetwork Clients Combined Optimal Control of Activation and Transmission in Delay-Tolerant Networks A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 001 Efficient VLSI Architecture For Interpolation Decoding Of Hermitian Codes A fast, area efficient very large scale integration (VLSI) architecture is proposed of interpolation modules. The algorithm has a regular structure which makes it suitable for VLSI implementation. The circuitry is simplified as the decoding algorithm directly gives the message word at the end of the decoding algorithm without separate - Further speed improvements can be achieved by combining the main idea of Guruswami list decoding with the Lee-O'Sullivan algorithm. In terms of hardware, the addition of this concept, will further reduce the running time of the algorithm and make the circuitry abo -O'Sullivan algorithms on Xilinx Virtex-5 shows that the proposed decoder can be operated at higher clock frequency with almost same area complexity. ETPL VLSI - 002 Designing Hardware-Efficient Fixed-Point FIR Filters In An Expanding Subexpression Space This paper presents a practical method for designing fixed-point FIR filters. The proposed method takes both the filter's magnitude response and its hardware cost into consideration in the design process. The method constructs a basis set based on the fixed-point coefficients that have been synthesized already. The elements in the basis set are used to synthesize the undetermined fixed-point coefficients later. Thus, this basis set expands gradually along with the progress of the coefficient design. The method employs some strategies to speed up the design process. For example, a complexity estimation strategy helps us stop digging deeper in some branches of the search tree, and a solution prediction strategy for high-order FIR filters helps us design fixedpoint FIR filters of length equal to a few hundreds. Applying the proposed method to design twenty benchmark cases, we can obtain hardware-efficient results in a reasonable design time. In two long filter design cases, our design results are better than those designed by the other methods. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 003 Synchronous Non-Volatile Logic Gate Design Based on Resistive Switching Memories Emerging non-volatile a memories (nvm) based on resistive switching mechanism (rs) such as stt-mram, oxrram and cbram etc., are under intense r&d investigation by both academics and industries. They provide high write/read speed, low power and good endurance (e.g., > 1012) beyond mainstream nvms, which allow them to be embedded directly with logic units for computing purpose. This integration could increase significantly the power/die area efficiency, and then overcome definitively the power/speed bottlenecks of modern vlsis. This paper presents firstly a theoretical investigation of synchronous nv logic gates based on rs memories (rsnvl). Special design techniques and strategies are proposed to optimize the structure according to different resistive characteristics of nvms. To validate this study, we simulated a non-volatile full-adder (nvfa) with two types of nvms: stt-mram and oxrram by using cmos 40 nm design kit and compact models, which includes related physics and experimental parameters. They show interesting power, speed and area gain compared with synchronized cmos fa while keeping good reliability. ETPL VLSI - 004 An Optimized Modified Booth Recoder for Efficient Design of the AddMultiply Operator Complex arithmetic operations are widely used in Digital Signal Processing (DSP) applications. In this work, we focus on optimizing the design of the fused Add-Multiply (FAM) operator for increasing performance. We investigate techniques to implement the direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a structured and efficient recoding technique and explore three different schemes by incorporating them in FAM designs. Comparing them with the FAM designs which use existing recoding schemes, the proposed technique yields considerable reductions in terms of critical delay, hardware complexity and power consumption of the FAM unit. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 005 Low-Power Pulse-Triggered Flip-Flop Design Based on a Signal Feed-Through In this brief, a low-power flip-flop (FF) design featuring an explicit type pulse-triggered structure and a modified true single phase clock latch based on a signal feed-through scheme is presented. The proposed design successfully solves the long discharging path problem in conventional explicit type pulse-triggered FF (P-FF) designs and achieves better speed and power performance. Based on post-layout simulation results using TSMC CMOS 90-nm technology, the proposed design outperforms the conventional P-FF design dataclose-to-output (ep-DCO) by 8.2% in data-to-Q delay. In the mean time, the performance edges on power and power- delay-product metrics are 22.7% and 29.7%, respectively. ETPL VLSI - 006 Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay In this paper, we present an efficient architecture for the implementation of a delayed least mean square adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we use a novel partial product generator and propose a strategy for optimized balanced pipelining across the timeconsuming combinational blocks of the structure. From synthesis results, we find that the proposed design offers nearly 17% less area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the best of the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an efficient fixed-point implementation scheme of the proposed architecture, and derive the expression for steady-state error. We show that the steady-state mean squared error obtained from the analytical result matches with the simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which provides nearly 20% saving in ADP and 9% saving in EDP over the proposed structure before pruning without noticeable degradation of steady-state-error performance. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 007 Rate-0.96 LDPC Decoding VLSI for Soft-Decision Error Correction of NAND Flash Memory The reliability of data stored in high-density Flash memory devices tends to decrease rapidly because of the reduced cell size and multilevel cell technology. Soft-decision error correction algorithms that use multipleprecision sensing for reading memory can solve this problem; however, they require very complex hardware for high-throughput decoding. In this paper, we present a rate-0.96 (68254, 65536) shortened Euclidean geometry low-density parity-check code and its VLSI implementation for high-throughput NAND Flash memory systems. The design employs the normalized a posteriori probability (APP)-based algorithm, serial schedule, and conditional update, which lead to simple functional units, halved decoding iterations, and lowpower consumption, respectively. A pipelined-parallel architecture is adopted for high-throughput decoding, and memory-reduction techniques are employed to minimize the chip size. The proposed decoder is implemented in 0.13-μ M z tion of the decoder are compared with those of a BCH (Bose-Chaudhuri-Hocquenghem) decoding circuit showing comparable errorcorrecting performance and throughput. ETPL VLSI - 008 A Generalized Lattice Filter for Finite Wordlength Implementation With Reduced Number of Multipliers The excellent finite wordlength (FWL) property of lattice digital filters is well known. The four-multiplier normalized lattice, with signal power at all delay elements normalized to unity, has particular advantage in its overflow property. However, when used to implement an Nth-order digital filter, the normalized lattice implementation requires 5N+1 multipliers. There exists another lattice structure with excellent FWL property called the injected numerator lattice structure. In this paper, we combine the injected numerator lattice and tapped numerator lattice to form a new hybrid lattice structure, which is not only canonic in the number of multipliers resulting in a significant reduction in overall implementation cost but also exhibits much better FWL properties than the normalized la A “ ” for application where the input signal has a strong time varying sinusoidal component. The new structure requires a few additional adders; it can be used to implement any causal and stable z-transform transfer function. Two numerical examples are presented to demonstrate the performance of the proposed structure. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 009 Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay In this paper, we present an efficient architecture for the implementation of a delayed least mean square adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we use a novel partial product generator and propose a strategy for optimized balanced pipelining across the timeconsuming combinational blocks of the structure. From synthesis results, we find that the proposed design offers nearly 17% less area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the best of the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an efficient fixed-point implementation scheme of the proposed architecture, and derive the expression for steady-state error. We show that the steady-state mean squared error obtained from the analytical result matches with the simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which provides nearly 20% saving in ADP and 9% saving in EDP over the proposed structure before pruning without noticeable degradation of steady-state-error performance. ETPL VLSI - 010 Finite Alphabet Iterative Decoders for LDPC Codes: Optimization, Architecture and Analysis Low-density parity-check (LDPC) codes are adopted in many applications due to their Shannon-limit approaching error-correcting performance. Nevertheless, belief-propagation (BP) based decoding of these codes suffers from the error-floor problem, i.e., an abrupt change in the slope of the error-rate curve that occurs at very low error rates. Recently, a new type of decoders termed finite alphabet iterative decoders (FAIDs) were introduced. The FAIDs use simple Boolean maps for variable node processing, and can surpass the BP-based decoders in the error floor region with very short word length. We restrict the scope of this paper to regular dv=3 LDPC codes on the BSC channel. This paper develops a low-complexity implementation architecture for the FAIDs by making use of their properties. Particularly, an innovative bit-serial check node unit is designed for the FAIDs, and a small-area variable node unit is proposed by exploiting the symmetry in the Boolean maps. Moreover, an optimized data scheduling scheme is proposed to increase the hardware utilization efficiency. From synthesis results, the proposed FAID implementation needs only 52% area to reach the same throughput as one of the most efficient standard Min-Sum decoders for an example (7807, 7177) LDPC code, while achieving better error-correcting performance in the error-floor region. Compared to an offset Min-Sum decoder with longer word length, the proposed design can achieve higher throughput with 45% area, and still leads to possible performance improvement in the error-floor region. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 011 Precise VLSI Architecture for AI Based 1-D/ 2-D Daub-6 Wavelet Filter Banks With Low Adder-Count A multiplier-less architecture based on algebraic integer representation for computing the Daubechies 6-tap wavelet transform for 1-D/2-D signal processing is proposed. This architecture improves on previous designs in a sense that it minimizes the number of parallel 2-input adder circuits. The algorithm was achieved using brute-force numerical optimization of the algebraic integer representation. The proposed architecture furnishes exact computation up to the final reconstruction step, which is the operation that maps the exactly computed filtered results from algebraic integer representation to fixed-point. Compared to our recent work, this architecture shows a reduction of $27cdot n-16$ adder circuits, where $n$ is the number of wavelet decomposition levels. The design is physically implemented for a 4-level 1-D/2-D decomposition using a Xilinx Virtex-6 vcx240t1ff1156 field programmable gate array (FPGA) device operating at up to a maximum clock frequency of 344/ 168 MHz. The FPGA implementation of 1-D/2-D are tested using hardware cosimulation using an ML605 board with clock of 100 MHz. A 45 nm CMOS synthesis of 2-D designs show improved clock frequency of better than 306 MHz for a supply voltage of 1.1 V. ETPL VLSI - 012 Design of Efficient Binary Comparators in Quantum-Dot Cellular Automata Quantum-dot cellular automata (QCA) are an attractive emerging technology suitable for the development of ultra-dense low-power high-performance digital circuits. Efficient solutions have recently been proposed for several arithmetic circuits, such as adders, multipliers, and comparators. Nevertheless, since the design of digital circuits in QCA still poses several challenges, novel implementation strategies and methodologies are highly desirable. This paper proposes a new design approach oriented to the implementation of binary comparators in QCA. New formulations of basic logic equations required to perform the comparison function are proposed. The new strategy has been exploited in the design of two different comparator architectures and for several operands word lengths. With respect to existing counterparts, the comparators proposed here exhibit significantly higher speed and reduced overall area. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 013 Improved matrix multiplier design for high-speed digital signal processing applications A transistor level implementation of an improved matrix multiplier for high-speed digital signal processing applications based on matrix element transformation and multiplication is reported in this study. The improvement in speed was achieved by rearranging the matrix element into a two-dimensional array of processing elements interconnected as a mesh. The edges of each row and column were interconnected in torus structure, facilitating simultaneous implementation of several multiplications. The functionality of the circuitry was verified and the performance parameters for example, propagation delay and dynamic switching power consumptions were calculated using spice spectre using 90 nm CMOS technology. The proposed methodology ensures substantial reduction in propagation delay compared with the conventional algorithm, systolic array and pseudo number theoretic transformation (PNTT)-based implementation, which are the most commonly used techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4 matrix ~2 μ × ~3.12 mW only. Improvement in speed compared with earlier reported matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%, respectively. ETPL VLSI - 014 High-Speed Experimental Demonstration of Adiabatic Quantum-FluxParametron Gates Using Quantum-Flux-Latches We experimentally demonstrated high-speed logic operations of adiabatic quantum-flux-parametron (AQFP) gates through the use of quantum-flux-latches (QFLs). In QFL-based high-speed test circuits (QHTCs), the output data of the circuits under test (CUTs), which are driven by high-speed excitation currents, are stored in QFLs and are slowly read out using low-speed excitation currents. We designed and fabricated three types of QHTCs using QFLs with different circuit parameters, where the CUTs were buffer gates and and gates. We confirmed the correct operation of buffer gates and and gates at 1 GHz. The obtained bias margins of the 1 GHz excitation currents were more than ±30% for each QHTC, which is wide enough for high-speed logic operations of AQFP gates Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 015 Spin Orbit Torque Non-Volatile Flip-Flop for High Speed and Low Energy Applications A novel nonvolatile flip-flop based on spin-orbit torque magnetic tunnel junctions (SOT-MTJs) is proposed for fast and ultralow energy applications. A case study of this nonvolatile flip-flop is considered. In addition to the independence between writing and reading paths, which offers a high reliability, the low resistive writing path performs high-speed, and energy-efficient WRITE operation. We compare the SOT-MTJ performances metrics with the spin transfer torque (STT)-MTJ. Based on accurate compact models, simulation results show an improvement, which attains 20× in terms of WRITE energy per bit cell. At the same writing current and supply voltage, the SOT-MTJ achieves a writing frequency 4× higher than the STT-MTJ. ETPL VLSI - 016 Critical-Path Analysis and Low-Complexity Implementation of the LMS Adaptive Algorithm This paper presents a precise analysis of the critical path of the least-mean-square (LMS) adaptive filter for deriving its architectures for high-speed and low-complexity implementation. It is shown that the direct-form LMS adaptive filter has nearly the same critical path as its transpose-form counterpart, but provides much faster convergence and lower register complexity. From the critical-path evaluation, it is further shown that no pipelining is required for implementing a direct-form LMS adaptive filter for most practical cases, and can be realized with a very small adaptation delay in cases where a very high sampling rate is required. Based on these findings, this paper proposes three structures of the LMS adaptive filter: (i) Design 1 having no adaptation delays, (ii) Design 2 with only one adaptation delay, and (iii) Design 3 with two adaptation delays. Design 1 involves the minimum area and the minimum energy per sample (EPS). The best of existing directform structures requires 80.4% more area and 41.9% more EPS compared to Design 1. Designs 2 and 3 involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF at a cost of 55.0% and 60.6% more area, respectively. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 017 Exploiting the Incomplete Diffusion Feature: A Specialized Analytical Side-Channel Attack Against the AES and Its Application to Microcontroller Implementations Algebraic side-channel attack (ASCA) is a typical technique that relies on a general solver to solve the equations of a cipher and its side-channel leaks. It falls under analytical side-channel attack and can recover the entire key at once. Many ASCAs are proposed against the AES, and they utilize the Gröbner basis-based, SAT-based, or optimizer-based solver. The advantage of the general solver approach is its generic feature, which can be easily applied to different cryptographic algorithms. The disadvantage is that it is difficult to take into account the specialized properties of the targeted cryptographic algorithms. The results vary depending on what type of solver is used, and the time complexity is quite high when considering the error-tolerant attack scenarios. Thus, we were motivated to find a new approach that would lessen the influence of the general solver and reduce the time complexity of ASCA. This paper proposes a new analytical side-channel attack on AES by exploiting the incomplete diffusion feature in one AES round. We named our technique incomplete diffusion analytical side-channel analysis (IDASCA). Different from previous ASCAs, IDASCA adopts a specialized approach to recover the secret key of AES instead of the general solver. Extensive attacks are performed against the software implementation of AES on an 8-bit microcontroller. Experimental results show that: 1) IDASCA can exploit the side-channel leaks in all AES rounds using a single power trace; 2) it has less time complexity and more robustness than previous ASCAs, especially when considering the error-tolerant attack scenarios; and 3) it can calculate the reduced key search space of AES for the given amount of sidechannel leaks. IDASCA can also interpret the mechanism behind previous ASCAs on AES from a quantitative perspective, such as why ASCA can work under unknown plaintext/ciphertext scenarios and what are the extreme cases in ASCAs. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 018 Efficient Register Renaming and Recovery for High-Performance Processors Modern superscalar processors implement register renaming using either random access memory (RAM) or content-addressable memories (CAM) tables. The design of these structures should address both access time and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents them from scaling with the number of physical registers and pipeline width, negatively impacting performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM– CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM enables immediate recovery. Experimental results show that in a four-way state-of-the-art superscalar processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme, while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid RAM–CAM scheme does not exceed the area required by conventional renaming mechanisms. ETPL VLSI - 019 Design and simulation of power efficient traffic light controller (PTLC) This paper presents design and simulation of a power efficient traffic light controller (PTLC). The main focus is on simulation and optimization of PTLC design and computing its speed of operation. In the conventional system, power consumption is high and expensive. The design of PTLC is better than conventional in terms of LUT's (number of gates), complexity, size and cost. In this research paper a novel PTLC is presented with a minimum number of LEDs which fairly improves its performance and makes the design efficient in terms of power and speed with respect to conventional design. The conventional traffic light controller has been implemented using microcontroller and FPGA's. The research paper by Parasmani in 2013 stated the use of FPGA to design an advanced traffic light controller which uses the sensor to maintain the continuous traffic flow hence the power consumption is too high which can be reduced by the design PTLC. The novel design of PTLC is an economical and possess the characters of high integration, low power and flexibility. The PTLC has been implemented using FPGA. FPGA has many advantages as the speed, number of input/output ports and performance. This system has been successful tested and implemented in hardware using Xilinx v 10.1 software packages using Very High Speed Integrated circuit hardware description language (VHDL), RTL and technology schematic are included to validate simulation results. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 020 Energy Efficient Exact Matching for Flow Identification with Cuckoo Affinity Hashing Energy efficiency has become an important design goal for networking equipment. Traditionally routers and switches have been designed to minimize peak power consumption but they operate most of the time with settings and traffic that is far from that peak. Therefore, many elements and functions of networking equipment are being redesigned to improve energy efficiency. A common functionality in networking is flow identification that is needed in many applications. Flow identification can be implemented with Content Addressable Memories (CAMs) or alternatively with several data structures. Among those, one efficient option is Cuckoo hashing that enables fast searches and high memory utilization at the cost of complicating the insertion procedure. In this letter, first the energy efficiency of exact matching using Cuckoo hashing is analyzed and then a technique is presented to improve the energy efficiency of Cuckoo hashing. The proposed scheme is evaluated using a traffic monitoring application and compared with the traditional Cuckoo hashing. The results show that significant energy savings can be obtained by using the proposed technique. ETPL VLSI - 021 Ultra Low Power Magnetic Flip-Flop Based on Checkpointing/Power Gating and Self-Enable Mechanisms Advanced computing systems suffer from high static power due to the rapidly rising leakage currents in deep sub-micron MOS technologies. Fast access non-volatile memories (NVM) are under intense investigation to be integrated in Flip-Flops or computing memories to allow system power-off in standby state and save power. Spin Transfer Torque MRAM (STT-MRAM) is considered the most promising NVM to address this issue thanks to its high speed, low power, and infinite endurance. However, one of the disadvantages of STTMRAM for the computing purpose is its relatively high write energy to build up Magnetic Flip-Flop (MFF). In this paper, we propose a power-efficient MFF design architecture to address this challenge based on the combination of checkpointing operation, power gating and self-enable mechanisms. Multi non-volatile storages can be integrated locally in a conventional FF without significant area overhead benefiting from the 3-D implementation of STT-MRAM. We performed electrical simulations (i.e. transient and statistical) to validate its functional behaviors and evaluate its performance by using an accurate spice model of STTMRAM and an industrial 40 nm CMOS design kit. The simulation results confirm its lower power consumption compared to conventional CMOS FF and the other structures. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 022 A joint encryption and error correction method used in satellite communications Due to the ubiquitous open air links and complex electromagnetic environment in the satellite a communications, how to ensure the security and reliability of the information through the satellite communications is an urgent problem. This paper combines the AES(Advanced Encryption Standard) with LDPC(Low Density Parity Check Code) to design a secure and reliable error correction method ?? SEEC(Satellite Encryption and Error Correction). This method selects the LDPC codes, which is suitable for satellite communications, and uses the AES round key to control the encoding process, at the same time, proposes a new algorithm of round key generation. Based on a fairly good property in error correction in satellite communications, the method improves the security of the system, achieves a shorter key size, and then makes the key management easier. Eventually, the method shows a great error correction capability and encryption effect by the MATLAB simulation ETPL VLSI - 023 Dynamic ternary cam for hardware search engine A five-transistor dynamic ternary content addressable memory (CAM) is presented for high-density data search applications. The data path and the search path are separated to avoid unwanted capacitive coupling at the storage node. To increase the data retention time, the data lines are grounded and dummy search lines are implemented for refresh operations. The proposed CAM cell is fabricated using a 130 nm CMOS process, and 8 99 μ 2 A f 64 × 128 search memory has a retention time of 2.84 ms at room temperature with a 1.2 V supply voltage. The hardware search performance is compared with a conventional software-based search scheme, running on two different systems with clock frequencies of more than an order of magnitude faster. The hardware search engine exhibits comparable search speeds while dissipating only 149 mW. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 024 High-Throughput Low-Energy Self-Timed CAM Based on Reordered Overlapped Search Mechanism This paper introduces a reordered overlapped search mechanism for high-throughput low-energy contentaddressable memories (CAMs). Most mismatches can be found by searching a few bits of a search word. To lower power dissipation, a word circuit is often divided into two sections that are sequentially searched or even pipelined. Because of this process, most of match lines in the second section are unused. Since searching the last few bits is very fast compared to searching the rest of the bits, we propose to increase throughput by asynchronously initiating second-stage searches on the unused match lines as soon as a first-stage search is complete. In our circuit implementation, each word circuit is independently controlled by a locally generated timing signal rather than a global signal. This allows the circuits to be in the required phase for their own local operation: evaluate or precharge, instead of having to synchronize their phase to the rest of the word circuits, which greatly reduces the cycle time. As a design example, a 128 × 64-bit CAM is implemented and evaluated by HSPICE simulation under a 90 nm CMOS technology. The proposed asynchronous CAM operates 5.98 times faster than a synchronous CAM with 14.2% smaller energy dissipation. The post-layout proposed CAM achieves 385-ps cycle delay time and 0.773 fJ/bit/search and is also evaluated under different corner conditions variations toand guarantee it operates properly. ETPL and PVT A Single-Bit Double-Adjacent Error Correcting Parallel Decoder for VLSI - 025 Multiple-Bit Error Correcting BCH Codes This paper presents a novel high-speed BCH decoder that corrects double-adjacent and single-bit errors in parallel and serially corrects multiple-bit errors other than double-adjacent errors. Its operation is based on extending an existing parallel BCH decoder that can only correct single-bit errors and serially corrects doubleadjacent errors at low speed. The proposed decoder is constructed by a novel design and is suitable for nanoscale memory systems, in which multiple-bit errors occur at a probability comparable to single-bit errors and double-adjacent errors occur at a higher probability (nearly two orders of magnitude) than other multiplebit errors. Extensive simulation results are reported. Compared with the existing scheme, the area and delay time of the proposed decoder are on average 11% and 6% higher, but its power consumption is reduced by 9% on average. This paper also shows that the area, delay, and power overheads incurred by the proposed scheme are significantly lower than traditional fully parallelized BCH decoders capable of correcting any double-bit errors in parallel. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 026 Design and implementation of high speed and high accuracy fixed-width modified booth multiplier for DSP application This paper presents an error compensation bias circuit added to a modified encoded booth multiplier to produce a high accuracy fixed-width multiplier. Fixed-width multiplier is employed in many digital signal processing applications, as most of these systems employ iterative structures with fixed precision. The design has been implemented in TSMC 180nm technology. The design is 14.6% faster than the fixed-width multipliers. The design has 37.2% less truncation error as compared to direct truncated fixed width multiplier (DTFM). The design is embedded with operand isolator technique to ensure low power operation when employed in DSP applications. ETPL VLSI - 027 A new design of low power high speed hybrid CMOS full adder We have designed the full Adder using hybrid-CMOS logic style by dividing it in three modules so that it can be optimized at various levels. First module is an XOR-XNOR circuit, which generates full swing XOR and XNOR outputs simultaneously and have a good driving capability. It also consumes minimum power and provides better delay performance. Second module is a sum circuit which is also a XOR circuit and uses carry input and the output of the first module as input to generate sum output. Third module is a carry circuit which uses the output of the first stage and other inputs to generate carry output. In the new full adder design we have proposed new full adder circuit which reduce the power consumption, delay between carry out to carry in and PDP by 12 to 100% ETPL VLSI - 028 P E M 0 18 μ M Design for reliability for low power digital circuits Lower power digital circuits in cellular phones, laptop or tablet computers have critical power consumption limitations. Power consumption at process corners can vary as much as 50%. In order to optimize high-speed logic circuit designs for low power needs, we need to accurately predict device to product aging across process, temperature and voltage corners. In this talk, we focus on the impact of BTI aging at corners, the Fmax guardband and its trade-off with power and performance. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 029 High speed vedic multiplier designs-A review Multipliers are the key block in high speed arithmetic logic units, multiplier and accumulate units, digital signal processing units etc. With the increasing constraints on delay, more and more emphasis is being laid on design of faster multiplications. To enhance speed many modifications over the standard modified booth algorithm, Wallace tree methods for multiplier design have been made and several new techniques are being worked upon. Amongst these Vedic multipliers based on Vedic mathematics are presently under focus due to these being one of the fastest and low power multiplier. There are sixteen sutras in Vedic multiplication in “U ” he most efficient one in terms of speed. A large number of high speed Vedic multipliers have been proposed with Urdhva Tiryakbhyam sutra. Few of them are presented in this paper giving an insight into their methodology, merits and demerits. Compressor based Vedic Multipliers show considerable improvements in speed and area efficiency over the conventional ones. ETPL VLSI - 030 Design of an energy efficient, high speed, low power full subtractor using GDI technique This paper proposes the design of an energy efficient, high speed and low power full subtractor using Gate Diffusion Input (GDI) technique. The entire design has been performed in 150nm technology and on comparison with a full subtractor employing the conventional CMOS transistors, transmission gates and Complementary Pass-Transistor Logic (CPL), respectively it has been found that there is a considerable amount of reduction in Average Power consumption (Pavg), delay time as well as Power Delay Product (PDP). Pavg is as low as 13.96nW while the delay time is found to be 18.02pico second thereby giving a PDP as low as 2.51×10-19 Joule for 1 volt power supply. In addition to this there is a significant reduction in transistor count compared to traditional full subtractor employing CMOS transistors, transmission gates and CPL, accordingly implying minimization of area. The simulation of the proposed design has been carried out in Tanner SPICE and the layout has been designed in Microwind. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 031 Ultrafast All-Optical Flip-Flops, Simultaneous Comparator-Decoder and Reconfigurable Logic Unit With Silicon Microring Resonator Switches We present designs of all-optical SR, clocked-SR, D and T flip-flops, simultaneous single-bit comparatordecoder and reconfigurable logic unit based on all-optical switching by two-photon absorption induced freecarrier injection in silicon 2 × 2 add-drop microring resonators. The proposed circuits have been theoretically analyzed using time-domain coupled-mode theory and all-optical switching has been optimized for ultrafast (~25 ps), low-power operation (~25 mW) and high modulation (> 85%), enabling logic operations at 40 Gb/s. The designs are attractive due to advantages of high Q-factor, tunability, compactness, cascadibility, scalability, reconfigurability, simplicity and minimal number of switches and inputs for realization of the desired logic. ETPL VLSI - 032 Implementation of high speed low power combinational and sequential circuits using reversible logic Reversible logic has presented itself as a prominent technology which plays an imperative role in Quantum Computing. Quantum computing devices theoretically operate at ultra high speed and consume infinitesimally less power. Research done in this paper aims to utilize the idea of reversible logic to break the conventional speed-power trade-off, thereby getting a step closer to realise Quantum computing devices. To authenticate this research, various combinational and sequential circuits are implemented such as a 4-bit Ripple-carry Adder, (8-bit X 8-bit) Wallace Tree Multiplier, and the Control Unit of an 8-bit GCD processor using Reversible gates. The power and speed parameters for the circuits have been indicated, and compared with their conventional non-reversible counterparts. The comparative statistical study proves that circuits employing Reversible logic thus are faster and power efficient. The designs presented in this paper were simulated using Xilinx 9.2 software. ETPL VLSI - 033 An all-digital delay-locked loop for high-speed memory interface applications This paper presents an all-digital delay-locked loop with the novel digital delay line for high-speed memory interface applications. The proposed digital delay line has smaller tuning step and better tuning linearity than the prior arts. The proposed ADDLL inside the DDR3 PHY for the purpose of the 90-degree phase shift and read leveling is fabricated in a 40nm low-power CMOS process. The testchip is successfully verified at the data rate of 800∼1600Mbps. The measured peak-to-peak and rms jitter of the write DQS are 60ps and 10ps at the data rate of 1600Mbps, respectively. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 034 Low Power Square and Cube Architectures Using Vedic Sutras In this paper low power square and cube architectures are proposed using Vedic sutras. Low power and less area square and cube architectures uses Dwandwa yoga Duplex combination properties of Urdhva Tiryagbhyam sutra and Anurupyena sutra of Vedic mathematics. Simulation results for 8-bit square and 8-bit cube shows that proposed architectures lowers the total power consumption by 45% and area by 63% when compared to the conventional architecture. Also the reduction in power consumption increases with the increase in bit width. Comparison is made between conventional and Vedic method implementations of square and cube architecture. Implementation results show a significant improvement in terms of area, power and delay. Proposed square and cube architectures can be used for high speed and low power applications. Synthesis is done on Xilinx FPGA Device using, Xilinx Family: Spartan 3E, Speed Grade: -4. Propagation delay of the proposed 8-bit square is 4ns and area consumed in terms of slices is 22 and for 8-bit cube propogation delay is 7.72ns and area consumed in terms of slices is 58. Dynamic power estimation for square and cube are 13mW and 16mW respectively. ETPL VLSI - 035 Performance analysis of a high speed, energy efficient 4×4 dynamic RAM cell array using 32nm fully depleted SOI/SON and CNFET The objective of this paper is fully focused on designing of a power efficient, high performance 4×4 1T DRAM cell array using conventional MOS, fully depleted SOI/SON and CNFET devices. As the CMOS technology is being scaled down, there has been a major need to improve the performance and robustness of the memory extensively used in today's hand-held devices. Dynamic Random Access Memory (DRAM) is the main memory used for all desktop and larger computers. In modern VLSI circuit designing, power dissipation is also a crucial issue. The new emerging devices with improved technology promise of low power applications. In this paper, we have presented a comparative circuit level analysis between Metal Oxide Semiconductor (MOS), fully depleted Silicon on Insulator (FD-SOI), fully depleted Silicon on Nothing (FDSON) and Carbon Nanotube Field Effect transistor (CNFET) in 32nm technology node using HSpice tool. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 036 High-performance 64-bit binary comparator High-performance 64-bit binary comparator is proposed in this brief. Comparison is most basic arithmetic operation that determines if one number is greater than, equal to, or less than the other number. Comparator is most fundamental component that performs comparison operation. This briefly presents comparison of modified and existing 64-bit binary comparator designs concentrating on power consumption and delay. Means some modifications have been done in existing 64-bit binary comparator design to improve the performance of the circuit. Comparison between modified and existing 64-bit binary comparator designs is calculated by simulation that is performed at 90nm technology in Tanner EDA Tool. ETPL VLSI - 037 FPGA based partial reconfigurable fir filter design This paper proposes partial reconfigurable FIR filter design using systolic Distributed Arithmetic (DA) architecture optimized for FPGAs. To implement computationally efficient, low power, high speed Finite Impulse Response (FIR) filter a two dimensional fully pipelined structure is used. To reduce the partial reconfiguration time a new architecture for the Look-Up Table (LUT) in distributed arithmetic is proposed. The FIR filter is dynamically reconfigured to realize low pass and high pass filter characteristics by changing the filter coefficients in the partial reconfiguration module. The design is implemented using XUP Virtex 5 LX110T FPGA kit. The FIR filter design shows improvement in configuration time and efficiency. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 038 Analysis and Design of a Low-Voltage Low-Power Double-Tail Comparator The need for ultra low-power, area efficient, and high speed analog-to-digital converters is pushing toward the use of dynamic regenerative comparators to maximize speed and power efficiency. In this paper, an analysis on the delay of the dynamic comparators will be presented and analytical expressions are derived. From the analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay and fully explore the tradeoffs in dynamic comparator design. Based on the presented analysis, a new dynamic comparator is proposed, where the circuit of a conventional double-tail comparator is modified for low-power and fast operation even in small supply voltages. Without complicating the design and by adding few transistors, the positive feedback during the regeneration is strengthened, which results in remarkably reduced delay time. Post-layout simulation results in a 0.18- μ M shown that in the proposed dynamic comparator both the power consumption and delay time are significantly reduced. The maximum clock frequency of the proposed comparator can be increased to 2.5 and 1.1 GHz at 12 06V 1 W 153 μW of the input-referred offset is 7.8 mV at 1.2 V supply. ETPL VLSI - 039 A Blind Dynamic Fingerprinting Technique for Sequential Circuit Intellectual Property Protection Design fingerprinting is a means to trace the illegally redistributed intellectual property (IP) by creating a a unique IP instance with a different signature for each user. Existing fingerprinting techniques for hardware IP protection focus on lowering the design effort to create a large number of different IP instances without paying much attention on the ease of fingerprint detection upon IP integration. This paper presents the first dynamic fingerprinting technique on sequential circuit IPs to enable both the owner and legal buyers of an IP embedded in a chip to be readily identified in the field. The proposed fingerprint is an oblivious ownership watermark independently endorsed by each user through a blind signature protocol. Thus, the authorship can also be proved through the detection of different user's fingerprints without the need to separately embed an identical IP owner's signature in all fingerprinted instances. The proposed technique is applicable to both applicationspecific integrated circuit and field-programmable gate array IPs. Our analyses show that the fingerprint is immune to collusion attack and can withstand all perceivable attacks, with a lower probability of removal than state-of-the-art FSM watermarking schemes. The probability of coincidence of a 32-bit fingerprint is in the order of 10-10 and up to 1035 32-bit fingerprinted instances can be generated for a small design of 100 flipflops. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 040 Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay In this paper, we present an efficient architecture for the implementation of a delayed least mean square adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we use a novel partial product generator and propose a strategy for optimized balanced pipelining across the timeconsuming combinational blocks of the structure. From synthesis results, we find that the proposed design offers nearly 17% less area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the best of the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an efficient fixed-point implementation scheme of the proposed architecture, and derive the expression for steady-state error. We show that the steady-state mean squared error obtained from the analytical result matches with the simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which provides nearly 20% saving in ADP and 9% saving in EDP over the proposed structure before pruning without noticeable degradation of steady-state-error performance. ETPL VLSI - 041 Reduced-Complexity Min-Sum Algorithm for Decoding LDPC Codes With Low Error-Floor This paper proposes a low-complexity min-sum algorithm for decoding low-density parity-check codes. It is an improved version of the single-minimum algorithm where the two-minimum calculation is replaced by one minimum calculation and a second minimum emulation. In the proposed one, variable correction factors that depend on the iteration number are introduced and the second minimum emulation is simplified, reducing by this way the decoder complexity. This proposal improves the performance of the single-minimum algorithm, approaching to the normalized min-sum performance in the water-fall region. Also, the error-floor region is analyzed for the code of the IEEE 802.3an standard showing that the trapping sets are decoded due to a slow down of the convergence of the algorithm. An error-floor free operation below $hbox {BER}=10^{-15}$ is shown for this code by means of a field-programmable gate array (FPGA)-based hardware emulator. A layered decoder is implemented in a 90-nm CMOS technology achieving 12.8 Gbps with an area of 3.84 mm$^2$ . Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 042 Improved 8-Point Approximate DCT for Image and Video Compression Requiring Only 14 Additions Video processing systems such as HEVC requiring low energy consumption needed for the multimedia market has lead to extensive development in fast algorithms for the efficient approximation of 2-D DCT transforms. The DCT is employed in a multitude of compression standards due to its remarkable energy compaction properties. Multiplier-free approximate DCT transforms have been proposed that offer superior compression performance at very low circuit complexity. Such approximations can be realized in digital VLSI hardware using additions and subtractions only, leading to significant reductions in chip area and power consumption compared to conventional DCTs and integer transforms. In this paper, we introduce a novel 8-point DCT approximation that requires only 14 addition operations and no multiplications. The proposed transform possesses low computational complexity and is compared to state-of-the-art DCT approximations in terms of both algorithm complexity and peak signal-to-noise ratio. The proposed DCT approximation is a candidate for reconfigurable video standards such as HEVC. The proposed transform and several other DCT approximations are mapped to systolic-array digital architectures and physically realized as digital prototype circuits using FPGA technology and mapped to 45 nm CMOS technology. ETPL VLSI - 043 Toward Multi-Gigabit Wireless: Design of High-Throughput MIMO Detectors With Hardware-Efficient Architecture This paper presents a hardware-efficient architecture for 4×4 and 8×8 high-throughput MIMO detectors. The adopted non-constant K-best algorithm tends to keep more survival nodes in top search tree layers and reduce computational complexity in bottom layers as opposed to the conventional K-best algorithm. A pipelined architecture is used to generate one detection output per clock cycle, thus meeting multi-gigabit throughput requirements for advanced wireless communication systems. The proposed efficient folding scheme strikes a suitable balance between complexity and throughput. This paper also presents a discussion on the scalability of this architecture with respect to the setting of QAM size, K values, and antenna number. One 4×4 MIMO detector IC has been manufactured and one 8×8 MIMO detector layout has been realized, both in 90-nm CMOS technology. The 4×4 detector IC has 232 kilogates (KG). Its maximum measured throughput is 4.08 Gbps at 170-MHz operating frequency and 1.3-V core voltage. The 8×8 detector has 665 KG. Its post-layout simulation results show that it achieves 4.37-Gbps throughput at 182-MHz operating frequency and 0.9-V core voltage. Compared to earlier hard-output detectors, both implemented detectors demonstrate good normalized power and normalized hardware efficiencies. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 044 Partial Access Mode: New Method for Reducing Power Consumption of Dynamic Random Access Memory Demands have been placed on a dynamic random access memory (DRAM) to not only have increased memory capacity and data transfer speed, but also have reduced operating and standby currents. When a system uses a DRAM, a refresh operation is necessary because of its data retention time restriction: each bit of the DRAM is stored as an amount of electrical charge in a storage capacitor that is discharged by the leakage current. Power consumption for the refresh operation increases in proportion to the memory capacity. We propose a new method to reduce the refresh power consumption by effectively extending the memory cell retention time. Conversion from 1 cell/bit to $2^{N}$ cells/bit reduces the variation in the retention time among memory cells. Although active power increases by a factor of $2^{N}$ , the refresh time increases by more than $2^{N}$ as a consequence of the fact that the majority decision does better than averaging for the tail distribution of retention time. The conversion can be realized very simply from the structure of the DRAM array circuit, and it reduces the frequency of disturbance and power consumption by two orders of magnitude. On the basis of this conversion method, we propose a partial access mode to reduce power consumption dynamically when the full memory capacity is not required. ETPL VLSI - 045 Pulsed-Latch Utilization for Clock-Tree Power Optimization Minimizing the size of a clock tree is known as an effective approach to reduce power dissipation in modern circuit designs. However, most existing power-aware clock-tree minimization algorithms optimize power on the basis of flip-flops alone, which may result in limited power savings. To achieve a power and timing tradeoff, this paper investigates the pulsed-latch utilization in a clock tree for further power savings. This is the first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and flip-flops. The proposed method is based on minimum-cost maximum-flow formulation to globally determine the tree topology, which maintains load balance and considers the wirelength between pulse generators and pulsed latches. Experimental results indicate that the proposed migration approach can improve the power consumption by 12% and 13% with 7% and 70% skew improvements on average compared with the most recent paper on the industrial circuits and ISPD-2010 benchmarks, respectively. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 046 Scalable Montgomery Modular Multiplication Architecture with Low-Latency and Low-Memory Bandwidth Requirement Montgomery modular multiplication is widely used in public-key cryptosystems. This work shows how to relax the data dependency in conventional word-based algorithms to maximize the possibility of reusing the current words of variables. With the greatly relaxed data dependency, we then proposed a novel scheduling scheme to alleviate the number of memory access in the developed scalable architecture. Analytical results show that the memory bandwidth requirement of the proposed scalable architecture is almost 1/(w - 1) times that of conventional scalable architectures, where w denotes word size. The proposed one also retains a latency of exactly one cycle between the operations of the same words in two consecutive iterations of the Montgomery modular multiplication algorithm when employing enough processing elements. Compared to the design in the related work, experimental results demonstrate that the proposed one achieves an almost 54 percent reduction in power consumption with no degradation in throughput. The reduced number of memory access not only leads to lower power consumption, but also facilitates the design of scalable architectures for any precision of operands. ETPL VLSI - 047 Reconfigurable CORDIC-Based Low-Power DCT Architecture Based on Data Priority This paper presents a low-power coordinate rotation digital computer (CORDIC)-based reconfigurable discrete cosine transform (DCT) architecture. The main idea of this paper is based on the interesting fact that all the computations in DCT are not equally important in generating the frequency domain outputs. Considering the importance difference in the DCT coefficients, the number of CORDIC iterations can be dynamically changed to efficiently tradeoff image quality for power consumption. Thus, the computational energy can be significantly reduced without seriously compromising the image quality. The proposed CORDIC-based 2-D D 0 13 μ M results show that our reconfigurable DCT achieves power savings ranging from 22.9% to 52.2% over the CORDIC-based Loeffler DCT at the cost of minor image quality degradations. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 048 Achieving High-Performance On-Chip Networks With Shared-Buffer Routers On-chip routers typically have buffers dedicated to their input or output ports for temporarily storing packets in case contention occurs on output physical channels. Buffers, unfortunately, consume significant portions of router area and power budgets. While running a traffic trace, however, not all input ports of routers have incoming packets needed to be transferred simultaneously. Therefore, a large number of buffer queues in the network are empty and other queues are mostly busy. This observation motivates us to design router architecture with shared queues (RoShaQ), router architecture that maximizes buffer utilization by allowing the sharing multiple buffer queues among input ports. Sharing queues, in fact, makes using buffers more efficient hence is able to achieve higher throughput when the network load becomes heavy. On the other side, at light traffic load, our router achieves low latency by allowing packets to effectively bypass these shared queues. Experimental results on a 65-nm CMOS standard-cell process show that over synthetic traffics RoShaQ has 17% less latency and 18% higher saturation throughput than a typical virtualchannel (VC) router. Because of its higher performance, RoShaQ consumes 9% less energy per transferred packet than VC router given the same buffer space capacity. Over real multitask applications and E3S embedded benchmarks using near-optimal NMAP mapping algorithm, RoShaQ has 32% lower latency than VC router and targeting the same application throughput with 30% lower energy per packet. ETPL VLSI - 049 Area-Delay Efficient Binary Adders in QCA As transistors decrease in size more and more of them can be accommodated in a single die, thus increasing chip computational capabilities. However, transistors cannot get much smaller than their current size. The quantum-dot cellular automata (QCA) approach represents one of the possible solutions in overcoming this physical limit, even though the design of logic modules in QCA is not always straightforward. In this brief, we propose a new adder that outperforms all state-of-the-art competitors and achieves the best area-delay tradeoff. The above advantages are obtained by using an overall area similar to the cheaper designs known in literature. The 64cycles, that is just 36 clock phases. 18 72 μ2 Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 050 Low-Complexity Reconfigurable Fast Filter Bank for Multi-Standard Wireless Receivers This brief presents a new low-complexity reconfigurable fast filter bank (RFFB) for wireless communication applications such as spectrum sensing and channelization. In RFFB, the bandwidth and center frequency of sub-bands can be varied with high frequency resolution without hardware reimplementation. This is achieved with an improved modified frequency transformation-based variable digital filter (MFT-VDF) at the first stage of the proposed multistage implementation. Existing second-order frequency transformation-based low-pass VDFs have limited cutoff frequency range which is approximately 12.5% of the sampling frequency. The proposed low-pass MFT-VDF offers unabridged control over the cutoff frequency on a wide frequency range thereby, improving the cutoff frequency range of existing VDFs. The design example shows that the RFFB is easy to design and offers substantial savings in gate counts over other filter banks. ETPL VLSI - 051 Input Vector Monitoring Concurrent BIST Architecture Using SRAM Cells Input vector monitoring concurrent built-in self test (BIST) schemes perform testing during the normal operation of the circuit without imposing a need to set the circuit offline to perform the test. These schemes are evaluated based on the hardware overhead and the concurrent test latency (CTL), i.e., the time required for the test to complete, whereas the circuit operates normally. In this brief, we present a novel input vector monitoring concurrent BIST scheme, which is based on the idea of monitoring a set (called window) of vectors reaching the circuit inputs during normal operation, and the use of a static-RAM-like structure to store the relative locations of the vectors that reach the circuit inputs in the examined window; the proposed scheme is shown to perform significantly better than previously proposed schemes with respect to the hardware overhead and CTL tradeoff. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 052 Low-Complexity Low-Latency Architecture for Matching of Data Encoded With Hard Systematic Error-Correcting Codes A new architecture for matching the data protected with an error-correcting code (ECC) is presented in this brief to reduce latency and complexity. Based on the fact that the codeword of an ECC is usually represented in a systematic form consisting of the raw data and the parity information generated by encoding, the proposed architecture parallelizes the comparison of the data and that of the parity information. To further reduce the latency and complexity, in addition, a new butterfly-formed weight accumulator (BWA) is proposed for the efficient computation of the Hamming distance. Grounded on the BWA, the proposed architecture examines whether the incoming data matches the stored data if a certain number of erroneous bits are corrected. For a (40, 33) code, the proposed architecture reduces the latency and the hardware complexity by ${sim}{32%}$ and 9%, respectively, compared with the most recent implementation. ETPL VLSI - 053 Layout-Based Refined NPSF Model for DRAM Characterization and Testing As dynamic random access memories (DRAMs) are becoming denser with technology scaling, more complex fault behaviors emerge; examples are leakage, coupling effects, and cell neighborhoods interaction. The neighborhood pattern sensitive fault (NPSF) model is suitable to address such faulty behaviors and identify them during the characterization and/or test of new DRAM chips. However, NPSF test algorithms are extremely time-consuming and therefore not economically affordable. In this brief, we show how layout information can be used to refine and significantly simplify the NPSF model and reduce the test time complexity. As a case study, the folded DRAM array is considered. A realistic NPSF model, the $Delta$ -type neighborhood, is introduced together with a time efficient test algorithm which is more than two-times cheaper than traditional ones. Even when incorporating bit-line influence and word-line coupling effects, along with NPSFs, the test algorithm time complexity almost remains unaltered. Therefore, the proposed approach makes NPSF testing economically affordable, and hence, suitable for the characterization/test of dense DRAMs in the nanoera. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 054 Parasitics-Aware Design of Symmetric and Asymmetric Gate-Workfunction FinFET SRAMs Multigate FET technology is the most viable successor to planar CMOS technology at the 22-nm node and beyond. Prior research on multigate SRAMs is generally confined to the optimization of DC targets. However, on account of the nonplanar nature of multigate FETs, it is highly questionable whether multigate SRAM DC metrics can guide bitcell designers, as parasitic capacitances for two topologically equivalent bitcells can be very different - due to various issues such as fin pitches - resulting in widely varying transient characteristics. In this paper, we evaluate several known symmetric gate-workfunction (Symm- Φ for the first time, asymmetric gate-workfunction (Asymm-Φ 6 E 6 RAM E RAM -to-head in a 22-nm silicon-on-insulator process, from the perspective of transient behavior, using a unified 3-D/mixed-mode 2-D TCAD technology-circuit co-design methodology. We accomplish the latter by capturing bitcell parasitics accurately through transport analysis-based 3-D TCAD capacitance extractions that leverage automated layout-3-D TCAD structure synthesis algorithms. Mixed-mode transient device simulations (incorporating back-annotated 3-D TCAD parasitics) indicate that a design guided by DC metrics alone can lead to erroneous conclusions and suboptimal bitcell choices. Overall, from the perspective of area and performance, in singleΦ -gate (or vanilla) configurations are superior to topologies employing independent-gate configurations, even though the latter often have better DC metrics. In a larger design space encompassing dual/Asymm-Φ A -Φ E RAM topologies in terms of DC metrics and have better dynamic write-ability, even at low VDD. ETPL VLSI - 055 Simplifying Clock Gating Logic by Matching Factored Forms Gate-level clock gating starts with a netlist, with partial or no gating applied; some flip-flops are then selected for further gating to reduce the circuit's power consumption, and a gating logic of the smallest possible size must then be synthesized. We show how to do this by factored form matching, in which gating functions in factored forms are matched, as far as possible, with factored forms of the Boolean functions of existing combinational nodes in the circuit; additional gates are then introduced, but only for the portion of gating functions that are not matched. Strong matching identifies matches that are explicitly present in the factored forms, and weak matching seeks matches that are implicit in the logic and thus are more difficult to discover. Factored form matching reduces gating logic by an average of 24%, over a few test circuits, for which Boolean division only achieves an average reduction of 8%. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 056 Design Flow for Flip-Flop Grouping in Data-Driven Clock Gating Clock gating is a predominant technique used for power saving. It is observed that the commonly used synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal. The question of what is the group size maximizing the power savings is answered in a previous paper. Here we answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a practical solution based on the toggling activity correlations of FFs and their physical position proximity constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technologies. These savings are achieved on top of the sClock gating is a predominant technique used for power saving. It is observed that the commonly used synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal. The question of what is the group size maximizing the power savings is answered in a previous paper. Here we answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a practical solution based on the toggling activity correlations of FFs and their physical position proximity constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technol- gies. These savings are achieved on top of the savings obtained by clock gating synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level design.avings obtained by clock gating synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level design. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 057 An Efficient Partial-Sum Network Architecture for Semi-Parallel Polar Codes Decoder Implementation Polar codes have recently received a lot of attention because of their capacity-achieving performance and low encoding and decoding complexity. The performance of the successive cancellation decoder (SCD) of the polar codes highly depends on that of the partial-sum network (PSN) implementation. Hence, in this work, an efficient PSN architecture is proposed, based on the properties of polar codes. First, a new partial-sum updating algorithm and the corresponding PSN architecture are introduced which achieve a delay performance independent of the code length. Moreover, the area complexity is also reduced. Second, for a highperformance and area-efficient semi-parallel SCD implementation, a folded PSN architecture is presented to integrate seamlessly with the folded processing element architecture. This is achieved by using a novel folded decoding schedule. As a result, both the critical path delay and the area (excluding the memory for folding) of the semi-parallel SCD are approximately constant for a large range of code lengths. The proposed designs are implemented in both FPGA and ASIC and compared with the existing designs. Experimental result shows that for polar codes with large code length, the decoding throughput is improved by more than 1.05 times and the area is reduced by as much as 50.4%, compared with the state-of-the-art designs. ETPL VLSI - 058 An Analog VLSI Implementation of the Inner Hair Cell and Auditory Nerve Using a Dual AGC Model An analog inner hair cell and auditory nerve circuit using a dual AGC model has been implemented using 0.35 micron mixed-signal technology. A fully-differential current-mode architecture is used and the ability to correct channel mismatch is evaluated with matched layouts as well as with digital current tuning. The Meddis test paradigm is used to examine the analog implementation's auditory processing capabilities and investigate the circuit's ability to correct DC mismatch. The correction techniques used demonstrate the analog inner hair cell and auditory nerve circuit's potential use in low-power, multiple-sensor analog biomimetic systems with highly reproducible signal processing blocks on a single massively parallel integrated circuit. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 059 Improved Accuracy Current-Mode Multiplier Circuits With Applications in Analog Signal Processing This brief presents two original implementations of improved accuracy current-mode multiplier/divider circuits. Besides the advantage of their simplicity, these original multiplier/divider structures present the advantage of very small linearity errors that can be obtained as a result of the proposed design techniques (0.75% and 0.9%, respectively, for an extended range of the input currents). The original multiplier/divider circuits permit a facile reconfiguration, the presented structures representing the functional basis for implementing complex function synthesizer circuits. The proposed computational structures are designed for implementing in 0.18-μ circuits' power consumpt M -voltage operation (a supply voltage of 1.2 V). The 60 75 μW 79 6 59.7 MHz, respectively. ETPL VLSI - 060 Low-Complexity Low-Latency Architecture for Matching of Data Encoded With Hard Systematic Error-Correcting Codes A new architecture for matching the data protected with an error-correcting code (ECC) is presented in this brief to reduce latency and complexity. Based on the fact that the codeword of an ECC is usually represented in a systematic form consisting of the raw data and the parity information generated by encoding, the proposed architecture parallelizes the comparison of the data and that of the parity information. To further reduce the latency and complexity, in addition, a new butterfly-formed weight accumulator (BWA) is proposed for the efficient computation of the Hamming distance. Grounded on the BWA, the proposed architecture examines whether the incoming data matches the stored data if a certain number of erroneous bits are corrected. For a (40, 33) code, the proposed architecture reduces the latency and the hardware complexity by ${sim}{32%}$ and 9%, respectively, compared with the most recent implementation. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 061 AWARE (Asymmetric Write Architecture With REdundant Blocks): A High Write Speed STT-MRAM Cache Architecture Spin-transfer torque magnetic RAM (STT-MRAM) is a promising memory technology for lower level caches because of its high density and nonvolatile nature. However, the high write latency is a bottleneck to its widespread adoption as the future on-chip memory. In this paper, we propose a new cache architectureasymmetric write architecture with redundant blocks (AWARE)-that can improve the write latency by taking advantage of the asymmetric write characteristics of 1T-1MTJ STT-MRAM bit-cells. Due to the nature of the storage element in STT-MRAM, the time required for the two- 1→ 0 0→ 1 identical. In other words, one of the state transitions is slower than the other direction. In conventional cache architecture, the overall write latency is limited by the slower transition. However, the AWARE cache design introduces redundant blocks in each row, and they are preset to the initial state that enables the faster transition. Hence the write operations performed in these redundant blocks are much faster than the conventional write scheme. The write latency in AWARE is improved by 30% over conventional cache architecture with no area penalty in the data array. Moreover, the additional tag bits introduced in this technique result in penalty on the total cache area. In addition, the write energy increases modestly by 7% in the proposed cache design. However, this write-energy increase can be mitigated by sacrificing the cache capacity. ETPL VLSI - 062 Design of a Low-Voltage Low-Dropout Regulator A low-voltage low-dropout (LDO) regulator that converts an input of 1 V to an output of 0.85–0.5 V, with 90nm CMOS technology is proposed. A simple symmetric operational transconductance amplifier is used as the error amplifier (EA), with a current splitting technique adopted to boost the gain. This also enhances the closed-loop bandwidth of the LDO regulator. In the rail-to-rail output stage of the EA, a power noise cancellation mechanism is formed, minimizing the size of the power MOS transistor. Furthermore, a fast responding transient accelerator is designed through the reuse of parts of the EA. These advantages allow the proposed LDO regulator to operate over a wide range of operating conditions while achieving 99.94% current efficiency, a 28-mV output variation for a 0–100 mA load transient, and a power supply rejection of roughly 50 dB over 0–100 kHz. The area of the proposed LDO regulator is only 0.0041 ${rm mm}^{2}$ , because of the compact architecture. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 063 An Analytical Delay Model for Mechanical Stress Induced Systematic Variability Analysis in Nanoscale Circuit Design Strain engineering for performance enhancement is an integral part of a state-of-the-art CMOS process flow. However, use of stressors makes the performance of CMOS devices layout dependent. Performance variability arising due to the use of stressor materials is often referred to as Layout Dependent Effect (LDE) variability. The existing delay models do not take LDE into consideration and, therefore, results into unaccounted change in performance and degraded design robustness. In this paper we propose an analytical delay model for Inverter, 2-input NAND and NOR gates while considering LDE variability due to the use of strain engineered devices. We compare our derived model with TCAD calibrated HSPICE simulation results and observe that our model estimates delay well for varying transistor sizes, load capacitances and input signal transition times. ETPL VLSI - 064 Energy Efficient Programmable MIMO Decoder Accelerator Chip in 65-nm CMOS This paper presents an energy efficient programmable hardware accelerator that targets multiple-inputmultiple-output (MIMO) decoding tasks of orthogonal frequency-division multiplexing (OFDM) systems. The work is motivated by the adoption of MIMO and OFDM by almost all existing and emerging high-speed wireless data communication systems. The accelerator was fabricated in 65-nm CMOS technology and occupies a core area of 2.48 ${rm mm}^{2}$ . It delivers full programmability across different wireless standards (i.e., WiFi, 3G-long term evolution, and WiMax) as well as different MIMO decoding algorithms (i.e., minimum mean square error, singular value decomposition, and maximum likelihood) with extreme energy efficiency. The energy efficiency of our MIMO accelerator chip was compared against dedicated application specific integrated circuits for 4 $,times,$ 4 QR decomposition, 4 $,times,$ 4 singular value decomposition, and 2 $,times,$ 2 minimum mean square error decoding. Despite the programmable nature of our design, it delivered energy efficiencies that were 18% to 28% better than the dedicated solutions reported in the literature. This paper presents the VLSI implementation of the architecture discussed in [14]–[16]. It discusses the implementation decisions and tradeoffs used to ensure minimum overall energy consumption of the resulting accelerator chip without sacrificing programmability. Given its programmability and extreme energy efficiency, the accelerator is an ideal solution for today's smart phones that implement multiple MIMOOFDM waveforms on the - ame platform. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 065 Iterative Linear Interpolation Based on Fuzzy Gradient Model for Low-Cost VLSI Implementation In this paper, we propose an iterative linear interpolation (ILI) algorithm, which produces quadratic ILI polynomials to perform the most cost-effective interpolation among state-of-the-art quadratic and cubic methods. Unlike traditional point and area pixel models, the ILI adopts the fuzzy gradient model to estimate gradients of the target point according to its neighbor sample points in different directions. By weighing the gradients using fuzzy membership grades, the ILI estimates the difference between the target point and its neighbor sample points and finally obtains the target point. In 1-D signal reconstructions, using only three multipliers, the ILI obviously outperforms both conventional quadratic Lagrange interpolation and cubic interpolation. To approximate 2-D signals, we use five 1-D ILIs, which costs only eight multipliers to obtain similar peak signal-to-noise ratio (PSNR) performance but better robustness compared with bi-cubic interpolation. Reusing the ILI polynomials of the previous target point, we further reduce the cost of ILI to three multipliers and eight adders. The VLSI implementation using TSMC 0.18- $mu{rm m}$ technology shows that only 7256 gates are required for running a 200-MHz, 8-bit input/output, 15-bit fix-point data path, and 10-stage pipelined 2-D ILI, which is the quadratic interpolation of lowest cost but with PSNR performance closest to state-of-the-art bi-cubic methods. ETPL VLSI - 066 Simplifying Clock Gating Logic by Matching Factored Forms Gate-level clock gating starts with a netlist, with partial or no gating applied; some flip-flops are then selected for further gating to reduce the circuit's power consumption, and a gating logic of the smallest possible size must then be synthesized. We show how to do this by factored form matching, in which gating functions in factored forms are matched, as far as possible, with factored forms of the Boolean functions of existing combinational nodes in the circuit; additional gates are then introduced, but only for the portion of gating functions that are not matched. Strong matching identifies matches that are explicitly present in the factored forms, and weak matching seeks matches that are implicit in the logic and thus are more difficult to discover. Factored form matching reduces gating logic by an average of 24%, over a few test circuits, for which Boolean division only achieves an average reduction of 8%. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 067 Use of SSTA Tools for Evaluating BTI Impact on Combinational Circuits This paper presents an extensive statistical study on the impact of bias temperature instability (BTI) on digital circuits. A statistical framework for the evaluation of BTI at the electrical (SPICE) level, enhanced by an atomistic model for BTI, is introduced. This framework is then employed to perform the timing analysis of different combinational paths using cells from a given library, aiming to statistically model BTI at the higher abstraction level. A statistical static timing analysis (SSTA) method is then performed and the results are compared to detailed simulations using atomistic models based on experimental data. The comparison between the two methods shows that for large paths both methods converge to the same distribution for the delay while for short paths the delay distributions are different causing the SSTA method to generate misleading results. An analysis is then performed in order to understand and formalize the results. ETPL VLSI - 068 Precise VLSI Architecture for AI Based 1-D/ 2-D Daub-6 Wavelet Filter Banks With Low Adder-Count A multiplier-less architecture based on algebraic integer representation for computing the Daubechies 6-tap wavelet transform for 1-D/2-D signal processing is proposed. This architecture improves on previous designs in a sense that it minimizes the number of parallel 2-input adder circuits. The algorithm was achieved using brute-force numerical optimization of the algebraic integer representation. The proposed architecture furnishes exact computation up to the final reconstruction step, which is the operation that maps the exactly computed filtered results from algebraic integer representation to fixed-point. Compared to our recent work, this architecture shows a reduction of $27cdot n-16$ adder circuits, where $n$ is the number of wavelet decomposition levels. The design is physically implemented for a 4-level 1-D/2-D decomposition using a Xilinx Virtex-6 vcx240t-1ff1156 field programmable gate array (FPGA) device operating at up to a maximum clock frequency of 344/ 168 MHz. The FPGA implementation of 1-D/2-D are tested using hardware cosimulation using an ML605 board with clock of 100 MHz. A 45 nm CMOS synthesis of 2-D designs show improved clock frequency of better than 306 MHz for a supply voltage of 1.1 V. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 069 An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply Operator Complex arithmetic operations are widely used in Digital Signal Processing (DSP) applications. In this work, we focus on optimizing the design of the fused Add-Multiply (FAM) operator for increasing performance. We investigate techniques to implement the direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a structured and efficient recoding technique and explore three different schemes by incorporating them in FAM designs. Comparing them with the FAM designs which use existing recoding schemes, the proposed technique yields considerable reductions in terms of critical delay, hardware complexity and power consumption of the FAM unit. ETPL VLSI - 070 Synchronous Non-Volatile Logic Gate Design Based on Resistive Switching Memories Emerging non-volatile memories (NVM) based on resistive switching mechanism (RS) such as STT-MRAM, OxRRAM and CBRAM etc., are under intense R&D investigation by both academics and industries. They provide high write/read speed, low power and good endurance (e.g., > 1012) beyond mainstream NVMs, which allow them to be embedded directly with logic units for computing purpose. This integration could increase significantly the power/die area efficiency, and then overcome definitively the power/speed bottlenecks of modern VLSIs. This paper presents firstly a theoretical investigation of synchronous NV logic gates based on RS memories (RS-NVL). Special design techniques and strategies are proposed to optimize the structure according to different resistive characteristics of NVMs. To validate this study, we simulated a nonvolatile full-adder (NVFA) with two types of NVMs: STT-MRAM and OxRRAM by using CMOS 40 nm design kit and compact models, which includes related physics and experimental parameters. They show interesting power, speed and area gain compared with synchronized CMOS FA while keeping good reliability. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 071 High-Throughput Multistandard Transform Core Supporting MPEG/H.264/VC-1 Using Common Sharing Distributed Arithmetic This paper proposes a low-cost high-throughput multistandard transform (MST) core, which can support MPEG-1/2/4 (8 × 8), H.264 (8 × 8, 4 × 4), and VC-1 (8 × 8, 8 × 4, 4 × 8, 4 × 4) transforms. Common sharing distributed arithmetic (CSDA) combines factor sharing and distributed arithmetic sharing techniques, efficiently reducing the number of adders for high hardware-sharing capability. This achieves a 44.5% reduction in adders in the proposed MST, compared with the direct implementation method. With eight parallel computation paths, the proposed MST core has an eightfold operation frequency throughput rate. Measurements show that the proposed CSDA-MST core achieves a high-throughput rate of 1.28 G-pels/s, supporting the (4928 × 2048@24 Hz) digital cinema or ultrahigh resolution format. This is possible only with 30 k gate counts when implemented in a TSMC 0.18- μ M DA-MST core thus achieves a high-throughput rate supporting multistandard transformations at low cost. ETPL VLSI - 072 Efficient VLSI Implementation of Neural Networks With Hyperbolic Tangent Activation Function Nonlinear activation function is one of the main building blocks of artificial neural networks. Hyperbolic tangent and sigmoid are the most used nonlinear activation functions. Accurate implementation of these transfer functions in digital networks faces certain challenges. In this paper, an efficient approximation scheme for hyperbolic tangent function is proposed. The approximation is based on a mathematical analysis considering the maximum allowable error as design parameter. Hardware implementation of the proposed approximation scheme is presented, which shows that the proposed structure compares favorably with previous architectures in terms of area and delay. The proposed structure requires less output bits for the same maximum allowable error when compared to the state-of-the-art. The number of output bits of the activation function determines the bit width of multipliers and adders in the network. Therefore, the proposed activation function results in reduction in area, delay, and power in VLSI implementation of artificial neural networks with hyperbolic tangent activation function. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 073 Simultaneous Low-Pass Filtering and Total Variation Denoising This paper seeks to combine linear time-invariant (LTI) filtering and sparsity-based denoising in a principled way in order to effectively filter (denoise) a wider class of signals. LTI filtering is most suitable for signals restricted to a known frequency band, while sparsity-based denoising is suitable for signals admitting a sparse representation with respect to a known transform. However, some signals cannot be accurately categorized as either band-limited or sparse. This paper addresses the problem of filtering noisy data for the particular case where the underlying signal comprises a low-frequency component and a sparse or sparse-derivative component. A convex optimization approach is presented and two algorithms derived: one based on majorization-minimization (MM), and the other based on the alternating direction method of multipliers (ADMM). It is shown that a particular choice of discrete-time filter, namely zero-phase noncausal recursive filters for finite-length data formulated in terms of banded matrices, makes the algorithms effective and computationally efficient. The efficiency stems from the use of fast algorithms for solving banded systems of linear equations. The method is illustrated using data from a physiological-measurement technique (i.e., near infrared spectroscopic time series imaging) that in many cases yields data that is well-approximated as the sum of low-frequency, sparse or sparse-derivative, and noise components. ETPL VLSI - 074 Effects of Random Delay Errors in Continuous-Time Semi-Digital Transversal Filters The implementation of transversal filters requires basic circuit elements such as adders, multipliers and (unit) delay elements. The filters designed under infinite precision of these elements may behave differently when implemented with components with limited accuracy. In fact, the effects of the coefficient inaccuracies in analog and digital transversal filters have been investigated extensively in the literature [1], [2]. On the other hand, the effects of the unit delays with limited precision have not received similar attention. In this paper, we find that such effects especially in very high frequency continuous-time semi-digital transversal filters may not be ignored. As an example, we analyze the impact of delay errors in the implementation of the direct modulation transmitter. Specifically, we provide the analytical statistical performance bounds and confirm the results with simulations. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 075 Two Polynomial FIR Filter Structures With Variable Fractional Delay and Phase Shift This paper introduces two polynomial finite-length impulse response (FIR) digital filter structures with simultaneously variable fractional delay (VFD) and phase shift (VPS). The structures are reconfigurable (adaptable) online without redesign and do not exhibit transients when the VFD and VPS parameters are altered. The structures can be viewed as generalizations of VFD structures in the sense that they offer a VPS in addition to the regular VFD. The overall filters are composed of a number of fixed subfilters and a few variable multipliers whose values are determined by the desired FD and PS values. A systematic design algorithm, based on iter ℓ1-norm minimization, is proposed. It generates fixed subfilters with many zero-valued coefficients, typically located in the impulse response tails. The paper considers two different structures, referred to as the basic structure and common-subfilters structure, and compares these proposals as well as the existing cascaded VFD and VPS structures, in terms of arithmetic complexity, delay, memory cost, and transients. In general, the common-subfilters structure is superior when all of these aspects are taken into account. Further, the paper shows and exemplifies that the VFDPS filters under consideration can be used for simultaneous resampling and frequency shift of signals. ETPL VLSI - 076 Algorithms and Architectures of Energy-Efficient Error-Resilient MIMO Detectors for Memory-Dominated Wireless Communication Systems In a broadband MIMO-OFDM wireless communication system, embedded buffering memories occupy a large portion of the chip area and a significant amount of power consumption. Due to process variations of advanced CMOS technologies, it becomes both challenging and costly to maintain perfectly functioning memories under all anticipated operating conditions. Thus, Voltage over Scaling (VoS) has emerged as a means to achieve energy efficient systems resulting in a tradeoff between energy efficiency and reliability. In this paper we present the algorithm and VLSI architecture of a novel error-resilient K-Best MIMO detector based on the combined distribution of channel noise and induced errors due to VoS. The simulation results show that, compared with a conventional MIMO detector design, the proposed algorithm provides up-to 4.5 dB gain to achieve the near-optimal Packet Error Rate (PER) performance in the 4 $times$ 4 64-QAM system. Furthermore, based on experimental results, when jointly considering the detector and memory power consumption, the proposed resilient scheme with VoS memory can achieve up to 32.64% savings compared to the conventional K-Best detector with perfect memory. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 077 A Methodology for Optimized Design of Secure Differential Logic Gates for DPA Resistant Circuits Cryptocircuits can be attacked by third parties using differential power analysis (DPA), which uses power consumption dependence on data being processed to reveal critical information. To protect security devices against this issue, differential logic styles with (almost) constant power dissipation are widely used. However, to use such circuits effectively for secure applications it is necessary to eliminate any energy-secure flaw in security in the shape of memory effects that could leak information. This paper proposes a design methodology to improve pull-down logic configuration for secure differential gates by redistributing the charge stored in internal nodes and thus, removing memory effects that represent a significant threat to security. To evaluate the methodology, it was applied to the design of AND/NAND and XOR/XNOR gates in a 90 nm technology, adopting the sense amplifier based logic (SABL) style for the pull-up network. The proposed solutions leak less information than typical SABL gates, increasing security by at least two orders of magnitude and with negligible performance degradation. A simulation-based DPA attack on the Sbox9 cryptographic module used in the Kasumi algorithm, implemented with complementary metal–oxide– semiconductor, SABL and proposed gates, was performed. The results obtained illustrate that the number of measurements needed to disclose the key increased by much more than one order of magnitude when using our proposal. This paper also discusses how the effectivenness of DPA attacks is influenced by operating temperature and details how to insure energy-secure operations in the new proposals. ETPL VLSI - 078 Reliability-Oriented Placement and Routing Algorithm for SRAM-Based FPGAs As the feature size shrinks to the nanometer scale, SRAM-based FPGAs will become increasingly vulnerable to soft errors. Existing reliability-oriented placement and routing approaches primarily focus on reducing the fault occurrence probability (node error rate) of soft errors. However, our analysis shows that, besides the fault occurrence probability, the propagation probability (error propagation probability) plays an important role and should be taken into consideration. In this paper, we first propose a cube-based analysis algorithm to efficiently and accurately estimate the error propagation probability. Based on such a model, we propose a novel reliability-oriented placement and routing algorithm that combines both the fault occurrence probability and the error propagation probability together to enhance system-level robustness against soft errors. Experimental results show that, compared with the baseline versatile place and route technique, the proposed scheme can reduce the failure rate by 20.73%, and increase the mean time between failures by 39.44%. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 079 Eliminating Synchronization Latency Using Sequenced Latching Modern multicore systems have a large number of components operating in different clock domains and communicating through asynchronous interfaces. These interfaces use synchronizer circuits, which guard against metastability failures but introduce latency in processing the asynchronous input. We propose a speculative method that hides synchronization latency by overlapping it with computation cycles. We verify the correctness of our approach through a field programmable gate array implementation and apply it to a number of synthesized benchmarks. Synthesis results reveal that our approach achieves average savings of 135% and 204% in area costs and nearly 100% in power costs compared to two similar speculative techniques. ETPL VLSI - 080 Partial Access Mode: New Method for Reducing Power Consumption of Dynamic Random Access Memory Demands have been placed on a dynamic random access memory (DRAM) to not only have increased memory capacity and data transfer speed, but also have reduced operating and standby currents. When a system uses a DRAM, a refresh operation is necessary because of its data retention time restriction: each bit of the DRAM is stored as an amount of electrical charge in a storage capacitor that is discharged by the leakage current. Power consumption for the refresh operation increases in proportion to the memory capacity. We propose a new method to reduce the refresh power consumption by effectively extending the memory cell retention time. Conversion from 1 cell/bit to $2^{N}$ cells/bit reduces the variation in the retention time among memory cells. Although active power increases by a factor of $2^{N}$ , the refresh time increases by more than $2^{N}$ as a consequence of the fact that the majority decision does better than averaging for the tail distribution of retention time. The conversion can be realized very simply from the structure of the DRAM array circuit, and it reduces the frequency of disturbance and power consumption by two orders of magnitude. On the basis of this conversion method, we propose a partial access mode to reduce power consumption dynamically when the full memory capacity is not required. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 081 Pulsed-Latch Utilization for Clock-Tree Power Optimization Minimizing the size of a clock tree is known as an effective approach to reduce power dissipation in modern circuit designs. However, most existing power-aware clock-tree minimization algorithms optimize power on the basis of flip-flops alone, which may result in limited power savings. To achieve a power and timing tradeoff, this paper investigates the pulsed-latch utilization in a clock tree for further power savings. This is the first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and flip-flops. The proposed method is based on minimum-cost maximum-flow formulation to globally determine the tree topology, which maintains load balance and considers the wirelength between pulse generators and pulsed latches. Experimental results indicate that the proposed migration approach can improve the power consumption by 12% and 13% with 7% and 70% skew improvements on average compared with the most recent paper on the industrial circuits and ISPD-2010 benchmarks, respectively. ETPL VLSI - 082 Toward Multi-Gigabit Wireless: Design of High-Throughput MIMO Detectors With Hardware-Efficient Architecture This paper presents a hardware-efficient architecture for 4×4 and 8×8 high-throughput MIMO detectors. The adopted non-constant K-best algorithm tends to keep more survival nodes in top search tree layers and reduce computational complexity in bottom layers as opposed to the conventional K-best algorithm. A pipelined architecture is used to generate one detection output per clock cycle, thus meeting multi-gigabit throughput requirements for advanced wireless communication systems. The proposed efficient folding scheme strikes a suitable balance between complexity and throughput. This paper also presents a discussion on the scalability of this architecture with respect to the setting of QAM size, K values, and antenna number. One 4×4 MIMO detector IC has been manufactured and one 8×8 MIMO detector layout has been realized, both in 90-nm CMOS technology. The 4×4 detector IC has 232 kilogates (KG). Its maximum measured throughput is 4.08 Gbps at 170-MHz operating frequency and 1.3-V core voltage. The 8×8 detector has 665 KG. Its post-layout simulation results show that it achieves 4.37-Gbps throughput at 182-MHz operating frequency and 0.9-V core voltage. Compared to earlier hard-output detectors, both implemented detectors demonstrate good normalized power and normalized hardware efficiencies. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 083 Finite Alphabet Iterative Decoders for LDPC Codes: Optimization, Architecture and Analysis Low-density parity-check (LDPC) codes are adopted in many applications due to their Shannon-limit approaching error-correcting performance. Nevertheless, belief-propagation (BP) based decoding of these codes suffers from the error-floor problem, i.e., an abrupt change in the slope of the error-rate curve that occurs at very low error rates. Recently, a new type of decoders termed finite alphabet iterative decoders (FAIDs) were introduced. The FAIDs use simple Boolean maps for variable node processing, and can surpass the BP-based decoders in the error floor region with very short word length. We restrict the scope of this paper to regular dv=3 LDPC codes on the BSC channel. This paper develops a low-complexity implementation architecture for the FAIDs by making use of their properties. Particularly, an innovative bit-serial check node unit is designed for the FAIDs, and a small-area variable node unit is proposed by exploiting the symmetry in the Boolean maps. Moreover, an optimized data scheduling scheme is proposed to increase the hardware utilization efficiency. From synthesis results, the proposed FAID implementation needs only 52% area to reach the same throughput as one of the most efficient standard Min-Sum decoders for an example (7807, 7177) LDPC code, while achieving better error-correcting performance in the error-floor region. Compared to an offset Min-Sum decoder with longer word length, the proposed design can achieve higher throughput with 45% area, and still leads to possible performance improvement in the error-floor region. ETPL VLSI - 084 Constructing Sub-Arrays with ShortInterconnects from Degradable VLSI Arrays Reducing the interconnection length of VLSI arrays leads to less capacitance, power dissipation and dynamic communication cost between the processing elements (PEs). This paper develops efficient algorithms for constructing tightly-coupled subarrays from the mesh-connected VLSI arrays with faulty PEs. For a given size r·s of the target (logical) array, the proposed algorithm searches and reroutes a physical r×s subarray that has the least number of faults, resulting in an approximate target array, which is subsequently extended to the desired target array. Experimental results show that over 65 percent redundant interconnects can be reduced for a 64×64 target array on the 512×512 host array with no more than 1 percent faults. In addition, we propose a recursive divide-and-conquer algorithm for constructing the maximum target array (MTA). The lower bound of the total interconnection length of the MTA has been established. Experimental results show that the proposed algorithm is capable of reducing the long interconnects by over 33 percent for the MTA derived from the 512×512 host array with no more than 1 percent faults. Moreover, the proposed total interconnection length of target array is close to the lower bound for the cases with relatively fewer number of faults. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 085 A Lattice Reduction-Aided MIMO Channel Equalizer in 90 nm CMOS Achieving 720 Mb/s In this paper, a VLSI implementation of a complete MIMO channel equalization ASIC based on lattice reduction-aided linear detection is presented. The architecture performs preprocessing steps at channel rate and low-complexity linear data detection at symbol rate. Preprocessing is based on Seysen's algorithm for lattice reduction. We present algorithmic improvements of the lattice reduction preprocessing in terms of area and throughput of the VLSI implementation with minor impact on the error-rate. Due to the low-complexity implementation of the lattice reduction-aided data detection stage, our architecture is able to achieve very low power in typical packet-based MIMO wireless data transmission scenarios. The final 90 nm CMOS ASIC achieves an energy efficiency for the detection of 24 pJ/bit at a throughput of 720 Mbps with near-optimal error-rate performance. ETPL VLSI - 086 Evaluation of Leakage Reduction Alternatives for Deep Submicron Dynamic Nonuniform Cache Architecture Caches Wire delays and leakage energy consumption are both growing problems in designing large on-chip caches. Nonuniform cache architecture (NUCA) is a wire-delay aware design paradigm based on the sub-banking of a cache, which allows the banks closer to the controller to be accessed with reduced latencies with respect to the other banks. This feature is leveraged by dynamic NUCA (D-NUCA) caches via a migration mechanism which speeds up frequently used data access, further reducing the effect wire delays have on performance. To reduce leakage power consumption of static random access memory caches, various micro-architectural techniques have been proposed. In this brief, we compare the benefits and limits of the application of some of these techniques to a D-NUCA cache memory, and propose a novel hybrid scheme based on the Drowsy and Way Adaptable techniques. Such a scheme allows further improvement in leakage reduction and limits the impact of process variation on the effectiveness of the Drowsy technique. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 087 Split-SAR ADCs: Improved Linearity With Power and Speed Optimization This paper presents the linearity analysis of a successive approximation registers (SAR) analog-to-digital converters (ADC) with split DAC structure based on two switching methods: conventional chargeredistribution and Vcm-based switching. The static linearity performance, namely the integral nonlinearity and differential nonlinearity, as well as the parasitic effects of the split DAC, are analyzed hereunder. In addition, a code-randomized calibration technique is proposed to correct the conversion nonlinearity in the conventional SAR ADC, which is verified by behavioral simulations, as well as measured results. Performances of both switching methods are demonstrated in 90 nm CMOS. Measurement results of power, speed, and linearity clearly show the benefits of using Vcm-based switching. ETPL VLSI - 088 An Event-Based Neural Network Architecture With an Asynchronous Programmable Synaptic Memory We present a hybrid analog/digital very large scale integration (VLSI) implementation of a spiking neural network with programmable synaptic weights. The synaptic weight values are stored in an asynchronous Static Random Access Memory (SRAM) module, which is interfaced to a fast current-mode event-driven DAC for producing synaptic currents with the appropriate amplitude values. These currents are further integrated by current-mode integrator synapses to produce biophysically realistic temporal dynamics. The synapse output currents are then integrated by compact and efficient integrate and fire silicon neuron circuits with spikefrequency adaptation and adjustable refractory period and spike-reset voltage settings. The fabricated chip comprises a total of 32 × 32 SRAM cells, 4 × 32 synapse circuits and 32 × 1 silicon neurons. It acts as a transceiver, receiving asynchronous events in input, performing neural computation with hybrid analog/digital circuits on the input spikes, and eventually producing digital asynchronous events in output. Input, output, and synaptic weight values are transmitted to/from the chip using a common communication protocol based on the Address Event Representation (AER). Using this representation it is possible to interface the device to a workstation or a micro-controller and explore the effect of different types of Spike-Timing Dependent Plasticity (STDP) learning algorithms for updating the synaptic weights values in the SRAM module. We present experimental results demonstrating the correct operation of all the circuits present on the chip. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 089 Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes Radio communication exhibits the highest energy consumption in wireless sensor nodes. Given their limited energy supply from batteries or scavenging, these nodes must trade data communication for on-the-node computation. Currently, they are designed around off-the-shelf low-power microcontrollers. But by employing a more appropriate processing element, the energy consumption can be significantly reduced. This paper describes the design and implementation of the newly proposed folded-tree architecture for on-the-node data processing in wireless sensor networks, using parallel prefix operations and data locality in hardware. Measurements of the silicon implementation show an improvement of 10-20× in terms of energy as compared to traditional modern micro-controllers found in sensor nodes. ETPL VLSI - 090 Partial Access Mode: New Method for Reducing Power Consumption of Dynamic Random Access Memory Demands have been placed on a dynamic random access memory (DRAM) to not only have increased memory capacity and data transfer speed, but also have reduced operating and standby currents. When a system uses a DRAM, a refresh operation is necessary because of its data retention time restriction: each bit of the DRAM is stored as an amount of electrical charge in a storage capacitor that is discharged by the leakage current. Power consumption for the refresh operation increases in proportion to the memory capacity. We propose a new method to reduce the refresh power consumption by effectively extending the memory cell retention time. Conversion from 1 cell/bit to $2^{N}$ cells/bit reduces the variation in the retention time among memory cells. Although active power increases by a factor of $2^{N}$ , the refresh time increases by more than $2^{N}$ as a consequence of the fact that the majority decision does better than averaging for the tail distribution of retention time. The conversion can be realized very simply from the structure of the DRAM array circuit, and it reduces the frequency of disturbance and power consumption by two orders of magnitude. On the basis of this conversion method, we propose a partial access mode to reduce power consumption dynamically when the full memory capacity is not required. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 091 Optimization Scheme to Minimize Reference Resistance Distribution of SpinTransfer-Torque MRAM Spin-transfer-torque magnetoresistive random access memory (STT-MRAM) is an emerging type of nonvolatile memory with compelling advantages in endurability, scalability, speed, and energy consumption. As the process technology shrinks, STT-MRAM has limited sensing margin due to the decrease in supply voltage and increase in process variation. Furthermore, the relatively smaller resistance difference of two states in STT-MRAM poses challenges for its read/write circuit design to maintain an acceptable sensing margin. The proposed reference circuits optimization scheme solves the reference resistance distribution issue to maximize the sensing margin and minimize the read disturbance, with low power consumption. Simulation results show that the optimization scheme is able to significantly improve the read reliability with the presence of one or few cases of reference cell failure, thus it eliminates the requirement of additional circuits for failure detection of reference cell or referencing to neighboring blocks. ETPL VLSI - 092 Towards Low-Power High-Efficiency RF and Microwave Energy Harvesting Since the very beginning of RF and microwave integrated techniques and energy harvesting, Schottky diodes μW -harvesting applications, the Schottky diode technique fails to provide a satisfactory RF-dc conversion efficiency mainly because of its high zero-bias junction resistance. This paper examines the state-of-the-art low-power microwave-to-dc energy conversion techniques. A comprehensive picture of the state-of-the-art on this aspect is given graphically, which compares different technologies such as transistor, diode, and CMOS schemes. Subsequent to the highlighted limitations of current devices, this work introduces, for the first time, a nonlinear component for low-power rectification based on a recent discovery in spintronics, namely, the spindiode. Along with an analysis of the role of nonlinearity and zero bias resistance in the rectification process of the spindiode, it is shown how the spindiode could enhance the rectification efficiency even at a very low-power level and how this technique would shift the design paradigms of diode-based devices and circuits. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 093 Access Time and Power Dissipation of a Model 256-Bit Single Flux Quantum RAM Superconductor electronics offers logic circuits for high-speed data processing and high-performance computing. The main barrier to practical application is the lack of high-speed and low-power memory. It is widely believed that the most reliable and functional bit cell for superconducting memory is the vortex transitional bit cell, which was successfully used by Nagasawa in a 4-kb memory. This paper reviews existing challenges in this type of Josephson memory devices and discusses engineering issues in implementing a model single flux quantum random access memory. We evaluate the contributions that various components of the memory system make to delay and power dissipation. The 256-bit memory provides an experimentally confirmed read access time of 190 ps. As a result, we found that delay and power dissipation are found largely in the address decoder, line drivers, bit-selection scheme, and the data readout circuitry. With these circuits being similar for various magnetic memory devices, our findings provide essential data for a comprehensive assessment of new concepts for bit cells, readout, and write in superconducting memories. ETPL VLSI - 094 All-Optical Ultrafast Switching in 2 × 2 Silicon Microring Resonators and its Application to Reconfigurable DEMUX/MUX and Reversible Logic Gates We present a theoretical model to analyze all-optical switching by two-photon absorption induced free-carrier injection in silicon 2 × 2 add-drop microring resonators. The theoretical simulations are in good agreement with experimental results. The results have been used to design all-optical ultrafast (i) reconfigurable Demultiplexer/Multiplexer logic circuits using three microring resonator switches and (ii) universal, conservative and reversible Fredkin and Toffoli logic gates with only one and two microring resonator switches respectively. Switching has been optimized for low-power (25 mW) ultrafast (25 ps) operation with high modulation depth (85%) to enable logic operations at 40 Gb/s. The combined advantages of high Q-factor, tunability, compactness, cascadibility, reversibility and reconfigurability make the designs favorable for practical applications. The proposed designs provide a new paradigm for ultrafast CMOS-compatible alloptical reversible computing circuits in silicon. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 095 Using Lifetime-Aware Progressive Programming to Improve SLC NAND Flash Memory Write Endurance This paper advocates a lifetime-aware progressive programming concept to improve single-level per cell NAND flash memory write endurance. NAND flash memory program/erase (P/E) cycling gradually degrades memory cell storage noise margin, and sufficiently strong fault tolerance must be used to ensure the memory P/E cycling endurance. As a result, the relatively large cell storage noise margin in early memory lifetime is essentially wasted in conventional design practice. This paper proposes to always fully utilize the available cell storage noise margin by adaptively adjusting the number of storage levels per cell, and progressively use these levels to realize multiple 1-bit programming operations between two consecutive erase operations. This simple progressive programming design concept is realized by two different implementation strategies, which are discussed and compared in detail. On the basis of an approximate NAND flash memory device model, we carried out simulations to quantitatively evaluate this design concept. The results show that it can improve the write endurance by 35.9% and in the meanwhile improve the average programming speed by 12% without sacrificing read speed. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 096 Design of a Low-Voltage Low-Dropout Regulator A low-voltage low-dropout (LDO) regulator that converts an input of 1 V to an output of 0.85–0.5 V, with 90nm CMOS technology is proposed. A simple symmetric operational transconductance amplifier is used as the error amplifier (EA), with a current splitting technique adopted to boost the gain. This also enhances the closed-loop bandwidth of the LDO regulator. In the rail-to-rail output stage of the EA, a power noise cancellation mechanism is formed, minimizing the size of the power MOS transistor. Furthermore, a fast responding transient accelerator is designed through the reuse of parts of the EA. These advantages allow the proposed LDO regulator to operate over a wide range of operating conditions while achieving 99.94% current efficiency, a 28-mV output variation for a 0–100 mA load transient, and a power supply rejection of roughly 50 dB over 0–100 kHz. The area of the proposed LDO regulator is only 0.0041 ${rm mm}^{2}$ , because of the compact architecture. ETPL VLSI - 097 Test Compaction by Sharing of Transparent-Scan Sequences Among Logic Blocks An approach to test application called transparent scan provides an opportunity to share tests among different logic blocks whose primary inputs and outputs are included in scan chains even if the blocks have different numbers of state variables. A transparent-scan sequence for one block is likely to detect faults in other blocks since transparent scan does not distinguish between functional and scan clock cycles, and allows faults to be detected at all the clock cycles of the sequence. Such sharing of tests is not meaningful with conventional scan-based tests, especially when the blocks have different numbers of state variables. Transparent scan thus enhances the ability to produce a compact test set for a group of logic blocks. The static test compaction procedure described in this paper uses transparent-scan sequences that follow the application of conventional scan-based tests precisely. The procedure obtains a set of transparent-scan sequences for a group of logic blocks from compacted test sets for the logic blocks in the group. From this set, it selects a subset that detects all the target faults, which are detected by the complete set. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 098 Multifunction Residue Architectures for Cryptography A design methodology for incorporating Residue Number System (RNS) and Polynomial Residue Number System (PRNS) in Montgomery modular multiplication in GF(p) or GF(2n) respectively, as well as a VLSI architecture of a dual-field residue arithmetic Montgomery multiplier are presented in this paper. An analysis of input/output conversions to/from residue representation, along with the proposed residue Montgomery multiplication algorithm, reveals common multiply-accumulate data paths both between the converters and between the two residue representations. A versatile architecture is derived that supports all operations of Montgomery multiplication in GF(p) and GF(2n), input/output conversions, Mixed Radix Conversion (MRC) for integers and polynomials, dual-field modular exponentiation and inversion in the same hardware. Detailed comparisons with state-of-the-art implementations prove the potential of residue arithmetic exploitation in dual-field modular multiplication. ETPL VLSI - 099 Scalable Montgomery Modular Multiplication Architecture with Low-Latency and Low-Memory Bandwidth Requirement Montgomery modular multiplication is widely used in public-key cryptosystems. This work shows how to relax the data dependency in conventional word-based algorithms to maximize the possibility of reusing the current words of variables. With the greatly relaxed data dependency, we then proposed a novel scheduling scheme to alleviate the number of memory access in the developed scalable architecture. Analytical results show that the memory bandwidth requirement of the proposed scalable architecture is almost 1/(w - 1) times that of conventional scalable architectures, where w denotes word size. The proposed one also retains a latency of exactly one cycle between the operations of the same words in two consecutive iterations of the Montgomery modular multiplication algorithm when employing enough processing elements. Compared to the design in the related work, experimental results demonstrate that the proposed one achieves an almost 54 percent reduction in power consumption with no degradation in throughput. The reduced number of memory access not only leads to lower power consumption, but also facilitates the design of scalable architectures for any precision of operands. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 100 An Efficient Partial-Sum Network Architecture for Semi-Parallel Polar Codes Decoder Implementation Polar codes have recently received a lot of attention because of their capacity-achieving performance and low encoding and decoding complexity. The performance of the successive cancellation decoder (SCD) of the polar codes highly depends on that of the partial-sum network (PSN) implementation. Hence, in this work, an efficient PSN architecture is proposed, based on the properties of polar codes. First, a new partial-sum updating algorithm and the corresponding PSN architecture are introduced which achieve a delay performance independent of the code length. Moreover, the area complexity is also reduced. Second, for a highperformance and area-efficient semi-parallel SCD implementation, a folded PSN architecture is presented to integrate seamlessly with the folded processing element architecture. This is achieved by using a novel folded decoding schedule. As a result, both the critical path delay and the area (excluding the memory for folding) of the semi-parallel SCD are approximately constant for a large range of code lengths. The proposed designs are implemented in both FPGA and ASIC and compared with the existing designs. Experimental result shows that for polar codes with large code length, the decoding throughput is improved by more than 1.05 times and the area is reduced by as much as 50.4%, compared with the state-of-the-art designs. ETPL VLSI - 101 Improved Matching-Pursuit Implementation for LTE Channel Estimation An implementation of a reduced complexity matching pursuit channel estimator for LTE is presented. The design contains an FFT/IFFT module with non-radix-2 units and a core estimator. The module is flexible enough to perform FFT and IFFT at different resolutions needed, using the same hardware. Based on prior work the needed internal word lengths are found. Internal shifts are employed to maximize the use of available resources. The design is implemented in a 65 nm low power process from STMicroelectronics. The total area of the implementation is 1 mm2 design, including input pads and extra control logic. The algorithmic improvements reduce the complexity by up to 56% compared to prior art. At the same time estimator shows great improvement in speed, allowing over 6 times the number of estimations in the same time. Power consumption of the estimator is simulated to ~ 20 mW, running at 70 MHz. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad Pondicherry | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI - 102 A Multicast Tree Router for Multichip Neuromorphic Systems We present a tree router for multichip systems that guarantees deadlock-free multicast packet routing without dropping packets or restricting their length. Multicast routing is required to efficiently connect massively parallel systems' computational units when each unit is connected to thousands of others residing on multiple chips, which is the case in neuromorphic systems. Our tree router implements this one-to-many routing by branching recursively-broadcasting the packet within a specified subtree. Within this subtree, the packet is only accepted by chips that have been programmed to do so. This approach boosts throughput because memory look-ups are avoided enroute, and keeps the header compact because it only specifies the route to the subtree's root. Deadlock is avoided by routing in two phases-an upward phase and a downward phase-and by restricting branching to the downward phase. This design is the first fully implemented wormhole router with packet-branching that can never deadlock. The design's effectiveness is demonstrated in Neurogrid, a millionneuron neuromorphic system consisting of sixteen chips. Each chip has a 256 × 256 silicon-neuron array integrated with a full-custom asynchronous VLSI implementation of the router that delivers up to 1.17 G words/s across the sixteen- ETPL VLSI - 103 1μ VLSI implementation of high-throughput parallel H.264/AVC baseline intrapredictor This study presents a parallel very large scale integrated circuits architecture for an intra-predictor based on a fast 4 × 4 algorithm. For real-time scheduling, the proposed algorithm overcomes the data dependency between intra-prediction and intra-coding, thereby improving coding performance and reducing the number of coding cycles. The high-speed architecture for intra-prediction includes configurable computation cores to process YUV components using 10 pixel parallelism. Prediction for one macro-block (MB) coding (luminance: 4 × 4 and 16 × 16 block modes; chrominance: 8 × 8 block modes) can all be completed within 256 cycles. The proposed architecture achieves throughput of 410 kMB/s, suitable for 1920 × 1080/35 Hz 4:2:0 HDTV encoder at a working frequency of 105 MHz. Thank You !