Elysium Technologies Private Limited

advertisement
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL NT-001
ETPL NT-002
ETPL NT-003
ETPL NT-004
ETPL NT-005
ETPL NT-006
ETPL NT-007
ETPL NT-008
ETPL NT-009
ETPL NT-010
ETPL NT-011
ETPL NT-012
ETPL NT-013
ETPL NT-014
ETPL NT-015
ETPL NT-016
ETPL NT-017
ETPL NT-018
ETPL NT-019
ETPL NT-020
ETPL NT-021
ETPL NT-022
ETPL NT-023
ETPL NT-024
ETPL NT-025
ETPL NT-026
ETPL NT-027
ETPL NT-028
ETPL NT-029
ETPL NT-030
ETPL NT-031
ETPL NT-032
ETPL NT-033
ETPL NT-034
ETPL NT-035
ETPL NT-036
ETPL NT-037
Answering “What-If” Deployment and Configuration Questions With WISE: Techniques and
Deployment Experience
Complexity Analysis and Algorithm Design for Advance Bandwidth Scheduling in Dedicated
Networks
Diffusion Dynamics of Network Technologies With Bounded Rational Users: Aspiration-Based
Learning
Delay-Based Network Utility Maximization
A Distributed Control Law for Load Balancing in Content Delivery Networks
Efficient Algorithms for Neighbor Discovery in Wireless Networks
Stochastic Game for Wireless Network Virtualization
ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification,
A Utility Maximization Framework for Fair and Efficient Multicasting in Multicarrier Wireless
Cellular Networks
Achieving Efficient Flooding by Utilizing Link Correlation in Wireless Sensor Networks,
Random Walks and Green's Function on Digraphs: A Framework for Estimating Wireless
Transmission Costs
"A Flexible Platform for Hardware-Aware Network Experiments and a Case Study on Wireless
Network Coding
Exploring the Design Space of Multichannel Peer-to-Peer Live Video Streaming Systems
Secondary Spectrum Trading—Auction-Based Framework for Spectrum Allocation and Profit
Sharing
Towards Practical Communication in Byzantine-Resistant DHTs
Semi-Random Backoff: Towards Resource Reservation for Channel Access in Wireless LANs
Entry and Spectrum Sharing Scheme Selection in Femtocell Communications Markets
On Replication Algorithm in P2P VoD,
Back-Pressure-Based Packet-by-Packet Adaptive Routing in Communication Networks
Scheduling in a Random Environment: Stability and Asymptotic Optimality
An Empirical Interference Modeling for Link Reliability Assessment in Wireless Networks
On Downlink Capacity of Cellular Data Networks With WLAN/WPAN Relays
Centralized and Distributed Protocols for Tracker-Based Dynamic Swarm Management
Localization of Wireless Sensor Networks in the Wild: Pursuit of Ranging Quality
Control of Wireless Networks With Secrecy
ICTCP: Incast Congestion Control for TCP in Data-Center Networks
Context-Aware Nanoscale Modeling of Multicast Multihop Cellular Networks
Moment-Based Spectral Analysis of Large-Scale Networks Using Local Structural Information
Internet-Scale IPv4 Alias Resolution With MIDAR
Time-Bounded Essential Localization for Wireless Sensor Networks
Stability of FIPP -Cycles Under Dynamic Traffic in WDM Networks
Cooperative Carrier Signaling: Harmonizing Coexisting WPAN and WLAN Devices
Mobility Increases the Connectivity of Wireless Networks
Topology Control for Effective Interference Cancellation in Multiuser MIMO Networks
Distortion-Aware Scalable Video Streaming to Multinetwork Clients
Combined Optimal Control of Activation and Transmission in Delay-Tolerant Networks
A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 001
Efficient VLSI Architecture For Interpolation Decoding Of Hermitian
Codes
A fast, area efficient very large scale integration (VLSI) architecture is proposed
of interpolation modules. The algorithm has a regular structure which makes it
suitable for VLSI implementation. The circuitry is simplified as the decoding algorithm directly gives the
message word at the end of the decoding algorithm without separate
-
Further speed improvements can be achieved by combining the main idea of Guruswami list decoding with
the Lee-O'Sullivan algorithm. In
terms of hardware, the addition of this concept, will further reduce the running time of the algorithm and
make the circuitry abo
-O'Sullivan algorithms on Xilinx Virtex-5 shows that the proposed
decoder can be operated at higher clock frequency with almost same area complexity.
ETPL
VLSI - 002
Designing Hardware-Efficient Fixed-Point FIR Filters In An Expanding
Subexpression Space
This paper presents a practical method for designing fixed-point FIR filters. The proposed method takes both
the filter's magnitude response and its hardware cost into consideration in the design process. The method
constructs a basis set based on the fixed-point coefficients that have been synthesized already. The elements in
the basis set are used to synthesize the undetermined fixed-point coefficients later. Thus, this basis set expands
gradually along with the progress of the coefficient design. The method employs some strategies to speed up
the design process. For example, a complexity estimation strategy helps us stop digging deeper in some
branches of the search tree, and a solution prediction strategy for high-order FIR filters helps us design fixedpoint FIR filters of length equal to a few hundreds. Applying the proposed method to design twenty
benchmark cases, we can obtain hardware-efficient results in a reasonable design time. In two long filter
design cases, our design results are better than those designed by the other methods.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 003
Synchronous Non-Volatile Logic Gate Design Based on Resistive Switching
Memories
Emerging non-volatile
a
memories (nvm) based on resistive switching mechanism (rs) such as stt-mram, oxrram
and cbram etc., are under intense r&d investigation by both academics and industries. They provide high
write/read speed, low power and good endurance (e.g., > 1012) beyond mainstream nvms, which allow them
to be embedded directly with logic units for computing purpose. This integration could increase significantly
the power/die area efficiency, and then overcome definitively the power/speed bottlenecks of modern vlsis.
This paper presents firstly a theoretical investigation of synchronous nv logic gates based on rs memories (rsnvl). Special design techniques and strategies are proposed to optimize the structure according to different
resistive characteristics of nvms. To validate this study, we simulated a non-volatile full-adder (nvfa) with two
types of nvms: stt-mram and oxrram by using cmos 40 nm design kit and compact models, which includes
related physics and experimental parameters. They show interesting power, speed and area gain compared
with synchronized cmos fa while keeping good reliability.
ETPL
VLSI - 004
An Optimized Modified Booth Recoder for Efficient Design of the AddMultiply Operator
Complex arithmetic operations are widely used in Digital Signal Processing (DSP) applications. In this work,
we focus on optimizing the design of the fused Add-Multiply (FAM) operator for increasing performance. We
investigate techniques to implement the direct recoding of the sum of two numbers in its Modified Booth
(MB) form. We introduce a structured and efficient recoding technique and explore
three different schemes by incorporating them in FAM designs. Comparing them with the FAM designs which
use existing recoding schemes, the proposed technique yields considerable reductions in terms of critical
delay, hardware complexity and power consumption of the FAM unit.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 005
Low-Power Pulse-Triggered Flip-Flop Design Based on a Signal Feed-Through
In this brief, a low-power flip-flop (FF) design featuring an explicit type pulse-triggered structure and a
modified true single phase clock latch based on a signal feed-through scheme is presented. The proposed
design successfully solves the long discharging path problem in conventional explicit type pulse-triggered FF
(P-FF) designs and achieves better speed and power performance. Based on post-layout simulation results
using TSMC CMOS 90-nm technology, the proposed design outperforms the conventional P-FF design dataclose-to-output (ep-DCO) by 8.2% in data-to-Q delay. In the mean time, the performance edges on power and
power- delay-product metrics are 22.7% and 29.7%, respectively.
ETPL
VLSI - 006
Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low
Adaptation-Delay
In this paper, we present an efficient architecture for the implementation of a delayed least mean square
adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we use a
novel partial product generator and propose a strategy for optimized balanced pipelining across the timeconsuming combinational blocks of the structure. From synthesis results, we find that the proposed design
offers nearly 17% less area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the
best of the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an efficient
fixed-point implementation scheme of the proposed architecture, and derive the expression for steady-state
error. We show that the steady-state mean squared error obtained from the analytical result matches with the
simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which
provides nearly 20% saving in ADP and 9% saving in EDP over the
proposed structure before pruning without noticeable degradation of steady-state-error performance.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 007
Rate-0.96 LDPC Decoding VLSI for Soft-Decision Error Correction of NAND
Flash Memory
The reliability of data stored in high-density Flash memory devices tends to decrease rapidly because of the
reduced cell size and multilevel cell technology. Soft-decision error correction algorithms that use multipleprecision sensing for reading memory can solve this problem; however, they require very complex hardware
for high-throughput decoding. In this paper, we present a rate-0.96 (68254, 65536) shortened Euclidean
geometry low-density parity-check code and its VLSI implementation for high-throughput NAND Flash
memory systems. The design employs the normalized a posteriori probability (APP)-based algorithm, serial
schedule, and conditional update, which lead to simple functional units, halved decoding iterations, and lowpower consumption, respectively. A pipelined-parallel architecture is adopted for high-throughput decoding,
and memory-reduction techniques are employed to minimize the chip size. The proposed decoder is
implemented in 0.13-μ
M
z
tion of the decoder are
compared with those of a BCH (Bose-Chaudhuri-Hocquenghem) decoding circuit showing comparable errorcorrecting performance and throughput.
ETPL
VLSI - 008
A Generalized Lattice Filter for Finite Wordlength Implementation With
Reduced Number of Multipliers
The excellent finite wordlength (FWL) property of lattice digital filters is well known. The four-multiplier
normalized lattice, with signal power at all delay elements normalized to unity, has particular advantage in its
overflow property. However, when used to implement an Nth-order digital filter, the normalized lattice
implementation requires 5N+1 multipliers. There exists another lattice structure with excellent FWL property
called the injected numerator lattice structure. In this paper, we combine the injected numerator lattice and
tapped numerator lattice to form a new hybrid lattice structure, which is not only canonic in the number of
multipliers resulting in a significant reduction in overall implementation cost but also exhibits much better
FWL properties than the normalized la
A
“
”
for application where the input signal has a strong time varying sinusoidal component. The new structure
requires a few additional adders; it can be used to implement any causal and stable z-transform transfer
function. Two numerical examples are presented to demonstrate the performance of the proposed structure.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 009
Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low
Adaptation-Delay
In this paper, we present an efficient architecture for the implementation of a delayed least mean square
adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we use a
novel partial product generator and propose a strategy for optimized balanced pipelining across the timeconsuming combinational blocks of the structure. From synthesis results, we find that the proposed design
offers nearly 17% less area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the
best of the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an efficient
fixed-point implementation scheme of the proposed architecture, and derive the expression for steady-state
error. We show that the steady-state mean squared error obtained from the analytical result matches with the
simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which
provides nearly 20% saving in ADP and 9% saving in EDP over the proposed structure before pruning without
noticeable degradation of steady-state-error performance.
ETPL
VLSI - 010
Finite Alphabet Iterative Decoders for LDPC Codes: Optimization,
Architecture and Analysis
Low-density parity-check (LDPC) codes are adopted in many applications due to their Shannon-limit
approaching error-correcting performance. Nevertheless, belief-propagation (BP) based decoding of these
codes suffers from the error-floor problem, i.e., an abrupt change in the slope of the error-rate curve that
occurs at very low error rates. Recently, a new type of decoders termed finite alphabet iterative decoders
(FAIDs) were introduced. The FAIDs use simple Boolean maps for variable node processing, and can surpass
the BP-based decoders in the error floor region with very short word length. We restrict the scope of this paper
to regular dv=3 LDPC codes on the BSC channel. This paper develops a low-complexity implementation
architecture for the FAIDs by making use of their properties. Particularly, an innovative bit-serial check node
unit is designed for the FAIDs, and a small-area variable node unit is proposed by exploiting the symmetry in
the Boolean maps. Moreover, an optimized data scheduling scheme is proposed to increase the hardware
utilization efficiency. From synthesis results, the proposed FAID implementation needs only 52% area to
reach the same throughput as one of the most efficient standard Min-Sum decoders for an example (7807,
7177) LDPC code, while achieving better error-correcting performance in the error-floor region. Compared to
an offset Min-Sum decoder with longer word length, the proposed design can achieve higher throughput with
45% area, and still leads to possible performance improvement in the error-floor region.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 011
Precise VLSI Architecture for AI Based 1-D/ 2-D Daub-6 Wavelet Filter Banks
With Low Adder-Count
A multiplier-less architecture based on algebraic integer representation for computing the Daubechies
6-tap wavelet transform for 1-D/2-D signal processing is proposed. This architecture improves on
previous designs in a sense that it minimizes the number of parallel 2-input adder circuits. The
algorithm was achieved using brute-force numerical optimization of the algebraic integer
representation. The proposed architecture furnishes exact computation up to the final reconstruction
step, which is the operation that maps the exactly computed filtered results from algebraic integer
representation to fixed-point. Compared to our recent work, this architecture shows a reduction of
$27cdot n-16$ adder circuits, where $n$ is the number of wavelet decomposition levels. The design
is physically implemented for a 4-level 1-D/2-D decomposition using a Xilinx Virtex-6 vcx240t1ff1156 field programmable gate array (FPGA) device operating at up to a maximum clock
frequency of 344/ 168 MHz. The FPGA implementation of 1-D/2-D are tested using hardware cosimulation using an ML605 board with clock of 100 MHz. A 45 nm CMOS synthesis of 2-D designs
show improved clock frequency of better than 306 MHz for a supply voltage of 1.1 V.
ETPL
VLSI - 012
Design of Efficient Binary Comparators in Quantum-Dot Cellular Automata
Quantum-dot cellular automata (QCA) are an attractive emerging technology suitable for the development of
ultra-dense low-power high-performance digital circuits. Efficient solutions have recently been proposed for
several arithmetic circuits, such as adders, multipliers, and comparators. Nevertheless, since the design of
digital circuits in QCA still poses several challenges, novel implementation strategies and methodologies are
highly desirable. This paper proposes a new design approach oriented to the implementation of binary
comparators in QCA. New formulations of basic logic equations required to perform the comparison function
are proposed. The new strategy has been exploited in the design of two different comparator architectures and
for several operands word lengths. With respect to existing counterparts, the comparators proposed here
exhibit significantly higher speed and reduced overall area.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 013
Improved matrix multiplier design for high-speed digital signal processing
applications
A transistor level implementation of an improved matrix multiplier for high-speed digital signal processing
applications based on matrix element transformation and multiplication is reported in this study. The
improvement in speed was achieved by rearranging the matrix element into a two-dimensional array of
processing elements interconnected as a mesh. The edges of each row and column were interconnected in
torus structure, facilitating simultaneous implementation of several multiplications. The functionality of the
circuitry was verified and the performance parameters for example, propagation delay and dynamic switching
power consumptions were calculated using spice spectre using 90 nm CMOS technology. The proposed
methodology ensures substantial reduction in propagation delay compared with the conventional algorithm,
systolic array and pseudo number theoretic transformation (PNTT)-based implementation, which are the most
commonly used techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4 matrix
~2 μ
×
~3.12 mW only. Improvement in speed compared with earlier reported matrix multipliers, for example,
conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%,
respectively.
ETPL
VLSI - 014
High-Speed Experimental Demonstration of Adiabatic Quantum-FluxParametron Gates Using Quantum-Flux-Latches
We experimentally demonstrated high-speed logic operations of adiabatic quantum-flux-parametron (AQFP)
gates through the use of quantum-flux-latches (QFLs). In QFL-based high-speed test circuits (QHTCs), the
output data of the circuits under test (CUTs), which are driven by high-speed excitation currents, are stored in
QFLs and are slowly read out using low-speed excitation currents. We designed and fabricated three types of
QHTCs using QFLs with different circuit parameters, where the CUTs were buffer gates and and gates. We
confirmed the correct operation of buffer gates and and gates at 1 GHz. The obtained bias margins of the 1
GHz excitation currents were more than ±30% for each QHTC, which is wide enough for high-speed logic
operations of AQFP gates
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 015
Spin Orbit Torque Non-Volatile Flip-Flop for High Speed and Low Energy
Applications
A novel nonvolatile flip-flop based on spin-orbit torque magnetic tunnel junctions (SOT-MTJs) is proposed
for fast and ultralow energy applications. A case study of this nonvolatile flip-flop is considered. In addition to
the independence between writing and reading paths, which offers a high reliability, the low resistive writing
path performs high-speed, and energy-efficient WRITE operation. We compare the SOT-MTJ performances
metrics with the spin transfer torque (STT)-MTJ. Based on accurate compact models, simulation results show
an improvement, which attains 20× in terms of WRITE energy per bit cell. At the same writing current and
supply voltage, the SOT-MTJ achieves a writing frequency 4× higher than the STT-MTJ.
ETPL
VLSI - 016
Critical-Path Analysis and Low-Complexity Implementation of the LMS
Adaptive Algorithm
This paper presents a precise analysis of the critical path of the least-mean-square (LMS) adaptive filter for
deriving its architectures for high-speed and low-complexity implementation. It is shown that the direct-form
LMS adaptive filter has nearly the same critical path as its transpose-form counterpart, but provides much
faster convergence and lower register complexity. From the critical-path evaluation, it is further shown that no
pipelining is required for implementing a direct-form LMS adaptive filter for most practical cases, and can be
realized with a very small adaptation delay in cases where a very high sampling rate is required. Based on
these findings, this paper proposes three structures of the LMS adaptive filter: (i) Design 1 having no
adaptation delays, (ii) Design 2 with only one adaptation delay, and (iii) Design 3 with two adaptation delays.
Design 1 involves the minimum area and the minimum energy per sample (EPS). The best of existing directform structures requires 80.4% more area and 41.9% more EPS compared to Design 1. Designs 2 and 3
involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF at a cost of 55.0% and
60.6% more area, respectively.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 017
Exploiting the Incomplete Diffusion Feature: A Specialized Analytical Side-Channel
Attack Against the AES and Its Application to Microcontroller Implementations
Algebraic side-channel attack (ASCA) is a typical technique that relies on a general solver to solve the
equations of a cipher and its side-channel leaks. It falls under analytical side-channel attack and can recover
the entire key at once. Many ASCAs are proposed against the AES, and they utilize the Gröbner basis-based,
SAT-based, or optimizer-based solver. The advantage of the general solver approach is its generic feature,
which can be easily applied to different cryptographic algorithms. The disadvantage is that it is difficult to take
into account the specialized properties of the targeted cryptographic algorithms. The results vary depending on
what type of solver is used, and the time complexity is quite high when considering the error-tolerant attack
scenarios. Thus, we were motivated to find a new approach that would lessen the influence of the general
solver and reduce the time complexity of ASCA. This paper proposes a new analytical side-channel attack on
AES by exploiting the incomplete diffusion feature in one AES round. We named our technique incomplete
diffusion analytical side-channel analysis (IDASCA). Different from previous ASCAs, IDASCA adopts a
specialized approach to recover the secret key of AES instead of the general solver. Extensive attacks are
performed against the software implementation of AES on an 8-bit microcontroller. Experimental results show
that: 1) IDASCA can exploit the side-channel leaks in all AES rounds using a single power trace; 2) it has less
time complexity and more robustness than previous ASCAs, especially when considering the error-tolerant
attack scenarios; and 3) it can calculate the reduced key search space of AES for the given amount of sidechannel leaks. IDASCA can also interpret the mechanism behind previous ASCAs on AES from a quantitative
perspective, such as why ASCA can work under unknown plaintext/ciphertext scenarios and what are the
extreme cases in ASCAs.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 018
Efficient Register Renaming and Recovery for High-Performance Processors
Modern superscalar processors implement register renaming using either random access memory (RAM) or
content-addressable memories (CAM) tables. The design of these structures should address both access time
and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are
more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents
them from scaling with the number of physical registers and pipeline width, negatively impacting
performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM–
CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM
provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM
enables immediate recovery. Experimental results show that in a four-way state-of-the-art superscalar
processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme,
while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less
energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid
RAM–CAM scheme does not exceed the area required by conventional renaming mechanisms.
ETPL
VLSI - 019
Design and simulation of power efficient traffic light controller (PTLC)
This paper presents design and simulation of a power efficient traffic light controller (PTLC). The main focus
is on simulation and optimization of PTLC design and computing its speed of operation. In the conventional
system, power consumption is high and expensive. The design of PTLC is better than conventional in terms of
LUT's (number of gates), complexity, size and cost. In this research paper a novel PTLC is presented with a
minimum number of LEDs which fairly improves its performance and makes the design efficient in terms of
power and speed with respect to conventional design. The conventional traffic light controller has been
implemented using microcontroller and FPGA's. The research paper by Parasmani in 2013 stated the use of
FPGA to design an advanced traffic light controller which uses the sensor to maintain the continuous traffic
flow hence the power consumption is too high which can be reduced by the design PTLC. The novel design of
PTLC is an economical and possess the characters of high integration, low power and flexibility. The PTLC
has been implemented using FPGA. FPGA has many advantages as the speed, number of input/output ports
and performance. This system has been successful tested and implemented in hardware using Xilinx v 10.1
software packages using Very High Speed Integrated circuit hardware description language (VHDL), RTL and
technology schematic are included to validate simulation results.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 020
Energy Efficient Exact Matching for Flow Identification with Cuckoo Affinity
Hashing
Energy efficiency has become an important design goal for networking equipment. Traditionally routers and
switches have been designed to minimize peak power consumption but they operate most of the time with
settings and traffic that is far from that peak. Therefore, many elements and functions of networking
equipment are being redesigned to improve energy efficiency. A common functionality in networking is flow
identification that is needed in many applications. Flow identification can be implemented with Content
Addressable Memories (CAMs) or alternatively with several data structures. Among those, one efficient
option is Cuckoo hashing that enables fast searches and high memory utilization at the cost of complicating
the insertion procedure. In this letter, first the energy efficiency of exact matching using Cuckoo hashing is
analyzed and then a technique is presented to improve the energy efficiency of Cuckoo hashing. The proposed
scheme is evaluated using a traffic monitoring application and compared with the traditional Cuckoo hashing.
The results show that significant energy savings can be obtained by using the proposed technique.
ETPL
VLSI - 021
Ultra Low Power Magnetic Flip-Flop Based on Checkpointing/Power Gating
and Self-Enable Mechanisms
Advanced computing systems suffer from high static power due to the rapidly rising leakage currents in deep
sub-micron MOS technologies. Fast access non-volatile memories (NVM) are under intense investigation to
be integrated in Flip-Flops or computing memories to allow system power-off in standby state and save power.
Spin Transfer Torque MRAM (STT-MRAM) is considered the most promising NVM to address this issue
thanks to its high speed, low power, and infinite endurance. However, one of the disadvantages of STTMRAM for the computing purpose is its relatively high write energy to build up Magnetic Flip-Flop (MFF). In
this paper, we propose a power-efficient MFF design architecture to address this challenge based on the
combination of checkpointing operation, power gating and self-enable mechanisms. Multi non-volatile
storages can be integrated locally in a conventional FF without significant area overhead benefiting from the
3-D implementation of STT-MRAM. We performed electrical simulations (i.e. transient and statistical) to
validate its functional behaviors and evaluate its performance by using an accurate spice model of STTMRAM and an industrial 40 nm CMOS design kit. The simulation results confirm its lower power
consumption compared to conventional CMOS FF and the other structures.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 022
A joint encryption and error correction method used in satellite
communications
Due to the ubiquitous
open air links and complex electromagnetic environment in the satellite
a
communications, how to ensure the security and reliability of the information through the satellite
communications is an urgent problem. This paper combines the AES(Advanced Encryption Standard) with
LDPC(Low Density Parity Check Code) to design a secure and reliable error correction method ??
SEEC(Satellite Encryption and Error Correction). This method selects the LDPC codes, which is suitable for
satellite communications, and uses the AES round key to control the encoding process, at the same time,
proposes a new algorithm of round key generation. Based on a fairly good property in error correction in
satellite communications, the method improves the security of the system, achieves a shorter key size, and
then makes the key management easier. Eventually, the method shows a great error correction capability and
encryption effect by the MATLAB simulation
ETPL
VLSI - 023
Dynamic ternary cam for hardware search engine
A five-transistor dynamic ternary content addressable memory (CAM) is presented for high-density data
search applications. The data path and the search path are separated to avoid unwanted capacitive coupling at
the storage node. To increase the data retention time, the data lines are grounded and dummy search lines are
implemented for refresh operations. The proposed CAM cell is fabricated using a 130 nm CMOS process, and
8 99 μ 2 A
f 64 × 128 search memory has a retention time of 2.84 ms at
room temperature with a 1.2 V supply voltage. The hardware search performance is compared with a
conventional software-based search scheme, running on two different systems with clock frequencies of more
than an order of magnitude faster. The hardware search engine exhibits comparable search speeds while
dissipating only 149 mW.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 024
High-Throughput Low-Energy Self-Timed CAM Based on Reordered
Overlapped Search Mechanism
This paper introduces a reordered overlapped search mechanism for high-throughput low-energy contentaddressable memories (CAMs). Most mismatches can be found by searching a few bits of a search word. To
lower power dissipation, a word circuit is often divided into two sections that are sequentially searched or even
pipelined. Because of this process, most of match lines in the second section are unused. Since searching the
last few bits is very fast compared to searching the rest of the bits, we propose to increase throughput by
asynchronously initiating second-stage searches on the unused match lines as soon as a first-stage search is
complete. In our circuit implementation, each word circuit is independently controlled by a locally generated
timing signal rather than a global signal. This allows the circuits to be in the required phase for their own local
operation: evaluate or precharge, instead of having to synchronize their phase to the rest of the word circuits,
which greatly reduces the cycle time. As a design example, a 128 × 64-bit CAM is implemented and evaluated
by HSPICE simulation under a 90 nm CMOS technology. The proposed asynchronous CAM operates 5.98
times faster than a synchronous CAM with 14.2% smaller energy dissipation. The post-layout proposed CAM
achieves 385-ps cycle delay time and 0.773 fJ/bit/search and is also evaluated under different corner
conditions
variations toand
guarantee
it operates properly.
ETPL and PVT
A Single-Bit
Double-Adjacent
Error Correcting Parallel Decoder for
VLSI - 025
Multiple-Bit Error Correcting BCH Codes
This paper presents a novel high-speed BCH decoder that corrects double-adjacent and single-bit errors in
parallel and serially corrects multiple-bit errors other than double-adjacent errors. Its operation is based on
extending an existing parallel BCH decoder that can only correct single-bit errors and serially corrects doubleadjacent errors at low speed. The proposed decoder is constructed by a novel design and is suitable for
nanoscale memory systems, in which multiple-bit errors occur at a probability comparable to single-bit errors
and double-adjacent errors occur at a higher probability (nearly two orders of magnitude) than other multiplebit errors. Extensive simulation results are reported. Compared with the existing scheme, the area and delay
time of the proposed decoder are on average 11% and 6% higher, but its power consumption is reduced by 9%
on average. This paper also shows that the area, delay, and power overheads incurred by the proposed scheme
are significantly lower than traditional fully parallelized BCH decoders capable of correcting any double-bit
errors in parallel.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 026
Design and implementation of high speed and high accuracy fixed-width
modified booth multiplier for DSP application
This paper presents an error compensation bias circuit added to a modified encoded booth multiplier to
produce a high accuracy fixed-width multiplier. Fixed-width multiplier is employed in many digital signal
processing applications, as most of these systems employ iterative structures with fixed precision. The design
has been implemented in TSMC 180nm technology. The design is 14.6% faster than the fixed-width
multipliers. The design has 37.2% less truncation error as compared to direct truncated fixed width multiplier
(DTFM). The design is embedded with operand isolator technique to ensure low power operation when
employed in DSP applications.
ETPL
VLSI - 027
A new design of low power high speed hybrid CMOS full adder
We have designed the full Adder using hybrid-CMOS logic style by dividing it in three modules so that it can
be optimized at various levels. First module is an XOR-XNOR circuit, which generates full swing XOR and
XNOR outputs simultaneously and have a good driving capability. It also consumes minimum power and
provides better delay performance. Second module is a sum circuit which is also a XOR circuit and uses carry
input and the output of the first module as input to generate sum output. Third module is a carry circuit which
uses the output of the first stage and other inputs to generate carry output. In the new full adder design we have
proposed new full adder circuit which reduce the power consumption, delay between carry out to carry in and
PDP by 12 to 100%
ETPL
VLSI - 028
P
E
M 0 18 μ
M
Design for reliability for low power digital circuits
Lower power digital circuits in cellular phones, laptop or tablet computers have critical power consumption
limitations. Power consumption at process corners can vary as much as 50%. In order to optimize high-speed
logic circuit designs for low power needs, we need to accurately predict device to product aging across
process, temperature and voltage corners. In this talk, we focus on the impact of BTI aging at corners, the
Fmax guardband and its trade-off with power and performance.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 029
High speed vedic multiplier designs-A review
Multipliers are the key block in high speed arithmetic logic units, multiplier and accumulate units, digital
signal processing units etc. With the increasing constraints on delay, more and more emphasis is being laid on
design of faster multiplications. To enhance speed many modifications over the standard modified booth
algorithm, Wallace tree methods for multiplier design have been made and several new techniques are being
worked upon. Amongst these Vedic multipliers based on Vedic mathematics are presently under focus due to
these being one of the fastest and low power multiplier. There are sixteen sutras in Vedic multiplication in
“U
”
he most efficient one in terms of speed. A large number
of high speed Vedic multipliers have been proposed with Urdhva Tiryakbhyam sutra. Few of them are
presented in this paper giving an insight into their methodology, merits and demerits. Compressor based Vedic
Multipliers show considerable improvements in speed and area efficiency over the conventional ones.
ETPL
VLSI - 030
Design of an energy efficient, high speed, low power full subtractor using GDI
technique
This paper proposes the design of an energy efficient, high speed and low power full subtractor using Gate
Diffusion Input (GDI) technique. The entire design has been performed in 150nm technology and on
comparison with a full subtractor employing the conventional CMOS transistors, transmission gates and
Complementary Pass-Transistor Logic (CPL), respectively it has been found that there is a considerable
amount of reduction in Average Power consumption (Pavg), delay time as well as Power Delay Product
(PDP). Pavg is as low as 13.96nW while the delay time is found to be 18.02pico second thereby giving a PDP
as low as 2.51×10-19 Joule for 1 volt power supply. In addition to this there is a significant reduction in
transistor count compared to traditional full subtractor employing CMOS transistors, transmission gates and
CPL, accordingly implying minimization of area. The simulation of the proposed design has been carried out
in Tanner SPICE and the layout has been designed in Microwind.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 031
Ultrafast All-Optical Flip-Flops, Simultaneous Comparator-Decoder and
Reconfigurable Logic Unit With Silicon Microring Resonator Switches
We present designs of all-optical SR, clocked-SR, D and T flip-flops, simultaneous single-bit comparatordecoder and reconfigurable logic unit based on all-optical switching by two-photon absorption induced freecarrier injection in silicon 2 × 2 add-drop microring resonators. The proposed circuits have been theoretically
analyzed using time-domain coupled-mode theory and all-optical switching has been optimized for ultrafast
(~25 ps), low-power operation (~25 mW) and high modulation (> 85%), enabling logic operations at 40 Gb/s.
The designs are attractive due to advantages of high Q-factor, tunability, compactness, cascadibility,
scalability, reconfigurability, simplicity and minimal number of switches and inputs for realization of the
desired logic.
ETPL
VLSI - 032
Implementation of high speed low power combinational and sequential circuits
using reversible logic
Reversible logic has presented itself as a prominent technology which plays an imperative role in Quantum
Computing. Quantum computing devices theoretically operate at ultra high speed and consume infinitesimally
less power. Research done in this paper aims to utilize the idea of reversible logic to break the conventional
speed-power trade-off, thereby getting a step closer to realise Quantum computing devices. To authenticate
this research, various combinational and sequential circuits are implemented such as a 4-bit Ripple-carry
Adder, (8-bit X 8-bit) Wallace Tree Multiplier, and the Control Unit of an 8-bit GCD processor using
Reversible gates. The power and speed parameters for the circuits have been indicated, and compared with
their conventional non-reversible counterparts. The comparative statistical study proves that circuits
employing Reversible logic thus are faster and power efficient. The designs presented in this paper were
simulated using Xilinx 9.2 software.
ETPL
VLSI - 033
An all-digital delay-locked loop for high-speed memory interface applications
This paper presents an all-digital delay-locked loop with the novel digital delay line for high-speed memory
interface applications. The proposed digital delay line has smaller tuning step and better tuning linearity than
the prior arts. The proposed ADDLL inside the DDR3 PHY for the purpose of the 90-degree phase shift and
read leveling is fabricated in a 40nm low-power CMOS process. The testchip is successfully verified at the
data rate of 800∼1600Mbps. The measured peak-to-peak and rms jitter of the write DQS are 60ps and 10ps at
the data rate of 1600Mbps, respectively.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 034
Low Power Square and Cube Architectures Using Vedic Sutras
In this paper low power square and cube architectures are proposed using Vedic sutras. Low power and less
area square and cube architectures uses Dwandwa yoga Duplex combination properties of Urdhva
Tiryagbhyam sutra and Anurupyena sutra of Vedic mathematics. Simulation results for 8-bit square and 8-bit
cube shows that proposed architectures lowers the total power consumption by 45% and area by 63% when
compared to the conventional architecture. Also the reduction in power consumption increases with the
increase in bit width. Comparison is made between conventional and Vedic method implementations of square
and cube architecture. Implementation results show a significant improvement in terms of area, power and
delay. Proposed square and cube architectures can be used for high speed and low power applications.
Synthesis is done on Xilinx FPGA Device using, Xilinx Family: Spartan 3E, Speed Grade: -4. Propagation
delay of the proposed 8-bit square is 4ns and area consumed in terms of slices is 22 and for 8-bit cube
propogation delay is 7.72ns and area consumed in terms of slices is 58. Dynamic power estimation for square
and cube are 13mW and 16mW respectively.
ETPL
VLSI - 035
Performance analysis of a high speed, energy efficient 4×4 dynamic RAM cell
array using 32nm fully depleted SOI/SON and CNFET
The objective of this paper is fully focused on designing of a power efficient, high performance 4×4 1T
DRAM cell array using conventional MOS, fully depleted SOI/SON and CNFET devices. As the CMOS
technology is being scaled down, there has been a major need to improve the performance and robustness of
the memory extensively used in today's hand-held devices. Dynamic Random Access Memory (DRAM) is the
main memory used for all desktop and larger computers. In modern VLSI circuit designing, power dissipation
is also a crucial issue. The new emerging devices with improved technology promise of low power
applications. In this paper, we have presented a comparative circuit level analysis between Metal Oxide
Semiconductor (MOS), fully depleted Silicon on Insulator (FD-SOI), fully depleted Silicon on Nothing (FDSON) and Carbon Nanotube Field Effect transistor (CNFET) in 32nm technology node using HSpice tool.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 036
High-performance 64-bit binary comparator
High-performance 64-bit binary comparator is proposed in this brief. Comparison is most basic arithmetic
operation that determines if one number is greater than, equal to, or less than the other number. Comparator is
most fundamental component that performs comparison operation. This briefly presents comparison of
modified and existing 64-bit binary comparator designs concentrating on power consumption and delay.
Means some modifications have been done in existing 64-bit binary comparator design to improve the
performance of the circuit. Comparison between modified and existing 64-bit binary comparator designs is
calculated by simulation that is performed at 90nm technology in Tanner EDA Tool.
ETPL
VLSI - 037
FPGA based partial reconfigurable fir filter design
This paper proposes partial reconfigurable FIR filter design using systolic Distributed Arithmetic (DA)
architecture optimized for FPGAs. To implement computationally efficient, low power, high speed Finite
Impulse Response (FIR) filter a two dimensional fully pipelined structure is used. To reduce the partial
reconfiguration time a new architecture for the Look-Up Table (LUT) in distributed arithmetic is proposed.
The FIR filter is dynamically reconfigured to realize low pass and high pass filter characteristics by changing
the filter coefficients in the partial reconfiguration module. The design is implemented using XUP Virtex 5
LX110T FPGA kit. The FIR filter design shows improvement in configuration time and efficiency.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 038
Analysis and Design of a Low-Voltage Low-Power Double-Tail Comparator
The need for ultra low-power, area efficient, and high speed analog-to-digital converters is pushing toward the
use of dynamic regenerative comparators to maximize speed and power efficiency. In this paper, an analysis
on the delay of the dynamic comparators will be presented and analytical expressions are derived. From the
analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay
and fully explore the tradeoffs in dynamic comparator design. Based on the presented analysis, a new dynamic
comparator is proposed, where the circuit of a conventional double-tail comparator is modified for low-power
and fast operation even in small supply voltages. Without complicating the design and by adding few
transistors, the positive feedback during the regeneration is strengthened, which results in remarkably reduced
delay time. Post-layout simulation results in a 0.18- μ
M
shown that in the proposed dynamic comparator both the power consumption and delay time are significantly
reduced. The maximum clock frequency of the proposed comparator can be increased to 2.5 and 1.1 GHz at
12
06V
1
W
153 μW
of the input-referred offset is 7.8 mV at 1.2 V supply.
ETPL
VLSI - 039
A Blind Dynamic Fingerprinting Technique for Sequential Circuit Intellectual
Property Protection
Design fingerprinting
is a means to trace the illegally redistributed intellectual property (IP) by creating a
a
unique IP instance with a different signature for each user. Existing fingerprinting techniques for hardware IP
protection focus on lowering the design effort to create a large number of different IP instances without paying
much attention on the ease of fingerprint detection upon IP integration. This paper presents the first dynamic
fingerprinting technique on sequential circuit IPs to enable both the owner and legal buyers of an IP embedded
in a chip to be readily identified in the field. The proposed fingerprint is an oblivious ownership watermark
independently endorsed by each user through a blind signature protocol. Thus, the authorship can also be
proved through the detection of different user's fingerprints without the need to separately embed an identical
IP owner's signature in all fingerprinted instances. The proposed technique is applicable to both applicationspecific integrated circuit and field-programmable gate array IPs. Our analyses show that the fingerprint is
immune to collusion attack and can withstand all perceivable attacks, with a lower probability of removal than
state-of-the-art FSM watermarking schemes. The probability of coincidence of a 32-bit fingerprint is in the
order of 10-10 and up to 1035 32-bit fingerprinted instances can be generated for a small design of 100 flipflops.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 040
Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low
Adaptation-Delay
In this paper, we present an efficient architecture for the implementation of a delayed least mean square
adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we use a
novel partial product generator and propose a strategy for optimized balanced pipelining across the timeconsuming combinational blocks of the structure. From synthesis results, we find that the proposed design
offers nearly 17% less area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the
best of the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an efficient
fixed-point implementation scheme of the proposed architecture, and derive the expression for steady-state
error. We show that the steady-state mean squared error obtained from the analytical result matches with the
simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which
provides nearly 20% saving in ADP and 9% saving in EDP over the proposed structure before pruning without
noticeable degradation of steady-state-error performance.
ETPL
VLSI - 041
Reduced-Complexity Min-Sum Algorithm for Decoding LDPC Codes With Low
Error-Floor
This paper proposes a low-complexity min-sum algorithm for decoding low-density parity-check codes. It is
an improved version of the single-minimum algorithm where the two-minimum calculation is replaced by one
minimum calculation and a second minimum emulation. In the proposed one, variable correction factors that
depend on the iteration number are introduced and the second minimum emulation is simplified, reducing by
this way the decoder complexity. This proposal improves the performance of the single-minimum algorithm,
approaching to the normalized min-sum performance in the water-fall region. Also, the error-floor region is
analyzed for the code of the IEEE 802.3an standard showing that the trapping sets are decoded due to a slow
down of the convergence of the algorithm. An error-floor free operation below $hbox {BER}=10^{-15}$ is
shown for this code by means of a field-programmable gate array (FPGA)-based hardware emulator. A layered
decoder is implemented in a 90-nm CMOS technology achieving 12.8 Gbps with an area of 3.84 mm$^2$ .
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 042
Improved 8-Point Approximate DCT for Image and Video Compression
Requiring Only 14 Additions
Video processing systems such as HEVC requiring low energy consumption needed for the multimedia market
has lead to extensive development in fast algorithms for the efficient approximation of 2-D DCT transforms.
The DCT is employed in a multitude of compression standards due to its remarkable energy compaction
properties. Multiplier-free approximate DCT transforms have been proposed that offer superior compression
performance at very low circuit complexity. Such approximations can be realized in digital VLSI hardware
using additions and subtractions only, leading to significant reductions in chip area and power consumption
compared to conventional DCTs and integer transforms. In this paper, we introduce a novel 8-point DCT
approximation that requires only 14 addition operations and no multiplications. The proposed transform
possesses low computational complexity and is compared to state-of-the-art DCT approximations in terms of
both algorithm complexity and peak signal-to-noise ratio. The proposed DCT approximation is a candidate for
reconfigurable video standards such as HEVC. The proposed transform and several other DCT approximations
are mapped to systolic-array digital architectures and physically realized as digital prototype circuits using
FPGA technology and mapped to 45 nm CMOS technology.
ETPL
VLSI - 043
Toward Multi-Gigabit Wireless: Design of High-Throughput MIMO Detectors
With Hardware-Efficient Architecture
This paper presents a hardware-efficient architecture for 4×4 and 8×8 high-throughput MIMO detectors. The
adopted non-constant K-best algorithm tends to keep more survival nodes in top search tree layers and reduce
computational complexity in bottom layers as opposed to the conventional K-best algorithm. A pipelined
architecture is used to generate one detection output per clock cycle, thus meeting multi-gigabit throughput
requirements for advanced wireless communication systems. The proposed efficient folding scheme strikes a
suitable balance between complexity and throughput. This paper also presents a discussion on the scalability
of this architecture with respect to the setting of QAM size, K values, and antenna number. One 4×4 MIMO
detector IC has been manufactured and one 8×8 MIMO detector layout has been realized, both in 90-nm
CMOS technology. The 4×4 detector IC has 232 kilogates (KG). Its maximum measured throughput is 4.08
Gbps at 170-MHz operating frequency and 1.3-V core voltage. The 8×8 detector has 665 KG. Its post-layout
simulation results show that it achieves 4.37-Gbps throughput at 182-MHz operating frequency and 0.9-V core
voltage. Compared to earlier hard-output detectors, both implemented detectors demonstrate good normalized
power and normalized hardware efficiencies.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 044
Partial Access Mode: New Method for Reducing Power Consumption of
Dynamic Random Access Memory
Demands have been placed on a dynamic random access memory (DRAM) to not only have increased
memory capacity and data transfer speed, but also have reduced operating and standby currents. When a
system uses a DRAM, a refresh operation is necessary because of its data retention time restriction: each bit of
the DRAM is stored as an amount of electrical charge in a storage capacitor that is discharged by the leakage
current. Power consumption for the refresh operation increases in proportion to the memory capacity. We
propose a new method to reduce the refresh power consumption by effectively extending the memory cell
retention time. Conversion from 1 cell/bit to $2^{N}$ cells/bit reduces the variation in the retention time
among memory cells. Although active power increases by a factor of $2^{N}$ , the refresh time increases by
more than $2^{N}$ as a consequence of the fact that the majority decision does better than averaging for the
tail distribution of retention time. The conversion can be realized very simply from the structure of the DRAM
array circuit, and it reduces the frequency of disturbance and power consumption by two orders of magnitude.
On the basis of this conversion method, we propose a partial access mode to reduce power consumption
dynamically when the full memory capacity is not required.
ETPL
VLSI - 045
Pulsed-Latch Utilization for Clock-Tree Power Optimization
Minimizing the size of a clock tree is known as an effective approach to reduce power dissipation in modern
circuit designs. However, most existing power-aware clock-tree minimization algorithms optimize power on
the basis of flip-flops alone, which may result in limited power savings. To achieve a power and timing
tradeoff, this paper investigates the pulsed-latch utilization in a clock tree for further power savings. This is the
first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and
flip-flops. The proposed method is based on minimum-cost maximum-flow formulation to globally determine
the tree topology, which maintains load balance and considers the wirelength between pulse generators and
pulsed latches. Experimental results indicate that the proposed migration approach can improve the power
consumption by 12% and 13% with 7% and 70% skew improvements on average compared with the most
recent paper on the industrial circuits and ISPD-2010 benchmarks, respectively.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 046
Scalable Montgomery Modular Multiplication Architecture with Low-Latency
and Low-Memory Bandwidth Requirement
Montgomery modular multiplication is widely used in public-key cryptosystems. This work shows how to
relax the data dependency in conventional word-based algorithms to maximize the possibility of reusing the
current words of variables. With the greatly relaxed data dependency, we then proposed a novel scheduling
scheme to alleviate the number of memory access in the developed scalable architecture. Analytical results
show that the memory bandwidth requirement of the proposed scalable architecture is almost 1/(w - 1) times
that of conventional scalable architectures, where w denotes word size. The proposed one also retains a latency
of exactly one cycle between the operations of the same words in two consecutive iterations of the
Montgomery modular multiplication algorithm when employing enough processing elements. Compared to the
design in the related work, experimental results demonstrate that the proposed one achieves an almost 54
percent reduction in power consumption with no degradation in throughput. The reduced number of memory
access not only leads to lower power consumption, but also facilitates the design of scalable architectures for
any precision of operands.
ETPL
VLSI - 047
Reconfigurable CORDIC-Based Low-Power DCT Architecture Based on Data
Priority
This paper presents a low-power coordinate rotation digital computer (CORDIC)-based reconfigurable
discrete cosine transform (DCT) architecture. The main idea of this paper is based on the interesting fact that
all the computations in DCT are not equally important in generating the frequency domain outputs.
Considering the importance difference in the DCT coefficients, the number of CORDIC iterations can be
dynamically changed to efficiently tradeoff image quality for power consumption. Thus, the computational
energy can be significantly reduced without seriously compromising the image quality. The proposed
CORDIC-based 2-D D
0 13 μ
M
results show that our reconfigurable DCT achieves power savings ranging from 22.9% to 52.2% over the
CORDIC-based Loeffler DCT at the cost of minor image quality degradations.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 048
Achieving High-Performance On-Chip Networks With Shared-Buffer Routers
On-chip routers typically have buffers dedicated to their input or output ports for temporarily storing packets
in case contention occurs on output physical channels. Buffers, unfortunately, consume significant portions of
router area and power budgets. While running a traffic trace, however, not all input ports of routers have
incoming packets needed to be transferred simultaneously. Therefore, a large number of buffer queues in the
network are empty and other queues are mostly busy. This observation motivates us to design router
architecture with shared queues (RoShaQ), router architecture that maximizes buffer utilization by allowing
the sharing multiple buffer queues among input ports. Sharing queues, in fact, makes using buffers more
efficient hence is able to achieve higher throughput when the network load becomes heavy. On the other side,
at light traffic load, our router achieves low latency by allowing packets to effectively bypass these shared
queues. Experimental results on a 65-nm CMOS standard-cell process show that over synthetic traffics
RoShaQ has 17% less latency and 18% higher saturation throughput than a typical virtualchannel (VC) router.
Because of its higher performance, RoShaQ consumes 9% less energy per transferred packet than VC router
given the same buffer space capacity. Over real multitask applications and E3S embedded benchmarks using
near-optimal NMAP mapping algorithm, RoShaQ has 32% lower latency than VC router and targeting the
same application throughput with 30% lower energy per packet.
ETPL
VLSI - 049
Area-Delay Efficient Binary Adders in QCA
As transistors decrease in size more and more of them can be accommodated in a single die, thus increasing
chip computational capabilities. However, transistors cannot get much smaller than their current size. The
quantum-dot cellular automata (QCA) approach represents one of the possible solutions in overcoming this
physical limit, even though the design of logic modules in QCA is not always straightforward. In this brief, we
propose a new adder that outperforms all state-of-the-art competitors and achieves the best area-delay tradeoff.
The above advantages are obtained by using an overall area similar to the cheaper designs known in literature.
The 64cycles, that is just 36 clock phases.
18 72 μ2
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 050
Low-Complexity Reconfigurable Fast Filter Bank for Multi-Standard Wireless
Receivers
This brief presents a new low-complexity reconfigurable fast filter bank (RFFB) for wireless communication
applications such as spectrum sensing and channelization. In RFFB, the bandwidth and center frequency of
sub-bands can be varied with high frequency resolution without hardware reimplementation. This is achieved
with an improved modified frequency transformation-based variable digital filter (MFT-VDF) at the first stage
of the proposed multistage implementation. Existing second-order frequency transformation-based low-pass
VDFs have limited cutoff frequency range which is approximately 12.5% of the sampling frequency. The
proposed low-pass MFT-VDF offers unabridged control over the cutoff frequency on a wide frequency range
thereby, improving the cutoff frequency range of existing VDFs. The design example shows that the RFFB is
easy to design and offers substantial savings in gate counts over other filter banks.
ETPL
VLSI - 051
Input Vector Monitoring Concurrent BIST Architecture Using SRAM Cells
Input vector monitoring concurrent built-in self test (BIST) schemes perform testing during the normal
operation of the circuit without imposing a need to set the circuit offline to perform the test. These schemes are
evaluated based on the hardware overhead and the concurrent test latency (CTL), i.e., the time required for the
test to complete, whereas the circuit operates normally. In this brief, we present a novel input vector
monitoring concurrent BIST scheme, which is based on the idea of monitoring a set (called window) of
vectors reaching the circuit inputs during normal operation, and the use of a static-RAM-like structure to store
the relative locations of the vectors that reach the circuit inputs in the examined window; the proposed scheme
is shown to perform significantly better than previously proposed schemes with respect to the hardware
overhead and CTL tradeoff.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 052
Low-Complexity Low-Latency Architecture for Matching of Data Encoded
With Hard Systematic Error-Correcting Codes
A new architecture for matching the data protected with an error-correcting code (ECC) is presented in this
brief to reduce latency and complexity. Based on the fact that the codeword of an ECC is usually represented
in a systematic form consisting of the raw data and the parity information generated by encoding, the proposed
architecture parallelizes the comparison of the data and that of the parity information. To further reduce the
latency and complexity, in addition, a new butterfly-formed weight accumulator (BWA) is proposed for the
efficient computation of the Hamming distance. Grounded on the BWA, the proposed architecture examines
whether the incoming data matches the stored data if a certain number of erroneous bits are corrected. For a
(40, 33) code, the proposed architecture reduces the latency and the hardware complexity by ${sim}{32%}$
and 9%, respectively, compared with the most recent implementation.
ETPL
VLSI - 053
Layout-Based Refined NPSF Model for DRAM Characterization and Testing
As dynamic random access memories (DRAMs) are becoming denser with technology scaling, more complex
fault behaviors emerge; examples are leakage, coupling effects, and cell neighborhoods interaction. The
neighborhood pattern sensitive fault (NPSF) model is suitable to address such faulty behaviors and identify
them during the characterization and/or test of new DRAM chips. However, NPSF test algorithms are
extremely time-consuming and therefore not economically affordable. In this brief, we show how layout
information can be used to refine and significantly simplify the NPSF model and reduce the test time
complexity. As a case study, the folded DRAM array is considered. A realistic NPSF model, the $Delta$ -type
neighborhood, is introduced together with a time efficient test algorithm which is more than two-times cheaper
than traditional ones. Even when incorporating bit-line influence and word-line coupling effects, along with
NPSFs, the test algorithm time complexity almost remains unaltered. Therefore, the proposed approach makes
NPSF testing economically affordable, and hence, suitable for the characterization/test of dense DRAMs in the
nanoera.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 054
Parasitics-Aware Design of Symmetric and Asymmetric Gate-Workfunction
FinFET SRAMs
Multigate FET technology is the most viable successor to planar CMOS technology at the 22-nm node and
beyond. Prior research on multigate SRAMs is generally confined to the optimization of DC targets. However,
on account of the nonplanar nature of multigate FETs, it is highly questionable whether multigate SRAM DC
metrics can guide bitcell designers, as parasitic capacitances for two topologically equivalent bitcells can be
very different - due to various issues such as fin pitches - resulting in widely varying transient characteristics.
In this paper, we evaluate several known symmetric gate-workfunction (Symm- Φ
for the first time, asymmetric gate-workfunction (Asymm-Φ
6
E
6
RAM
E
RAM
-to-head in a 22-nm
silicon-on-insulator process, from the perspective of transient behavior, using a unified 3-D/mixed-mode 2-D
TCAD technology-circuit co-design methodology. We accomplish the latter by capturing bitcell parasitics
accurately through transport analysis-based 3-D TCAD capacitance extractions that leverage automated
layout-3-D TCAD structure synthesis algorithms. Mixed-mode transient device simulations (incorporating
back-annotated 3-D TCAD parasitics) indicate that a design guided by DC metrics alone can lead to erroneous
conclusions and suboptimal bitcell choices. Overall, from the perspective of area and performance, in singleΦ
-gate (or vanilla) configurations are superior to topologies employing independent-gate
configurations, even though the latter often have better DC metrics. In a larger design space encompassing
dual/Asymm-Φ
A
-Φ
E
RAM
topologies in terms of DC metrics and have better dynamic write-ability, even at low VDD.
ETPL
VLSI - 055
Simplifying Clock Gating Logic by Matching Factored Forms
Gate-level clock gating starts with a netlist, with partial or no gating applied; some flip-flops are then selected
for further gating to reduce the circuit's power consumption, and a gating logic of the smallest possible size
must then be synthesized. We show how to do this by factored form matching, in which gating functions in
factored forms are matched, as far as possible, with factored forms of the Boolean functions of existing
combinational nodes in the circuit; additional gates are then introduced, but only for the portion of gating
functions that are not matched. Strong matching identifies matches that are explicitly present in the factored
forms, and weak matching seeks matches that are implicit in the logic and thus are more difficult to discover.
Factored form matching reduces gating logic by an average of 24%, over a few test circuits, for which Boolean
division only achieves an average reduction of 8%.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 056
Design Flow for Flip-Flop Grouping in Data-Driven Clock Gating
Clock gating is a predominant technique used for power saving. It is observed that the commonly used
synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to
disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a
common clock enabling signal. The question of what is the group size maximizing the power savings is
answered in a previous paper. Here we answer the question of which FFs should be placed in a group to
maximize the power reduction. We propose a practical solution based on the toggling activity correlations of
FFs and their physical position proximity constraints in the layout. Our data-driven clock gating is integrated
into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power
reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40
and 65 manometer process technologies. These savings are achieved on top of the sClock gating is a
predominant technique used for power saving. It is observed that the commonly used synthesis-based gating
still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. To reduce the
hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal.
The question of what is the group size maximizing the power savings is answered in a previous paper. Here we
answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a
practical solution based on the toggling activity correlations of FFs and their physical position proximity
constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design Automation
(EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of
large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technol- gies.
These savings are achieved on top of the savings obtained by clock gating synthesis performed by commercial
EDA tools, and gating manually inserted into the register transfer level design.avings obtained by clock gating
synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level
design.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 057
An Efficient Partial-Sum Network Architecture for Semi-Parallel Polar Codes
Decoder Implementation
Polar codes have recently received a lot of attention because of their capacity-achieving performance and low
encoding and decoding complexity. The performance of the successive cancellation decoder (SCD) of the
polar codes highly depends on that of the partial-sum network (PSN) implementation. Hence, in this work, an
efficient PSN architecture is proposed, based on the properties of polar codes. First, a new partial-sum
updating algorithm and the corresponding PSN architecture are introduced which achieve a delay performance
independent of the code length. Moreover, the area complexity is also reduced. Second, for a highperformance and area-efficient semi-parallel SCD implementation, a folded PSN architecture is presented to
integrate seamlessly with the folded processing element architecture. This is achieved by using a novel folded
decoding schedule. As a result, both the critical path delay and the area (excluding the memory for folding) of
the semi-parallel SCD are approximately constant for a large range of code lengths. The proposed designs are
implemented in both FPGA and ASIC and compared with the existing designs. Experimental result shows that
for polar codes with large code length, the decoding throughput is improved by more than 1.05 times and the
area is reduced by as much as 50.4%, compared with the state-of-the-art designs.
ETPL
VLSI - 058
An Analog VLSI Implementation of the Inner Hair Cell and Auditory Nerve
Using a Dual AGC Model
An analog inner hair cell and auditory nerve circuit using a dual AGC model has been implemented using 0.35
micron mixed-signal technology. A fully-differential current-mode architecture is used and the ability to
correct channel mismatch is evaluated with matched layouts as well as with digital current tuning. The Meddis
test paradigm is used to examine the analog implementation's auditory processing capabilities and investigate
the circuit's ability to correct DC mismatch. The correction techniques used demonstrate the analog inner hair
cell and auditory nerve circuit's potential use in low-power, multiple-sensor analog biomimetic systems with
highly reproducible signal processing blocks on a single massively parallel integrated circuit.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 059
Improved Accuracy Current-Mode Multiplier Circuits With Applications in
Analog Signal Processing
This brief presents two original implementations of improved accuracy current-mode multiplier/divider
circuits. Besides the advantage of their simplicity, these original multiplier/divider structures present the
advantage of very small linearity errors that can be obtained as a result of the proposed design techniques
(0.75% and 0.9%, respectively, for an extended range of the input currents). The original multiplier/divider
circuits permit a facile reconfiguration, the presented structures representing the functional basis for
implementing complex function synthesizer circuits. The proposed computational structures are designed for
implementing in 0.18-μ
circuits' power consumpt
M
-voltage operation (a supply voltage of 1.2 V). The
60
75 μW
79 6
59.7 MHz, respectively.
ETPL
VLSI - 060
Low-Complexity Low-Latency Architecture for Matching of Data Encoded
With Hard Systematic Error-Correcting Codes
A new architecture for matching the data protected with an error-correcting code (ECC) is presented in this
brief to reduce latency and complexity. Based on the fact that the codeword of an ECC is usually represented
in a systematic form consisting of the raw data and the parity information generated by encoding, the proposed
architecture parallelizes the comparison of the data and that of the parity information. To further reduce the
latency and complexity, in addition, a new butterfly-formed weight accumulator (BWA) is proposed for the
efficient computation of the Hamming distance. Grounded on the BWA, the proposed architecture examines
whether the incoming data matches the stored data if a certain number of erroneous bits are corrected. For a
(40, 33) code, the proposed architecture reduces the latency and the hardware complexity by ${sim}{32%}$
and 9%, respectively, compared with the most recent implementation.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 061
AWARE (Asymmetric Write Architecture With REdundant Blocks): A High
Write Speed STT-MRAM Cache Architecture
Spin-transfer torque magnetic RAM (STT-MRAM) is a promising memory technology for lower level caches
because of its high density and nonvolatile nature. However, the high write latency is a bottleneck to its
widespread adoption as the future on-chip memory. In this paper, we propose a new cache architectureasymmetric write architecture with redundant blocks (AWARE)-that can improve the write latency by taking
advantage of the asymmetric write characteristics of 1T-1MTJ STT-MRAM bit-cells. Due to the nature of the
storage element in STT-MRAM, the time required for the two-
1→ 0
0→ 1
identical. In other words, one of the state transitions is slower than the other direction. In conventional cache
architecture, the overall write latency is limited by the slower transition. However, the AWARE cache design
introduces redundant blocks in each row, and they are preset to the initial state that enables the faster
transition. Hence the write operations performed in these redundant blocks are much faster than the
conventional write scheme. The write latency in AWARE is improved by 30% over conventional cache
architecture with no area penalty in the data array. Moreover, the additional tag bits introduced in this
technique result in penalty on the total cache area. In addition, the write energy increases modestly by 7% in
the proposed cache design. However, this write-energy increase can be mitigated by sacrificing the cache
capacity.
ETPL
VLSI - 062
Design of a Low-Voltage Low-Dropout Regulator
A low-voltage low-dropout (LDO) regulator that converts an input of 1 V to an output of 0.85–0.5 V, with 90nm CMOS technology is proposed. A simple symmetric operational transconductance amplifier is used as the
error amplifier (EA), with a current splitting technique adopted to boost the gain. This also enhances the
closed-loop bandwidth of the LDO regulator. In the rail-to-rail output stage of the EA, a power noise
cancellation mechanism is formed, minimizing the size of the power MOS transistor. Furthermore, a fast
responding transient accelerator is designed through the reuse of parts of the EA. These advantages allow the
proposed LDO regulator to operate over a wide range of operating conditions while achieving 99.94% current
efficiency, a 28-mV output variation for a 0–100 mA load transient, and a power supply rejection of roughly
50 dB over 0–100 kHz. The area of the proposed LDO regulator is only 0.0041 ${rm mm}^{2}$ , because of
the compact architecture.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 063
An Analytical Delay Model for Mechanical Stress Induced Systematic
Variability Analysis in Nanoscale Circuit Design
Strain engineering for performance enhancement is an integral part of a state-of-the-art CMOS process flow.
However, use of stressors makes the performance of CMOS devices layout dependent. Performance variability
arising due to the use of stressor materials is often referred to as Layout Dependent Effect (LDE) variability.
The existing delay models do not take LDE into consideration and, therefore, results into unaccounted change
in performance and degraded design robustness. In this paper we propose an analytical delay model for
Inverter, 2-input NAND and NOR gates while considering LDE variability due to the use of strain engineered
devices. We compare our derived model with TCAD calibrated HSPICE simulation results and observe that
our model estimates delay well for varying transistor sizes, load capacitances and input signal transition times.
ETPL
VLSI - 064
Energy Efficient Programmable MIMO Decoder Accelerator Chip in 65-nm
CMOS
This paper presents an energy efficient programmable hardware accelerator that targets multiple-inputmultiple-output (MIMO) decoding tasks of orthogonal frequency-division multiplexing (OFDM) systems. The
work is motivated by the adoption of MIMO and OFDM by almost all existing and emerging high-speed
wireless data communication systems. The accelerator was fabricated in 65-nm CMOS technology and
occupies a core area of 2.48 ${rm mm}^{2}$ . It delivers full programmability across different wireless
standards (i.e., WiFi, 3G-long term evolution, and WiMax) as well as different MIMO decoding algorithms
(i.e., minimum mean square error, singular value decomposition, and maximum likelihood) with extreme
energy efficiency. The energy efficiency of our MIMO accelerator chip was compared against dedicated
application specific integrated circuits for 4 $,times,$ 4 QR decomposition, 4 $,times,$ 4 singular value
decomposition, and 2 $,times,$ 2 minimum mean square error decoding. Despite the programmable nature of
our design, it delivered energy efficiencies that were 18% to 28% better than the dedicated solutions reported
in the literature. This paper presents the VLSI implementation of the architecture discussed in [14]–[16]. It
discusses the implementation decisions and tradeoffs used to ensure minimum overall energy consumption of
the resulting accelerator chip without sacrificing programmability. Given its programmability and extreme
energy efficiency, the accelerator is an ideal solution for today's smart phones that implement multiple MIMOOFDM waveforms on the - ame platform.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 065
Iterative Linear Interpolation Based on Fuzzy Gradient Model for Low-Cost
VLSI Implementation
In this paper, we propose an iterative linear interpolation (ILI) algorithm, which produces quadratic ILI
polynomials to perform the most cost-effective interpolation among state-of-the-art quadratic and cubic
methods. Unlike traditional point and area pixel models, the ILI adopts the fuzzy gradient model to estimate
gradients of the target point according to its neighbor sample points in different directions. By weighing the
gradients using fuzzy membership grades, the ILI estimates the difference between the target point and its
neighbor sample points and finally obtains the target point. In 1-D signal reconstructions, using only three
multipliers, the ILI obviously outperforms both conventional quadratic Lagrange interpolation and cubic
interpolation. To approximate 2-D signals, we use five 1-D ILIs, which costs only eight multipliers to obtain
similar peak signal-to-noise ratio (PSNR) performance but better robustness compared with bi-cubic
interpolation. Reusing the ILI polynomials of the previous target point, we further reduce the cost of ILI to
three multipliers and eight adders. The VLSI implementation using TSMC 0.18- $mu{rm m}$ technology
shows that only 7256 gates are required for running a 200-MHz, 8-bit input/output, 15-bit fix-point data path,
and 10-stage pipelined 2-D ILI, which is the quadratic interpolation of lowest cost but with PSNR
performance closest to state-of-the-art bi-cubic methods.
ETPL
VLSI - 066
Simplifying Clock Gating Logic by Matching Factored Forms
Gate-level clock gating starts with a netlist, with partial or no gating applied; some flip-flops are then selected
for further gating to reduce the circuit's power consumption, and a gating logic of the smallest possible size
must then be synthesized. We show how to do this by factored form matching, in which gating functions in
factored forms are matched, as far as possible, with factored forms of the Boolean functions of existing
combinational nodes in the circuit; additional gates are then introduced, but only for the portion of gating
functions that are not matched. Strong matching identifies matches that are explicitly present in the factored
forms, and weak matching seeks matches that are implicit in the logic and thus are more difficult to discover.
Factored form matching reduces gating logic by an average of 24%, over a few test circuits, for which Boolean
division only achieves an average reduction of 8%.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 067
Use of SSTA Tools for Evaluating BTI Impact on Combinational Circuits
This paper presents an extensive statistical study on the impact of bias temperature instability (BTI) on digital
circuits. A statistical framework for the evaluation of BTI at the electrical (SPICE) level, enhanced by an
atomistic model for BTI, is introduced. This framework is then employed to perform the timing analysis of
different combinational paths using cells from a given library, aiming to statistically model BTI at the higher
abstraction level. A statistical static timing analysis (SSTA) method is then performed and the results are
compared to detailed simulations using atomistic models based on experimental data. The comparison between
the two methods shows that for large paths both methods converge to the same distribution for the delay while
for short paths the delay distributions are different causing the SSTA method to generate misleading results.
An analysis is then performed in order to understand and formalize the results.
ETPL
VLSI - 068
Precise VLSI Architecture for AI Based 1-D/ 2-D Daub-6 Wavelet Filter Banks
With Low Adder-Count
A multiplier-less architecture based on algebraic integer representation for computing the Daubechies 6-tap
wavelet transform for 1-D/2-D signal processing is proposed. This architecture improves on previous designs
in a sense that it minimizes the number of parallel 2-input adder circuits. The algorithm was achieved using
brute-force numerical optimization of the algebraic integer representation. The proposed architecture furnishes
exact computation up to the final reconstruction step, which is the operation that maps the exactly computed
filtered results from algebraic integer representation to fixed-point. Compared to our recent work, this
architecture shows a reduction of $27cdot n-16$ adder circuits, where $n$ is the number of wavelet
decomposition levels. The design is physically implemented for a 4-level 1-D/2-D decomposition using a
Xilinx Virtex-6 vcx240t-1ff1156 field programmable gate array (FPGA) device operating at up to a maximum
clock frequency of 344/ 168 MHz. The FPGA implementation of 1-D/2-D are tested using hardware cosimulation using an ML605 board with clock of 100 MHz. A 45 nm CMOS synthesis of 2-D designs show
improved clock frequency of better than 306 MHz for a supply voltage of 1.1 V.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 069
An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply
Operator
Complex arithmetic operations are widely used in Digital Signal Processing (DSP) applications. In this work,
we focus on optimizing the design of the fused Add-Multiply (FAM) operator for increasing performance. We
investigate techniques to implement the direct recoding of the sum of two numbers in its Modified Booth
(MB) form. We introduce a structured and efficient recoding technique and explore three different schemes by
incorporating them in FAM designs. Comparing them with the FAM designs which use existing recoding
schemes, the proposed technique yields considerable reductions in terms of critical delay, hardware
complexity and power consumption of the FAM unit.
ETPL
VLSI - 070
Synchronous Non-Volatile Logic Gate Design Based on Resistive Switching
Memories
Emerging non-volatile memories (NVM) based on resistive switching mechanism (RS) such as STT-MRAM,
OxRRAM and CBRAM etc., are under intense R&D investigation by both academics and industries. They
provide high write/read speed, low power and good endurance (e.g., > 1012) beyond mainstream NVMs,
which allow them to be embedded directly with logic units for computing purpose. This integration could
increase significantly the power/die area efficiency, and then overcome definitively the power/speed
bottlenecks of modern VLSIs. This paper presents firstly a theoretical investigation of synchronous NV logic
gates based on RS memories (RS-NVL). Special design techniques and strategies are proposed to optimize the
structure according to different resistive characteristics of NVMs. To validate this study, we simulated a nonvolatile full-adder (NVFA) with two types of NVMs: STT-MRAM and OxRRAM by using CMOS 40 nm
design kit and compact models, which includes related physics and experimental parameters. They show
interesting power, speed and area gain compared with synchronized CMOS FA while keeping good reliability.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 071
High-Throughput Multistandard Transform Core Supporting
MPEG/H.264/VC-1 Using Common Sharing Distributed Arithmetic
This paper proposes a low-cost high-throughput multistandard transform (MST) core, which can support
MPEG-1/2/4 (8 × 8), H.264 (8 × 8, 4 × 4), and VC-1 (8 × 8, 8 × 4, 4 × 8, 4 × 4) transforms. Common sharing
distributed arithmetic (CSDA) combines factor sharing and distributed arithmetic sharing techniques,
efficiently reducing the number of adders for high hardware-sharing capability. This achieves a 44.5%
reduction in adders in the proposed MST, compared with the direct implementation method. With eight
parallel computation paths, the proposed MST core has an eightfold operation frequency throughput rate.
Measurements show that the proposed CSDA-MST core achieves a high-throughput rate of 1.28 G-pels/s,
supporting the (4928 × 2048@24 Hz) digital cinema or ultrahigh resolution format. This is possible only with
30 k gate counts when implemented in a TSMC 0.18- μ
M
DA-MST core thus achieves
a high-throughput rate supporting multistandard transformations at low cost.
ETPL
VLSI - 072
Efficient VLSI Implementation of Neural Networks With Hyperbolic Tangent
Activation Function
Nonlinear activation function is one of the main building blocks of artificial neural networks. Hyperbolic
tangent and sigmoid are the most used nonlinear activation functions. Accurate implementation of these
transfer functions in digital networks faces certain challenges. In this paper, an efficient approximation scheme
for hyperbolic tangent function is proposed. The approximation is based on a mathematical analysis
considering the maximum allowable error as design parameter. Hardware implementation of the proposed
approximation scheme is presented, which shows that the proposed structure compares favorably with
previous architectures in terms of area and delay. The proposed structure requires less output bits for the same
maximum allowable error when compared to the state-of-the-art. The number of output bits of the activation
function determines the bit width of multipliers and adders in the network. Therefore, the proposed activation
function results in reduction in area, delay, and power in VLSI implementation of artificial neural networks
with hyperbolic tangent activation function.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 073
Simultaneous Low-Pass Filtering and Total Variation Denoising
This paper seeks to combine linear time-invariant (LTI) filtering and sparsity-based denoising in a principled
way in order to effectively filter (denoise) a wider class of signals. LTI filtering is most suitable for signals
restricted to a known frequency band, while sparsity-based denoising is suitable for signals admitting a sparse
representation with respect to a known transform. However, some signals cannot be accurately categorized as
either band-limited or sparse. This paper addresses the problem of filtering noisy data for the particular case
where the underlying signal comprises a low-frequency component and a sparse or sparse-derivative
component. A convex optimization approach is presented and two algorithms derived: one based on
majorization-minimization (MM), and the other based on the alternating direction method of multipliers
(ADMM). It is shown that a particular choice of discrete-time filter, namely zero-phase noncausal recursive
filters for finite-length data formulated in terms of banded matrices, makes the algorithms effective and
computationally efficient. The efficiency stems from the use of fast algorithms for solving banded systems of
linear equations. The method is illustrated using data from a physiological-measurement technique (i.e., near
infrared spectroscopic time series imaging) that in many cases yields data that is well-approximated as the sum
of low-frequency, sparse or sparse-derivative, and noise components.
ETPL
VLSI - 074
Effects of Random Delay Errors in Continuous-Time Semi-Digital Transversal
Filters
The implementation of transversal filters requires basic circuit elements such as adders, multipliers and (unit)
delay elements. The filters designed under infinite precision of these elements may behave differently when
implemented with components with limited accuracy. In fact, the effects of the coefficient inaccuracies in
analog and digital transversal filters have been investigated extensively in the literature [1], [2]. On the other
hand, the effects of the unit delays with limited precision have not received similar attention. In this paper, we
find that such effects especially in very high frequency continuous-time semi-digital transversal filters may not
be ignored. As an example, we analyze the impact of delay errors in the implementation of the direct
modulation transmitter. Specifically, we provide the analytical statistical performance bounds and confirm the
results with simulations.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 075
Two Polynomial FIR Filter Structures With Variable Fractional Delay and
Phase Shift
This paper introduces two polynomial finite-length impulse response (FIR) digital filter structures with
simultaneously variable fractional delay (VFD) and phase shift (VPS). The structures are reconfigurable
(adaptable) online without redesign and do not exhibit transients when the VFD and VPS parameters are
altered. The structures can be viewed as generalizations of VFD structures in the sense that they offer a VPS in
addition to the regular VFD. The overall filters are composed of a number of fixed subfilters and a few
variable multipliers whose values are determined by the desired FD and PS values. A systematic design
algorithm, based on iter
ℓ1-norm minimization, is proposed. It generates fixed subfilters
with many zero-valued coefficients, typically located in the impulse response tails. The paper considers two
different structures, referred to as the basic structure and common-subfilters structure, and compares these
proposals as well as the existing cascaded VFD and VPS structures, in terms of arithmetic complexity, delay,
memory cost, and transients. In general, the common-subfilters structure is superior when all of these aspects
are taken into account. Further, the paper shows and exemplifies that the VFDPS filters under consideration
can be used for simultaneous resampling and frequency shift of signals.
ETPL
VLSI - 076
Algorithms and Architectures of Energy-Efficient Error-Resilient MIMO
Detectors for Memory-Dominated Wireless Communication Systems
In a broadband MIMO-OFDM wireless communication system, embedded buffering memories occupy a large
portion of the chip area and a significant amount of power consumption. Due to process variations of advanced
CMOS technologies, it becomes both challenging and costly to maintain perfectly functioning memories under
all anticipated operating conditions. Thus, Voltage over Scaling (VoS) has emerged as a means to achieve
energy efficient systems resulting in a tradeoff between energy efficiency and reliability. In this paper we
present the algorithm and VLSI architecture of a novel error-resilient K-Best MIMO detector based on the
combined distribution of channel noise and induced errors due to VoS. The simulation results show that,
compared with a conventional MIMO detector design, the proposed algorithm provides up-to 4.5 dB gain to
achieve the near-optimal Packet Error Rate (PER) performance in the 4 $times$ 4 64-QAM system.
Furthermore, based on experimental results, when jointly considering the detector and memory power
consumption, the proposed resilient scheme with VoS memory can achieve up to 32.64% savings compared to
the conventional K-Best detector with perfect memory.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 077
A Methodology for Optimized Design of Secure Differential Logic Gates for
DPA Resistant Circuits
Cryptocircuits can be attacked by third parties using differential power analysis (DPA), which uses power
consumption dependence on data being processed to reveal critical information. To protect security devices
against this issue, differential logic styles with (almost) constant power dissipation are widely used. However,
to use such circuits effectively for secure applications it is necessary to eliminate any energy-secure flaw in
security in the shape of memory effects that could leak information. This paper proposes a design
methodology to improve pull-down logic configuration for secure differential gates by redistributing the
charge stored in internal nodes and thus, removing memory effects that represent a significant threat to
security. To evaluate the methodology, it was applied to the design of AND/NAND and XOR/XNOR gates in
a 90 nm technology, adopting the sense amplifier based logic (SABL) style for the pull-up network. The
proposed solutions leak less information than typical SABL gates, increasing security by at least two orders of
magnitude and with negligible performance degradation. A simulation-based DPA attack on the Sbox9
cryptographic module used in the Kasumi algorithm, implemented with complementary metal–oxide–
semiconductor, SABL and proposed gates, was performed. The results obtained illustrate that the number of
measurements needed to disclose the key increased by much more than one order of magnitude when using
our proposal. This paper also discusses how the effectivenness of DPA attacks is influenced by operating
temperature and details how to insure energy-secure operations in the new proposals.
ETPL
VLSI - 078
Reliability-Oriented Placement and Routing Algorithm for SRAM-Based
FPGAs
As the feature size shrinks to the nanometer scale, SRAM-based FPGAs will become increasingly vulnerable
to soft errors. Existing reliability-oriented placement and routing approaches primarily focus on reducing the
fault occurrence probability (node error rate) of soft errors. However, our analysis shows that, besides the fault
occurrence probability, the propagation probability (error propagation probability) plays an important role and
should be taken into consideration. In this paper, we first propose a cube-based analysis algorithm to
efficiently and accurately estimate the error propagation probability. Based on such a model, we propose a
novel reliability-oriented placement and routing algorithm that combines both the fault occurrence probability
and the error propagation probability together to enhance system-level robustness against soft errors.
Experimental results show that, compared with the baseline versatile place and route technique, the proposed
scheme can reduce the failure rate by 20.73%, and increase the mean time between failures by 39.44%.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 079
Eliminating Synchronization Latency Using Sequenced Latching
Modern multicore systems have a large number of components operating in different clock domains and
communicating through asynchronous interfaces. These interfaces use synchronizer circuits, which guard
against metastability failures but introduce latency in processing the asynchronous input. We propose a
speculative method that hides synchronization latency by overlapping it with computation cycles. We verify
the correctness of our approach through a field programmable gate array implementation and apply it to a
number of synthesized benchmarks. Synthesis results reveal that our approach achieves average savings of
135% and 204% in area costs and nearly 100% in power costs compared to two similar speculative techniques.
ETPL
VLSI - 080
Partial Access Mode: New Method for Reducing Power Consumption of
Dynamic Random Access Memory
Demands have been placed on a dynamic random access memory (DRAM) to not only have increased
memory capacity and data transfer speed, but also have reduced operating and standby currents. When a
system uses a DRAM, a refresh operation is necessary because of its data retention time restriction: each bit of
the DRAM is stored as an amount of electrical charge in a storage capacitor that is discharged by the leakage
current. Power consumption for the refresh operation increases in proportion to the memory capacity. We
propose a new method to reduce the refresh power consumption by effectively extending the memory cell
retention time. Conversion from 1 cell/bit to $2^{N}$ cells/bit reduces the variation in the retention time
among memory cells. Although active power increases by a factor of $2^{N}$ , the refresh time increases by
more than $2^{N}$ as a consequence of the fact that the majority decision does better than averaging for the
tail distribution of retention time. The conversion can be realized very simply from the structure of the DRAM
array circuit, and it reduces the frequency of disturbance and power consumption by two orders of magnitude.
On the basis of this conversion method, we propose a partial access mode to reduce power consumption
dynamically when the full memory capacity is not required.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 081
Pulsed-Latch Utilization for Clock-Tree Power Optimization
Minimizing the size of a clock tree is known as an effective approach to reduce power dissipation in modern
circuit designs. However, most existing power-aware clock-tree minimization algorithms optimize power on
the basis of flip-flops alone, which may result in limited power savings. To achieve a power and timing
tradeoff, this paper investigates the pulsed-latch utilization in a clock tree for further power savings. This is the
first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and
flip-flops. The proposed method is based on minimum-cost maximum-flow formulation to globally determine
the tree topology, which maintains load balance and considers the wirelength between pulse generators and
pulsed latches. Experimental results indicate that the proposed migration approach can improve the power
consumption by 12% and 13% with 7% and 70% skew improvements on average compared with the most
recent paper on the industrial circuits and ISPD-2010 benchmarks, respectively.
ETPL
VLSI - 082
Toward Multi-Gigabit Wireless: Design of High-Throughput MIMO Detectors
With Hardware-Efficient Architecture
This paper presents a hardware-efficient architecture for 4×4 and 8×8 high-throughput MIMO detectors. The
adopted non-constant K-best algorithm tends to keep more survival nodes in top search tree layers and reduce
computational complexity in bottom layers as opposed to the conventional K-best algorithm. A pipelined
architecture is used to generate one detection output per clock cycle, thus meeting multi-gigabit throughput
requirements for advanced wireless communication systems. The proposed efficient folding scheme strikes a
suitable balance between complexity and throughput. This paper also presents a discussion on the scalability
of this architecture with respect to the setting of QAM size, K values, and antenna number. One 4×4 MIMO
detector IC has been manufactured and one 8×8 MIMO detector layout has been realized, both in 90-nm
CMOS technology. The 4×4 detector IC has 232 kilogates (KG). Its maximum measured throughput is 4.08
Gbps at 170-MHz operating frequency and 1.3-V core voltage. The 8×8 detector has 665 KG. Its post-layout
simulation results show that it achieves 4.37-Gbps throughput at 182-MHz operating frequency and 0.9-V core
voltage. Compared to earlier hard-output detectors, both implemented detectors demonstrate good normalized
power and normalized hardware efficiencies.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 083
Finite Alphabet Iterative Decoders for LDPC Codes: Optimization,
Architecture and Analysis
Low-density parity-check (LDPC) codes are adopted in many applications due to their Shannon-limit
approaching error-correcting performance. Nevertheless, belief-propagation (BP) based decoding of these
codes suffers from the error-floor problem, i.e., an abrupt change in the slope of the error-rate curve that
occurs at very low error rates. Recently, a new type of decoders termed finite alphabet iterative decoders
(FAIDs) were introduced. The FAIDs use simple Boolean maps for variable node processing, and can surpass
the BP-based decoders in the error floor region with very short word length. We restrict the scope of this paper
to regular dv=3 LDPC codes on the BSC channel. This paper develops a low-complexity implementation
architecture for the FAIDs by making use of their properties. Particularly, an innovative bit-serial check node
unit is designed for the FAIDs, and a small-area variable node unit is proposed by exploiting the symmetry in
the Boolean maps. Moreover, an optimized data scheduling scheme is proposed to increase the hardware
utilization efficiency. From synthesis results, the proposed FAID implementation needs only 52% area to
reach the same throughput as one of the most efficient standard Min-Sum decoders for an example (7807,
7177) LDPC code, while achieving better error-correcting performance in the error-floor region. Compared to
an offset Min-Sum decoder with longer word length, the proposed design can achieve higher throughput with
45% area, and still leads to possible performance improvement in the error-floor region.
ETPL
VLSI - 084
Constructing Sub-Arrays with ShortInterconnects from Degradable VLSI
Arrays
Reducing the interconnection length of VLSI arrays leads to less capacitance, power dissipation and dynamic
communication cost between the processing elements (PEs). This paper develops efficient algorithms for
constructing tightly-coupled subarrays from the mesh-connected VLSI arrays with faulty PEs. For a given size
r·s of the target (logical) array, the proposed algorithm searches and reroutes a physical r×s subarray that has
the least number of faults, resulting in an approximate target array, which is subsequently extended to the
desired target array. Experimental results show that over 65 percent redundant interconnects can be reduced
for a 64×64 target array on the 512×512 host array with no more than 1 percent faults. In addition, we propose
a recursive divide-and-conquer algorithm for constructing the maximum target array (MTA). The lower bound
of the total interconnection length of the MTA has been established. Experimental results show that the
proposed algorithm is capable of reducing the long interconnects by over 33 percent for the MTA derived
from the 512×512 host array with no more than 1 percent faults. Moreover, the proposed total interconnection
length of target array is close to the lower bound for the cases with relatively fewer number of faults.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 085
A Lattice Reduction-Aided MIMO Channel Equalizer in 90 nm CMOS
Achieving 720 Mb/s
In this paper, a VLSI implementation of a complete MIMO channel equalization ASIC based on lattice
reduction-aided linear detection is presented. The architecture performs preprocessing steps at channel rate and
low-complexity linear data detection at symbol rate. Preprocessing is based on Seysen's algorithm for lattice
reduction. We present algorithmic improvements of the lattice reduction preprocessing in terms of area and
throughput of the VLSI implementation with minor impact on the error-rate. Due to the low-complexity
implementation of the lattice reduction-aided data detection stage, our architecture is able to achieve very low
power in typical packet-based MIMO wireless data transmission scenarios. The final 90 nm CMOS ASIC
achieves an energy efficiency for the detection of 24 pJ/bit at a throughput of 720 Mbps with near-optimal
error-rate performance.
ETPL
VLSI - 086
Evaluation of Leakage Reduction Alternatives for Deep Submicron Dynamic
Nonuniform Cache Architecture Caches
Wire delays and leakage energy consumption are both growing problems in designing large on-chip caches.
Nonuniform cache architecture (NUCA) is a wire-delay aware design paradigm based on the sub-banking of a
cache, which allows the banks closer to the controller to be accessed with reduced latencies with respect to the
other banks. This feature is leveraged by dynamic NUCA (D-NUCA) caches via a migration mechanism
which speeds up frequently used data access, further reducing the effect wire delays have on performance. To
reduce leakage power consumption of static random access memory caches, various micro-architectural
techniques have been proposed. In this brief, we compare the benefits and limits of the application of some of
these techniques to a D-NUCA cache memory, and propose a novel hybrid scheme based on the Drowsy and
Way Adaptable techniques. Such a scheme allows further improvement in leakage reduction and limits the
impact of process variation on the effectiveness of the Drowsy technique.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 087
Split-SAR ADCs: Improved Linearity With Power and Speed Optimization
This paper presents the linearity analysis of a successive approximation registers (SAR) analog-to-digital
converters (ADC) with split DAC structure based on two switching methods: conventional chargeredistribution and Vcm-based switching. The static linearity performance, namely the integral nonlinearity and
differential nonlinearity, as well as the parasitic effects of the split DAC, are analyzed hereunder. In addition, a
code-randomized calibration technique is proposed to correct the conversion nonlinearity in the conventional
SAR ADC, which is verified by behavioral simulations, as well as measured results. Performances of both
switching methods are demonstrated in 90 nm CMOS. Measurement results of power, speed, and linearity
clearly show the benefits of using Vcm-based switching.
ETPL
VLSI - 088
An Event-Based Neural Network Architecture With an Asynchronous
Programmable Synaptic Memory
We present a hybrid analog/digital very large scale integration (VLSI) implementation of a spiking neural
network with programmable synaptic weights. The synaptic weight values are stored in an asynchronous Static
Random Access Memory (SRAM) module, which is interfaced to a fast current-mode event-driven DAC for
producing synaptic currents with the appropriate amplitude values. These currents are further integrated by
current-mode integrator synapses to produce biophysically realistic temporal dynamics. The synapse output
currents are then integrated by compact and efficient integrate and fire silicon neuron circuits with spikefrequency adaptation and adjustable refractory period and spike-reset voltage settings. The fabricated chip
comprises a total of 32 × 32 SRAM cells, 4 × 32 synapse circuits and 32 × 1 silicon neurons. It acts as a
transceiver, receiving asynchronous events in input, performing neural computation with hybrid analog/digital
circuits on the input spikes, and eventually producing digital asynchronous events in output. Input, output, and
synaptic weight values are transmitted to/from the chip using a common communication protocol based on the
Address Event Representation (AER). Using this representation it is possible to interface the device to a
workstation or a micro-controller and explore the effect of different types of Spike-Timing Dependent
Plasticity (STDP) learning algorithms for updating the synaptic weights values in the SRAM module. We
present experimental results demonstrating the correct operation of all the circuits present on the chip.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 089
Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes
Radio communication exhibits the highest energy consumption in wireless sensor nodes. Given their limited
energy supply from batteries or scavenging, these nodes must trade data communication for on-the-node
computation. Currently, they are designed around off-the-shelf low-power microcontrollers. But by employing
a more appropriate processing element, the energy consumption can be significantly reduced. This paper
describes the design and implementation of the newly proposed folded-tree architecture for on-the-node data
processing in wireless sensor networks, using parallel prefix operations and data locality in hardware.
Measurements of the silicon implementation show an improvement of 10-20× in terms of energy as compared
to traditional modern micro-controllers found in sensor nodes.
ETPL
VLSI - 090
Partial Access Mode: New Method for Reducing Power Consumption of
Dynamic Random Access Memory
Demands have been placed on a dynamic random access memory (DRAM) to not only have increased
memory capacity and data transfer speed, but also have reduced operating and standby currents. When a
system uses a DRAM, a refresh operation is necessary because of its data retention time restriction: each bit of
the DRAM is stored as an amount of electrical charge in a storage capacitor that is discharged by the leakage
current. Power consumption for the refresh operation increases in proportion to the memory capacity. We
propose a new method to reduce the refresh power consumption by effectively extending the memory cell
retention time. Conversion from 1 cell/bit to $2^{N}$ cells/bit reduces the variation in the retention time
among memory cells. Although active power increases by a factor of $2^{N}$ , the refresh time increases by
more than $2^{N}$ as a consequence of the fact that the majority decision does better than averaging for the
tail distribution of retention time. The conversion can be realized very simply from the structure of the DRAM
array circuit, and it reduces the frequency of disturbance and power consumption by two orders of magnitude.
On the basis of this conversion method, we propose a partial access mode to reduce power consumption
dynamically when the full memory capacity is not required.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 091
Optimization Scheme to Minimize Reference Resistance Distribution of SpinTransfer-Torque MRAM
Spin-transfer-torque magnetoresistive random access memory (STT-MRAM) is an emerging type of
nonvolatile memory with compelling advantages in endurability, scalability, speed, and energy consumption.
As the process technology shrinks, STT-MRAM has limited sensing margin due to the decrease in supply
voltage and increase in process variation. Furthermore, the relatively smaller resistance difference of two
states in STT-MRAM poses challenges for its read/write circuit design to maintain an acceptable sensing
margin. The proposed reference circuits optimization scheme solves the reference resistance distribution issue
to maximize the sensing margin and minimize the read disturbance, with low power consumption. Simulation
results show that the optimization scheme is able to significantly improve the read reliability with the presence
of one or few cases of reference cell failure, thus it eliminates the requirement of additional circuits for failure
detection of reference cell or referencing to neighboring blocks.
ETPL
VLSI - 092
Towards Low-Power High-Efficiency RF and Microwave Energy Harvesting
Since the very beginning of RF and microwave integrated techniques and energy harvesting, Schottky diodes
μW
-harvesting applications,
the Schottky diode technique fails to provide a satisfactory RF-dc conversion efficiency mainly because of its
high zero-bias junction resistance. This paper examines the state-of-the-art low-power microwave-to-dc
energy conversion techniques. A comprehensive picture of the state-of-the-art on this aspect is given
graphically, which compares different technologies such as transistor, diode, and CMOS schemes. Subsequent
to the highlighted limitations of current devices, this work introduces, for the first time, a nonlinear component
for low-power rectification based on a recent discovery in spintronics, namely, the spindiode. Along with an
analysis of the role of nonlinearity and zero bias resistance in the rectification process of the spindiode, it is
shown how the spindiode could enhance the rectification efficiency even at a very low-power level and how
this technique would shift the design paradigms of diode-based devices and circuits.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 093
Access Time and Power Dissipation of a Model 256-Bit Single Flux Quantum
RAM
Superconductor electronics offers logic circuits for high-speed data processing and high-performance
computing. The main barrier to practical application is the lack of high-speed and low-power memory. It is
widely believed that the most reliable and functional bit cell for superconducting memory is the vortex
transitional bit cell, which was successfully used by Nagasawa in a 4-kb memory. This paper reviews existing
challenges in this type of Josephson memory devices and discusses engineering issues in implementing a
model single flux quantum random access memory. We evaluate the contributions that various components of
the memory system make to delay and power dissipation. The 256-bit memory provides an experimentally
confirmed read access time of 190 ps. As a result, we found that delay and power dissipation are found largely
in the address decoder, line drivers, bit-selection scheme, and the data readout circuitry. With these circuits
being similar for various magnetic memory devices, our findings provide essential data for a comprehensive
assessment of new concepts for bit cells, readout, and write in superconducting memories.
ETPL
VLSI - 094
All-Optical Ultrafast Switching in 2 × 2 Silicon Microring Resonators and its
Application to Reconfigurable DEMUX/MUX and Reversible Logic Gates
We present a theoretical model to analyze all-optical switching by two-photon absorption induced free-carrier
injection in silicon 2 × 2 add-drop microring resonators. The theoretical simulations are in good agreement
with experimental results. The results have been used to design all-optical ultrafast (i) reconfigurable Demultiplexer/Multiplexer logic circuits using three microring resonator switches and (ii) universal, conservative
and reversible Fredkin and Toffoli logic gates with only one and two microring resonator switches
respectively. Switching has been optimized for low-power (25 mW) ultrafast (25 ps) operation with high
modulation depth (85%) to enable logic operations at 40 Gb/s. The combined advantages of high Q-factor,
tunability, compactness, cascadibility, reversibility and reconfigurability make the designs favorable for
practical applications. The proposed designs provide a new paradigm for ultrafast CMOS-compatible alloptical reversible computing circuits in silicon.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 095
Using Lifetime-Aware Progressive Programming to Improve SLC NAND Flash
Memory Write Endurance
This paper advocates a lifetime-aware progressive programming concept to improve single-level per cell
NAND flash memory write endurance. NAND flash memory program/erase (P/E) cycling gradually degrades
memory cell storage noise margin, and sufficiently strong fault tolerance must be used to ensure the memory
P/E cycling endurance. As a result, the relatively large cell storage noise margin in early memory lifetime is
essentially wasted in conventional design practice. This paper proposes to always fully utilize the available
cell storage noise margin by adaptively adjusting the number of storage levels per cell, and progressively use
these levels to realize multiple 1-bit programming operations between two consecutive erase operations. This
simple progressive programming design concept is realized by two different implementation strategies, which
are discussed and compared in detail. On the basis of an approximate NAND flash memory device model, we
carried out simulations to quantitatively evaluate this design concept. The results show that it can improve the
write endurance by 35.9% and in the meanwhile improve the average programming speed by 12% without
sacrificing read speed.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 096
Design of a Low-Voltage Low-Dropout Regulator
A low-voltage low-dropout (LDO) regulator that converts an input of 1 V to an output of 0.85–0.5 V, with 90nm CMOS technology is proposed. A simple symmetric operational transconductance amplifier is used as the
error amplifier (EA), with a current splitting technique adopted to boost the gain. This also enhances the
closed-loop bandwidth of the LDO regulator. In the rail-to-rail output stage of the EA, a power noise
cancellation mechanism is formed, minimizing the size of the power MOS transistor. Furthermore, a fast
responding transient accelerator is designed through the reuse of parts of the EA. These advantages allow the
proposed LDO regulator to operate over a wide range of operating conditions while achieving 99.94% current
efficiency, a 28-mV output variation for a 0–100 mA load transient, and a power supply rejection of roughly
50 dB over 0–100 kHz. The area of the proposed LDO regulator is only 0.0041 ${rm mm}^{2}$ , because of
the compact architecture.
ETPL
VLSI - 097
Test Compaction by Sharing of Transparent-Scan Sequences Among Logic
Blocks
An approach to test application called transparent scan provides an opportunity to share tests among different
logic blocks whose primary inputs and outputs are included in scan chains even if the blocks have different
numbers of state variables. A transparent-scan sequence for one block is likely to detect faults in other blocks
since transparent scan does not distinguish between functional and scan clock cycles, and allows faults to be
detected at all the clock cycles of the sequence. Such sharing of tests is not meaningful with conventional
scan-based tests, especially when the blocks have different numbers of state variables. Transparent scan thus
enhances the ability to produce a compact test set for a group of logic blocks. The static test compaction
procedure described in this paper uses transparent-scan sequences that follow the application of conventional
scan-based tests precisely. The procedure obtains a set of transparent-scan sequences for a group of logic
blocks from compacted test sets for the logic blocks in the group. From this set, it selects a subset that detects
all the target faults, which are detected by the complete set.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 098
Multifunction Residue Architectures for Cryptography
A design methodology for incorporating Residue Number System (RNS) and Polynomial Residue Number
System (PRNS) in Montgomery modular multiplication in GF(p) or GF(2n) respectively, as well as a VLSI
architecture of a dual-field residue arithmetic Montgomery multiplier are presented in this paper. An analysis
of input/output conversions to/from residue representation, along with the proposed residue Montgomery
multiplication algorithm, reveals common multiply-accumulate data paths both between the converters and
between the two residue representations. A versatile architecture is derived that supports all operations of
Montgomery multiplication in GF(p) and GF(2n), input/output conversions, Mixed Radix Conversion (MRC)
for integers and polynomials, dual-field modular exponentiation and inversion in the same hardware. Detailed
comparisons with state-of-the-art implementations prove the potential of residue arithmetic exploitation in
dual-field modular multiplication.
ETPL
VLSI - 099
Scalable Montgomery Modular Multiplication Architecture with Low-Latency
and Low-Memory Bandwidth Requirement
Montgomery modular multiplication is widely used in public-key cryptosystems. This work shows how to
relax the data dependency in conventional word-based algorithms to maximize the possibility of reusing the
current words of variables. With the greatly relaxed data dependency, we then proposed a novel scheduling
scheme to alleviate the number of memory access in the developed scalable architecture. Analytical results
show that the memory bandwidth requirement of the proposed scalable architecture is almost 1/(w - 1) times
that of conventional scalable architectures, where w denotes word size. The proposed one also retains a latency
of exactly one cycle between the operations of the same words in two consecutive iterations of the
Montgomery modular multiplication algorithm when employing enough processing elements. Compared to the
design in the related work, experimental results demonstrate that the proposed one achieves an almost 54
percent reduction in power consumption with no degradation in throughput. The reduced number of memory
access not only leads to lower power consumption, but also facilitates the design of scalable architectures for
any precision of operands.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 100
An Efficient Partial-Sum Network Architecture for Semi-Parallel Polar Codes
Decoder Implementation
Polar codes have recently received a lot of attention because of their capacity-achieving performance and low
encoding and decoding complexity. The performance of the successive cancellation decoder (SCD) of the
polar codes highly depends on that of the partial-sum network (PSN) implementation. Hence, in this work, an
efficient PSN architecture is proposed, based on the properties of polar codes. First, a new partial-sum
updating algorithm and the corresponding PSN architecture are introduced which achieve a delay performance
independent of the code length. Moreover, the area complexity is also reduced. Second, for a highperformance and area-efficient semi-parallel SCD implementation, a folded PSN architecture is presented to
integrate seamlessly with the folded processing element architecture. This is achieved by using a novel folded
decoding schedule. As a result, both the critical path delay and the area (excluding the memory for folding) of
the semi-parallel SCD are approximately constant for a large range of code lengths. The proposed designs are
implemented in both FPGA and ASIC and compared with the existing designs. Experimental result shows that
for polar codes with large code length, the decoding throughput is improved by more than 1.05 times and the
area is reduced by as much as 50.4%, compared with the state-of-the-art designs.
ETPL
VLSI - 101
Improved Matching-Pursuit Implementation for LTE Channel Estimation
An implementation of a reduced complexity matching pursuit channel estimator for LTE is presented. The
design contains an FFT/IFFT module with non-radix-2 units and a core estimator. The module is flexible
enough to perform FFT and IFFT at different resolutions needed, using the same hardware. Based on prior
work the needed internal word lengths are found. Internal shifts are employed to maximize the use of available
resources. The design is implemented in a 65 nm low power process from STMicroelectronics. The total area
of the implementation is 1 mm2 design, including input pads and extra control logic. The algorithmic
improvements reduce the complexity by up to 56% compared to prior art. At the same time estimator shows
great improvement in speed, allowing over 6 times the number of estimations in the same time. Power
consumption of the estimator is simulated to ~ 20 mW, running at 70 MHz.
Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
ETPL
VLSI - 102
A Multicast Tree Router for Multichip Neuromorphic Systems
We present a tree router for multichip systems that guarantees deadlock-free multicast packet routing without
dropping packets or restricting their length. Multicast routing is required to efficiently connect massively
parallel systems' computational units when each unit is connected to thousands of others residing on multiple
chips, which is the case in neuromorphic systems. Our tree router implements this one-to-many routing by
branching recursively-broadcasting the packet within a specified subtree. Within this subtree, the packet is
only accepted by chips that have been programmed to do so. This approach boosts throughput because
memory look-ups are avoided enroute, and keeps the header compact because it only specifies the route to the
subtree's root. Deadlock is avoided by routing in two phases-an upward phase and a downward phase-and by
restricting branching to the downward phase. This design is the first fully implemented wormhole router with
packet-branching that can never deadlock. The design's effectiveness is demonstrated in Neurogrid, a millionneuron neuromorphic system consisting of sixteen chips. Each chip has a 256 × 256 silicon-neuron array
integrated with a full-custom asynchronous VLSI implementation of the router that delivers up to 1.17 G
words/s across the sixteen-
ETPL
VLSI - 103
1μ
VLSI implementation of high-throughput parallel H.264/AVC baseline intrapredictor
This study presents a parallel very large scale integrated circuits architecture for an intra-predictor based on a
fast 4 × 4 algorithm. For real-time scheduling, the proposed algorithm overcomes the data dependency
between intra-prediction and intra-coding, thereby improving coding performance and reducing the number of
coding cycles. The high-speed architecture for intra-prediction includes configurable computation cores to
process YUV components using 10 pixel parallelism. Prediction for one macro-block (MB) coding
(luminance: 4 × 4 and 16 × 16 block modes; chrominance: 8 × 8 block modes) can all be completed within 256
cycles. The proposed architecture achieves throughput of 410 kMB/s, suitable for 1920 × 1080/35 Hz 4:2:0
HDTV encoder at a working frequency of 105 MHz.
Thank You !
Download