Logic Synthesis – New Decomposition Algorithm for Threshold Synthesis The paper describes an algorithm to break a function down into threshold functions. Threshold functions are important as a threshold function is a model of a neuron. This makes for important applications in neural networks. To optimize speed, the algorithm makes use of unite functions. This is important as checking for unateness is faster than checking to see if the function is a threshold function, and all threshold functions are unate functions. The algorithm works as follows. The original function is added to the workset. The function is simplified if necessary. It is then checked to see if it is a unate function. If not, it is split based on the variable with the highest influence using Shannon’s decomposition and the resulting functions added to the worksheet. If so, it is checked to see if it is a threshold function. If not, it is split and the resulting functions added to the worksheet. If so, it is added to the solution set. The process repeats until the workset is complete. All functions in the solution set are unified either by an OR or an AND, depending on whether there are more 1s in the truth table than 0s or not. This also influences how the functions are split. The algorithm is tested against decision tree algorithms, feedforward neural networks, and an implementation of the nearest neighbour algorithm, using 60% of examples to train and 40% to test for each function. The generalization ability of the algorithm was found to be comparable to the other functions, though with an increased amount of interconnect Necessary Definitions Threshold Function: Positive Unate Variable: Negative Unate Variable: Unate Function: A function where all variables are either positive or negative ALU - Bridge Floating-Point Fused Multiply-Add Design A fused multiply-add (FMA) is a unit designed to perform (AxB)+C in a single instruction. As it is designed to perform this one instruction, it is faster and more precise than two consecutive instructions using a standard multiplier and adder. It is possible to perform standard addition and multiplication with the unit by setting B to 1 and C to 0 respectively. This means it can replace the adder and multiplier altogether, though it would have greater latency with addition and multiplication as well as being incapable of performing these instructions in parallel. This paper looks to implement a bridge between FADD and FMUL units, offering the capabilities of an FADD, FMUL, and FMA without needing to have an entirely separate FMA unit. An FMA consists of a series of steps. A and B are multiplier while C is aligned. A carry save adder is then implemented to add the result of AxB and C. This result is then normalized and rounded. The bridge unit operates by using the multiplier array from FMUL, requiring it to send out extra outputs, and the rounding unit from FADD, requiring its multiplexer to be capable of selecting the additional FMA path. Clock-gating is used to shut down unnecessary parts as needed for all instructions. The bridge FMA suffers from additional latency and power consumption over a standalone FMA, while the FADD and FMUL units within the bridge FMA have negligible increase in latency and no increase in power consumption. For FMA instructions, the alternatives are either a standalone FMA or an FADD, FMUL, and FMA all in parallel. The latency and power savings in standard add and multiply instructions over the former are significant while the area savings over the latter are significant. DFP – A DFP Adder with Decoded Operands and a Decimal Leading-Zero Anticipator Decimal floating-point arithmetic has become important for many commercial and monetary applications, where the inaccuracies in binary floating-point arithmetic begin to add up. Naturally, improving the speed of the base operations is important to ensure high speeds of overall operation. This paper looks to improve the latency of the DFP adder through the use of a new internal format and a Leading-Zero Anticipator (LZA). Leading-zero detection is the act of finding the location of the most significant non-zero bit. A leading-zero anticipator finds the most significant non-zero bit using the operator and operands without actually calculating the result. The leading-zero count (LZC) is the numerical value defining the location of the most significant non-zero bit. The DFP adder uses a new internal format that keeps track of the leading zero count. This removes the need for leading-zero detection through the critical path. The lead zero count is used as an input signal. The leading-zero anticipator works in parallel with the adding operation to find the leadingzero count of the output. In addition, the preliminary LZC is the minimum LZC between the two significands being added. In the event of a carry, the final LZA becomes that value reduced by one. In subtraction, the preliminary LZC is found as in addition. The incoming values then produce a variety of flags, which, through Boolean logic based on predetermined string patterns, determine how the preliminary LZC should be modified. The proposed adder is tested on randomly generated cases and IBM’s corner cases to verify correctness. It is compared against the previous adder design and found to be 14% faster but with 18% more area. The LZA is found to have a maximum delay of 24 F04 inverter delays. Cache – Efficient Use of Multibit Error-Correcting Codes in L2 Cache As CMOS technology shrinks, the number of random defects increases. These defects are traditionally treated with redundant replacement rows, columns, and words. However, as random defects increase, redundancy alone may no longer be sufficient. This paper looks to extend the role of Error-Correcting Codes (ECC) to make up for these defects. ECC are codes that can be used to correct an error in data. They are traditionally used to correct soft transient errors. Multi-bit ECC (M-ECC) can correct multiple errors in data, though they have larger latency. This makes it unusable in L1 cache, though selective use in L2 cache may yet be reasonable. Cache blocks with multiple defects are identified during memory testing. Content-Addressable Memory (CAM) is then used to identify these blocks during runtime. The traditional L2 cache is used for subblocks with one or less defects. A M-ECC cache core is used for subblocks with two defects (referred to as m-blocks). Subblocks with three or more defects are repaired through redundancy. As M-ECC has larger latency and energy consumption, two additional buffers are used to reduce how often it is accessed. The pre-decoding buffer is a cache that keeps copies of recently accessed mblocks. The FLU buffer is a small CAM that keeps addresses of recently accessed blocks that aren’t mblocks. Both use a LRU policy and effectively reduce M-ECC cache access. With ECC devoted to defect tolerance, data integrity becomes an issue due to soft errors. In the case of clean blocks, the data can be restored from memory. A solution is therefore needed for dirty blocks. A dirty replication (DR) cache is used. When a block is made dirty, data is duplicated in the cache. Data leaving the DR cache is written to main memory. Therefore, data backup is always available. The pre-decoding buffer, FLU buffer, and DR cache must all be 64-blocks for consistent results across platforms. IPC performance is nearly the same as a defect-free L2 cache. Power on average is 36% more than a defect-free L2 cache, though this is reasonable as the L2 cache power consumption is 10% of the entire system cache. It uses 2.5% more area than an L2 cache alone.