Part II Circuit-Level Parallelism Part II Circuit-Level Parallelism Winter 2014 Sorting and Searching Numerical Computations Parallel Processing, Circuit-Level Parallelism 7. Sorting and Selection Networks 8A. Search Acceleration Circuits 6B. Arithmetic and Counting Circuits 6C. Fourier Transform Circuits Slide 1 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition First Winter 2014 Released Revised Revised Revised Spring 2005 Spring 2006 Fall 2008 Fall 2010 Winter 2013 Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 2 II Circuit-Level Parallelism Circuit-level specs: most realistic parallel computation model • Concrete circuit model; incorporates hardware details • Allows realistic speed and cost comparisons • Useful for stand-alone systems or acceleration units Topics in This Part Chapter 7 Sorting and Selection Networks Chapter 8A Search Acceleration Circuits Chapter 8B Arithmetic and Counting Circuits Chapter 8C Fourier Transform Circuits Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 3 7 Sorting and Selection Networks Become familiar with the circuit model of parallel processing: • Go from algorithm to architecture, not vice versa • Use a familiar problem to study various trade-offs Topics in This Chapter 7.1 What is a Sorting Network? 7.2 Figures of Merit for Sorting Networks 7.3 Design of Sorting Networks 7.4 Batcher Sorting Networks 7.5 Other Classes of Sorting Networks 7.6 Selection Networks Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 4 7.1 What is a Sorting Network? x0 x1 x2 . . . n-sort er . . . x n–1 y0 y1 y2 The outputs are a permutation of the inputs satisfying y0 Š y1 Š ... Šy n–1 (non-descending) y n–1 input 0 min Fig. 7.1 An n-input sorting network or an n-sorter. in out in out in out in out 2-sorter input 1 max Block Diagram Alternate Representations Fig. 7.2 Block diagram and four different schematic representations for a 2-sorter. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 5 Building Blocks for Sorting Networks input 0 Implementation with bit-parallel inputs min in 2-sorter 2-sorter out in with Implementation bit-serial inputs in Block Diagram b out 0 k 1 min(a, b ) b <a? k 0 k 1 max(a, b ) Alternate Representations a 0 MSB-first serial inputs k Compare in max input 1 a out out min(a, b ) 1 S R Q S R Q b <a? a<b ? 0 b 1 max(a, b ) Reset Fig. 7.3 Parallel and bit-serial hardware realizations of a 2-sorter. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 6 Proving a Sorting Network Correct x0 y0 3 2 1 1 x1 y1 2 3 3 2 x2 y2 5 1 2 3 x3 y3 1 5 5 5 Fig. 7.4 Block diagram and schematic representation of a 4-sorter. Method 1: Exhaustive test – Try all n! possible input orders Method 2: Ad hoc proof – for the example above, note that y0 is smallest, y3 is largest, and the last comparator sorts the other two outputs Method 3: Use the zero-one principle – A comparison-based sorting algorithm is correct iff it correctly sorts all 0-1 sequences (2n tests) Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 7 Elaboration on the Zero-One Principle 0 1 1 0 1 0 3 6 9 1 8 5 Invalid 6-sort er 1 3 6* 5* 8 9 0 0 1 0 1 1 Deriving a 0-1 sequence that is not correctly sorted, given an arbitrary sequence that is not correctly sorted. Let outputs yi and yi+1 be out of order, that is yi > yi+1 Replace inputs that are strictly less than yi with 0s and all others with 1s The resulting 0-1 sequence will not be correctly sorted either Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 8 7.2 Figures of Merit for Sorting Networks Cost: Number of comparators In the following example, we have 5 comparators Delay: Number of levels The following 4-sorter has 3 comparator levels on its critical path Cost Delay The cost-delay product for this example is 15 x0 y0 3 2 1 1 x1 y1 2 3 3 2 x2 y2 5 1 2 3 x3 y3 1 5 5 5 Fig. 7.4 Winter 2014 Block diagram and schematic representation of a 4-sorter. Parallel Processing, Circuit-Level Parallelism Slide 9 Cost as a Figure of Merit Optimal size is known for n = 1 to 8: 0, 1, 3, 5, 9, 12, 16, 19 n = 6, 12 modules, 5 levels n = 9, 25 modules , 9 l evel s n = 9, 25 modules, 9 levels n = 10, 29 modules, 9 level s n = 1 2, 3 9 mo d u les, 9 levels Fig. 7.5 Some low-cost sorting networks. Winter 2014 n = 16, 60 modules, 10 l evels Parallel Processing, Circuit-Level Parallelism Slide 10 Delay as a Figure of Merit Optimal delay is known for n = 1 to 10: 0, 1, 3, 3, 5, 5, 6, 6, 7, 7 n = 6, 12 m odules, 5 levels These 3 comparators constitute one level n = 9 , 25 mo d u les, 8 l ev els n = 10, 31 modules , 7 l evel s n = 12, 40 modules , 8 levels Fig. 7.6 Some fast sorting networks. Winter 2014 n = 16, 61 modul es , 9 l evels Parallel Processing, Circuit-Level Parallelism Slide 11 Cost-Delay Product as a Figure of Merit n = 6, 12 m odules, 5 levels n = 9 , 25 mo d u les, 8 l ev els n = 10, 29 modules, 9 level s Low-cost 10-sorter from Fig. 7.5 Cost Delay = 29 9 = 261 n = 10, 31 modules , 7 l evel s Fast 10-sorter from Fig. 7.6 Cost Delay = 31 7 = 217 The most cost-effective n-sorter may be neither the fastest design, nor the lowest-cost design n = 12, 40 modules , 8 levels n = 16, 61 modul es , 9 l evels n = 16, 60 modules, 10 l evels Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 12 7.3 Design of Sorting Networks C(n) D(n ) Cost Delay = = = n(n – 1)/2 n n2(n – 1)/2 = Q(n3) Rotate Rotate by bydegrees 90 90 todegrees see the odd-even exchange patterns Fig. 7.7 Brick-wall 6-sorter based on odd–even transposition. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 13 Insertion Sort and Selection Sort x0 x1 x2 . . . x n–2 x n–1 (n–1)- sorte r Inse rtion sort y0 y1 y2 . . . x0 x1 x2 . . . y n–2 y n–1 . . . x n–2 x n–1 y0 y1 y2 (n–1)- sorte r . . . y n–2 y n–1 Sele ction sor t P aral lel i nserti on sort = P aral lel s el ecti on s ort = Parall el bubble s ort! C(n) = n(n – 1)/2 D(n ) = 2n – 3 Cost Delay = Q(n3) Fig. 7.8 Sorting network based on insertion sort or selection sort. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 14 Theoretically Optimal Sorting Networks O(log n) depth x0 x1 x2 . . . O(n log n) n-sort er size . . . x n–1 y0 y1 y2 Note that even for these optimal networks, delay-cost product is suboptimal; but this the best we canado Theis outputs are permutation of the Existing networks inputssorting satisfying 2 n) latency and have y0 Š O(log y1 Š ... Šy n–1 2 O(n log n) cost (non-descending) y n–1 Given that log2 n is only 20 AKS sorting network (Ajtai, Komlos, Szemeredi: 1983) for n = 1 000 000, the latter are more practical Unfortunately, AKS networks are not practical owing to large (4-digit) constant factors involved; improvements since 1983 not enough Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 15 7.4 Batcher Sorting Networks x0 First sorted sequence x x1 x2 w0 v1 x3 y0 w1 v2 y1 Second sorted sequence y (2, 3)-merger v0 y2 w2 w3 v4 y5 y6 Fig. 7.9 Winter 2014 (1, 1) (1, 2) v3 y3 y4 a0 a1 b0 b1 b2 w4 v5 (2, 4)-merger (2, 3)-merger (1, 2)-merger c0 d0 d1 (1, 1) Batcher’s even–odd merging network for 4 + 7 inputs. Parallel Processing, Circuit-Level Parallelism Slide 16 Proof of Batcher’s Even-Odd Merge x0 v0 x1 First sorted sequence x x2 v1 x3 w1 y0 x has k 0s y has k 0s w2 y2 v3 y3 w3 v has keven = k/2 + k /2 0s w4 w has kodd = k/2 + k /2 0s v4 y4 y5 y6 Assume: v2 y1 Second sorted sequence y Use the zero-one principle w0 v5 (2, 4)-merger (2, 3)-merger Case a: keven = kodd v w 0 Case b: keven = kodd+1 v w 0 Case c: keven = kodd+2 v w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 Parallel Processing, Circuit-Level Parallelism 1 1 1 1 Out of order Winter 2014 1 Slide 17 1 1 Batcher’s Even-Odd Merge Sorting Batcher’s (m, m) even-odd merger, for m a power of 2: . . . n/2-sorter . . . . . . D(m) = D(m/2) + 1 = log2 m + 1 (n/2, n/2)merger . . . n/2-sorter . . . C(m) = 2C(m/2) + m – 1 = (m – 1) + 2(m/2 – 1) + 4(m/4 – 1) + . . . = m log2m + 1 Cost Delay = Q(m log2 m) . . . Fig. 7.10 The recursive structure of Batcher’s even– odd merge sorting network. Batcher sorting networks based on the even-odd merge technique: C(n) = 2C(n/2) + (n/2)(log2(n/2)) + 1 n(log2n)2/ 2 D(n) = D(n/2) + log2(n/2) + 1 = D(n/2) + log2n = log2n (log2n + 1)/2 Cost Delay = Q(n log4n) Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 18 Example Batcher’s Even-Odd 8-Sorter . . . n/2-sorter . . . . . . (n/2, n/2)merger . . . n/2-sorter . . . . . . 4 -s orters Even (2 ,2 )-merger Od d (2 ,2 )-merger Fig. 7.11 Batcher’s even-odd merge sorting network for eight inputs . Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 19 Bitonic-Sequence Sorter Bitonic sequence: Shifted right half 1 3 3 4 6 6 6 2 2 1 0 0 Rises, then falls 0 1 2 . . . Shift right half of data to left half (superimpose the two halves) Bitonic sequence n/2 . . . n–1 In eac h position, keep the smaller value of each pair and ship the larger value to the right 8 7 7 6 6 6 5 4 6 8 8 9 Falls, then rises Each half is a bitonic sequence that can be sorted independently 8 9 8 7 7 6 6 6 5 4 6 8 The previous sequence, right-rotated by 2 Winter 2014 0 1 2 . . . n/2 . . . n–1 Fig. 14.2 Sorting a bitonic sequence on a linear array. Parallel Processing, Circuit-Level Parallelism Slide 20 Batcher’s Bitonic Sorting Networks 2-input 4-input bit onicsorters sequence sorters Fig. 7.12 The recursive structure of Batcher’s bitonic sorting network. Winter 2014 8-input bit onicsequence sorter Fig. 7.13 Batcher’s bitonic sorting network for eight inputs. Parallel Processing, Circuit-Level Parallelism Slide 21 7.5 Other Classes of Sorting Networks Fig. 7.14 Periodic balanced sorting network for eight inputs. Winter 2014 Desirable properties: a. Regular / modular (easier VLSI layout). b. Simpler circuits via reusing the blocks c. With an extra block tolerates some faults (missed exchanges) d. With 2 extra blocks provides tolerance to single faults (a missed or incorrect exchange) e. Multiple passes through faulty network (graceful degradation) f. Single-block design becomes fault-tolerant by using an extra stage Parallel Processing, Circuit-Level Parallelism Slide 22 Shearsort-Based Sorting Networks (1) 0 1 2 3 4 5 6 7 0 1 2 3 7 6 5 4 Corresponding 2-D mesh Snake-like row sorts Fig. 7.15 Winter 2014 Column sorts Snake-like row sorts Design of an 8-sorter based on shearsort on 24 mesh. Parallel Processing, Circuit-Level Parallelism Slide 23 Shearsort-Based Sorting Networks (2) 0 1 2 3 4 5 6 7 0 1 3 2 4 5 7 6 Corresponding 2-D mesh Right Left column column sort sort Snake-like row sort Fig. 7.16 Winter 2014 Right Left column column sort sort Snake-like row sort Some of the same advantages as periodic balanced sorting networks Design of an 8-sorter based on shearsort on 24 mesh. Parallel Processing, Circuit-Level Parallelism Slide 24 7.6 Selection Networks Direct design may yield simpler/faster selection networks 3rd smallest element Can remove this block if smallest three inputs needed 4-s orters Even (2,2)-merger Odd (2,2)-merger Can remove these four comparators Deriving an (8, 3)-selector from Batcher’s even-odd merge 8-sorter. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 25 Categories of Selection Networks Unfortunately we know even less about selection networks than we do about sorting networks. One can define three selection problems [Knut81]: I. Select the k smallest values; present in sorted order II. Select kth smallest value III. Select the k smallest values; present in any order Circuit and time complexity: (I) hardest, (III) easiest Type-I 8 inputs Winter 2014 (8, 4)-selector Smallest 2nd smallest 3rd smallest 4th smallest Type-II 4th smallest Parallel Processing, Circuit-Level Parallelism Type-III The 4 smallest In any order Slide 26 Type-III Selection Networks Figure 7.17 Winter 2014 A type III (8, 4)-selector. 8-Classifier Parallel Processing, Circuit-Level Parallelism Slide 27 Classifier Networks Classifiers: Selectors that separate the smaller half of values from the larger half 8 inputs Smaller 4 values 8-Classifier Larger 4 values Use of classifiers for building sorting networks 2-Classifier 4-Classifier 2-Classifier 8-Classifier 2-Classifier 4-Classifier 2-Classifier Problem: Given O(log n)-time and O(n log n)-cost n-classifier designs, what are the delay and cost of the resulting sorting network? Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 28 8A Search Acceleration Circuits Much of sorting is done to facilitate/accelerate searching • Simple search can be speeded up via special circuits • More complicated searches: range, approximate-match Topics in This Chapter 8A.1 Systolic Priority Queues 8A.2 Searching and Dictionary Operations 8A.3 Tree-Structured Dictionary Machines 8A.4 Associative Memories 8A.5 Associative Processors 8A.6 VLSI Trade-offs in Search Processors Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 29 8A.1 Systolic Priority Queues Problem: We want to maintain a large list of keys, so that we can add new keys into it (insert operation) and obtain the smallest key (extract operation) whenever desired. Unsorted list: Constant-time insertion / Linear-time extraction Sorted list: Linear-time insertion / Constant-time extraction Can both insert and extract operations (priority-queue operations) be performed in constant time, independent of the size of the list? Priority queue 5 2 8 6 1 3 7 9 1 4 Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 30 5 2 8 6 3 7 9 1 4 5 2 8 6 3 7 9 1 4 5 2 8 6 3 7 9 1 4 5 2 8 6 3 7 1 9 5 2 8 6 3 1 7 5 2 8 6 1 3 5 2 8 1 6 5 2 1 8 5 1 2 Insertion of new keys and 5 1 read-out of the smallest First Attempt: Via a LinearArray Sorter key value can be done in constant time, but the “hole” created by the extracted value cannot be filled in constant time Fig. 2.9 Winter 2014 4 4 4 3 3 3 2 9 7 4 6 8 3 5 9 7 4 4 4 3 9 7 6 8 4 9 7 6 7 8 9 2 1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 6 4 Parallel Processing, Circuit-Level Parallelism 6 5 7 9 1 5 6 9 7 6 8 7 6 9 8 7 9 8 7 9 8 7 9 8 9 Slide 31 5 -- 2 -- 8 -- 6 -- 3 -- 7 -- 9 -- 1 -- 4 5 -- 2 -- 8 -- 6 -- 3 -- 7 -- 9 -- 1 -- 4 5 -- 2 -- 8 -- 6 -- 3 -- 7 -- 9 -- 1 4 5 -- 2 -- 8 -- 6 -- 3 -- 7 -- 9 -- 1 4 5 -- 2 -- 8 -- 6 -- 3 -- 7 -- 9 1 4 5 -- 2 -- 8 -- 6 -- 3 -- 7 -- 1 9 4 A Viable Systolic Priority Queue 9 5 -- 2 -- 8 -- 6 -- 3 -- 7 1 4 5 -- 2 -- 8 -- 6 -- 3 -- 1 7 4 9 7 5 -- 2 -- 8 -- 6 -- 3 1 4 9 9 Extract 1 4 7 4 Operating on every other clock cycle, allows holes to be filled Winter 2014 Extract 4 7 7 7 Extract 7 9 9 9 9 5 -- 2 -- 8 -- 6 -- 3 9 5 -- 2 -- 8 -- 6 -- 3 9 Parallel Processing, Circuit-Level Parallelism Slide 32 Systolic Data Structures Fig. 8.3 Each node holds the smallest (S), median (M),and largest (L) value in its subtree Each subtree is balanced or has one fewer element on the left (root flag shows this) S L M S L M S L M S L M S L M S L M S L M 5 [5, 87] 19 or 20 items Winter 2014 176 87 S L M Example: 20 elements, 3 in root, 8 on left, and 9 on right S L M S L M [87, 176] 20 items Systolic data structure for minimum, maximum, and median finding. S L M Insert Insert Insert Insert S L M S L M 2 20 127 195 S L M Extractmin Extractmed Extractmax Parallel Processing, Circuit-Level Parallelism 8 elements: 3+2+3 S L M Update/access examples for the systolic data structure of Fig. 8.3 Slide 33 8A.2 Searching and Dictionary Operations Parallel (p + 1)-ary search on PRAM logp+1(n + 1) = log2(n + 1) / log2(p + 1) = Q(log n / log p) steps 0 1 2 Speedup log p Optimal: no comparison-based Example: search algorithm can be faster 8 Example: n = 26, p = 2 P0 P0 P1 P1 P0 n = 26 A single search in a sorted list can’t be significantly speeded p = 2 up through parallel processing, but all hope is not lost: 17 P1 25 Step Step Step 2 1 0 Dynamic data (sorting overhead) Batch searching (multiple lookups) Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 34 Dictionary Operations Basic dictionary operations: record keys x0, x1, . . . , xn–1 search(y) insert(y, z) delete(y) Find record with key y; return its associated data Augment list with a record: key = y, data = z Remove record with key y; return its associated data Some or all of the following operations might also be of interest: findmin findmax findmed findbest(y) findnext(y) findprev(y) extractmin extractmax extractmed Find record with smallest key; return data Find record with largest key; return data Find record with median key; return data Find record with key “nearest” to y Find record whose key is right after y in sorted order Find record whose key is right before y in sorted order Remove record(s) with min key; return data Remove record(s) with max key; return data Remove record(s) with median key value; return data Priority queue operations: findmin, extractmin (or findmax, extractmax) Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 35 8A.3 Tree-Structured Dictionary Machines Search 2 Search 1 x0 Pipelined search Input Root "Ci rcle" Tree x1 x2 search(y): Pass OR of match signals & data from “yes” side x3 x4 x5 x6 "Tri angle" Tree Output Root Fig. 8.1 Winter 2014 Combining in the triangular nodes A tree-structured dictionary machine. Parallel Processing, Circuit-Level Parallelism x7 findmin / findmax: Pass smaller / larger of two keys & data findmed: Not supported here findbest(y): Pass the larger of two match-degree indicators along with associated record Slide 36 Insertion and Deletion in the Tree Machine In p ut Ro ot in s ert(y ,z) Counters keep track of the vacancies in each subtree 1 0 0 * 0 1 0 1 * 0 Deletion needs second pass to update the vacancy counters 2 0 0 * * 0 2 1 1 * Redundant insertion (update?) and deletion (no-op?) Implementation: Merge the circle and triangle trees by folding Ou tp ut Roo t Figure 8.2 Winter 2014 Tree machine storing 5 records and containing 3 free slots. Parallel Processing, Circuit-Level Parallelism Slide 37 Physical Realization of a Tree Machine Tree machine in folded form Inner node Leaf node Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 38 VLSI Layout of a Tree H-tree layout (used, e.g., for clock distribution network in high-performance microchips) A clock domain Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 39 8A.4 Associative Memories Associative or content-addressable memories (AMs, CAMs) Binary (BCAM) vs. ternary (TCAM) d d Image source: http://www.pagiamtzis.com/cam/camintro.html Mismatch in cell connects the match line (ml) to ground If all cells in the word match the input pattern, a word match is indicated Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 40 Word Match Circuitry The match line is precharged and then pulled down by any mismatch Image source: http://www.pagiamtzis.com/cam/camintro.html Note that each CAM cell is nearly twice as complex as an SRAM cell More transistors, more wires Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 41 CAM Array Operation Image source: http://www.pagiamtzis.com/cam/camintro.html Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 42 Current CAM Applications Packet forwarding Routing tables specify the path to be taken by matching an incoming destination address with stored address prefixes Prefixes must be stored in order of decreasing length (difficult updating) Packet classification Determine packet category based on information in multiple fields Different classes of packets may be treated differently Associative caches / TLBs Main processor caches are usually not fully associative (too large) Smaller specialized caches and TLBs benefit from full associativity Data compression Frequently used substrings are identified and replaced by short codes Substring matching is accelerated by CAM Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 43 History of Associative Processing Associative memory Parallel masked search of all words Bit-serial implementation with RAM 100111010110001101000 Associative processor Add more processing logic to PEs Table 4.1 Comparand Mask Memory array with comparison logic Entering the second half-century of associative processing ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Decade Events and Advances Technology Performance ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1940s Formulation of need & concept Relays 1950s Emergence of cell technologies Magnetic, Cryogenic Mega-bit-OPS 1960s Introduction of basic architectures Transistors 1970s Commercialization & applications ICs Giga-bit-OPS 1980s Focus on system/software issues VLSI Tera-bit-OPS 1990s Scalable & flexible architectures ULSI, WSI Peta-bit-OPS ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 44 8A.5 Associative Processors Associative or content-addressable memories/processors constituted early forms of SIMD parallel processing Fig. 23.1 Functional view of an associative memory/processor. Winter 2014 Control Unit Global Operat ions Cont rol & Response Read Lines Comparand Mask Data and Commands Broadcast Response Store (T ags) Cell 0 t0 Cell 1 t1 Cell 2 t2 . . . . . . Cell m–1 Parallel Processing, Circuit-Level Parallelism Global T ag Operations Unit . . . t m –1 Slide 45 Search Functions in Associative Devices Exact match: Locating data based on partial knowledge of contents Inexact match: Finding numerically or logically proximate values Membership: Identifying all members of a specified set Relational: Determining values that are less than, less than or equal, etc. Interval: Marking items that are between or outside given limits Extrema: Finding the maximum, minimum, next higher, or next lower Rank-based: Selecting kth or k largest/smallest elements Ordered retrieval: Repeated max- or min-finding with elimination (sorting) Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 46 Classification of Associative Devices Handling of bits within words Parallel Serial Parallel WPBP: Fully parallel WPBS: Bitserial Serial WSBP: Wordserial WSBS: Fully serial Handling of words Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 47 WSBP: Word-Serial Associative Devices Strictly speaking, this is not a parallel processor, but with superhigh-speed shift registers and deeply pipelined processing logic, it behaves like one From processing logic Superhigh-speed shift registers Processing logic One word Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 48 WPBS: Bit-Serial Associative Devices One bit of every word is processed in one device cycle Advantages: 1. Can be implemented with conventional memory 2. Easy to add other capabilities beyond search PE PE Example: Adding field A to field B in every word, storing the sum in field S PE Memory array PE PE PE PE PE One bit-slice Winter 2014 Loop: Read next bit slice of A Read next bit slice of B (carry from previous slice is in PE flag C) Find sum bits; store in next bit slice of S Find new carries; store in PE flag Endloop Parallel Processing, Circuit-Level Parallelism Slide 49 Goodyear STARAN Associative Processor First computer based on associative memory (1972) Aimed at air traffic control applications Aircraft conflict detection is an O(n2) operation AM can do it in O(n) time Winter 2014 256 PEs Parallel Processing, Circuit-Level Parallelism Slide 50 Flip Network Permutations in the Goodyear STARAN The 256 bits in a bit-slice could be routed to 256 PEs in different arrangements (permutations) Figs. in this slide from J. Potter, “The STARAN Architecture and Its Applications …,” 1978 NCC Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 51 Distributed Array Processor (DAP) To neig h bo ri ng p ro cess o rs N E W S Condition A { NN Fro m n ei gh b orin g E E p ro cess o rs S S WW Fro m con trol u ni t Fig. 23.6 The bit-serial processor of DAP. Winter 2014 { Row Col C Mu x Q To ro w/col resp o ns es Ful l Car ry add er Sum Mu x Memo ry D S Fro m so u th n ei gh b or Parallel Processing, Circuit-Level Parallelism To no rt h neig h bo r Slide 52 DAP’s High-Level Structure N W Column j E S Program memory Row i Master control unit Processors Q plane C plane Fast I/O Register Q in processor ij A plane D plane Host interface unit One plane of memory Host work station Array memory (at least 32K planes) Winter 2014 Local memory for processor ij Fig. 23.7 The high-level architecture of DAP system. Parallel Processing, Circuit-Level Parallelism Slide 53 8A.6 VLSI Trade-offs in Search Processors This section has not been written yet References: [Parh90] B. Parhami, "Massively Parallel Search Processors: History and Modern Trends," Proc. 4th Int'l Parallel Processing Symp., pp. 91-104, 1990. [Parh91] B. Parhami, "Scalable Architectures for VLSI-Based Associative Memories," in Parallel Architectures, ed. by N. Rishe, S. Navathe, and D. Tal, IEEE Computer Society Press, 1991, pp. 181-200. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 54 8B Arithmetic and Counting Circuits Many parallel processing techniques originate from, or find applications in, designing high-speed arithmetic circuits • Counting, addition/subtraction, multiplication, division • Limits on performance and various VLSI trade-offs Topics in This Chapter 8B.1 Basic Addition and Counting 8B.2 Circuits for Parallel Counting 8B.3 Addition as a Prefix Computation 8B.4 Parallel Prefix Networks 8B.5 Multiplication and Squaring Circuits 8B.6 Division and Square-Rooting Circuits Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 55 8B.1 Basic Addition and Counting y Fig. 5.3 (in Computer Arithmetic) Using full-adders in building bit-serial and ripple-carry adders. x xi Shift Carry FF yi ci ci+1 FA Clock Shift si s (a) Bit-serial adder. x31 y31 c32 x1 c31 FA . . . y1 c2 x0 FA c0 FA cout s32 Ideal cost: O(k) y0 c1 cin s31 s1 Ideal latency: O(log k) Can these be achieved simultaneously? s0 (b) Ripple-carry adder. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 56 Constant-Time Counters Any fast adder design can be specialized and optimized to yield a fast counter (carry-lookahead, carry-skip, etc.) One can use redundant representation to build a constant-time counter, but a conversion penalty must be paid during read-out Count regis ter divided i nto three stages Load 1 Load Increment 1 Incr em e nter Incr em e nter Control 2 Control 1 Fig. 5.12 (in Computer Arithmetic) Fast (constant-time) three-stage up counter. Winter 2014 Parallel Processing, Circuit-Level Parallelism Counting is fundamentally simpler than addition Slide 57 8B.2 Circuits for Parallel Counting 1-bit full-adder = (3; 2)-counter FA FA 1 1 0 1 FA 1 0 0 FA Circuit reducing 7 bits to their 3-bit sum = (7; 3)-counter 2 FA 1 0 0 HA 1 2 3-bit ripple-carry adder FA HA 3 Circuit reducing n bits to their log2(n + 1)-bit sum = (n; log2(n + 1))-counter Winter 2014 2 1 0 Fig. 8.16 (in Computer Arithmetic) A 10-input parallel counter also known as a (10; 4)-counter. Parallel Processing, Circuit-Level Parallelism Slide 58 Accumulative Parallel Counters n increment signals vi, 2q–1 < n 2q True generalization of sequential counters FA q-bit initial count x FA n increment signals vi q-bit final count y = x + Svi Possible application: Compare Hamming weight of a vector to a constant Winter 2014 FA Count register FA Parallel incrementer FA FA q-bit tally of up to 2q – 1 of the increment signals FA FA FA FA FA FA FA FA q-bit initial count x cq Latency: Ignore, or use O(log n) for decision Cost: O(n) FA q-bit final count y (q + 1)-bit final count y Parallel Processing, Circuit-Level Parallelism Slide 59 8B.3 Addition as a Prefix Computation Example: Prefix sums x0 x1 x2 x0 x0 + x1 x0 + x1 + x2 s0 s1 s2 . . . . . . . . . xi x0 + x1 + . . . + xi si Sequential time with one processor is O(n) Simple pipelining does not help xi Function unit Fig. 8.4 Winter 2014 si Latches si xi Four-stage pipeline Prefix computation using a latched or pipelined function unit. Parallel Processing, Circuit-Level Parallelism Slide 60 Improving the Performance with Pipelining Ignoring pipelining overhead, it appears that we have achieved a speedup of 4 with 3 “processors.” Can you explain this anomaly? (Problem 8.6a) xi–6³ a[i–7] xi–7 a[i–6] xi–1 a[i–1] s i–12 x[i – 12] Delays Del ays Delay Del ay xi a[i] Function uni t comput ing xi–10 ³ a[i–11] xi–11 xi–9 a[i–8] x³ i–8 a[i–9] ³ a[i–10] xi–5 a[i–4] ³xi–4 a[i–5] Fig. 8.5 High-throughput prefix computation using a pipelined function unit. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 61 Carry Determination as a Prefix Computation gi pi Carry is: 0 0 1 1 annihilated or killed propagated generated (impossible) 0 1 0 1 g k1 p k1 gk1 pk1 ck ck xi g k2 p k2 c k1 gi = xi yi pi = xi yi g i+1 p i+1 gi pi ... ... gk2 pk2 g1 ... ck1 yi ck2 c2 p1 g0 gk2 pk2 c Carry network c c c 1 ck ... c k2 p0 pk1 gk1 ci c i+1 Figure from Computer Arithmetic g1 p1 ... 0 k1 g1 p1 k2 c2 c1 ... c1 g0 p0 g0 g–1= p–1=0 c0 p0 c0 c0 si Fig. 5.15 (ripple-carry network) superimposed on Fig. 5.14 (generic adder). Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 62 8B.4 Parallel Prefix Networks x n–1 x n–2 ... x3 x2 x1 x0 + + + T(n) = T(n/2) + 2 = 2 log2n – 1 C(n) = C(n/2) + n – 1 = 2n – 2 – log2n Prefix Sum n/2 + s n–1 s n–2 + ... s3 s2 s1 s0 This is the Brent-Kung Parallel prefix network (its delay is actually 2 log2n – 2) Fig. 8.6 Prefix sum network built of one n/2-input network and n – 1 adders. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 63 Example of Brent-Kung Parallel Prefix Network x15 x14 x13 x12 x 11 x10 x9 x8 x x x x x x x x 7 6 5 4 3 2 1 0 Originally developed by Brent and Kung as part of a VLSI-friendly carry lookahead adder One level of latency T(n) = 2 log2n – 2 s 15 s 14 s 13 s 12 s Fig. 8.8 Winter 2014 11 s 10 s 9 s 8 s s s s s s s s 7 6 5 4 3 2 1 0 C(n) = 2n – 2 – log2n Brent–Kung parallel prefix graph for n = 16. Parallel Processing, Circuit-Level Parallelism Slide 64 Another Divide-and-Conquer Design Ladner-Fischer construction xn–1 xn/2 P refix Sum n/2 s n–1 + s n/2 T(n) = T(n/2) + 1 = log2n C(n) = 2C(n/2) + n/2 P refix Sum n/2 = (n/2) log2n ... ... ... x0 ... ... + xn/2–1 s n/2–1 s0 Simple Ladner-Fisher Parallel prefix network (its delay is optimal, but has fan-out issues if implemented directly) Fig. 8.7 Prefix sum network built of two n/2-input networks and n/2 adders. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 65 Example of Kogge-Stone Parallel Prefix Network x15 x14 x13 x12 x x x x x x x x x x x x 11 10 9 8 7 6 5 4 3 2 1 0 T(n) = log2n C(n) = (n – 1) + (n – 2) + (n – 4) + . . . + n/2 = n log2n – n – 1 Optimal in delay, but too complex in number of cells and wiring pattern s15 s 14 s 13 s 12 s s s s s s s s s s s s 11 10 9 8 7 6 5 4 3 2 1 0 Fig. 8.9 Winter 2014 Kogge-Stone parallel prefix graph for n = 16. Parallel Processing, Circuit-Level Parallelism Slide 66 Comparison and Hybrid Parallel Prefix Networks x15 x14 x13 x12 x x x x x x x x x x x x 11 10 9 8 7 6 5 4 3 2 1 0 x15 x14 x13 x12 x x x x x x x x x x x x 11 10 9 8 7 6 5 4 3 2 1 0 Brent/Kung 6 levels 26 cells s 15 s 14 s 13 s 12 s s s s s s s s s s s s 11 10 9 8 7 6 5 4 3 2 1 0 Kogge/Stone 4 levels 49 cells s15 s 14 s 13 s 12 s s s s s s s s s s s s 11 10 9 8 7 6 5 4 3 2 1 0 x15 x14 x13 x12 x x x x x x x x x x x x 11 10 9 8 7 6 5 4 3 2 1 0 BrentKung Fig. 8.10 A hybrid Brent–Kung / Kogge–Stone parallel prefix graph for n = 16. KoggeStone Han/Carlson 5 levels Brent- 32 cells Kung s15 s14 s13 s12 s Winter 2014 11 s10 s9 s8 s7 s6 s5 s4 s3 s2 s1 s0 Parallel Processing, Circuit-Level Parallelism Slide 67 Linear-Cost, Optimal Ladner-Fischer Networks Define a type-x parallel prefix network as one that: Produces the leftmost output in optimal log2 n time Yields all other outputs with at most x additional delay Note that even the Brent-Kung network produces the leftmost output in optimal time Type -0 xn–1 xn/2 P refix Sum n/2 + P refix Sum n/2 Type -1 ... ... We are interested in building a type-0 overall network, but can use type-x networks (x > 0) as component parts x0 ... ... Type -0 xn/2–1 + s n/2–1 s0 ... s n–1 s n/2 Recursive construction of the fastest possible parallel prefix network (type-0) Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 68 Examples of Type-0, 1, 2 Parallel Prefix Networks x15 x14 x13 x12 x x x x x x x x x x x x 11 10 9 8 7 6 5 4 3 2 1 0 x15 x14 x13 x12 x x x x x x x x x x x x 11 10 9 8 7 6 5 4 3 2 1 0 Brent/Kung: 16-input type-2 network s 15 s 14 s 13 s 12 s s s s s s s s s s s s 11 10 9 8 7 6 5 4 3 2 1 0 Kogge/Stone 16-input type-0 network s15 s 14 s 13 s 12 s s s s s s s s s s s s 11 10 9 8 7 6 5 4 3 2 1 0 x15 x14 x13 x12 x x x x x x x x x x x x 11 10 9 8 7 6 5 4 3 2 1 0 BrentKung Fig. 8.10 A hybrid Brent–Kung / Kogge–Stone parallel prefix graph for n = 16. KoggeStone Han/Carlson 16-input type-1 Brent- network Kung s15 s14 s13 s12 s Winter 2014 11 s10 s9 s8 s7 s6 s5 s4 s3 s2 s1 s0 Parallel Processing, Circuit-Level Parallelism Slide 69 8B.5 Multiplication and Squaring Circuits Notation for our discussion of multiplication algorithms: a x p Multiplicand Multiplier Product (a x) p2k–1p2k–2 ak–1ak–2 . . . a1a0 xk–1xk–2 . . . x1x0 . . . p3 p2 p1 p0 Initially, we assume unsigned operands Sequential: O(k) circuit complexity O(k) time with carry-save additions a x Multiplicand Multiplier x 0 a 20 x 1 a 21 x 2 a 22 x 3 a 23 Partial products bit-matrix p Product Parallel: O(k2) circuit complexity O(log k) time Fig. 9.1 (in Computer Arithmetic) Multiplication of 4-bit binary numbers. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 70 Divide-and-Conquer (Recursive) Multipliers Building wide multiplier from narrower ones aH aL xH xL Rearranged part ial products in 2b-by-2b mult iplicat ion 2b bits a L xL a L xH a H xL a H xL a H xH a L xL a L xH a H xH 3b bits p Fig. 12.1 (in Computer Arithmetic) Divide-and-conquer (recursive) strategy for synthesizing a 2b 2b multiplier from b b multipliers. Winter 2014 b bit s C(k) = 4C(k/2) + O(k) = O(k2) T(k) = T(k/2) + O(log k) = O(log2 k) Parallel Processing, Circuit-Level Parallelism Slide 71 Karatsuba Multiplication 2b 2b multiplication requires four b b multiplications: (2baH + aL) (2bxH + xL) = 22baHxH + 2b (aHxL + aLxH) + aLxL Karatsuba noted that one of the four multiplications can be removed at the expense of introducing a few additions: (2baH + aL) (2bxH + xL) = b bits 22baHxH + 2b [(aH + aL) (xH + xL) – aHxH – aLxL] + aLxL Mult 1 Mult 3 Benefit is quite significant for extremely wide operands Winter 2014 aH aL xH xL Mult 2 C(k) = 3C(k/2) + O(k) = O(k1.585) T(k) = T(k/2) + O(log k) = O(log2 k) Parallel Processing, Circuit-Level Parallelism Slide 72 Divide-and-Conquer Squarers Building wide squarers from narrower ones axH axLL xH xL Rearranged part ial products in 2b-by-2b mult iplicat ion 2b bits axLL x L axL x H axH H xL axHH x L axHH x H xaLL x L axL x H xaHH x H p b bit s 3b bits Divide-and-conquer (recursive) strategy for synthesizing a 2b 2b squarer from b b squarers and multiplier. Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 73 VLSI Complexity Issues and Bounds Any VLSI circuit computing the product of two k-bit integers must satisfy the following constraints: AT grows at least as fast as k3/2 AT2 is at least proportional to k2 Array multipliers: O(k2) gate count and area, O(k) time AT = O(k3) AT2 = O(k4) Simple recursive multipliers: O(k2) gate count, O(log2 k) time AT = O(k2 log2 k) ? AT2 = O(k2 log4 k) ? Karatsuba multipliers: O(k1.585) gate count, O(log2 k) time AT = O(k1.585 log2 k) ? AT2 = O(k1.585 log4 k) ??? Discrepancy due to the fact that interconnect area is not taken into account in our previous analyses Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 74 Theoretically Best Multipliers Schonhage and Strassen (via FFT); best result until 2007 O(log k) time O(k log k log log k) complexity In 2007, M. Furer managed to replace the log log k term with an asymptotically smaller term It is an open problem whether there exist logarithmic-delay multipliers with linear cost (it is widely believed that there are not) In the absence of a linear cost multiplication circuit, multiplication must be viewed as a more difficult problem than addition Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 75 8B.6 Division and Square-Rooting Circuits Division via Newton’s method: O(log k) multiplications Using Schonhage and Strassen’s FFT-based multiplication, leads to: O(log2 k) time O(k log k log log k) complexity Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 76 Theoretically Best Dividers Best known bounds; cannot be achieved at the same time (yet) O(log k) time O(k log k log log k) complexity In 1966, S. A. Cook established these simultaneous bounds: O(log2 k) time O(k log k log log k) complexity In 1983, J. H. Reif reduced the time complexity to the current best O(log k (log log k)2) time In 1984, Beame/Cook/Hoover established these simultaneous bounds: O(log k) time O(k4) complexity Given our current state of knowledge, division must be viewed as a more difficult problem than multiplication Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 77 Implications for Ultrawide High-Radix Arithmetic Arithmetic results with k-bit binary operands hold with no change when the k bits are processed as g radix-2h digits (gh = k) k bits h-bit group g groups Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 78 Another Circuit Model: Artificial Neural Nets Artificial neuron Activation function Supervised learning Feedforward network Three layers: input, hidden, output No feedback Output Inputs Weights Threshold Characterized by connection topology and learning method Recurrent network Simple version due to Elman Feedback from hidden nodes to special nodes at the input layer Hopfield network All connections are bidirectional Winter 2014 Diagrams from http://www.learnartificialneuralnetworks.com/ Parallel Processing, Circuit-Level Parallelism Slide 79 8C Fourier Transform Circuits Fourier transform is quite important, and it also serves as a template for other types of arithmetic-intensive computations • FFT; properties that allow efficient implementation • General methods of mapping flow graphs to hardware Topics in This Chapter 8C.1 The Discrete Fourier Transform 8C.2 Fast Fourier Transform (FFT) 8C.3 The Butterfly FFT Network 8C.4 Mapping of Flow Graphs to Hardware 8C.5 The Shuffle-Exchange Network 8C.6 Other Mappings of the FFT Flow Graph Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 80 8C.1 The Discrete Fourier Transform y0 y1 y2 x0 x1 x2 . . . xn–1 DFT . . . yn–1 x0 x1 x2 Inv. DFT . . . xn–1 n–point DFT x in time domain y in frequency domain Some operations are easier in frequency domain; hence the need for transform Other important transforms for discrete signals: z-transform (generalized form of Fourier transform) Discrete cosine transform (used in JPEG image compression) Haar transform (a wavelet transform, which like DFT, has a fast version) Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 81 Defining the DFT and Inverse DFT DFT yields output sequence yi based on input sequence xi (0 i < n) yi = ∑j=0 to n–1wnij xj O(n2)-time naïve algorithm where wn is the nth primitive root of unity; wnn = 1, wnj ≠ 1 (1 j < n) Examples: w4 = i w3 = (1 + i 3)/2 w8 = 2(1 + i )/2 The inverse DFT is almost exactly the same computation: xi = (1/n) ∑j=0 to n–1wnij yj Input seq. xi (0 i < n) is said to be in time domain Output seq. yi (0 i < n) is the input’s frequency-domain characterization Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 82 DFT of a Cosine Waveform DFT of a cosine with a frequency 1/10 the sampling frequency fs Frequency fs Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 83 DFT of a Cosine with Varying Resolutions DFT of a cosine with a frequency 1/10 the sampling frequency fs Frequency fs Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 84 DFT as Vector-Matrix Multiplication DFT and inverse DFT computable via matrix-by-vector multiplication yi = n–1 ∑j = 0 Y = W X wnij xj DFT matrix Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 85 Application of DFT to Smoothing or Filtering Input signal with noise DFT Low-pass filt er Inverse DFT Recovered smoot h signal Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 86 DFT Application Example Signal corrupted by 0-mean random noise FFT shows strong frequency components of 50 and 120 The uncorrupted signal was: x = 0.7 sin(2p 50t) + sin(2p 120t) Source of images: http://www.mathworks.com/help/techdoc/ref/fft.html Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 87 Application of DFT to Spectral Analysis 697 Hz 1 2 3 A 770 Hz 4 5 6 B 852 Hz 7 8 9 C Received tone DFT 941 Hz * 1209 Hz 0 # 1336 1477 Hz Hz D 1633 Hz T one frequency assignment s for t ouch-t one dialing Frequency spect rum of received tone Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 88 8C.2 Fast Fourier Transform DFT yields output sequence yi based on input sequence xi (0 i < n) yi = ∑j=0 to n–1wnij xj i Fast Fourier Transform (FFT): The Cooley-Tukey algorithm j O(n log n)-time DFT algorithm that derives y from half-length sequences u and v that are DFTs of even- and odd-indexed inputs, respectively yi = ui + wni vi (0 i < n/2) yi+n/2 = ui + wni+n/2 vi = ui – wni vi T(n) = 2T(n/2) + n = n log2n T(n) = T(n/2) + 1 = log2n Winter 2014 Parallel Processing, Circuit-Level Parallelism Image from Wikipedia Butterfly operation sequentially in parallel Slide 89 More General Factoring-Based Algorithm Image from Wikipedia Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 90 8C.3 The Butterfly FFT Network u: DFT of even-indexed inputs v: DFT of odd-indexed inputs yi = ui + wni vi (0 i < n/2) yi+n/2 = ui + wni+n/2 vi x0 u0 y0 x0 u0 y0 x4 u1 y1 x1 u2 y4 x2 u2 y2 x2 u1 y2 x6 u3 y3 x3 u3 y6 x1 v0 y4 x4 v0 y1 x5 v1 y5 x5 v2 y5 x3 v2 y6 x6 v1 y3 x7 v3 y7 x7 v3 y7 Fig. 8.11 Winter 2014 Butterfly network for an 8-point FFT. Parallel Processing, Circuit-Level Parallelism Slide 91 Butterfly Processor Performs a pair of multiply-add operations, where the multiplication is by a constant processor Winter 2014 + – Design can be optimized by merging the adder and subtractor, as they receive the same inputs Parallel Processing, Circuit-Level Parallelism Slide 92 Computation Scheme for 16-Point FFT 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 0 0 0 4 0 2 4 6 0 1 2 3 4 5 6 7 0 0 0 4 0 0 Bit-reversal permutation Winter 2014 0 4 0 4 0 2 4 6 Butterfly operation a b j Parallel Processing, Circuit-Level Parallelism 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a + b wj a b wj Slide 93 8C.4 Mapping of Flow Graphs to Hardware Given a computation flow graph, it can be mapped to hardware a b (a + b) c d ef + c d e f / z Direct one-to-one mapping (possibly with pipelining) Latch positions in a four-stage pipeline + Time t=0 / Pipelining period Fig. 25.6 Winter 2014 Output available Latency Parhami’s textbook on computer arithmetic. Parallel Processing, Circuit-Level Parallelism Slide 94 Ad-hoc Scheduling on a Given Set of Resources Given a computation flow graph, it can be mapped to hardware a b ++ c d e f – (a + b) c d ef / z Assume: tadd = 1 tmult = 3 tdiv = 8 tsqrt = 10 Latch positions in a four-stage pipeline 0 1 2 3 + Add Pipelining period Winter 2014 Time 14 Multt = 0 Div / Sqrt 6 24 Time Output available / Latency Parallel Processing, Circuit-Level Parallelism Slide 95 Mapping through Projection Given a flow graph, it can be projected in various directions to obtain corresponding hardware realizations Multiple nodes of a flow graph may map onto a single hardware node That one hardware node then performs the computations associated with the flow graph nodes one by one, according to some timing arrangement (schedule) x0 u0 y0 x0 u0 y0 x4 u1 y1 x1 u2 y4 x2 u2 y2 x2 u1 y2 x6 u3 y3 x3 u3 y6 x1 v0 y4 x4 v0 y1 x5 v1 y5 x5 v2 y5 x3 v2 y6 x6 v1 y3 x7 v3 y7 x7 v3 y7 Winter 2014 Projection direction Parallel Processing, Circuit-Level Parallelism Linear array, with each cell acting for one butterfly network row Slide 96 8C.5 The Shuffle-Exchange Network x0 u0 y0 x0 u0 y0 x4 u1 y1 x1 u2 y4 x2 u2 y2 x2 u1 y2 x6 u3 y3 x3 u3 y6 x1 v0 y4 x4 v0 y1 x5 v1 y5 x5 v2 y5 x3 v2 y6 x6 v1 y3 x7 v3 y7 x7 v3 y7 Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 97 Variants of the Butterfly Architecture x0 u0 y0 x1 u2 y4 x2 u1 y2 x3 u3 y6 x4 v0 y1 x5 v2 y5 x6 v1 y3 x7 v3 y7 Fig. 8.12 Winter 2014 FFT network variant and its shared-hardware realization. Parallel Processing, Circuit-Level Parallelism Slide 98 8C.6 Other Mappings of the FFT Flow Graph This section is incomplete at this time Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 99 More Economical FFT Hardware x0 u0 y0 x1 u2 y4 x2 u1 y2 x3 u3 y6 x4 v0 y1 x5 v2 y5 x6 v1 y3 x7 v3 y7 Project Project Project Fig. 8.13 Linear array of log2n cells for n-point FFT computation. Winter 2014 x a Project 1 a+b y 0 win b ab Control 0 1 Parallel Processing, Circuit-Level Parallelism Slide 100 Space-Time Diagram for the Feedback FFT Array Feedback butterfly processor Winter 2014 Parallel Processing, Circuit-Level Parallelism Slide 101