Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison Outline • • • • • • • From sequential to multicore Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution Rethinking canonical parallel execution Dynamic Serialization Consequences of Dynamic Serialization Wrap up April 27, 2010 Mason Wells 2 Microprocessor Generations • • • • Generation 1: Serial Generation 2: Pipelined Generation 3: Instruction-level Parallel (ILP) Generation 4: Multiple processing cores April 27, 2010 Mason Wells 3 Microprocessor Generations Gen 2: Pipelined (1980s) Gen 1: Sequential (1970s) Gen 4: Multicore (2000s) Gen 3: ILP (1990s) April 27, 2010 Mason Wells 4 From One Generation to Next • Significant debate and research – New solutions proposed – Old solutions adapt in interesting ways to become viable or even better than new solutions • Solutions that involve changes “under the hood” end up winning over others 5 From One Generation to Next • From Sequential to Pipelined – RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC) vs. CISC (Intel x86) – CISC architectures learned and employed RISC innovations • From Pipelined to Instruction-Level Parallel – Statically scheduled VLIW/EPIC – Dynamically scheduled superscalar 6 From One Generation to Next • From ILP to Multicore – Parallelism based upon canonical parallel execution model – Overcome constraints to canonical parallelization • Thread-level speculation (TLS) • Transactional memory (TM) 7 Reminiscing about ILP • Late 1980s to mid 1990s • Search for “post RISC” architecture – More accurately, instruction processing model • Desire to do more than one instruction per cycle—exploit ILP • Majority school of thought: VLIW/EPIC • Minority: out-of-order (OOO) superscalar 8 VLIW/EPIC School • Parallel execution requires a parallel ISA • Parallel execution determined statically (by compiler) • Parallel execution expressed in static program • Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism 9 VLIW/EPIC School • Creating effective parallel representations (statically) introduces several problems – – – – Predication Statically scheduling loads Exception handling Recovery code • Lots of research addressing these problems • Intel and HP pushed it as their future (Itanium) 10 OOO Superscalar • Create dynamic parallel execution from sequential static representation – dynamic dependence information accurate – execution schedule flexible • None of the problems associated with trying to create a parallel representation statically • Natural growth path with no demands on software 11 Lessons from ILP Generation • Significant consequences of trying to statically detect and express parallelism • Techniques that make “under the hood” changes are the winners – Even though they may have some drawbacks/overheads 12 The Multicore Generation • How to achieve parallel execution on multiple processors? • Solution critical to the long-term health of the computer and information technology industry • And thus the economy and society as we know it 13 14 15 16 The Multicore Generation • How to achieve parallel execution on multiple processors? • Over four decades of conventional wisdom in parallel processing – Mostly in the scientific application/HPC arena – Use this as basis Parallel Execution Requires a Parallel Representation 17 Canonical Parallel Execution Model A: Analyze program to identify independence in program – independent portions executed in parallel B: Create static representation of independence – synchronization to satisfy independence assumption C: Dynamic parallel execution unwinds as per static representation – potential consequences due to static assumptions 18 Canonical Parallel Execution Model • Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research – identifying independence – creating static representation – dynamic unwinding 19 Identifying Independence • Static program analysis – Over four decades of work • Hard to identify statically – Inherently dynamic properties – Must be conservative statically • Need to identify dependence in order to identify independence April 27, 2010 Mason Wells 20 Creating Static Representation • Parallel representation for guaranteed independent work • Insert synchronization for potential dependences – Conservative synchronization moves parallel execution towards sequential execution April 27, 2010 Mason Wells 21 Dynamic Unwinding • Non-determinism – Changes to program state may not be repeatable • Race conditions • Several startup companies to deal with this problem April 27, 2010 Mason Wells 22 Conventional Wisdom Parallel Execution Requires a Parallel Representation Consequences: • Must create parallel representation • For correct execution, must statically identify: – Independence for parallel representation – Dependence for synchronization • Source of enormous difficulty and complexity – Generally functions of input to program – Inherently dynamic properties April 27, 2010 Mason Wells 23 Current Approaches • Stick with canonical model and try to overcome limitations • Thread Level Speculation (TLS) and Transactional Memory (TM) • Techniques to allow programmer to program sequentially but automatically generate parallel representation • Techniques to handle non-determinism and race conditions. April 27, 2010 Mason Wells 24 TLS and TM • Overcome major constraint to creating static parallel representation • Likely in several upcoming microprocessors – Our work in mid 1990s will be key enabler • Already in Sun MAJC, NEC Merlot, Sun Rock April 27, 2010 Mason Wells 25 Static Program Representation Issues Sequential Parallel Bugs Yes Yes (more) Data races No Yes Locks/Synch No Yes Deadlock No Yes Nondeterminism No Yes Parallel Execution ? Yes • Can we get parallel execution without a parallel representation? Yes • Can dynamic parallelization extract parallelism that is inaccessible to static methods? Yes April 27, 2010 Mason Wells 26 Serialization Sets: What? • Sequential program representation and dynamic parallel execution – No static representation of independence – No locks and no explicit synchronization • “Under the hood” run time system dynamically determines and orders dependent computations – Independence and thus parallelism falls out as a side • Comparable or better performance than conventional parallel models April 27, 2010 Mason Wells 27 How? Big Picture • Write program in well object-oriented style – Method operates on data of associated object (ver. 1) • Identify parts of program for potential parallel execution – Make suitable annotations as needed • Dynamically determine data object touched by selected code – Identify dependence • Program thread assigns selected code to bins April 27, 2010 Mason Wells 28 How? Big Picture • Serialize computations to same object – Enforce dependence – Assign them to same bin; delegate thread executes computations in same bin sequentially • Do not look for/represent independence – Falls out as an effect of enforcing dependence – Computations in different bins execute in parallel • Updates to given state in same order as in sequential program – Determinism – No races – If sequential correct; parallel execution is correct (same input) April 27, 2010 Mason Wells 29 Big Picture Delegate Thread 0 Program Thread Delegate Thread 1 Delegate Thread 2 30 Serialization Sets: How? • Sequential program with annotations – Identify potentially independent methods – Associate a serializers with objects to express dependence • Serializer groups dependent method invocations into a serialization set – Runtime executes in order to honor dependences • Independent method invocations in different sets – Runtime opportunistically parallelizes execution April 27, 2010 Mason Wells 31 Example: Debit/Credit Transactions # of transactions? trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans->account; Points to? if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); Loop-carried dependence? } Several static unknowns! April 27, 2010 Mason Wells 32 Multithreading Strategy trans_t* trans; Oblivious to what accounts each while ((trans = get_trans ()) != NULL) { thread may access! must account_t* account = lock trans[i]->account; 1) Read → all Methods transactions into anaccount array to ensureofmutual exclusion 2) Divide→chunks array among multiple threads if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } April 27, 2010 Mason Wells 33 Example with Serialization Sets private <account_t> private_account_t; Declare wrapped account type begin_nest (); At execution, delegate: trans_t* trans; Initiate nesting 1) level Creates method invocation while ((trans = get_trans ()) != structure NULL) { 2) Gets serializer pointer from base class private_account_t* account = trans->account; 3) Enqueues invocation in serialization set if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount); else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount); End nesting level, implicit barrier } end_nest (); Delegate indicates potentially- independent operations April 27, 2010 Mason Wells 34 Program context delegate delegate delegate delegate delegate delegate delegate delegate Delegate context SS #100 SS #200 SS #300 deposit acct=100 $2000 withdraw acct=100 $50 withdraw acct=100 $20 withdraw acct=200 $1000 withdraw acct=200 $1000 withdraw acct=300 $350 deposit acct=300 $5000 deposit acct=100 $300 April 27, 2010 Mason Wells 35 Program context thread delegate delegate delegate delegate delegate delegate delegate delegate Delegate threads context SS #100 DelegateSS 0 #200Delegate SS1#300 depositdeposit withdraw acct=100 acct=100 acct=200 withdraw $2000 $2000 $1000 acct=200 withdraw withdraw withdraw $1000 withdraw acct=300 acct=100 acct=100 acct=300$350 $50 $50 $350 deposit withdraw withdraw deposit acct=300 withdraw acct=100 acct=100 acct=300 $5000 acct=200 $20 $20 $5000 $1000 depositdeposit withdraw acct=100 acct=100 acct=200 $300 $300 $1000 Race-free, determinate execution without synchronization! April 27, 2010 Mason Wells 36 Prometheus: C++ Library for SS • Template library – Compile-time instantiation of SS data structures – Metaprogramming for static type checking • Runtime orchestrates parallel execution • Portable – x86, x86_64, SPARC V9 – Linux, Solaris April 27, 2010 Mason Wells 37 Prometheus Runtime • Version 1.0 – Dynamically extracts parallelism – Statically scheduled – No nested parallelism • Version 2.0 – Dynamically extracts parallelism – Dynamically scheduled • Work-stealing scheduler – Supports nested parallelism April 27, 2010 Mason Wells 38 Network Packet Classification packet_t* packet; classify_t* classifier; vector<int> ruleCount(num_rules); Vector<packet_queue_t> packet_queues; int packetCount = 0; for(i=0;i<packet_queues.size();i++) { while ((packet = packet_queues[i].get_pkt()) != NULL) { ruleID = classifier->softClassify (packet); ruleCount[ruleID]++; packetCount++; } } 39 Example with Serialization Sets Private <classify_t> private_classify_t; vector<private_classify_t> classifiers; int packetCount = 0; vector<int> ruleCount(numRules,0); int size = packet_queues.size(); begin_nest (); for (i=0;i<size;i++){ classifiers[i].delegate (&classifier_t::softClassify, packet_queues[i]); } end_nest (); for(i=0;i<size;i++){ ruleCount += classifier[i].getRuleCount(); packetCount += classifier[i].getPacketCount(); } 40 Speedup wrt sequential execution Packet Classification(No Locks!) 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Speedup vs Computation granularity AMD Barcelona(4 sockets) AMD Barcelona(8 sockets) Intel Nehalem(1 socket) Intel Nehalem(2 sockets) 1 2 4 8 16 32 64 128 256 Number of packets/grain 512 1024 2048 41 Network Intrusion Detection • Very common networking application • Most common program used: Snort – Open source version (like Linux) – But also commercial versions (Sourcefire) • Basic structure of computation also found in many other deep packet inspection applications – E.g., packet de-duplication (Riverbed) April 27, 2010 Mason Wells 42 Snort Speedup Speedup wrt Sequential execution 12 10 8 4 Core Intel Nehalem 6 4 4 core AMD Barcelona 8 core Intel Nehalem 16 core AMD Barcelona 2 0 Other Applications • Benchmarks – Lonestar, NU-MineBench, PARSEC, Phoenix • Conventional Parallelization – pthreads, OpenMP • Prometheus versions – Port program to sequential C++ program – Idiomatic C++: OO, inheritance, STL – Parallelize with serialization sets April 27, 2010 Mason Wells 44 Statically Scheduled Results Speedup relative to original sequential program Conventional Parallel 20 18 16 14 12 10 8 6 4 2 0 Serialization Sets 18.1 16.3 17.9 17.5 13.8 12.8 12.3 9.2 7.5 7.5 4.65.0 12.3 10.4 8.78.2 5.0 3.9 4 Socket AMD Barcelona (4-way multicore) = 16 total cores April 27, 2010 Mason Wells 45 Statically Scheduled Results Conventional Parallel Serialization Sets Speedup reltaive to original sequential program 10 8.7 9 8.2 8 7.0 7 6 5 4 8.3 6.4 6.1 4.0 3.8 3 2 1 0 AMD Barcelona AMD Barcelona Sun UltraSPARC Sun UltraSPARC Multicore (4) ccNUMA (16) T-1 Multicore (32) III+ SMP (8) April 27, 2010 Mason Wells 46 Summary • Sequential program with annotations – No explicit synchronization, no locks • Programmers focus on keeping computation private to object state – Consistent with OO programming practices • Dependence-based model – Determinate race-free parallel execution • Do as well or better than incumbents but without their negatives • Can do things that are very hard for incumbents April 27, 2010 Mason Wells 47