MasonWells.ppt

Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison Outline • • • • • • • From sequential to multicore Reminiscing: Instruction Level Parallelism (ILP) Canonical parallel processing and execution Rethinking canonical parallel execution Dynamic Serialization Consequences of Dynamic Serialization Wrap up April 27, 2010 Mason Wells 2 Microprocessor Generations • • • • Generation 1: Serial Generation 2: Pipelined Generation 3: Instruction-level Parallel (ILP) Generation 4: Multiple processing cores April 27, 2010 Mason Wells 3 Microprocessor Generations Gen 2: Pipelined (1980s) Gen 1: Sequential (1970s) Gen 4: Multicore (2000s) Gen 3: ILP (1990s) April 27, 2010 Mason Wells 4 From One Generation to Next • Significant debate and research – New solutions proposed – Old solutions adapt in interesting ways to become viable or even better than new solutions • Solutions that involve changes “under the hood” end up winning over others 5 From One Generation to Next • From Sequential to Pipelined – RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC) vs. CISC (Intel x86) – CISC architectures learned and employed RISC innovations • From Pipelined to Instruction-Level Parallel – Statically scheduled VLIW/EPIC – Dynamically scheduled superscalar 6 From One Generation to Next • From ILP to Multicore – Parallelism based upon canonical parallel execution model – Overcome constraints to canonical parallelization • Thread-level speculation (TLS) • Transactional memory (TM) 7 Reminiscing about ILP • Late 1980s to mid 1990s • Search for “post RISC” architecture – More accurately, instruction processing model • Desire to do more than one instruction per cycle—exploit ILP • Majority school of thought: VLIW/EPIC • Minority: out-of-order (OOO) superscalar 8 VLIW/EPIC School • Parallel execution requires a parallel ISA • Parallel execution determined statically (by compiler) • Parallel execution expressed in static program • Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism 9 VLIW/EPIC School • Creating effective parallel representations (statically) introduces several problems – – – – Predication Statically scheduling loads Exception handling Recovery code • Lots of research addressing these problems • Intel and HP pushed it as their future (Itanium) 10 OOO Superscalar • Create dynamic parallel execution from sequential static representation – dynamic dependence information accurate – execution schedule flexible • None of the problems associated with trying to create a parallel representation statically • Natural growth path with no demands on software 11 Lessons from ILP Generation • Significant consequences of trying to statically detect and express parallelism • Techniques that make “under the hood” changes are the winners – Even though they may have some drawbacks/overheads 12 The Multicore Generation • How to achieve parallel execution on multiple processors? • Solution critical to the long-term health of the computer and information technology industry • And thus the economy and society as we know it 13 14 15 16 The Multicore Generation • How to achieve parallel execution on multiple processors? • Over four decades of conventional wisdom in parallel processing – Mostly in the scientific application/HPC arena – Use this as basis Parallel Execution Requires a Parallel Representation 17 Canonical Parallel Execution Model A: Analyze program to identify independence in program – independent portions executed in parallel B: Create static representation of independence – synchronization to satisfy independence assumption C: Dynamic parallel execution unwinds as per static representation – potential consequences due to static assumptions 18 Canonical Parallel Execution Model • Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research – identifying independence – creating static representation – dynamic unwinding 19 Identifying Independence • Static program analysis – Over four decades of work • Hard to identify statically – Inherently dynamic properties – Must be conservative statically • Need to identify dependence in order to identify independence April 27, 2010 Mason Wells 20 Creating Static Representation • Parallel representation for guaranteed independent work • Insert synchronization for potential dependences – Conservative synchronization moves parallel execution towards sequential execution April 27, 2010 Mason Wells 21 Dynamic Unwinding • Non-determinism – Changes to program state may not be repeatable • Race conditions • Several startup companies to deal with this problem April 27, 2010 Mason Wells 22 Conventional Wisdom Parallel Execution Requires a Parallel Representation Consequences: • Must create parallel representation • For correct execution, must statically identify: – Independence for parallel representation – Dependence for synchronization • Source of enormous difficulty and complexity – Generally functions of input to program – Inherently dynamic properties April 27, 2010 Mason Wells 23 Current Approaches • Stick with canonical model and try to overcome limitations • Thread Level Speculation (TLS) and Transactional Memory (TM) • Techniques to allow programmer to program sequentially but automatically generate parallel representation • Techniques to handle non-determinism and race conditions. April 27, 2010 Mason Wells 24 TLS and TM • Overcome major constraint to creating static parallel representation • Likely in several upcoming microprocessors – Our work in mid 1990s will be key enabler • Already in Sun MAJC, NEC Merlot, Sun Rock April 27, 2010 Mason Wells 25 Static Program Representation Issues Sequential Parallel Bugs Yes Yes (more) Data races No Yes Locks/Synch No Yes Deadlock No Yes Nondeterminism No Yes Parallel Execution ? Yes • Can we get parallel execution without a parallel representation?  Yes • Can dynamic parallelization extract parallelism that is inaccessible to static methods?  Yes April 27, 2010 Mason Wells 26 Serialization Sets: What? • Sequential program representation and dynamic parallel execution – No static representation of independence – No locks and no explicit synchronization • “Under the hood” run time system dynamically determines and orders dependent computations – Independence and thus parallelism falls out as a side • Comparable or better performance than conventional parallel models April 27, 2010 Mason Wells 27 How? Big Picture • Write program in well object-oriented style – Method operates on data of associated object (ver. 1) • Identify parts of program for potential parallel execution – Make suitable annotations as needed • Dynamically determine data object touched by selected code – Identify dependence • Program thread assigns selected code to bins April 27, 2010 Mason Wells 28 How? Big Picture • Serialize computations to same object – Enforce dependence – Assign them to same bin; delegate thread executes computations in same bin sequentially • Do not look for/represent independence – Falls out as an effect of enforcing dependence – Computations in different bins execute in parallel • Updates to given state in same order as in sequential program – Determinism – No races – If sequential correct; parallel execution is correct (same input) April 27, 2010 Mason Wells 29 Big Picture Delegate Thread 0 Program Thread Delegate Thread 1 Delegate Thread 2 30 Serialization Sets: How? • Sequential program with annotations – Identify potentially independent methods – Associate a serializers with objects to express dependence • Serializer groups dependent method invocations into a serialization set – Runtime executes in order to honor dependences • Independent method invocations in different sets – Runtime opportunistically parallelizes execution April 27, 2010 Mason Wells 31 Example: Debit/Credit Transactions # of transactions? trans_t* trans; while ((trans = get_trans ()) != NULL) { account_t* account = trans->account; Points to? if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); Loop-carried dependence? } Several static unknowns! April 27, 2010 Mason Wells 32 Multithreading Strategy trans_t* trans; Oblivious to what accounts each while ((trans = get_trans ()) != NULL) { thread may access! must account_t* account = lock trans[i]->account; 1) Read → all Methods transactions into anaccount array to ensureofmutual exclusion 2) Divide→chunks array among multiple threads if (trans->type == DEPOSIT) account->deposit (trans->amount); else if (trans->type == WITHDRAW) account->withdraw (trans->amount); } April 27, 2010 Mason Wells 33 Example with Serialization Sets private <account_t> private_account_t; Declare wrapped account type begin_nest (); At execution, delegate: trans_t* trans; Initiate nesting 1) level Creates method invocation while ((trans = get_trans ()) != structure NULL) { 2) Gets serializer pointer from base class private_account_t* account = trans->account; 3) Enqueues invocation in serialization set if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount); else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount); End nesting level, implicit barrier } end_nest (); Delegate indicates potentially- independent operations April 27, 2010 Mason Wells 34 Program context delegate delegate delegate delegate delegate delegate delegate delegate Delegate context SS #100 SS #200 SS #300 deposit acct=100 $2000 withdraw acct=100 $50 withdraw acct=100 $20 withdraw acct=200 $1000 withdraw acct=200 $1000 withdraw acct=300 $350 deposit acct=300 $5000 deposit acct=100 $300 April 27, 2010 Mason Wells 35 Program context thread delegate delegate delegate delegate delegate delegate delegate delegate Delegate threads context SS #100 DelegateSS 0 #200Delegate SS1#300 depositdeposit withdraw acct=100 acct=100 acct=200 withdraw $2000 $2000 $1000 acct=200 withdraw withdraw withdraw $1000 withdraw acct=300 acct=100 acct=100 acct=300$350 $50 $50 $350 deposit withdraw withdraw deposit acct=300 withdraw acct=100 acct=100 acct=300 $5000 acct=200 $20 $20 $5000 $1000 depositdeposit withdraw acct=100 acct=100 acct=200 $300 $300 $1000 Race-free, determinate execution without synchronization! April 27, 2010 Mason Wells 36 Prometheus: C++ Library for SS • Template library – Compile-time instantiation of SS data structures – Metaprogramming for static type checking • Runtime orchestrates parallel execution • Portable – x86, x86_64, SPARC V9 – Linux, Solaris April 27, 2010 Mason Wells 37 Prometheus Runtime • Version 1.0 – Dynamically extracts parallelism – Statically scheduled – No nested parallelism • Version 2.0 – Dynamically extracts parallelism – Dynamically scheduled • Work-stealing scheduler – Supports nested parallelism April 27, 2010 Mason Wells 38 Network Packet Classification packet_t* packet; classify_t* classifier; vector<int> ruleCount(num_rules); Vector<packet_queue_t> packet_queues; int packetCount = 0; for(i=0;i<packet_queues.size();i++) { while ((packet = packet_queues[i].get_pkt()) != NULL) { ruleID = classifier->softClassify (packet); ruleCount[ruleID]++; packetCount++; } } 39 Example with Serialization Sets Private <classify_t> private_classify_t; vector<private_classify_t> classifiers; int packetCount = 0; vector<int> ruleCount(numRules,0); int size = packet_queues.size(); begin_nest (); for (i=0;i<size;i++){ classifiers[i].delegate (&classifier_t::softClassify, packet_queues[i]); } end_nest (); for(i=0;i<size;i++){ ruleCount += classifier[i].getRuleCount(); packetCount += classifier[i].getPacketCount(); } 40 Speedup wrt sequential execution Packet Classification(No Locks!) 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Speedup vs Computation granularity AMD Barcelona(4 sockets) AMD Barcelona(8 sockets) Intel Nehalem(1 socket) Intel Nehalem(2 sockets) 1 2 4 8 16 32 64 128 256 Number of packets/grain 512 1024 2048 41 Network Intrusion Detection • Very common networking application • Most common program used: Snort – Open source version (like Linux) – But also commercial versions (Sourcefire) • Basic structure of computation also found in many other deep packet inspection applications – E.g., packet de-duplication (Riverbed) April 27, 2010 Mason Wells 42 Snort Speedup Speedup wrt Sequential execution 12 10 8 4 Core Intel Nehalem 6 4 4 core AMD Barcelona 8 core Intel Nehalem 16 core AMD Barcelona 2 0 Other Applications • Benchmarks – Lonestar, NU-MineBench, PARSEC, Phoenix • Conventional Parallelization – pthreads, OpenMP • Prometheus versions – Port program to sequential C++ program – Idiomatic C++: OO, inheritance, STL – Parallelize with serialization sets April 27, 2010 Mason Wells 44 Statically Scheduled Results Speedup relative to original sequential program Conventional Parallel 20 18 16 14 12 10 8 6 4 2 0 Serialization Sets 18.1 16.3 17.9 17.5 13.8 12.8 12.3 9.2 7.5 7.5 4.65.0 12.3 10.4 8.78.2 5.0 3.9 4 Socket AMD Barcelona (4-way multicore) = 16 total cores April 27, 2010 Mason Wells 45 Statically Scheduled Results Conventional Parallel Serialization Sets Speedup reltaive to original sequential program 10 8.7 9 8.2 8 7.0 7 6 5 4 8.3 6.4 6.1 4.0 3.8 3 2 1 0 AMD Barcelona AMD Barcelona Sun UltraSPARC Sun UltraSPARC Multicore (4) ccNUMA (16) T-1 Multicore (32) III+ SMP (8) April 27, 2010 Mason Wells 46 Summary • Sequential program with annotations – No explicit synchronization, no locks • Programmers focus on keeping computation private to object state – Consistent with OO programming practices • Dependence-based model – Determinate race-free parallel execution • Do as well or better than incumbents but without their negatives • Can do things that are very hard for incumbents April 27, 2010 Mason Wells 47

MasonWells.ppt

Related documents

Products

Support

MasonWells.ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib