Matakuliah Tahun Versi : H0344/Organisasi dan Arsitektur Komputer : 2005 : 1/1 Pertemuan 26 Parallel Processing 2 1 Learning Outcomes Pada akhir pertemuan ini, diharapkan mahasiswa akan mampu : • Menjelaskan prinsip kerja parallel processing 2 Outline Materi • • • • • • Multiple Processor Organization Symmetric Multiprocessor Cache Coherence and The MESI Protocol Clusters Non-uniform Memory Access Vector Computation 3 Cache coherence and MESI Protocol The cache coherence: Multiple copies of the same data can exist in different caches simultaneously, and if processors are allowed to update their own copies freely, an inconsistent view of memory can result. • Software solution • Hardware solution • Directory protocol • Snoopy protocol 4 Cache coherence and MESI Protocol MESI cache line states This cache line valid? The memory copy is … Copies exist in other caches? A write to this line … M Modified E Exclusive S Shared I Invalid Yes Yes Yes No Out of date Valid Valid - No No Maybe Maybe Does not go to bus Does not go to bus Goes to bus and updates cache Goes directly to bus 5 Cache coherence and MESI Protocol MESI state transition diagram Invalid WM RMS Shared RH Invalid SHW Shared RME SHW SHR WH RH Modified SHR WH SHR Exclusive RH Modified SHW Exclusive WH (a) Line in cache at initiating processor (a) Line in snooping cache 6 Clusters Four benefits that can be achieved with clustering: • Absolute scalability • Incremental scalability • High availability • Superior price/performance 7 Clusters Cluster configuration P P P P High speed message link M I/O I/O I/O I/O M (a) Standby server with no shared disk High speed message link P P I/O I/O P P M I/O I/O I/O I/O M RAID (a) Shared disk 8 Clustering methods: benefits and limitations Clustering method Description Benefits Limitation Passive standby A secondary server takes over in case of primary server failure. Easy to implement. High cost because the secondary server is unavailable for other processing tasks. Active secondary The secondary server is also used for processing tasks. Reduces cost because secondary servers can be used for processing. Increased complexity. Separate servers Separate servers have their own disks. Data are continuously copied from primary to secondary server. High availability. High network and server overhead due to copying operations. Servers connected to disk Servers are cabled to the same disks, but each server owns its disk. If one server fails, its disks are taken over by the other server. Reduced network and server overhead due to elimination of copying operations. Usually requires disk mirroring or RAID technology to compensate for risk of disk failure. Servers share disks Multiple servers simultaneously share access to disks. Low network and server overhead. Reduced risk of downtime caused by disk failure. Requires look manager software. Usually used with disk mirroring or RAID technology. 9 Clusters Operating system design issue: • Failure management • Load balancing • Parallel computation • Parallelizing compiler • Parallelized application • Parametric computing 10 Non uniform memory access • Uniform memory access (UMA) • Non uniform memory access (NUMA) • Cache coherent NUMA (CC-NUMA) 11 Non uniform memory access Processor 1-1 Processor 1-m L1 cache Processor 2-1 L1 cache L1 cache L1 cache Main memory 1 Processor N-1 L1 cache Directory L1 cache I/O I/O L1 cache L1 cache Directory Main memory 2 Interconnection network Processor N-m L1 cache Processor 2-m L1 cache L1 cache L1 cache I/O CC-NUMA Organization Main memory N Directory 12 Vector computation DO 100 I = 1, N DO 100 I = 1, N DO 100 J = 1, N C(I, J) = 0.0 (J = 1, N) C(I, J) = 0.0 DO 100 K = 1, N DO 100 K = 1, N C(1, J) = C(I, J) + A(I, K) * B(K, J) (J = 1, N) C(I, J) = C(I, J) + A(I, K) * B(K, J) 100 100 CONTINUE CONTINUE (a) Scalar processing (b) Vector processing DO 50 J = 1, N - 1 FORK 100 50 CONTINUE J=N 100 DO 200 I = 1, N C(I, J) = 0.0 DO 200 K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) 200 CONTINUE (c) Parallel processing 13 Vector computation Memory Input register Pipelined ALU Output register (a) Pipelined ALU Input register ALU ALU ALU Memory Output register (b) Parallel ALUs 14 xi Compare exponent yi Shift significand Add significands Normalize zi C S A N zi C S A N zi+1 C S A N zi+2 C S A N zi+3 xi xi C S A N zi yi yi xi+1 x1, y1 C x2, y2 S A N C S A N C S A N C S A x3, y3 x4, y4 x5, y5 yi+1 z1 C xi+2 z2 S (a) Pipelined ALU Vector computation yi+2 z3 N A xi+3 z4 N yi+3 z5 x1, y1 C S A N z1 x2, y2 C S A N z2 x3, y3 C S A N z3 x4, y4 C S A N z4 x5, y5 C S A N z5 x6, y6 C S A N z6 x7, y7 C S A N z7 x8, y8 C S A N z8 x9, y9 C S A N z9 x10, y10 C S A N z10 x11, y11 C S A N z11 x12, y12 C S A N z12 (b) Four parallel ALUs 15 Vector computation DO 100 J = 1, 50 CR(J) = AR(J) * BR(J) – AI(J) * BI(J) 100 CI(J) = AR(J) * BI(J) + AI(J) * BR(J) Operation Cycle Operation Cycle AR(J) * BR(J) T1(J) 3 AR(J) V1(J) 1 AI(J) * BI(J) T2(J) 3 BR(J) V2(J) 1 T1(J) – T2(J) CR(J) 3 V1(J) * V2(J) V3(J) 1 AR(J) * BI(J) T3(J) 3 AI(J) V4(J) 1 AI(J) * BR(J) T4(J) 3 BI(J) V5(J) 1 T3(J) + T4(J) CI(J) 3 V4(J) * V5(J) V6(J) 1 TOTAL 12 V3(J) – V6(J) V7(J) 1 V7(J) CR(J) 1 V1(J) * V5(J) V8(J) 1 V4(J) * V2(J) V9(J) 1 V8(J) + V9(J) V0(J) 1 V0(J) CI(J) 1 TOTAL 12 (a) Storage to storage (b) Register to register 16 Vector computation DO 100 J = 1, 50 CR(J) = AR(J) * BR(J) – AI(J) * BI(J) 100 CI(J) = AR(J) * BI(J) + AI(J) * BR(J) Operation Cycle Operation Cy cle AR(J) V1(J) 1 AR(J) V1(J) 1 V1(J) * BR(J) V2(J) 1 V1(J) * BR(J) V2(J) 1 AI(J) V3(J) 1 AI(J) V3(J) 1 V3(J) * BI(J) V4(J) 1 V2(J) – V(3) * BI(J) V2(J) 1 V2(J) – V4(J) V5(J) 1 V2(J) CR(J) 1 V5(J) CR(J) 1 V1(J) * BI(J) V4(J) 1 V1(J) * BI(J) V6(J) 1 V4(J) + V3(J) * BR(J) V5(J) 1 V4(J) * BR(J) V7(J) 1 V5(J) CI(J) 1 V6(J) + V7(J) V8(J) 1 TOTAL 8 V8(J) CI(J) 1 (d) Compound instruction TOTAL 10 (c) Storage to storage 17