OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING 1 Indiana University Computer Science Dept. Seung-Hee Bae 1 OUTLINE Motivation Multicore Parallel Computing Data Mining 2 MOTIVATION According to “How Much Information” project at UC Berkeley Print, film, magnetic & optical storage media produced about 5 exabytes (a billion of billion bytes) of new info. in 2002. 5 exabytes = 37000 Library of Congress (17 million books) The rate of data increase will continue to accelerate through weblogs, digital photo & video, surveillance monitor, scientific instruments (sensors), and instant message etc. Thus, we need more powerful computing platforms to deal with this much data. To take advantage of multicore chip, it is critical to build a software with scalable parallelism. To deal with a huge amount of data and utilize multicore, it is essential to develop data mining tools with highly scalable parallel programming. 3 RECOGNITION, MINING, AND SYNTHESIS (RMS) (from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.) Intel points out these three processing cycle will be necessary to deal with most generalized decision support (data mining). Examples Medicine (a tumor) Business (hiring) Investment 4 Motivation Multicore Toward Concurrency What is Multicore? Parallel Computing Data Mining 5 TOWARD CONCURRENCY IN SOFTWARE Previous CPU performance gains Exponential growth (Moore’s Law) will change Clock speed: getting more cycles Increasing the size of on-chip cache: main memory is much slower than the cache. Moore’s law is over? Not yet (# of transistors ↑) Hyperthreading Running two or more threads in parallel inside a single CPU It doesn’t help single-threaded applications Multicore Running two or more actual CPUs on one chip. It will boost reasonably well-written multi-thread applications, but not single-threaded applications. Pipelining, branch prediction, multiple instructions/clock, reordering the instruction Cache Become harder to exploit higher clock speeds (2GHz:2001, 3.4GHz:2004, now?) Execution optimization: more work per cycle Current CPU performance gains Cache Only this will broadly benefit most existing applications. 6 A cache miss costs 10 to 50 times. WHAT IS MULTICORE? Single Chip Multiple distinct processing Engine E.g.) Shared-cache Dual Core Architecture Core 0 Core 1 CPU CPU L1 Cache L1 Cache L2 Cache 7 Motivation Multicore Parallel Computing Parallel architectures (Shared-Memory vs. Distributed-Memory) Decomposing Program (Data Parallelism vs. Task Parallelism) MPI and OpenMP Data Mining 8 PARALLEL COMPUTING: INTRODUCTION Parallel computing More than just a strategy for achieving good performance Vision for how computation can seamlessly scale from a single processor to virtually limitless computing power Parallel computing software systems Goal: to make parallel programming easier and the resulting applications more portable and scalable while achieving good performance. Component Parallel Paradigm (Explicit Parallel) One explicitly programs the different parts of a parallel application. E.g.) MPI, PGAS, CCR & DSS, Workflow, DES Program Parallel Paradigm (Implicit Parallel) One writes a single program to describe the whole app. compiler and runtime break up the program into the multiple parts that execute in parallel. E.g.) OpenMP, HPF, HPCS, MapReduce Parallel Computing Challenges Concurrency & Communication Scalability and portability are difficult to achieve. Diversity of Architectures 9 PARALLEL ARCHITECTURE 1 Shared-memory machines Have a single shared address space that can be accessed by any processor. Examples Multicore Symmetric multiprocessor (SMP) Uniform Memory Access (UMA) Distributed-memory machines Access time is independent of the location. Use bus or fully connected net. Hard to achieve the scalability The system memory is packaged with individual nodes of one or more processors (c.f. Use separate computers connected by a network) E.g.) Cluster communication is required to provide data from a processor to a different processor. 10 PARALLEL ARCHITECTURE 2 Shared-Memory Distributed Memory Pros • Lower latency and higher BW • Data are available to all of the CPUs through load and store instructions • Single address space • Scalable, if a scalable interconnection network is used. • Quite fast local data access. Cons • cache coherency should be dealt with carefully. • synchronization is explicitly needed to access shared data. • scalability issue • Communication required to access data in a diff. processor. • Communication management problem 1. Long latency 2. Long transmission time 11 PARALLEL ARCHITECTURE 3 Hybrid systems Distributed shared-memory (DSM) Distributed-memory machine which allows a processor to directly access a datum in a remote memory. Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA) characteristics. SMP clusters distributed-memory system with SMP as a unit. 12 PARALLEL PROGRAMMING MODEL Shared-Memory Programming model Need for synchronization to preserve the integrity More appropriate to shared-memory machine E.g.) Open Specifications for MultiProcessing (OpenMP) Message-Passing Programming model Send-receive communication steps. Communication is used to access a remote data location. More appropriate to distributed-memory machine E.g.) Message Passing Interface (MPI) Shared-memory programming model can be used to distributedmemory machines as well as message-passing programming model can be used to shared-memory architectures. However, the efficiency of the programming model is different. 13 PARALLEL PROGRAM: DECOMPOSITION 1 Data Parallelism Subdivides the data domain of a problem into multiple regions and assigns different processors. Exploit the parallelism inherent in many large data structures. Same Task on diff. data. (SPMD) More commonly used in scientific problems. Features natural form of scalability. Hard to express when geometry irregular or dynamic Can be expressed by ALL parallel programming models (i.e. MPI, HPF like, OpenMP like) Functional Parallelism Different processors carry out different functions. Coarse grain parallelism Different task on the same or different data. Features Parallelism limited in size Tens not millions Synchronization probably good Parallelism and Decomposition can be derived from problem structure. E.g.) workflow 14 PARALLEL PROGRAM: DECOMPOSITION 2 Load balance and scalability Scalable: running time is inversely proportional to the number of processors used. Speedup(n) = T(1)/T(n) Second definition of scalability: scaled speedup Scalable if speedup(n) ≈ n Scalable if the running time remains the same when the number of processors and the problem size are increased by a factor of n. Why scalability is not achieved? a region that must be run sequentially. Total speedup ≤ T(1)/Ts (Amdahl’s Law) Require for a high degree of communication or coordination. Poor load balance (major goal of parallel programming) If one of the processors takes half of the parallel work, speedup will be limited to a factor of two. 15 MEMORY MANAGEMENT Memory-Hierarchy Management Blocking Ensuring that data remains in cache between subsequent accesses to the same memory location. Elimination of False Sharing False sharing: When two diff. processors are accessing distinct data items that reside on the same cache line. Ensure that data used by diff. processors reside on diff. cache line. (by padding: inserting empty bytes in a data structure.) Communication Minimization and Placement Move send and receive commands far enough apart so that time spent on communication can be overlapped. Stride-one access Programs in which the loops access contiguous data items are much more efficient than those that do not. 16 MESSAGE PASSING INTERFACE (MPI) Message Passing Interface (MPI) A specification for a set of functions for managing movement of data among sets of communicating processes. The dominant scalable parallel computing paradigm with scientific problem. Explicit message send and receive using rendezvous model. Point-to-point communication Collective communication Commonly implemented in terms of an SPMD model All processes execute essentially the same logic. Pros: scalable and portable Race condition avoided (implicit synch. w/ completion of the copy) Cons: implements details at communication. 17 MPI 6 Key Functions MPI_INIT MPI_COMM_SIZE MPI_COMM_RANK MPI_SEND MPI_RECV MPI_FINALIZE Collective Communications Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange General reduction operation (sum, minimum, scan) Blocking, nonblocking, buffered, synchronous messaging 18 OPEN SPECIFICATIONS FOR MULTIPROCESSING (OpenMP) 1 Appropriate to uniform-access, shared-memory. A sophisticated set of annotations (compiler directives) for traditional C, C++, or Fortran codes to aid compilers producing parallel codes. It provides parallel loops and collective operations such as summation over loop indices. Provide lock variables to allow fine-grain synchronization btwn threads. Specify where multiple threads should be applied, and how to assign work to those threads. Pros: Excellent programming interface for uniform-access, shared-memory machines. Cons: No way to specify locality in machines w/ non-uniform shared-memory or distributed memory. Cannot express all parallel algorithms. 19 OpenMP 2 Directives: instruct the compiler to Create threads, perform synchronization ops, manage shared memory. Examples PARALLEL DO ~ END PARALLEL DO SCHEDULE (STATIC) SCHEDULE (DYNAMIC) REDUCTION(+: x) PARALLEL SECTIONS OpenMP synchronization primitives Critical sections Atomic updates Barriers 20 Motivation Multicore Parallel Computing Data Mining Expectation Maximization (EM) Deterministic Annealing (DA) Hidden Markov Model (HMM) Other Important Algorithms 21 EXPECTATION MAXIMIZATION (EM) Expectation Maximization (EM) A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables. An efficient iterative procedure Goal: estimate unknown parameters, given measurement. Hill climbing approach guarantee to reach maxima (or local maxima.) Two Steps E-step (Expectation): the missing data are estimated given the observed data and current estimate of the model parameters. M-step (Maximization): the likelihood function is maximized under the assumption that the missing data are known. (The estimated missing data from the E-step are used in lieu of the actual missing data.) Those two steps are repeated until the likelihood converges. 22 DETERMINISTIC ANNEALING (DA) Purpose: avoid local minima (optimization) Simulated Annealing (SA) A sequence of random moves is generated and the random decision to accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method) Deterministic Annealing (DA) Uses expectation instead of stochastic simulations (random move). Deterministic: Making incremental progress on the average. (minimize the free energy (F) directly) Annealing: still want to avoid local minima with certain level of uncertainty. Minimizing the cost at prescribed level of randomness (Shannon Entropy) eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost) At large T, entropy (H) dominates while at small T cost dominates. Annealing lowers temperature so solution tracks continuously 23 DA FOR CLUSTERING This is an extended K-means algorithm. Start with a single cluster giving as solution Y1 as centroid For some annealing schedule for T, iterate above algorithm testing covariance matrix in Xi about each cluster center to see if “elongated” Split cluster if elongation “long enough” phase transition You do not need to assume number of clusters but rather a final resolution T or equivalent 24 At T=0, uninteresting solution is N clusters; one at each point xi DA CLUSTERING RESULTS (GIS) Age under 5 vs. 25 to 34 Age under 5 vs. 75 and up 25 HIDDEN MARKOV MODEL (HMM) 1 Markov Model A system being in one of a set of N distinct states, S1, S2, …, SN at any time. State transition probability The special case of a discrete, first order Markov chain: P[qt=Sj|qt-1=Si, qt-2=Sk, …] = P[qt=Sj|qt-1=Si] (1) Consider the right-hand side of (1) is independent of time, thereby leading to the set of state transition probability aij of the form aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑j aij = 1 Initial state probability Hidden Markov Model (HMM) Observation is a probabilistic function of the state. State is hidden. Speech recognition, bioinfo, etc. Elements of an HMM N, the number of states M, the number of symbols A = {aij}, The state transition probability distribution B = {bj(k)}, The symbol emission probability distribution in state j bj(k) = P[vk at t| qt = Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M π = {πi}, The initial state distribution 26 πi = P[q1 = Si], 1 ≤ j ≤ N Compact notation: λ = (A, B, π) HIDDEN MARKOV MODEL (HMM) 2 Three Basic Problems Prob(observation seq | model): Given O = O1O2 … OT, and λ=(A,B,π), how do we choose a corresponding optimal state sequence Q = q1q2 … qT in some meaningful sense (i.e. best “explains” the observations)? Finding Optimal Model Parameters: Given the observation sequence O=O1O2 … OT, and a model λ=(A,B,π), how do we efficiently compute P(O| λ)? Finding Optimal State Seq: Solutions of those Problems How do we adjust the model parameters λ = (A, B, π) to maximize P(O| λ) ? Prob(observation seq | model): Enumeration: computationally unfeasible. Forward Procedure αt(i) = P(O1O2 … Ot, qt = Si| λ) Finding Optimal State Seq: find the best state sequence (path) Viterbi algorithm: dynamic programming method δt(i) = max P[q1q2…qt = i, O1O2…Ot| λ) Path back tracking Finding Optimal Model Parameters: Baum-Welch Method: Choose λ = (A, B, π) such that P(O| λ) is locally maximized 27 Essentially EM method: iterative ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ) OTHER IMPORTANT ALGS. Other Data Mining Algorithms Support Vector Machine (SVM) K-means (special case of DA clustering), Nearest-neighbor Decision Tree, Neural network, etc. Dimension Reduction GTM (Generative Topographic Map) MDS (MultiDimensional Scaling) SOM (Self-Organizing Map) 28 SUMMARY Era of Multicore (Parallelism is essential.) Explosion of information from many kinds of sources. We are interesting scalable parallel data-mining algorithms. Clustering algorithm (DA clustering) GIS (demographic (census) data) – visualization is natural. Cheminformatics – dimension reduction is necessary to visualize. Visualization (Dimension Reduction) Hidden Markov Models, … 29 THANK YOU! QUESTIONS? 30