Distributed and Parallel Computing Research and Education at Northwestern University (with a focus on clouds) Peter Dinda Associate Professor Head of Computer Engineering and Systems Division pdinda@northwestern.edu Department of Electrical Engineering and Computer Science Northwestern University http://www.eecs.northwestern.edu Northwestern EECS • Research University • EE, CE, CS degrees at BS/BA, MS, and Ph.D. levels • ~50 faculty 2 Northwestern EECS • Research University • EE, CE, CS degrees at BS/BA, MS, and Ph.D. levels • ~50 faculty 3 4 5 Highlights • Virtuoso virtualized distributed computing environment – One of the first “IaaS Clouds” • Palacios virtual machine monitor – Virtualizing a supercomputer at scale • PLT Scheme with futures – Multicore parallelism in a widely used functional language • NU-Minebench – Widely used data-mining benchmark suite • Ono and P2P research – One million users of a research tool • P2P as CDN – Akamaizing Bittorrent • VLab VM-based educational lab for systems • HPDC 2010 6 Virtuoso Project virtuoso.cs.northestern.edu • “Infrastructure as a Service” distributed grid cloud virtual computing system – Particularly for HPC and multi-VM scalable apps • First adaptive virtual computing system – Drives virtualization mechanisms to increase the performance of existing, unmodified apps running in collections of VMs R. Figueiredo, P. Dinda, J. Fortes, A Case for Grid Computing on Virtual Machines, Proceedings of the 23rd International Conference on Distributed Computing (ICDCS 2003), May, 2003. Tech report version: August, 2002. 7 The Illusion User’s LAN VM User Your machines are sitting next to you. 8 Virtuoso: A Virtualized Computing Infrastructure • • • • Providers sell computational and communication bandwidth Users run collections of virtual machines (VMs) that are interconnected by overlay networks Replacement for buying machines That continuously adapts… to increase the performance of your existing, unmodified applications and operating systems See virtuoso.cs.northwestern.edu for many papers, talks, and movies 9 Core Results (Patent Covered) • • • • • • Monitor application traffic: Use application‟s own traffic to automatically and cheaply produce a view of the application‟s network and CPU demands and of parallel load imbalance Monitor physical network using the application‟s own traffic to automatically and cheaply probe it, and then use the probes to produce characterizations Formalize performance optimization problems in clean, simple ways that facilitate understanding their asymptotic difficulty. Adapt the application to the network to make it run faster or more cost-effectively with algorithms that make use of monitoring information to drive mechanisms like VM->host mapping, scheduling of VMs, overlay network topology and routing, etc. Adapt the network to the application through automatic reservations of CPU (incl. gang scheduling) and optical net paths Transparently add network services to unmodified applications and OSes to fix design problems 10 Palacios VMM • • OS-independent embeddable virtual machine monitor Developed at Northwestern and University of New Mexico – Dinda is project lead – His student Jack Lange is lead Ph.D. student and development lead • Open source and freely available – Downloaded over 1000 times as of July • Users: – Kitten: Lightweight supercomputing OS from Sandia National Labs – MINIX 3 – Modified Linux versions • Successfully used on supercomputers, clusters (Infiniband and Ethernet), and commodity servers http://www.v3vee.org/palacios 11 Palacios as an HPC VMM • Minimalist host OS interface – Suitable for an LWK or type-I • Compile and runtime configurability – Create a VMM tailored to specific environments • Low noise • Contiguous memory pre-allocation • Passthrough resources and resource partitioning 12 HPC Performance Evaluation • Virtualization is very useful for HPC, but… Only if it doesn‟t hurt performance • Virtualized RedStorm with Palacios – Evaluated with Sandia‟s system evaluation benchmarks 17th fastest supercomputer Cray XT3 38208 cores ~3500 sq ft 2.5 MegaWatts $90 million 13 Large Scale Study • Evaluation on full RedStorm system – 12 hours of dedicated system time on full machine – Largest virtualization performance scaling study to date • Measured performance at exponentially increasing scales – Up to 4096 nodes • Publicity – New York Times – Slashdot – HPCWire – Communications of the ACM – PC World 14 Scalability at Large Scale (Catamount guest OS) Within 3% Scalable CTH: multi-material, large deformation, strong shockwave simulation 15 • Most widely used Scheme implementation • ~400 downloads per day 16 Scheme with Futures: Incremental Parallelization of Sequential Run-time Systems • Adding parallelism to a large, sequential C code base is nearly impossible • Runtime systems have a special structure that lends itself to an easily parallelizable “fast path” • We were able to exploit that structure to add parallel futures to PLT Scheme • Graphs at right show our performance improvements on two benchmarks 17 NU-MineBench: Widely used Benchmark Suite for Data Mining http://cucis.ece.northwestern.edu/projects/DMS 18 MineBench Benchmark Suite Overview Application Category Description ScalParC Classification Decision tree classification Naive Bayesian Classification Classification K-means Clustering Mean-based data partitioning method Fuzzy K-means Clustering Fuzzy logic-based data partitioning method HOP Clustering Density-based grouping method BIRCH Clustering Hierarchical Clustering method Eclat ARM Vertical database, Lattice transversal techniques used Apriori ARM Horizontal database, level-wise mining based on Apriori property Utility ARM Utility-based association rule mining SNP Classication Hill-climbing search method for DNA dependency extraction GeneNet Classication Gene relationship extraction using microarray-based method SEMPHY Classication Gene sequencing using phylogenetic tree-based method Rsearch Classication RNA sequence search using stochastic Context-Free Grammars SVM-RFE Classication Gene expression classier using recursive feature elimination Afi* ARM Approximate frequent itemsets association rule application ARM Greedy error tolerant itemsets (ETI) association rule application Getipp* ARM Greedy ETI with strong post processing association rule application (ARP) Rw* ARM Recursive Weak ETI ARP Rwpp* ARM Recursive Weak ETI with strong post processing ARP ParETI* ARM Parallel implementation of ETI application geti * *Contributed by University of Minnesota (Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries) 19 Uniqueness of Data Mining Apps Cluster Number • Performance metrics gathered from VTune were fed into Clementine data mining software Data for various benchmark suites run through Kohenen clustering: – Other benchmarks tend to fall into one or two clusters – Data mining applications span multiple clusters – Most importantly, mining apps have their own cluster 11 10 9 8 7 6 5 4 3 2 1 0 SPEC INT SPEC FP MediaBench TPC-H MineBench gcc bzip2 gzip mcf twolf vortex vpr parser apsi art equake lucas mesa mgrid swim wupwise rawcaudio epic encode cjpeg mpeg2 pegwit gs toast Q17 Q3 Q4 Q6 Apriori Bayesian Birch Eclat HOP ScalParC K-Means Fuzzy RSearch SEMPHY SNP GeneNet SVM-RFE • 20 Northwestern‟s Ono collects and shares the perspective of one million BitTorrent peers worldwide NEWS provides warnings of network problems or neutrality violations based on 50,000 peers worldwide 21 P2P as a CDN (Akamizing BitTorrent) Apply CDN design principles to P2P: • Closest node selection 22 • Controlled content replication P2P as a CDN: Effects P2P-CDN British Telecom Shaw Comm. Canada AS-biased Telecom Italia France Telecom Baseline Easynet UK • Dramatic reduction of inter-AS traffic 23 EECS 395/495: Networking Problems in Cloud Computing (Offered this quarter) • Provides a Solid Survey of Cloud Computing Research – – – – – – 24 New Applications and Requirements Datacenter Architectures Novel Security Issues/Solutions Performance Issues TCP Incast Energy Efficiency Related Pedagogy • Networking problems in cloud computing (this quarter) • Data-intensive computing (this quarter) • Resource virtualization course since 2004 – Smaller version in our professional masters program • Palacios VMM-focused OSDI course for grads and undergrads since 2008 • Distributed systems courses at undergrad and grad levels • Parallel computing course since the „80s • Wide range of undergrad and grad courses in systems/computer engineering areas: networking, databases, compilers, operating systems, architecture, security, etc. 25 VLab virtual computing environment supports education in experimental computer systems and networks 26 HPDC 2010 hpdc.org June 20-25 Downtown Chicago Main conference 8 workshops OGF meeting 27 For More Information • Northwestern EECS – http://www.eecs.northwestern.edu • Computer Engineering and Systems – http://ces.eecs.northwestern.edu/ • Prescience Lab – http://presciencelab.org • Peter Dinda – http://pdinda.org 28