并行程序设计 PARALLEL PROGRAMMING Pingpeng Yuan PARALLEL PROGRAMMING What Why How Goal exam 2 What is Parallel Programming? Coordinating multiple processing elements to solve a problem 3 PARALLELISM - A SIMPLISTIC UNDERSTANDING Multiple tasks at once. Distribute work into multiple execution units. Two approaches Data Parallelism Functional or Control Parallelism 数据并行 – 将数据分成块,然后 每一计算单元分别处理数据块. 功能并行 – 将问题划分成不同的 任务,然后处理单元分别处理任 务 4 WHY Why Technology Trend Application Needs 5 HUMAN ARCHITECTURE! GROWTH PERFORMANCE Vertical Growth Horizontal 5 10 15 20 25 30 35 40 45 . . . . Age 6 COMPUTATIONAL POWER IMPROVEMENT C.P.I. Multiprocessor Uniprocessor 1 2. . . . No. of Processors 7 GENERAL TECHNOLOGY TRENDS •Microprocessor performance increases 50% 100% per year •Clock frequency doubles every 3 years •Transistor count quadruples every 3 years 8 CLOCK FREQUENCY GROWTH RATE (INTEL FAMILY) • 30% per year 9 INTEL MANY INTEGRATED CORE (MIC) 32 core version of MIC: TILERA’S 100 CORES (JUNE 2011) Tilera has introduced a range of processors (64-bit Gx family: 36 cores, 64 cores and 100 cores), aiming to take on Intel in servers that handle high-throughput web applications 64-bit cores running up to 1.5GHz Manufactured in 40nm technology 11 …. Ju n Ju n Ju n Ju n Ju n Ju n Ju n Ju n -1 1 -1 0 -0 9 -0 8 -0 7 -0 6 -0 5 -0 4 -0 3 -0 2 -0 1 -0 0 -9 9 -9 8 -9 7 -9 6 -9 5 -9 4 -9 3 400000 Ju n Ju n Ju n Ju n Ju n Ju n Ju n Ju n Ju n Ju n Ju n TOP500 Number of cores Number of cores of no 1 system from Top500 600000 500000 Paradigm Change in HPC 300000 200000 100000 0 GPU ARCHITECTURE NVIDIA Fermi, 512 Processing Elements (PEs) THE GAP BETWEEN CPU AND GPU ref: Tesla GPU Computing Brochure GPU WILL TOP THE LIST IN NOV 2010 TRANSISTOR COUNT GROWTH RATE (INTEL FAMILY) • Transistor count grows much faster than clock rate - 40% per year, order of magnitude more contribution in 2 decades 16 HOW TO USE MORE TRANSISTORS Improve single threaded performance via architecture: Not keeping up with potential given by technology Use transistors for memory structures to improve data locality Use parallelism Instruction-level Thread level 17 SIMILAR STORY FOR STORAGE (TRANSISTOR COUNT) 18 TRENDS IN DRAM CAPABILITIES • DRAM densities to double 1000 every 3 years • Projections for DRAM densities revised downwards over time 100 • Current densities at 4Gb/die 10 1 8. Gb/s 6. 5. 4. 3. 2. 1. . 1999 2001 2003 2004 2005 2006 2007 2009 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 7. 1999 2001 2003 2004 2005 2006 2007 2009 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 9. DRAM I/O Rate (Source: ITRS ITWG) DRAM Density (Gbits/die) (Source: ITRS ITWG) • DRAM data rates to double every 4-5 years • Projections for DRAM data rates revised upwards over time • Current data-rates at 2.2 Gb/s SIMILAR STORY FOR STORAGE 内存容量和内存访问速度差距更明显 从1980-95起内存容量扩大了1000x,每年增长50% 延迟每年只降低了3% (only 2x from 1980-95) 内存带宽增加了2x 处理器速度变快,内存变大,内存相对变慢 需要并行传输更多地数据 需要更多的cache层次 20 存储层次MEMORY HIERARCHY 100 bytes CPU registers 32KB L1 cache 256KB 1GB 1TB 1PB < 1 ns L2 cache Primary Memory Secondary Storage Tertiary Storage 1 ns 4 ns 60 ns 10 ms 1s-1hr 每一层次可视作为下一层的cache 21 SIMILAR STORY FOR STORAGE 并行增加了每层的效率,但没有增加访问时 间 并行和局部性在存储系统内部同样如此 内存芯片上同时取多个bit;然后在狭窄的通道上 流水传输 缓冲区存储最近访问的数据 22 DISK TRENDS Disks too: Parallel disks plus caching Disk capacity, 1975-1989 doubled every 3+ years 25% improvement each year factor of 10 every decade Still exponential, but far less rapid than processor performance Disk capacity, 1990-recently doubling every 12 months 100% improvement each year factor of 1000 every decade Capacity growth 10x as fast as processor performance! 23 DISK TRENDS Only a few years ago, we purchased disks by the megabyte Today, 1 GB (a billion bytes) costs $1 $0.50 $0.05 from Dell => 1 TB costs $1K $500 $50, 1 PB costs $1M $500K $50K Technology is amazing Flying a 747 6” above the ground Reading/writing a strip of postage stamps 24 总之,飞速增长 处理器速度 存储能力 带宽相对于延迟和时钟频率之间的差距 并行是计算机体系结构发展的必然趋势 25 COMMODITY COMPUTER SYSTEMS 19462003 General-purpose computing: Serial. 5KHz4GHz. 2004 General-purpose computing goes parallel. Clock frequency growth flat. #Transistors/chip 19802011: 29K30B! #”cores”: ~dy-2003 If you want your program to run significantly faster … you’re going to have to parallelize it 27 DRIVERS OF PARALLEL APPLICATION NEEDS ref: http://www.nvidia.com/object/tesla_computing_solutions.html COMPUTING – APPLICATIONS OF PARALLEL PROCESSING 29 30 WHY DO WE NEED PARALLEL PROCESSING? Reasonable running time = Fraction of hour to several hours (103-104 s) In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations Example 1: Southern oceans heat Modeling (10-minute iterations) 300 GFLOP per iteration 300 000 iterations per 6 yrs = 1016 FLOP Example 2: Fluid dynamics calculations (1000 1000 1000 lattice) 109 lattice points 1000 FLOP/point 10 000 time steps = 1016 FLOP Example 3: Monte Carlo simulation of nuclear reactor 1011 particles to track (for 1000 escapes) 104 FLOP/particle = 1015 FLOP Decentralized supercomputing ( from Mathworld News, 2006/4/7 ): Grid of tens of thousands networked computers discovers 230 402 457 – 1, the 43rd Mersenne prime, as the largest known prime (9 152 052 digits ) 31 32 33 34 大数据时代 根据IDC的报告,2012年全球的数据总量为 2.7ZB,预计到2020年,全球的数据总量将 达到35ZB。 大数据分类: 互联网数据 科学数据 多媒体数据 行业应用数据,如金融数据 WHAT MAKES IT BIG DATA? SOCIAL BLOG SMART METER VOLUME VELOCITY VARIETY 101100101001 001001101010 101011100101 010100100101 VALUE 36 NUMBERS How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes(Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute 2011 37 BIG DATA USE CASES Today’s Challenge New Data What’s Possible Healthcare Expensive office visits Remote patient monitoring Preventive care, reduced hospitalization Manufacturing In-person support Product sensors Automated diagnosis, support Location-Based Services Based on home zip code Real time location data Geo-advertising, traffic, local search Public Sector Standardized services Citizen surveys Tailored services, cost reductions Retail One size fits all marketing Social media Sentiment analysis segmentation 38 HOW How 实践是检验真理的唯一标准 39 PARALLEL PROGRAMMING 课程内容结构 Parallel Architectures Parallel Algorithms Parallel Programming 40 GOAL • Most people in the research community agree that there are at least two kinds of parallel programmers that will be important to the future of computing • Programmers that understand how to write software, but are naïve about parallelization and mapping to architecture • Programmers that are knowledgeable about parallelization, and mapping to architecture, so can achieve high performance 授课计划 总共32学时 4学时: 课程介绍+并行计算系统体系结构 4学时:并行算法基础 24学时:并行程序设计 42 考核要求 成绩评定方式:平时成绩(出勤率 + 1 doc) +考试 成绩(分数比例:20:80) 1 doc 针对某一并行计算技术问题,对相关解决技术进行评论 并给出改进 评论主要着眼于创新点和存在的问题,以及可能下一步 的研究工作。 43