大规模数据处理 Massive Data Processing http://net.pku.edu.cn/~course/cs402/2014 闫宏飞 北京大学信息科学技术学院 7/1/2014 Outline • MDP是什么? • MDP课程安排和内容 2 Massive Data Processing • Data-intensive information processing – the relevant datasets are too large to t in memory and must be held on disk. – data-intensive processing is beyond the capability of any individual machine and requires clusters • Big data problems • Focus on MapReduce programming • An entry-level course~ 3 大数据的特点 • 量大(Volume),是指它的复杂性 – 许多小的数据集结构复杂,尽管没有占用很多物理空间,也被认为 是大数据. – 大数据库占用大的存储空间,因为结构简单,不认为是大数据. • 样多(Variety)是指多种结构的特性 – 例如:混合结构,半结构和无结构数据的文本,声音和视频. • 速度(Velocity)是指它成生和分析的速率 – 在某些应用中需要实时或者近实时. • 真实性(Veracity),价值(Value) What is MapReduce? • Programming model for expressing distributed computations at a massive scale • Execution framework for organizing and performing such computations • Open-source implementation called Hadoop 5 课程的组织与安排 • 课堂时间 – 周二,周四(8:30开始)三教201 – 讲课老师:闫宏飞、彭博 – 助教:李睢、江翰 • 教学环节 – 课堂讲授,作业,上机指导,答疑 • 评分方法 – 以作业为中心,评分也以作业&报告为准 • 课程网站 – Web http://net.pku.edu.cn/~course/cs402/2014 – Group http://groups.google.com/group/cs402pku TextBooks • [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2013.1. • [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 3rd, 2012.5. This schedule is tentative and subject to change without notice ID Week1 Week2 Week3 Week4 Topics Contents Reading Introduction to MapReduce Why large data? Cloud Computing Value of big data [Lin]Ch1:Introduction [Tom]Ch1:Meet Hadoop MapReduce Basics How do we scale up? MapReduce HDFS [Lin]Ch2:Mapreduce Basics [Tom]Ch6:How mapreduce works [GFS&MapReduce Paper] MapReduce Program Develop Basic MapReduce algorithm design and design patterns [Tom]Ch5:Developing a MapReduce Application [Lin]Ch3:MapReduce algorithm design Introduction to Information Retrieval Inverted Index on MapReduce Retrieval Problems [Lin]Ch4:Inverted Indexing for Text Retrieval Graph Algorithm and Mapreduce Parallel Breadth-First-Search PageRank [Lin]Ch5:Graph Algorithms MapReduce Algorithm Design Text retrieval Graph Algorithm 选课登记 • 个人选课登记,通过浏览器完成 – http://net.pku.edu.cn/~course/cs402/2014/regcourse.html Thank You! Q&A