Adopting Big-Data Computing Across the Undergraduate Curriculum Bina Ramamurthy (Bina) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina This talk is partially funded by NSF grant NSF-TUES-0920335 & by AWS in Education Coursework Grant award Symposium on Big Data Science and Engineering 10/19/2012 1 Outline of the talk • Golden era in computing • Big Data computing curriculum • Data-intensive/Big Data Computing Certificate program at University at Buffalo • Outcome Evaluation • Important Findings • Recommendations for adoption into undergraduate curriculum • Demos • Useful links and project web page • Question and Answers Symposium on Big Data Science and Engineering 10/19/2012 2 A Golden Era in Computing Heavy societal involvement Explosion of domain applications Proliferation of devices Wider bandwidth for communication Symposium on Big Data Science and Engineering Powerful multi-core processors Superior software methodologies Virtualization leveraging the powerful hardware 10/19/2012 3 Top Ten Largest Databases 7000 6000 5000 Terabytes 4000 Top ten largest databases (2007) 3000 2000 1000 0 LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/ Symposium on Big Data Science and Engineering 10/19/2012 4 Top Ten Largest Databases in 2007 vs Facebook ‘s cluster in 2010 21 PetaByte In 2010 7000 6000 5000 4000 Terabytes 3000 Top ten largest databases (2007) 2000 1000 0 LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate Facebook Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world Symposium on Big Data Science and Engineering 10/19/2012 5 Data Deluge: smallest to largest • Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors • The internet: web logs, facebook, twitter, maps, blogs, etc.: Analytics … • Financial applications: that analyze volumes of data for trends and other deeper knowledge • Health Care: huge amount of patient data, drug and treatment data • The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars Symposium on Big Data Science and Engineering 10/19/2012 6 Different Type of Storage Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale” • But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data, or “bank account data” : • The data type is “write once read many (WORM)” ; • Privacy protected healthcare and patient information; • Historical financial data; • Other historical data • Relational file system and tables are insufficient. • Large <key, value> stores (files) and storage management system. • Built-in features for fault-tolerance, load balancing, data-transfer and aggregation,… • Clusters of distributed nodes for storage and computing. • Computing is inherently parallel • Symposium on Big Data Science and Engineering 10/19/2012 7 Big-data Concepts Originated from the Google File System (GFS) is the special <key, value> store Hadoop Distributed file system (HDFS) is the open source version of this. (Currently an Apache project) Parallel processing of the data using MapReduce (MR) programming model Challenges Formulation of MR algorithms Proper use of the features of infrastructure (Ex: sort) Best practices in using MR and HDFS An extensive ecosystem consisting of other components such as column-based store (Hbase, BigTable), big data warehousing (Hive), workflow languages, etc. Symposium on Big Data Science and Engineering 10/19/2012 8 Data & Analytics • We have witnessed explosion in algorithmic solutions. • “In pioneer days they used oxen for heavy pulling, when one couldn’t budge a log they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” Grace Hopper • What you cannot achieve by an algorithm can be achieved by more data. • Big data if analyzed right gives you better answers: Center for disease control prediction of flu vs. prediction of flu through “search” data 2 full weeks before the onset of flu season! (see the reference) Symposium on Big Data Science and Engineering 10/19/2012 9 The Cloud Computing • Cloud is a facilitator for Big Data computing and is an indispensable in this context • Cloud provides processor, software, operating systems, storage, monitoring, load balancing, clusters and other requirements as a service • Cloud offers accessibility to Big Data computing • Cloud computing models: o platform (PaaS), Microsoft Azure o software (SaaS), Google App Engine (GAE) o infrastructure (IaaS), Amazon web services (AWS) o Services-based application programming interface (API) Symposium on Big Data Science and Engineering 10/19/2012 10 Big-data Courses • We introduced the concepts in a two course (sequence): • Course 1 (has become a core course): CSE 486 o Foundational concepts of MapReduce and Hadoop distributed file system is introduced as a part of the Distributed System course o The last project/lab in the distributed systems course is based on MapReduce (MR) concepts, and is implemented on HDFS • Course 2 (has become an elective course):CSE 487 o The second course focuses completely on Big-data issues and mostly MR algorithm formulation and best practices o Analytics on large clusters, and on the cloud o Text book we use for this course is deals with algorithms and data structures for MR and best ways to leverage the parallelism in MR family of operations (map, reduce, combine, partition, etc.) o Other Big Databases such as Hbase and Hive and workflows are also introduced Symposium on Big Data Science and Engineering 10/19/2012 11 Big-data Certificate Program • Official name is Data-intensive computing certificate o Initiated with support from NSF TUES program o Approved by SUNY system in Fall 2011 o Offered by the University o For the enrolled undergraduates--- Any major! • Details of the program o CS1, CS2 o Distributed system (CSE486) : Pre-req CS2 o Data-intensive computing system (CSE487) Pre-req CS2 o An elective in the discipline of choice (Ex: BIO4XY or MGS4XY) o A capstone project applying data-intensive computing (Ex:BIO499or MGS499) Symposium on Big Data Science and Engineering 10/19/2012 12 Evaluation Symposium on Big Data Science and Engineering 10/19/2012 13 Findings • There is high demand from student for “data-intensive” and big data computation related courses • Certificate program is hard for non-CSE majors • High demand for big-data skills from employers • Educators and administrators need to be educated about bigdata (Remember the times we educated people on Objectorientation, Java, web-enabling etc.) • It is imperative we improve the preparedness of our workforce at all levels for Big Data skills for global competitiveness. • Just one course or a single certificate is NOT enough: we need continuous and repeated exposure to Big Data in various contexts. • It is often very hard to create and sustain a new curriculum • How can we address these challenges? Symposium on Big Data Science and Engineering 10/19/2012 14 Recommendations • • • • • • • Introduce big-data concepts as integral part of UG curriculum o For example for CS, simple word-count of big-data in CS1, map-reduce algorithm in CS2, cloud storage and big-table in Database systems, Hadoop in distributed systems, the entire big-data analytics in other elective courses such as Machine Learning and Data Mining. Use compelling examples using real world datasets Train the educators: big-data professional development for the academic core is critical Expose the administrators: to use of Big Data applications/tools in all possible areas: institutional analysis, data collected at various educational institutions is a gold-mine for macro-level analytics; “What is the trend?” “Are they learning?” Train the counselors who advise high school students, and college entry level counselors Include the community colleges and four years colleges Need investment from major industries (mentoring, educator days, etc.) Symposium on Big Data Science and Engineering 10/19/2012 15 Demos • Simple word count using MR model on HDFS on the local machine o Foundation for many algorithms such as word cloud o Simple and easy to understand o Project Guttenberg • Simple co-occurrence analysis of twitter data o Twitter has donated the entire collection of tweets to Library of Congress • Amazon MR workflow and working AWS facilities • Finally sample run of 10miilion node tree of a compute cluster on the Center for Computational Research (CCR) at Buffalo Symposium on Big Data Science and Engineering 10/19/2012 16 Summary • We explored the need for data-intensive or big-data computing • We illustrated Big Data concepts and demonstrated the cloud capabilities through simple applications • Data-intensive computing on the cloud is an essential and indispensable skill for the workforce of today and tomorrow • University at Buffalo has implemented a SUNY-wide a Certificate Program in Data-intensive Computing • Actionable thing we could do is form a group of people passionate about Big Data and work at introducing it in their courses/projects Symposium on Big Data Science and Engineering 10/19/2012 17 References & useful links • Flu prediction reference: J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski and L. Brilliant. Detecting influenza epidemics using search engine query data, Nature 457, 1012-1014 (19 February 2009): http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html • Twitter and Library of Congress: A. Watters. How Library of Congress is Building the Twitter Archive. http://radar.oreilly.com/2011/06/library-of-congress-twitter-archive.html, last viewed July 2012. • Project web page for all the project material including course description, course material, project description, several presentations, useful links, and references http://www.cse.buffalo.edu/faculty/bina/DataIntensive Symposium on Big Data Science and Engineering 10/19/2012 18