Understanding Big Data and its Relevance to Undergraduate

advertisement
Adopting Big-Data Computing
Across the Undergraduate
Curriculum
Bina Ramamurthy (Bina)
bina@buffalo.edu
http://www.cse.buffalo.edu/faculty/bina
This talk is partially funded by NSF grant NSF-TUES-0920335
& by AWS in Education Coursework Grant award
Symposium on Big Data Science
and Engineering
10/19/2012
1
Outline of the talk
• Golden era in computing
• Big Data computing curriculum
• Data-intensive/Big Data Computing Certificate program at
University at Buffalo
• Outcome Evaluation
• Important Findings
• Recommendations for adoption into undergraduate
curriculum
• Demos
• Useful links and project web page
• Question and Answers
Symposium on Big Data Science
and Engineering
10/19/2012
2
A Golden Era in
Computing
Heavy societal
involvement
Explosion of
domain
applications
Proliferation
of devices
Wider bandwidth
for communication
Symposium on Big Data Science
and Engineering
Powerful
multi-core
processors
Superior software
methodologies
Virtualization
leveraging the
powerful hardware
10/19/2012
3
Top Ten Largest Databases
7000
6000
5000
Terabytes
4000
Top ten largest databases (2007)
3000
2000
1000
0
LOC
CIA
Amazon
YOUTube ChoicePt
Sprint
Google
AT&T
NERSC
Climate
Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/
Symposium on Big Data Science
and Engineering
10/19/2012
4
Top Ten Largest Databases in 2007 vs
Facebook ‘s cluster in 2010
21 PetaByte
In 2010
7000
6000
5000
4000
Terabytes
3000
Top ten largest databases (2007)
2000
1000
0
LOC
CIA
Amazon
YOUTube ChoicePt
Sprint
Google
AT&T
NERSC
Climate
Facebook
Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world
Symposium on Big Data Science
and Engineering
10/19/2012
5
Data Deluge: smallest to
largest
• Bioinformatics data: from about 3.3 billion base pairs in a
human genome to huge number of sequences of proteins
and the analysis of their behaviors
• The internet: web logs, facebook, twitter, maps, blogs, etc.:
Analytics …
• Financial applications: that analyze volumes of data for
trends and other deeper knowledge
• Health Care: huge amount of patient data, drug and
treatment data
• The universe: The Hubble ultra deep telescope shows 100s
of galaxies each with billions of stars
Symposium on Big Data Science
and Engineering
10/19/2012
6
Different Type of Storage
Internet introduced a new challenge in the form web logs, web crawler’s data:
large scale “peta scale”
• But observe that this type of data has an uniquely different characteristic than
your transactional or the “customer order” data, or “bank account data” :
• The data type is “write once read many (WORM)” ;
• Privacy protected healthcare and patient information;
• Historical financial data;
• Other historical data
• Relational file system and tables are insufficient.
• Large <key, value> stores (files) and storage management system.
• Built-in features for fault-tolerance, load balancing, data-transfer and
aggregation,…
• Clusters of distributed nodes for storage and computing.
• Computing is inherently parallel
•
Symposium on Big Data Science
and Engineering
10/19/2012
7
Big-data Concepts
 Originated from the Google File System (GFS) is the special
<key, value> store
 Hadoop Distributed file system (HDFS) is the open source
version of this. (Currently an Apache project)
 Parallel processing of the data using MapReduce (MR)
programming model
 Challenges
 Formulation of MR algorithms
 Proper use of the features of infrastructure (Ex: sort)
 Best practices in using MR and HDFS
 An extensive ecosystem consisting of other components such as
column-based store (Hbase, BigTable), big data warehousing
(Hive), workflow languages, etc.
Symposium on Big Data Science
and Engineering
10/19/2012
8
Data & Analytics
• We have witnessed explosion in algorithmic solutions.
• “In pioneer days they used oxen for heavy pulling, when
one couldn’t budge a log they didn’t try to grow a
larger ox. We shouldn’t be trying for bigger computers,
but for more systems of computers.” Grace Hopper
• What you cannot achieve by an algorithm can be
achieved by more data.
• Big data if analyzed right gives you better answers:
Center for disease control prediction of flu vs. prediction
of flu through “search” data 2 full weeks before the
onset of flu season! (see the reference)
Symposium on Big Data Science
and Engineering
10/19/2012
9
The Cloud Computing
• Cloud is a facilitator for Big Data computing and is an
indispensable in this context
• Cloud provides processor, software, operating systems,
storage, monitoring, load balancing, clusters and other
requirements as a service
• Cloud offers accessibility to Big Data computing
• Cloud computing models:
o platform (PaaS), Microsoft Azure
o software (SaaS), Google App Engine (GAE)
o infrastructure (IaaS), Amazon web services (AWS)
o Services-based application programming interface (API)
Symposium on Big Data Science
and Engineering
10/19/2012
10
Big-data Courses
• We introduced the concepts in a two course (sequence):
• Course 1 (has become a core course): CSE 486
o Foundational concepts of MapReduce and Hadoop distributed
file system is introduced as a part of the Distributed System course
o The last project/lab in the distributed systems course is based on
MapReduce (MR) concepts, and is implemented on HDFS
• Course 2 (has become an elective course):CSE 487
o The second course focuses completely on Big-data issues and
mostly MR algorithm formulation and best practices
o Analytics on large clusters, and on the cloud
o Text book we use for this course is deals with algorithms and data
structures for MR and best ways to leverage the parallelism in MR
family of operations (map, reduce, combine, partition, etc.)
o Other Big Databases such as Hbase and Hive and workflows are
also introduced
Symposium on Big Data Science
and Engineering
10/19/2012
11
Big-data Certificate
Program
• Official name is Data-intensive computing certificate
o Initiated with support from NSF TUES program
o Approved by SUNY system in Fall 2011
o Offered by the University
o For the enrolled undergraduates--- Any major!
• Details of the program
o CS1, CS2
o Distributed system (CSE486) : Pre-req CS2
o Data-intensive computing system (CSE487) Pre-req CS2
o An elective in the discipline of choice (Ex: BIO4XY or MGS4XY)
o A capstone project applying data-intensive computing (Ex:BIO499or
MGS499)
Symposium on Big Data Science
and Engineering
10/19/2012
12
Evaluation
Symposium on Big Data Science
and Engineering
10/19/2012
13
Findings
• There is high demand from student for “data-intensive” and
big data computation related courses
• Certificate program is hard for non-CSE majors
• High demand for big-data skills from employers
• Educators and administrators need to be educated about bigdata (Remember the times we educated people on Objectorientation, Java, web-enabling etc.)
• It is imperative we improve the preparedness of our workforce
at all levels for Big Data skills for global competitiveness.
• Just one course or a single certificate is NOT enough: we need
continuous and repeated exposure to Big Data in various
contexts.
• It is often very hard to create and sustain a new curriculum
• How can we address these challenges?
Symposium on Big Data Science
and Engineering
10/19/2012
14
Recommendations
•
•
•
•
•
•
•
Introduce big-data concepts as integral part of UG curriculum
o For example for CS, simple word-count of big-data in CS1, map-reduce
algorithm in CS2, cloud storage and big-table in Database systems,
Hadoop in distributed systems, the entire big-data analytics in other
elective courses such as Machine Learning and Data Mining.
Use compelling examples using real world datasets
Train the educators: big-data professional development for the academic core
is critical
Expose the administrators: to use of Big Data applications/tools in all possible
areas: institutional analysis, data collected at various educational institutions
is a gold-mine for macro-level analytics; “What is the trend?” “Are they
learning?”
Train the counselors who advise high school students, and college entry level
counselors
Include the community colleges and four years colleges
Need investment from major industries (mentoring, educator days, etc.)
Symposium on Big Data Science
and Engineering
10/19/2012
15
Demos
• Simple word count using MR model on HDFS on the local machine
o Foundation for many algorithms such as word cloud
o Simple and easy to understand
o Project Guttenberg
• Simple co-occurrence analysis of twitter data
o Twitter has donated the entire collection of tweets to Library of
Congress
• Amazon MR workflow and working AWS facilities
• Finally sample run of 10miilion node tree of a compute cluster on the
Center for Computational Research (CCR) at Buffalo
Symposium on Big Data Science
and Engineering
10/19/2012
16
Summary
• We explored the need for data-intensive or big-data
computing
• We illustrated Big Data concepts and demonstrated the
cloud capabilities through simple applications
• Data-intensive computing on the cloud is an essential and
indispensable skill for the workforce of today and
tomorrow
• University at Buffalo has implemented a SUNY-wide a
Certificate Program in Data-intensive Computing
• Actionable thing we could do is form a group of people
passionate about Big Data and work at introducing it in
their courses/projects
Symposium on Big Data Science
and Engineering
10/19/2012
17
References & useful links
• Flu prediction reference:
J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski and L. Brilliant.
Detecting influenza epidemics using search engine query data, Nature 457, 1012-1014
(19 February 2009):
http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html
• Twitter and Library of Congress:
A. Watters. How Library of Congress is Building the Twitter Archive.
http://radar.oreilly.com/2011/06/library-of-congress-twitter-archive.html, last viewed
July 2012.
• Project web page for all the project material including
course description, course material, project description,
several presentations, useful links, and references
http://www.cse.buffalo.edu/faculty/bina/DataIntensive
Symposium on Big Data Science
and Engineering
10/19/2012
18
Download