Presentation - BiTeM Group

advertisement
Big Data
a small introduction
Prabhakar TV
IIT Kanpur, India
tvp@iitk.ac.in
Much of this content is generously borrowed
from all over the Internet
Let us start with a story
 Can we predict that a customer is expecting a baby?
2
“As Pole’s(Statistician at Target) computers crawled
through the data, he was able to identify about 25
products that, when analyzed together, allowed him to
assign each shopper a “pregnancy prediction” score.
More important, he could also estimate her due date to
within a small window, so Target could send coupons
timed to very specific stages of her pregnancy”
3
What is Big Data?
 How big is Big?
 Constantly moving target
 More than 100 petabytes in 2012
4
Big in What?
 Big in Volume
 Big in Velocity
 Big in Variety
5
Big Data
Dimensions
Michael Schroeck etal
IBM Executive Report
Analytics: The real-world use of big data
6
Big Data Dimensions – add more Vs
Michael Schroeck etal
IBM Executive Report
Analytics: The real-world use of big data
7
Gartner’s Definition
 "Big data are





high volume,
high velocity,
high variety information assets
that require new forms of processing
to enable
 enhanced decision making,
 insight discovery and
 process optimization."
8
Big data can be very small
 Thousands of sensors in planes, power stations, trains…
 These sensors have errors(tolerance)
 Monitor engine efficiency to safety passenger well being
 The size of the dataset is not very large – several
Gigabytes, but number of permutations in the source are
very large
http://mike2.openmethodology.org/wiki/Big_Data_Definition
9
Large datasets that ain’t big
 Media streaming is generating very large volumes with
increasing amounts of structured metadata.
 Telephone calls and internet connections
 Petabytes of data, but content is extremely structured.
 Relational databases can handle well-structured very
well
10
Who coined the term
Big Data?
 Not clear
 An Economist has claims to it (Prof Francis Diebold of
Univ of Pennsylvania)
 There is even a NYTimes article
http://bits.blogs.nytimes.com/2013/02/01/the-origins-ofbig-data-an-etymological-detective-story/
11
But generally speaking..
 originated as a tag for a class of technology with roots
in high-performance computing
 pioneered by Google in the early 2000s
 Includes technologies, such as distributed file and
database management tools led by the Apache
Hadoop project;
 Big data analytic platforms, also led by Apache; and
integration technology for exposing data to other
systems and services.
12
 A/B testing
 association rule learning
Big data Toolkit
 classification
 cluster analysis
 genetic algorithms
 machine learning
 natural language processing
 neural networks
 pattern recognition
 anomaly detection
 predictive modeling
 regression
• sentiment analysis
• signal processing
• supervised and unsup
ervised learning
• simulation
• time series analysis
• Visualisation
13
What is special about
big data processing?
14
Big Volume - Little Analytics
 Well addressed by data warehouse crowd
 Who are pretty good at SQL analytics on
 Hundreds of nodes
 Petabytes of data
From Stonebraker
15
Big Data - Big Analytics
 Complex math operations (machine learning, clustering,
trend detection, ….)
 In the market, the world of the “quants”
 Mostly specified as linear algebra on array data
16
Big Data - Big Analytics
An Example
 Consider closing price on all trading days for the last
5 years for two stocks A and B
 What is the covariance between the two timeseries?
17
Now Make It Interesting …
 Do this for all pairs of 4000 stocks
 The data is the following 4000 x 1000 matrix
Stock
t1
t2
t3
t4
t5
t6
t7
….
t1000
S1
S2
…
S4000
Hourly data?
All securities?
18
And
 Now try it for companies headquartered in Switzerland!
19
Goal of Big Data
 Good data management
 Integrated with complex analytics
20
How to manage big data?
 While big data technology may be quite advanced,
everything else surrounding it – best practices,
methodologies, organizational structures, etc. – is
nascent.
21
What is wrong with Bigdata
 End of theory
 Traditional Statistics have model – a distribution, say
normal
 Compute Mean and variance
 Here there is no apriori model – it is discovered
 Like how many clusters?
22
How companies learn your
secrets?
 Privacy issues
http://www.forbes.com/sites/kashmirhill/2012/02/16/howtarget-figured-out-a-teen-girl-was-pregnant-before-herfather-did/
http://www.nytimes.com/2012/02/19/magazine/shoppinghabits.html?pagewanted=1&_r=2&hp
23
Will now talk about
 Map reduce
 Hadoop
 Bigdata in India – academic scene
24
Map reduce
Map Reduce
 Inspired by Lisp programming language
 programming model for processing large data sets with
a parallel, distributed algorithm on a cluster
 Many problems can be phrased this way
 Easy to distribute across nodes
 Google has a patent!!
 Will it hurt me?
26
The MapReduce Paradigm
 Platform for reliable, scalable parallel computing
 Abstracts issues of distributed and parallel environment
from programmer.
 Runs over distributed file systems
 Google File System
 Hadoop File System (HDFS)
Adapted from S. Sudarshan, IIT
Bombay
27
MapReduce
 Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
 How would you do it in parallel ?
 Solution:
 Divide documents among workers
 Each worker parses document to find all words, outputs (word,
count) pairs
 Partition (word, count) pairs across workers based on word
 For each word at a worker, locally add up counts
28
Map - Reduce
 Iterate over a large number of records
 Map: extract something of interest from each
 Shuffle and sort intermediate results
 Reduce: aggregate intermediate results
 Generate final output
29
MapReduce Programming Model
 Input: a set of key/value pairs
 User supplies two functions:
 map(k,v)  list(k1,v1)
 reduce(k1, list(v1))  v2
 (k1,v1) is an intermediate key/value pair
 Output is the set of (k1,v2) pairs
30
MapReduce: Execution overview
31
MapReduce: The Map Step
Input
key-value pairs
k1
v1
k2
v2
Intermediate
key-value pairs
v
k
v
k
v
map
map
…
kn
k
…
vn
k
E.g. (doc—id, doc-content)
v
E.g. (word, wordcount-in-a-doc)
Adapted from Jeff Ullman’s course slides
32
MapReduce: The Reduce
Step
Intermediate
key-value pairs
k
Output
key-value pairs
Key-value groups
v
k
v
v
v
reduce
reduce
k
v
k
v
group
k
v
v
k
v
…
…
k
v
k
v
E.g.
(word, wordcount-in-a-doc)
k
…
v
(word, list-of-wordcount)
~ SQL Group by
k
v
(word, final-count)
~ SQL aggregation
Adapted from Jeff Ullman’s course slides
33
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
// Group by step done by system on key of intermediate Emit above, and // reduce called on list of values in each
group.
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
34
Distributed Execution
Overview
User
Program
fork
fork
Master
assign
map
input data from
distributed file
system
Worker
Split 0 read
Split 1
Split 2
Worker
local
write
fork
assign
reduce
Worker
write
Worker
Worker
remote
read,
sort
From Jeff Ullman’s course slides
Output
File 0
Output
File 1
35
Map Reduce vs. Parallel
Databases
 Map Reduce widely used for parallel processing
 Google, Yahoo, and 100’s of other companies
 Example uses: compute PageRank, build keyword indices, do
data analysis of web click logs, ….
 Database people say: but parallel databases have
been doing this for decades
 Map Reduce people say:
 we operate at scales of 1000’s of machines
 We handle failures seamlessly
 We allow procedural code in map and reduce and allow data of
any type
36
Implementations
 Google
 Not available outside Google
 Hadoop
 An open-source implementation in Java
 Uses HDFS for stable storage
 Aster Data
 Cluster-optimized SQL Database that also implements
MapReduce
 And several others, such as Cassandra at
Facebook, ..
37
Reading
 Jeffrey Dean and Sanjay Ghemawat, MapReduce:
Simplified Data Processing on Large Clusters
http://labs.google.com/papers/mapreduce.html
 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung, The Google File System,
http://labs.google.com/papers/gfs.html
38
Map reduce in English

Map: In this phase, a User Defined Function (UDF), also called Map, is executed on each record in a
given file. The file is typically striped across many computers, and many processes (called Mappers)
work on the file in parallel. The output of each call to Map is a list of <KEY, VALUE> pairs.

Shuffle: This is a phase that is hidden from the programmer. All the <KEY, VALUE> pairs are sent to
another group of computers, such that all <KEY, VALUE> pairs with the same KEY go to the same
computer, chosen uniformly at random from this group, and independently of all other keys. At each
destination computer, <KEY, VALUE> pairs with the same KEY are aggregated together. So if <x; y1>;
<x; y>; : : : ;<x; yK> are all the key-value pairs produced by the Mappers with the same key x, at the
destination computer for key x, these get aggregated into a large <KEY, VALUE> pair <x;{y1; y2; : : : ;
yK}>; observe that there is no ordering guarantee. The aggregated <KEY, VALUE>pair is typically
called a Reduce Record, and its key is referred to as the Reduce Key.

Reduce: In this phase, a UDF, also called Reduce, is applied to each Reduce Record, often by many
parallel processes. Each process is called a Reducer. For each invocation of Reduce, one or more
records may get written into a local output file.
39
Hadoop
Why is Hadoop exciting?
 Blazing speed at low cost on commodity hardware
 Linear Scalability
 Highly scalable data store with a good parallel
programming model, MapReduce
 Doesn't solve all problems, but it is a strong solution for
many tasks.
41
What is Hadoop?
For the executives:
 Hadoop is an Apache open source software project
 Gives you value from the volume/velocity/variety of
data you have
42
What is Hadoop?
Technical managers
 An open source suite of software that
 mines the structured and unstructured BigData
43
What is Hadoop?
Legal
 An open source suite of software that is packaged and
supported by multiple suppliers.
 licensed under the Apache v2 license
44
Apache V2
A licensee of Apache Licensed V2 software can:

copy, modify and distribute the covered software in source and/or binary forms

exercise patent rights that would normally only extend to the licensor provided that:

all copies, modified or unmodified, are accompanied by a copy of the licensee

all modifications are clearly marked as being the work of the modifier

all notices of copyright, trademark and patent rights are reproduced accurately in
distributed copies

the licensee does not use any trademarks that belong to the licensor Furthermore, the
grant of patent rights specifically is withdrawn if:
 the licensee starts legal action against the licensor(s) over patent infringements within the
covered software
45
What is Hadoop?
Engineering
 A massively parallel, shared nothing, Java-based mapreduce execution environment.
 hundreds to thousands of computers working on the
same problem, with built-in failure resilience
 Projects in the Hadoop ecosystem provide data
loading, higher-level languages, automated cloud
deployment, and other capabilities.
 Kerberos-secured software suite
46
What are the components of
Hadoop?
Two core components,
 File store called Hadoop Distributed File System
(HDFS)
 Programming framework called MapReduce
47
 HDFS
 FlumeNG
 MapReduce
 Whirr
 Hadoop Streaming
 Mahout
 Hive and Hue
 Fuse
 Pig
 Zookeeper
 Sqoop
 Hbase
48
Hadoop Components
 HDFS: spreads data over thousands of nodes
 The Datanodes store your data, and the Namenode
keeps track of where stuff is stored.
49
Hadoop Components
 Pig: A higher-level programming environment to do
MapReduce coding
 Sqoop: data transfer between Hadoop and relational
database
 HBase: highly scalable key-value store
 Whirr: Cloud provisioning for Hadoop
50
51
Reduce
Shuffle/sort mapper output
Mapper – read 64+ MB blocks
HDFS
........
 HDFS, the bottom layer, sits on a cluster of commodity
hardware.
 For a map-reduce job, the mapper layer reads from the
disks at very high speed.
 The mapper emits key value pairs that are sorted and
presented to the reducer, and
 the reducer layer summarizes the key-value pairs
53
54
Hadoop and relational
databases?
 Hadoop integrates very well with relational database
 Apache Sqoop
 Used for moving data between Hadoop and relational
databases
55
Some elementary references
 Open Source Big Data for the Impatient, Part 1:
Hadoop tutorial: Hello World with Java, Pig, Hive,
Flume, Fuse, Oozie, and Sqoop with Informix, DB2,
and MySQL
 How to get started with Hadoop and your favorite
databasesMarty Lurie (marty@cloudera.com),
Systems Engineer, Cloudera
http://www.ibm.com/developerworks/data/library/techart
icle/dm-1209hadoopbigdata/
56
Bigdata India Scene
Big data India
 Will restrict myself to the academic scene
 Almost every institute has courses and researchers in
this space
 But not with the label Big Data
 Found only one ‘course’ with this title
58
 A/B testing
 association rule learning
Big data Toolkit
 classification
 cluster analysis
 genetic algorithms
 machine learning
 natural language processing
 neural networks
 pattern recognition
 anomaly detection
 predictive modeling
 regression
 sentiment analysis
• simulation
• time series analysis
• Visualisation
59
Courses
 Machine Learning
 Natural Language Processing
 Data Mining
 Soft computing
 Statistics
60
MOOC on Bigdata
Coursera
24 March 2013
10 weeks
Dr. Gautam Shroff
Department of Computer Science and
Engineering
Indian Institute of Technology Delhi
61
 http://www.mu-sigma.com/
 Mu Sigma, one of the world’s largest Decision Sciences and
analytics firms, helps companies institutionalize data-driven
decision making and harness Big Data
 http://www.veooz.com/
62
63
64
What is Veooz?
• pronounced as "views”
• helps you to get a quick overview,
understand and insights from the
• views/opinions by users on
different social media platforms
like
• Facebook, Twitter, Google+,
LinkedIn, News Sites, Blogs, ...
• Track views/opinions expressed
by social media users- on
people, places, products, movies,
events, brands …
• billions of views on millions of
topics at one place
65
Goal: Organize thoughts and interactions in Social media in real
time
66
veooz: Real time Social Media Search & Analytics Engine
67
Social media is a
Good Proxy
for the
Real World
68
Social media
monitoring to
listening to
Social Intelligence
Social
New Power …
69
Most
Social Media
data
is
Noisy
Not easy, because…
70
Spelling errors
and
variations
messenger message
and
problem
Short forms and abbreviations
Semantic equivalents
#HashTagMapping Social media SPAM
Irony, Sarcasm Negation
Noisy Data in Social Media 71
and
detecti
Text Analysis
Context/Semantic
Topic level aggregation
Fine grained
and
Processing
vs. text level processing
Detecting
variations
of the
topic
Using Prior Global Sentiment in
Computing current sentiment
Noise? Because it makes sentiment computation
and deeper text processing very hard 72
Literal
Special Symbols
Non-Literal
SPAM
User Engagement
•Non-opinions
•Opinion
•Intensity/graded
expressions
•Emoticons
•Punctuation
Transgression
•Grapheme stretching
•Abbreviations
•Metaphor
•Sarcasm
•Irony
•Oxymoron
•Incorrect/Ill-intention
•Reputation/Influence
•Content
•User actions
•User reactions
•Social relations
Sentiment Expression Axis
73
http://www.bda2013.net/
Important Dates (Research, Tutorial, Industry):
Abstract submission deadline: June 30 2013
Paper submission deadline: July 7 2013
Notification to authors: August 23 2013
Camera ready submission: September 4 2013
74
Conferences on Big Data
75
Indian Institutes of Technology
76
Indian Institutes of Technology (IITs)
 IITs
are a group of fifteen autonomous
engineering and technology oriented institutes of
higher education established and declared as
Institutes of National Importance by the
Parliament of India.
77
 IITs were created to train scientists and
engineers, with the aim of developing a skilled
workforce to support the economic and social
development of India after independence in 1947.
78
Original IITs
1. As a step towards this direction,
the first IIT was established in
1951, in Kharagpur (near
Kolkata) in the state of West
Bengal.
79
2. IIT Bombay was founded in
1958 at Powai, Mumbai with
assistance from UNESCO and
the Soviet Union, which
provided technical expertise.
80
3. IIT Madras is located in the city of Chennai in
Tamil Nadu. It was established in 1959 with
technical assistance from the Government of
West Germany.
81
4. IIT Kanpur was established in 1959 in the city of
Kanpur, Uttar Pradesh. During its first 10 years, IIT
Kanpur benefited from the Kanpur–Indo-American
Programme (KIAP), where a consortium of nine US
universities.
82
5. Established
as the College of
Engineering in 1961, located in Hauz
Khas was renamed as IIT Delhi.
6. IIT Guwahati
was established in
1994 near the city
of Guwahati
(Assam) on the bank
of the
Brahmaputra River.
83
7. IIT Roorkee, originally known as the University
of Roorkee, was established in 1847 as the first
engineering college of the British Empire.
Located in Uttarakhand, the college was
renamed The Thomson College of Civil
Engineering in 1854. It became first technical
university of India in 1949 and was renamed
University of Roorkee which was included in the
IIT system in 2001.
84
New IITs
1. Patna (Bihar)
2. Jodhpur(Rajasthan)
3. Hyderabad (Andhra Pradesh)
4. Mandi(Himachal Pradesh)
5. Bhubaneshwar (Orissa)
6. Indore (Madhya Pradesh)
7. Gandhinagar (Gujarat)
8. Ropar (Punjab)
85
Admission
 Admission to undergraduate B.Tech., M.Sc., and
dual degree (BT-MT) programs are through
Joint Entrance Examination (JEE)
 1 out of 100 get in
86
Features
• IITs receive large grants compared to other
engineering colleges in India.
• About Rs. 1,000 million per year for each IIT.
87
Features (cont.)
 The availability of resources has translated into
superior infrastructure and qualified faculty in the
IITs and consequently higher competition among
students to gain admissions into the IITs.
88
Features (cont.)
 The government has no direct control over internal
policy decisions of IITs (such as faculty
recruitment) but has representation on the IIT
Council.
89
Features (cont.)
 All over, IIT degrees are respected, largely due to
the prestige created by very successful alumni.
90
Success story
 Other factors contributing to the success of IITs are
stringent faculty recruitment procedures and industry
collaboration.
 This combination of success factors has led to the
concept of the IIT Brand.
91
Success story (cont.)
 IIT brand was reaffirmed when the United States
House of Representatives passed a resolution
honouring Indian Americans and especially
graduates of IIT for their contributions to the
American society.
 Similarly, China also recognised the value of IITs
and has planned to replicate the model.
92
Indian Institute of Technology Kanpur
 Indian Institute of Technology, Kanpur is one of the
premier institutions established in 1959 by the
Government of India.
93
IITK (Cont.)
 “to provide meaningful education, to conduct
original research of the highest standard and to
provide leadership in technological innovation for
the industrial growth of the country”
94
IITK (Cont.)
 Under the guidance of eminent economist John
Kenneth Galbraith, IIT Kanpur was the first
Institute in India to start Computer Science
education.
 The Institute now has its own residential campus
spread over 420 hectors of land.
95
Statistics

Undergraduate
3679

Postgraduate
2039

Ph.D.
1064

Faculty

Research Staff
30

Supporting Staff
900

Alumni
351
26900
96
Departments
 Sciences: Chemistry, Physics, Mathematics & Statistics
 Engineering: Aerospace, Bio-Sciences and Bioengineering,
Chemical, Civil, Computer Science & Engineering, Electrical,
Industrial & Management Engineering, Mechanical, Material
Science & Engineering
 Humanities and Social Sciences
 Interdisciplinary: Environmental Engineering & Management,
Laser Technology, Master of Design, Materials Science
Programme, Nuclear Engineering & Technology
97
Thank you
98
Download