Pang-Ning Tan - Michigan State University

advertisement
Mining Big Data
Pang-Ning Tan
Associate Professor
Dept of Computer Science & Engineering
Michigan State University
Website: http://www.cse.msu.edu/~ptan
Google Trends
Big Data
Smart Cities
Big Data and Smart Cities
Outline
Smart Cities
Big Data and Its Challenges
Mining Big Data
Smart Cities
“Cities are growing steadily, and the process of urbanization is a
common trend in the world. Although cities are getting bigger, they
are not necessarily getting better… smart cities, founded on the use
of information and communication technologies, aim at tackling
many local problems, from local economy and transportation to
quality of life and e-governance.”
[Martínez-Ballesté et al. IEEE Communications 2013]
Examples of Smart Cities
E-Governance
Smart
Buildings
Healthcare
Transportation
$$$$
DATA
Energy
Education
Water
Waste
management
Public safety
What are the key resources needed to realize this?
Types of Data from Smart Cities
Sensor time series
Smart card
Surveillance video
streams
Social media
GPS trajectories
from mobile devices
Structured data
Why Mine/Analyze the Data?
The data contains useful information that can be
harnessed for various purposes:
Monitoring/surveillance
Event detection
Adaptation
Decision making
Planning
Forecasting
Etc..
Outline
Smart Cities
Big Data and Its Challenges
Mining Big Data
Big Data: How Much Data is Out There?
Source: http://www.emc.com/leadership/digital-universe/index.htm
How much is a Zettabyte?
1 ZettaByte = 1000 ExaBytes = 106 PetaBytes
= 109 TeraBytes = 1012 GigaBytes
A DVD stores about 5 GB data and its case is ~1cm thick
1 ZettaByte ~ 1021 / 5×109
= 200 billion DVDs to store them
Distance from Earth to moon
= 384,000 km = 3.84 × 1010 cm
** If you stack all the DVDs that contain 1 ZB of data, it is
about 3 times the distance to the moon and back
Challenges of Big Data
Volume: large amount of data that is continuously growing
Velocity: rapid streams of data collected
Variety: structured and unstructured data obtained from
(potentially) multiple data sources
Veracity: messiness or trustworthiness of the data
Value: usefulness of the data; needs a careful cost/benefit
analysis before embarking on big data project
Outline
Smart Cities
Big Data and Its Challenges
Mining Big Data
What is Data Mining?
A collection of computer algorithms and techniques
to automatically extract useful information from
large data repositories
Big Data Analytics Pipeline
Garbage In, Garbage Out
Quality of output information depends on quality of input data
Data Preprocessing
Helps to alleviate many of the data quality issues
Noise
Outliers
Missing values (incomplete data)
Duplicate data
Data with irrelevant attributes
Data with redundant attributes
Data of varying format, scales, etc
Types of Data Analysis
Simple, descriptive statistics
Mean/Median/Mode
Standard deviation/Mean absolute deviation
Quartiles, percentiles, top-k
Example: Heavy-hitter problem
Find the hot topics (e.g., trending hashtags) used over the
past 24 hours
TrendMap
Finding Hot Topics (Unbounded storage)
Data
Stream
2013
discount
holiday
2013
MSU
Memory
Associative
array, f
2013
discount
holiday
MSU
2
1
1
1
1
Naïve algorithm;
Assume storage
space is unbounded
Finding Hot Topics (Limited Storage)
Data
Stream
2013
discount
holiday
Memory
Associative
array, f
2013
discount
1
1
?
holiday
1
Which one to replace?
Any theoretical
guarantees that solution
will always be in the
array?
Misra-Gries Algorithm
Data
Stream
2013
discount
holiday
2013
MSU
Memory
Associative
array, f
2013
discount
MSU
10
holiday
1
10
Algorithm guarantees
that all “hot items” that
appear at least m/k+1
times will be in the
buffer (where m is
length of data stream
and k is number of
buffers)
Summary
Even simple analysis becomes harder to compute
when you have big data
Need for fast and scalable algorithms that can
produce good, approximate solutions
Advanced Data Mining Analysis
Data
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
60K
No
10
Ranking/
Recommendation
Predictive Modeling: Classification
To infer the value of a nominal attribute based on the
values of other observed attributes
Examples:
Autonomous driving
• Traffic sign recognition
• Open lane detection
Smart Home/Building:
• Appliance identification based
on electricity utilization
Predictive Modeling: Regression
To infer the value of a continuous attribute based on
the values of other observed attributes
Examples:
mHealth
• Monitoring heart rate and body temperature
using wearable devices
Intelligent Transportation System
• Traffic volume prediction
Smart Building
• Electricity/Water demand prediction
Framework for Predictive Modeling
Labeled examples
Unlabeled examples
congestion
No
congestion
Training
Set
Test
Set
Train
Model
Model
Cluster Analysis
Find groups of observations such that the observations in the same
group are more similar to each other than to those in other groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Applications of Cluster Analysis
Crime hotspot detection
GPS trajectory segmentation
Association Analysis
Extract patterns of frequently co-occurring events
Time
Sensor ID
State
3/1/2015 07:48:05
BR1
OFF
3/1/2015 07:48:07
LR1
ON
3/1/2015 07:48:10
LR6
ON
3/1/2015 07:48:20
BT1
ON
3/1/2015 07:48:40
LR6
OFF
3/1/2015 07:49:30
BT3
ON
Weekday, 7 - 8am, BR2 = OFF, BR1 = OFF, LR6 = ON → LR1=ON
Weekday, 10-11pm, BR1 = ON, BR2 = ON, LR6 = OFF → LR1 = OFF
Applications of Association Analysis
Traffic Accident Analysis
Smart Health
Adverse drug interactions
Anomaly Detection
Detect significant deviations from normal observations
Applications of Anomaly Detection
Smart Transportation
Congestion detection
Sensor fault detection
Smart Home/Building
Water theft detection
Pipe burst detection
Ranking (Recommendation)
Given a query q, recommend items in specific rank
order based on their relevance to q
Examples:
Location-aware services
Smart home assistant
Other Challenges: Privacy
http://techland.time.com/2012/02/17/how-target-knew-a-high-school-girl-was-pregnant-before-her-parents/
Other Challenges: Security
http://arstechnica.com/security/2012/12/how-an-internet-connected-samsung-tv-can-spill-your-deepest-secrets/
Summary
Mining big data is both a challenge and an opportunity
CSE Courses on Data Mining
CSE 491/891: Computational Techniques for Large-Scale
Data Analysis
CSE 881: Data Mining
References
Pang-Ning Tan, Knowledge Discovery
from Sensor Data, Feature Article in
Sensors Magazine, March 1 2006
Pang-Ning Tan, Michael Steinbach,
and Vipin Kumar, Introduction to Data
Mining, Addison Wesley, 2006
Download