Computational Limitations & Cloud Area Derek McAuley Big Data workshop 7 January 2015

advertisement
Computational Limitations & Cloud Area
Derek McAuley
Big Data workshop 7th January 2015
Big data hype
Big data is an all-encompassing term for any collection of
data sets so large and complex that it becomes difficult to
process using on-hand data management tools or traditional
data processing applications.
“The rule is, jam tomorrow and jam yesterday - but never jam today.”
“It must come sometime to jam today”, Alice objected
“No it can't”, said the Queen.
“As our sample set approaches n=all..
2
The first big data problem
• US Census Bureau had taken 8 year to tabulate 1880 census
• Issues challenge in 1888 to capture and tabulate data
• Hollerith wins and gains 1890 contract
•
3
– So IBM is born
Punched card in use in census until 1950
Big science – big data
4
Big data – old news?
5
On the importance of sampling
George Gallup v. Literary digest
• 50,000 v. 2,000,000 samples
Big data dream / myth:
“The sample is the population”
6
Energy monitoring…
•
•
•
•
7
Rollout across UK by 2020
Utility?
– Reductions of between 5 and 15%
– No systematic examinations of why and when
Importance of interface
– Making it usable by real users
Importance of information displayed
e.g. display units
– Carbon emissions and/or Costs?
– Behavioural spillover?
Privacy
Clustering doesn’t work
Daily usage and traditional demographics (e.g. age, gender,
socio-economic class, etc.) gave a correlation of 0.1!
Demographic Profiles of Clusters
100%
Welfare Borderline
90%
Urban Intelligence
80%
Twilight Subsistence
70%
Ties of Community
60%
Symbols of Success
50%
Suburban Comfort
40%
Rural Isolation
30%
Municipal Dependency
20%
Happy Families
10%
Grey Perspectives
0%
Blue Collar Enterprise
1
2
3
4
5
6
7
8
9
10
cluster
Clustering on daily usage and projecting results onto
segmentation products (e.g. Experian Mosaic) highlight the
problem – no strong relationships exist between those groups
and actual behaviour.
9
Spectral analysis
• Categorize people’s energy use by the temporal patterns
•
•
inherent in their daily consumption behaviours
Identify common temporal behaviours that appear across
the population.
Identify uncommon behaviours…
• Not Big Data, Small Data
– Still personal, but less invasive
10
Lots of “small data”?
• Do you need all your data in one
place at one time?
– may be costly
– may be a risk
– may be difficult…
Aggregate
Analyze
11
Examples…
Solve for voltages…
DAR [1]
[1] “Teletraffic Science”, ITC12. Elsevier, 1989.
12
Engine monitoring
•
•
•
•
•
•
•
13
25 sensors per engine
high data capture rates
low bandwidth in flight
can’t backhaul all data
need distributed implementation
look for anomalies…
what needs to be real time?
BIG DATA
14
Problem
• There’s always more data, so:
– Clearly understand the data you need to aggregate
– Determine statistics with highest relative information
content
• e.g. not the mean over time…
– Triage
• CERN have been doing it for years…
– Computation at the edge is cheap and abundant
• …and often paid for by the customer!
15
Prepare for success
http://www.horizon.ac.uk
Questions?
derek.mcauley@nottingham.ac.uk
Download