Computational Limitations & Cloud Area Derek McAuley Big Data workshop 7th January 2015 Big data hype Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. “The rule is, jam tomorrow and jam yesterday - but never jam today.” “It must come sometime to jam today”, Alice objected “No it can't”, said the Queen. “As our sample set approaches n=all.. 2 The first big data problem • US Census Bureau had taken 8 year to tabulate 1880 census • Issues challenge in 1888 to capture and tabulate data • Hollerith wins and gains 1890 contract • 3 – So IBM is born Punched card in use in census until 1950 Big science – big data 4 Big data – old news? 5 On the importance of sampling George Gallup v. Literary digest • 50,000 v. 2,000,000 samples Big data dream / myth: “The sample is the population” 6 Energy monitoring… • • • • 7 Rollout across UK by 2020 Utility? – Reductions of between 5 and 15% – No systematic examinations of why and when Importance of interface – Making it usable by real users Importance of information displayed e.g. display units – Carbon emissions and/or Costs? – Behavioural spillover? Privacy Clustering doesn’t work Daily usage and traditional demographics (e.g. age, gender, socio-economic class, etc.) gave a correlation of 0.1! Demographic Profiles of Clusters 100% Welfare Borderline 90% Urban Intelligence 80% Twilight Subsistence 70% Ties of Community 60% Symbols of Success 50% Suburban Comfort 40% Rural Isolation 30% Municipal Dependency 20% Happy Families 10% Grey Perspectives 0% Blue Collar Enterprise 1 2 3 4 5 6 7 8 9 10 cluster Clustering on daily usage and projecting results onto segmentation products (e.g. Experian Mosaic) highlight the problem – no strong relationships exist between those groups and actual behaviour. 9 Spectral analysis • Categorize people’s energy use by the temporal patterns • • inherent in their daily consumption behaviours Identify common temporal behaviours that appear across the population. Identify uncommon behaviours… • Not Big Data, Small Data – Still personal, but less invasive 10 Lots of “small data”? • Do you need all your data in one place at one time? – may be costly – may be a risk – may be difficult… Aggregate Analyze 11 Examples… Solve for voltages… DAR [1] [1] “Teletraffic Science”, ITC12. Elsevier, 1989. 12 Engine monitoring • • • • • • • 13 25 sensors per engine high data capture rates low bandwidth in flight can’t backhaul all data need distributed implementation look for anomalies… what needs to be real time? BIG DATA 14 Problem • There’s always more data, so: – Clearly understand the data you need to aggregate – Determine statistics with highest relative information content • e.g. not the mean over time… – Triage • CERN have been doing it for years… – Computation at the edge is cheap and abundant • …and often paid for by the customer! 15 Prepare for success http://www.horizon.ac.uk Questions? derek.mcauley@nottingham.ac.uk