Mining Big Data Pang-Ning Tan Associate Professor Dept of Computer Science & Engineering Michigan State University Website: http://www.cse.msu.edu/~ptan Google Trends Big Data Smart Cities Big Data and Smart Cities Outline Smart Cities Big Data and Its Challenges Mining Big Data Smart Cities “Cities are growing steadily, and the process of urbanization is a common trend in the world. Although cities are getting bigger, they are not necessarily getting better… smart cities, founded on the use of information and communication technologies, aim at tackling many local problems, from local economy and transportation to quality of life and e-governance.” [Martínez-Ballesté et al. IEEE Communications 2013] Examples of Smart Cities E-Governance Smart Buildings Healthcare Transportation $$$$ DATA Energy Education Water Waste management Public safety What are the key resources needed to realize this? Types of Data from Smart Cities Sensor time series Smart card Surveillance video streams Social media GPS trajectories from mobile devices Structured data Why Mine/Analyze the Data? The data contains useful information that can be harnessed for various purposes: Monitoring/surveillance Event detection Adaptation Decision making Planning Forecasting Etc.. Outline Smart Cities Big Data and Its Challenges Mining Big Data Big Data: How Much Data is Out There? Source: http://www.emc.com/leadership/digital-universe/index.htm How much is a Zettabyte? 1 ZettaByte = 1000 ExaBytes = 106 PetaBytes = 109 TeraBytes = 1012 GigaBytes A DVD stores about 5 GB data and its case is ~1cm thick 1 ZettaByte ~ 1021 / 5×109 = 200 billion DVDs to store them Distance from Earth to moon = 384,000 km = 3.84 × 1010 cm ** If you stack all the DVDs that contain 1 ZB of data, it is about 3 times the distance to the moon and back Challenges of Big Data Volume: large amount of data that is continuously growing Velocity: rapid streams of data collected Variety: structured and unstructured data obtained from (potentially) multiple data sources Veracity: messiness or trustworthiness of the data Value: usefulness of the data; needs a careful cost/benefit analysis before embarking on big data project Outline Smart Cities Big Data and Its Challenges Mining Big Data What is Data Mining? A collection of computer algorithms and techniques to automatically extract useful information from large data repositories Big Data Analytics Pipeline Garbage In, Garbage Out Quality of output information depends on quality of input data Data Preprocessing Helps to alleviate many of the data quality issues Noise Outliers Missing values (incomplete data) Duplicate data Data with irrelevant attributes Data with redundant attributes Data of varying format, scales, etc Types of Data Analysis Simple, descriptive statistics Mean/Median/Mode Standard deviation/Mean absolute deviation Quartiles, percentiles, top-k Example: Heavy-hitter problem Find the hot topics (e.g., trending hashtags) used over the past 24 hours TrendMap Finding Hot Topics (Unbounded storage) Data Stream 2013 discount holiday 2013 MSU Memory Associative array, f 2013 discount holiday MSU 2 1 1 1 1 Naïve algorithm; Assume storage space is unbounded Finding Hot Topics (Limited Storage) Data Stream 2013 discount holiday Memory Associative array, f 2013 discount 1 1 ? holiday 1 Which one to replace? Any theoretical guarantees that solution will always be in the array? Misra-Gries Algorithm Data Stream 2013 discount holiday 2013 MSU Memory Associative array, f 2013 discount MSU 10 holiday 1 10 Algorithm guarantees that all “hot items” that appear at least m/k+1 times will be in the buffer (where m is length of data stream and k is number of buffers) Summary Even simple analysis becomes harder to compute when you have big data Need for fast and scalable algorithms that can produce good, approximate solutions Advanced Data Mining Analysis Data Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 60K No 10 Ranking/ Recommendation Predictive Modeling: Classification To infer the value of a nominal attribute based on the values of other observed attributes Examples: Autonomous driving • Traffic sign recognition • Open lane detection Smart Home/Building: • Appliance identification based on electricity utilization Predictive Modeling: Regression To infer the value of a continuous attribute based on the values of other observed attributes Examples: mHealth • Monitoring heart rate and body temperature using wearable devices Intelligent Transportation System • Traffic volume prediction Smart Building • Electricity/Water demand prediction Framework for Predictive Modeling Labeled examples Unlabeled examples congestion No congestion Training Set Test Set Train Model Model Cluster Analysis Find groups of observations such that the observations in the same group are more similar to each other than to those in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Applications of Cluster Analysis Crime hotspot detection GPS trajectory segmentation Association Analysis Extract patterns of frequently co-occurring events Time Sensor ID State 3/1/2015 07:48:05 BR1 OFF 3/1/2015 07:48:07 LR1 ON 3/1/2015 07:48:10 LR6 ON 3/1/2015 07:48:20 BT1 ON 3/1/2015 07:48:40 LR6 OFF 3/1/2015 07:49:30 BT3 ON Weekday, 7 - 8am, BR2 = OFF, BR1 = OFF, LR6 = ON → LR1=ON Weekday, 10-11pm, BR1 = ON, BR2 = ON, LR6 = OFF → LR1 = OFF Applications of Association Analysis Traffic Accident Analysis Smart Health Adverse drug interactions Anomaly Detection Detect significant deviations from normal observations Applications of Anomaly Detection Smart Transportation Congestion detection Sensor fault detection Smart Home/Building Water theft detection Pipe burst detection Ranking (Recommendation) Given a query q, recommend items in specific rank order based on their relevance to q Examples: Location-aware services Smart home assistant Other Challenges: Privacy http://techland.time.com/2012/02/17/how-target-knew-a-high-school-girl-was-pregnant-before-her-parents/ Other Challenges: Security http://arstechnica.com/security/2012/12/how-an-internet-connected-samsung-tv-can-spill-your-deepest-secrets/ Summary Mining big data is both a challenge and an opportunity CSE Courses on Data Mining CSE 491/891: Computational Techniques for Large-Scale Data Analysis CSE 881: Data Mining References Pang-Ning Tan, Knowledge Discovery from Sensor Data, Feature Article in Sensors Magazine, March 1 2006 Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2006