Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, January 28, 2014, SAGE 3101 1 Admin info (keep/ print this slide) • • • • • • • • • Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg) Contact hours: Monday** 3:00-4:00pm (or by email appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by email) TA: Lakshmi Chenicheri chenil@rpi.edu Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014 – Schedule, lectures, syllabus, reading, assignments, etc. 2 Contents • Back to the data sources – Cyber – Human • “Munging” • Beginning with hypothesis -> synthesis • Distributions… • Scoping out analysis and model choices 3 Lower layers in the Analytics Stack 4 “Cyber Data” … 5 “Human Data” … 6 Descriptive / Inferential • Descriptive statistics: numerical summaries of samples – i.e., what was observed, distributions – The ‘sample’ may be exhaustive, i.e., identical to the population • Inferential statistics: from samples to populations – i.e., what could have been or will be observed in a larger population • Descriptive (report) to Inferential (model suggestion) is a key process in analytics • So often NOT a linear process.. • Sample bias – choice and awareness Adapted from Marshall Ma (and other sources) 7 Populations and samples • A population is defined – We must be able to say, for every object, if it is in the population or not – We must be able, in principle, to find every individual of the population A geographic example of a population is all pixels in a multi-spectral satellite image • A sample is a subset of a population – We must be able to say, for every object in the population, if it is in the sample or not – Sampling is the process of selecting a sample from a population • E.g 2010EPI_data.xls (EPI2010_all countries or EPI2010_onlyEPIcountries tabs) 8 Election prediction • Exit polls versus election results – Human versus cyber • How is the “population” defined here? • What is the sample, how chosen? – What is described and how is that used to predict? – Are results categorized? (where from, M/F, age) • What is the uncertainty? – It is reflected in the “sample distribution” – And controlled/ constraints by “sampling theory” 9 Bias difference: between cyber and human data • Election results and exit polls – What are examples of bias in election results? – In exit polls? 10 Hypothesis • What are you exploring? • Regular data analytics features ~ well defined hypotheses – Big Data messes that up • E.g. Stock market performance / trends versus unusual events (crash/ boom): – Populations versus samples – which is which? – Why? • E.g. Election results are predictable from exit polls 11 Distributions • http://www.quantitativeskills.com/sisa/rojo/alld ist.zip • Shape • Character • Parameter(s) 12 Plotting these distributions • Histograms and binning • Getting used to log scales • Going beyond 2-D • More of this on Friday (in more detail) 13 In applications • Scipy: http://docs.scipy.org/doc/scipy/reference/stats .html • R: http://stat.ethz.ch/R-manual/Rpatched/library/stats/html/Distributions.html • Matlab: http://www.mathworks.com/help/stats/_brn2irf .html • Excel: HAH! 14 Heavy-tail distributions • are probability distributions whose tails are not exponentially bounded • Common – long-tail… human v. cyber… Few that dominate More that add up Equal areas 15 http://en.wikipedia.org/wiki/Heavy-tailed_distribution Spatial example 16 Spatial roughness… 17 Compare median, mean, mode 18 Huh, we have Big Data? • Why would we care about samples? – Let’s take it all? • It gets messy == quality, gaps, … • Very often goes beyond known patterns, i.e. out of the range of previous values – Anyone remember the financial crisis in 2008? • Data becomes more subjective than objective and especially human v. cyber.. • To start: let’s take a look at EPI data that you 19 started to explore last week (cyber) Munging • Missing values, null values, etc. • E.g. in EPI_data – they use “--” – Most data applications provide built ins for these higherorder functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday) • Of course, different variables often are missing “different” values • In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then friends: http://www.johnmyleswhite.com/notebook/2010/09/2 20 3/higher-order-functions-in-r/ 21 22 23 24 Patterns and Relationships • Stepping from elementary/ distribution analysis to algorithmic-based analysis • I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, nonparametric models • Relations – associations between/among populations • Outcome: model and an evaluation of its fitness for purpose 25 More munging • Bad values, outliers, corrupted entries, thresholds … • Noise reduction – low-pass filtering, binning • A few example today but the labs will bring this into view soon • REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations… 26 27 28 Populations within populations • In the EPI example: – Geographic regions (GEO_subregion) – EPI_regions – Eco-regions (EDC v. LEDC – know what that is?) – Primary industry(ies) – Climate region • What would you do to start exploring? 29 30 Or, a twist – n=1 but many attributes? The item of interest in relation to its attributes 31 Summary: explore • Going from preliminary to initial analysis… • Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution) • Fitting that distribution • Or NOT – A hybrid or – Non-parametric (statistics) approaches are needed – more on this to come 32 Models • Assumptions are often used when considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit) • Two key topics: – N=all and the open world assumption – Model of the thing of interest versus model of the data (data model; structural form) • “All models are wrong but some are useful” (generally attributed to the statistician George Box) 33 Conceptual, logical and physical models Applied to a database: However our models will be mathematical, statistical, or a combination. The concept of the model comes from the hypothesis The implementation of the physical model comes from 34 the data ;-) Art or science? • The form of the model, incorporating the hypothesis determines a “form” • Thus, as much art as science because it depends both on your world view and what the data is telling you (or not) • We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc… 35 Goodness of fit • And, we cannot take the models at face value, we must assess how fit they may be: – Chi-Square – One-sided and two-sided Kolmogorov-Smirnov tests – Lilliefors tests – Ansari-Bradley tests – Jarque-Bera tests • Just a preview… 36 Summary • Cyber and Human data; quality, uncertainty and bias • Distributions – the common and not-so common ones and how cyber and human data can have distinct distributions • How simple statistical distributions can mis-lead us • Populations and samples and how inferential statistics will lead us to model choices (no we have not actually done that yet in detail) • Big Data and some consequences • Munging toward exploratory analysis • Toward models! 37 Tentative assignments • Assignment 2: Datasets and data infrastructures – lab assignment. Held in week 3 (Feb. 7) 10% (lab; individual); • Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual); • Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual); • Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual); • Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); • Term project. Due ~ week 13. 30% (25% written, 5% oral; individual). 38 How are the software installs going? • R/Scipy (et al)/Matlab • Data infrastructure • Exercises? • More on Friday… 39 Assignment 1 – how is it going? • Choose a DA case study from a) readings, or b) your choice (must be approved by me) • Read it and provide a short written review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits). • Be prepared to discuss it in class this Friday 31st. Hand in the written report by 5pm that day. 40