Document

Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, January 28, 2014, SAGE 3101 1 Admin info (keep/ print this slide) • • • • • • • • • Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg) Contact hours: Monday** 3:00-4:00pm (or by email appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by email) TA: Lakshmi Chenicheri chenil@rpi.edu Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014 – Schedule, lectures, syllabus, reading, assignments, etc. 2 Contents • Back to the data sources – Cyber – Human • “Munging” • Beginning with hypothesis -> synthesis • Distributions… • Scoping out analysis and model choices 3 Lower layers in the Analytics Stack 4 “Cyber Data” … 5 “Human Data” … 6 Descriptive / Inferential • Descriptive statistics: numerical summaries of samples – i.e., what was observed, distributions – The ‘sample’ may be exhaustive, i.e., identical to the population • Inferential statistics: from samples to populations – i.e., what could have been or will be observed in a larger population • Descriptive (report) to Inferential (model suggestion) is a key process in analytics • So often NOT a linear process.. • Sample bias – choice and awareness Adapted from Marshall Ma (and other sources) 7 Populations and samples • A population is defined – We must be able to say, for every object, if it is in the population or not – We must be able, in principle, to find every individual of the population A geographic example of a population is all pixels in a multi-spectral satellite image • A sample is a subset of a population – We must be able to say, for every object in the population, if it is in the sample or not – Sampling is the process of selecting a sample from a population • E.g 2010EPI_data.xls (EPI2010_all countries or EPI2010_onlyEPIcountries tabs) 8 Election prediction • Exit polls versus election results – Human versus cyber • How is the “population” defined here? • What is the sample, how chosen? – What is described and how is that used to predict? – Are results categorized? (where from, M/F, age) • What is the uncertainty? – It is reflected in the “sample distribution” – And controlled/ constraints by “sampling theory” 9 Bias difference: between cyber and human data • Election results and exit polls – What are examples of bias in election results? – In exit polls? 10 Hypothesis • What are you exploring? • Regular data analytics features ~ well defined hypotheses – Big Data messes that up • E.g. Stock market performance / trends versus unusual events (crash/ boom): – Populations versus samples – which is which? – Why? • E.g. Election results are predictable from exit polls 11 Distributions • http://www.quantitativeskills.com/sisa/rojo/alld ist.zip • Shape • Character • Parameter(s) 12 Plotting these distributions • Histograms and binning • Getting used to log scales • Going beyond 2-D • More of this on Friday (in more detail) 13 In applications • Scipy: http://docs.scipy.org/doc/scipy/reference/stats .html • R: http://stat.ethz.ch/R-manual/Rpatched/library/stats/html/Distributions.html • Matlab: http://www.mathworks.com/help/stats/_brn2irf .html • Excel: HAH! 14 Heavy-tail distributions • are probability distributions whose tails are not exponentially bounded • Common – long-tail… human v. cyber… Few that dominate More that add up Equal areas 15 http://en.wikipedia.org/wiki/Heavy-tailed_distribution Spatial example 16 Spatial roughness… 17 Compare median, mean, mode 18 Huh, we have Big Data? • Why would we care about samples? – Let’s take it all? • It gets messy == quality, gaps, … • Very often goes beyond known patterns, i.e. out of the range of previous values – Anyone remember the financial crisis in 2008? • Data becomes more subjective than objective and especially human v. cyber.. • To start: let’s take a look at EPI data that you 19 started to explore last week (cyber) Munging • Missing values, null values, etc. • E.g. in EPI_data – they use “--” – Most data applications provide built ins for these higherorder functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday) • Of course, different variables often are missing “different” values • In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then friends: http://www.johnmyleswhite.com/notebook/2010/09/2 20 3/higher-order-functions-in-r/ 21 22 23 24 Patterns and Relationships • Stepping from elementary/ distribution analysis to algorithmic-based analysis • I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, nonparametric models • Relations – associations between/among populations • Outcome: model and an evaluation of its fitness for purpose 25 More munging • Bad values, outliers, corrupted entries, thresholds … • Noise reduction – low-pass filtering, binning • A few example today but the labs will bring this into view soon • REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations… 26 27 28 Populations within populations • In the EPI example: – Geographic regions (GEO_subregion) – EPI_regions – Eco-regions (EDC v. LEDC – know what that is?) – Primary industry(ies) – Climate region • What would you do to start exploring? 29 30 Or, a twist – n=1 but many attributes? The item of interest in relation to its attributes 31 Summary: explore • Going from preliminary to initial analysis… • Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution) • Fitting that distribution • Or NOT – A hybrid or – Non-parametric (statistics) approaches are needed – more on this to come 32 Models • Assumptions are often used when considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit) • Two key topics: – N=all and the open world assumption – Model of the thing of interest versus model of the data (data model; structural form) • “All models are wrong but some are useful” (generally attributed to the statistician George Box) 33 Conceptual, logical and physical models Applied to a database: However our models will be mathematical, statistical, or a combination. The concept of the model comes from the hypothesis The implementation of the physical model comes from 34 the data ;-) Art or science? • The form of the model, incorporating the hypothesis determines a “form” • Thus, as much art as science because it depends both on your world view and what the data is telling you (or not) • We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc… 35 Goodness of fit • And, we cannot take the models at face value, we must assess how fit they may be: – Chi-Square – One-sided and two-sided Kolmogorov-Smirnov tests – Lilliefors tests – Ansari-Bradley tests – Jarque-Bera tests • Just a preview… 36 Summary • Cyber and Human data; quality, uncertainty and bias • Distributions – the common and not-so common ones and how cyber and human data can have distinct distributions • How simple statistical distributions can mis-lead us • Populations and samples and how inferential statistics will lead us to model choices (no we have not actually done that yet in detail) • Big Data and some consequences • Munging toward exploratory analysis • Toward models! 37 Tentative assignments • Assignment 2: Datasets and data infrastructures – lab assignment. Held in week 3 (Feb. 7) 10% (lab; individual); • Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual); • Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual); • Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual); • Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); • Term project. Due ~ week 13. 30% (25% written, 5% oral; individual). 38 How are the software installs going? • R/Scipy (et al)/Matlab • Data infrastructure • Exercises? • More on Friday… 39 Assignment 1 – how is it going? • Choose a DA case study from a) readings, or b) your choice (must be approved by me) • Read it and provide a short written review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits). • Be prepared to discuss it in class this Friday 31st. Hand in the written report by 5pm that day. 40

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib