CS 277, Data Mining Exploratory Data Analysis Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine 2 Outline Assignment 1: Questions? Today’s Lecture: Exploratory Data Analysis – Analyzing single variables – Analyzing pairs of variables – Higher-dimensional visualization techniques Next Lecture: Clustering and Dimension Reduction – Dimension reduction methods – Clustering methods Padhraic Smyth, UC Irvine: CS 277, Winter 2014 3 What are the main Data Mining Techniques? • Descriptive Methods – – – – – Exploratory Data Analysis, Visualization Dimension reduction (principal components, factor models, topic models) Clustering Pattern and Anomaly Detection ….and more • Predictive Modeling – – – – – Classification Ranking Regression Matrix completion (recommender systems) …and more Padhraic Smyth, UC Irvine: CS 277, Winter 2014 4 The Typical Data Mining Process Problem Definition Defining the Goal Understanding the Problem Domain Data Definition Defining and Understanding Features Creating Training and Test Data Data Exploration Exploratory Data Analysis Data Mining Running Data Mining Algorithms Evaluating Results/Models Model Deployment System Implementation And Testing Evaluation “in the field” Model in Operations Model Monitoring Model Updating Padhraic Smyth, UC Irvine: CS 277, Winter 2014 5 Exploratory Data Analysis: Single Variables Padhraic Smyth, UC Irvine: CS 277, Winter 2014 6 Summary Statistics Mean: “center of data” Mode: location of highest data density Variance: “spread of data” Skew: indication of non-symmetry Range: max - min Median: 50% of values below, 50% above Quantiles: e.g., values such that 25%, 50%, 75% are smaller Note that some of these statistics can be misleading E.g., mean for data with 2 clusters may be in a region with zero data Padhraic Smyth, UC Irvine: CS 277, Winter 2014 7 Histogram of Unimodal Data 1000 data points simulated from a Normal distribution, mean 10, variance 1, 30 bins 1200 1000 800 600 400 200 0 6 7 8 9 10 11 12 13 14 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 8 Histograms: Unimodal Data 100 data points from a Normal, mean 10, variance 1, with 5, 10, 30 bins 25 40 35 20 30 25 15 20 10 15 10 5 5 0 6 7 8 9 10 11 12 13 0 6 7 8 9 10 11 12 13 12 10 8 6 4 2 0 6 7 8 9 10 11 12 13 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 9 Histogram of Multimodal Data 15000 data points simulated from a mixture of 3 Normal distributions, 300 bins 400 350 300 250 200 150 100 50 0 5 6 7 8 9 10 11 12 13 14 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 10 Histogram of Multimodal Data 15000 data points simulated from a mixture of 3 Normal distributions, 300 bins 6000 3500 3000 5000 2500 4000 2000 3000 1500 2000 1000 1000 0 500 5 6 7 8 9 10 11 12 13 14 400 0 5 7 8 9 10 11 12 13 14 11 12 13 14 160 350 140 300 120 250 100 200 80 150 60 100 40 50 20 0 6 5 6 7 8 9 10 11 12 13 14 0 5 6 7 8 9 10 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 11 Skewed Data 5000 data points simulated from an exponential distribution, 100 bins 450 400 350 300 250 200 150 100 50 0 0 1 2 3 4 5 6 7 8 9 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 12 Another Skewed Data Set 10000 data points simulated from a mixture of 2 exponentials, 100 bins 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 20 40 60 80 100 120 140 160 180 200 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 13 Same Skewed Data after taking Logs (base 10) 10000 data points simulated from a mixture of 2 exponentials, 100 bins 350 300 250 200 150 100 50 0 -4 -3 -2 -1 0 1 2 3 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 14 What will the mean or median tell us about this data? 900 800 700 600 500 400 300 200 100 0 9 10 11 12 13 14 15 16 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 15 Issues with Histograms • For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms. • For large data sets, histograms can be quite effective at illustrating general properties of the distribution. • Can smooth histogram using a variety of techniques – E.g., kernel density estimation, which avoids bins – but requires some notion of “scale” • Histograms effectively only work with 1 variable at a time – Difficult to extend to 2 dimensions, not possible for >2 – So histograms tell us nothing about the relationships among variables Padhraic Smyth, UC Irvine: CS 277, Winter 2014 16 US Zipcode Data: Population by Zipcode 900 8000 K = 50 7000 K = 500 800 700 6000 600 5000 500 4000 400 3000 300 2000 200 1000 0 100 0 2 4 6 8 10 0 12 0 2 4 6 8 10 12 4 4 x 10 x 10 400 K = 50 350 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 17 Histogram with Outliers Pima Indians Diabetes Data, From UC Irvine Machine Learning Repository Number of Individuals X values Padhraic Smyth, UC Irvine: CS 277, Winter 2014 18 Histogram with Outliers Number of Individuals Pima Indians Diabetes Data, From UC Irvine Machine Learning Repository blood pressure = 0 ? Diastolic Blood Pressure Padhraic Smyth, UC Irvine: CS 277, Winter 2014 19 Box Plots: Pima Indians Diabetes Data Two side-by-side box-plots of individuals from the Pima Indians Diabetes Data Set Body Mass Index Healthy Individuals Diabetic Individuals Padhraic Smyth, UC Irvine: CS 277, Winter 2014 20 Box Plots: Pima Indians Diabetes Data Two side-by-side box-plots of individuals from the Pima Indians Diabetes Data Set Body Mass Index Plots all data points outside “whiskers” Upper Whisker 1.5 x Q3-Q1 Q3 Q2 (median) Box = middle 50% of data Q1 Lower Whisker Healthy Individuals Diabetic Individuals Padhraic Smyth, UC Irvine: CS 277, Winter 2014 21 Box Plots: Pima Indians Diabetes Data healthy Diastolic Blood Pressure 24-hour Serum Insulin Plasma Glucose Concentration Body Mass Index diabetic healthy diabetic Padhraic Smyth, UC Irvine: CS 277, Winter 2014 22 Exploratory Data Analysis Analyzing more than 1 variable at a time… Padhraic Smyth, UC Irvine: CS 277, Winter 2014 23 Relationships between Pairs of Variables • Say we have a variable Y we want to predict and many variables X that we could use to predict Y • In exploratory data analysis we may be interested in quickly finding out if a particular X variable is potentially useful at predicting Y • Options? – Linear correlation – Scatter plot: plot Y values versus X values Padhraic Smyth, UC Irvine: CS 277, Winter 2014 24 Linear Dependence between Pairs of Variables • Covariance and correlation measure linear dependence • Assume we have two variables or attributes X and Y and n objects taking on values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is: 1 n Cov( X , Y ) ( x(i ) x )( y (i ) y ) n i 1 • The covariance is a measure of how X and Y vary together. – it will be large and positive if large values of X are associated with large values of Y and small X small Y • (Linear) Correlation = scaled covariance, varies between -1 and 1 n ( x(i) x )( y(i) y ) ( X ,Y ) i 1 ( x(i ) x ) 2 ( y (i ) y ) 2 i 1 i 1 n n 1 2 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 25 Correlation coefficient • Covariance depends on ranges of X and Y • Standardize by dividing by standard deviation • Linear correlation coefficient is defined as: n ( x(i) x )( y(i) y ) ( X ,Y ) i 1 ( x(i ) x ) 2 ( y (i ) y ) 2 i 1 i 1 n n 1 2 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 26 Data Set on Housing Prices in Boston (widely used data set in research on regression models) 1 CRIM per capita crime rate by town 2 ZN proportion of residential land zoned for lots over 25,000 ft2 3 INDUS proportion of non-retail business acres per town 4 NOX Nitrogen oxide concentration (parts per 10 million) 5 RM average number of rooms per dwelling 6 AGE proportion of owner-occupied units built prior to 1940 7 DIS weighted distances to five Boston employment centres 8 RAD index of accessibility to radial highways 9 TAX full-value property-tax rate per $10,000 10 PTRATIO pupil-teacher ratio by town 11 MEDV Median value of owner-occupied homes in $1000's Padhraic Smyth, UC Irvine: CS 277, Winter 2014 27 Matrix of Pairwise Linear Correlations -1 0 +1 Crime Rate Industry Nitrous oxide Average # rooms Proportion of old houses Highway accessibility Property tax rate Student-teacher ratio Data on characteristics of Boston housing Percentage of large residential lots Distance to employment centers Median house value Padhraic Smyth, UC Irvine: CS 277, Winter 2014 28 Dangers of searching for correlations in high-dimensional data Simulated 50 random Gaussian/normal data vectors, each with 100 variables Results in a 50 x 100 data matrix Below is a histogram of the 100 choose 2 pairs of correlation coefficients Even if data are entirely random (no dependence) there is a very high probability some variables will appear dependent just by chance. Padhraic Smyth, UC Irvine: CS 277, Winter 2014 29 Examples of X-Y plots and linear correlation values Padhraic Smyth, UC Irvine: CS 277, Winter 2014 30 Examples of X-Y plots and linear correlation values Padhraic Smyth, UC Irvine: CS 277, Winter 2014 31 Examples of X-Y plots and linear correlation values Non-Linear Dependence Lack of linear correlation does not imply lack of dependence Linear Dependence Padhraic Smyth, UC Irvine: CS 277, Winter 2014 32 DATA SET 2 15 10 10 Y values Y values DATA SET 1 15 5 0 0 5 10 15 5 0 20 0 5 X values 10 10 5 5 10 20 15 20 DATA SET 4 15 Y values Y values DATA SET 3 0 15 X values 15 0 10 15 20 X values 5 0 0 5 10 X values Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Padhraic Smyth, UC Irvine: CS 277, Winter 2014 33 Guess the Linear Correlation Values for each Data Set DATA SET 2 15 10 10 Y values Y values DATA SET 1 15 5 0 0 5 10 15 5 0 20 0 5 X values 10 10 5 5 10 20 15 20 DATA SET 4 15 Y values Y values DATA SET 3 0 15 X values 15 0 10 15 20 X values 5 0 0 5 10 X values Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Padhraic Smyth, UC Irvine: CS 277, Winter 2014 34 Actual Correlation Values DATA SET 1 DATA SET 2 15 15 Correlation = 0.82 10 Y values Y values Correlation = 0.82 5 0 0 5 10 15 10 5 0 20 0 5 X values DATA SET 3 20 15 20 DATA SET 4 15 Correlation = 0.82 Correlation = 0.82 10 Y values Y values 15 X values 15 5 0 10 0 5 10 15 20 X values 10 5 0 0 5 10 X values Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Padhraic Smyth, UC Irvine: CS 277, Winter 2014 35 Summary Statistics for each Data Set Summary Statistics of Data Set 1 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82 Summary Statistics of Data Set 2 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82 Summary Statistics of Data Set 3 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82 Summary Statistics of Data Set 4 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82 Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Padhraic Smyth, UC Irvine: CS 277, Winter 2014 36 Conclusions so far? • Summary statistics are useful…..up to a point • Linear correlation measures can be misleading • There really is no substitute for plotting/visualizing the data Padhraic Smyth, UC Irvine: CS 277, Winter 2014 37 Scatter Plots • Plot the value of one variable against the other • Simple…but can be very informative, can reveal more than summary statistics • For example, we can… – See if variables are dependent on each other (beyond linear dependence) – Detect if outliers are present – Can color-code to overlay group information (e.g., color points by class label for classification problems) Padhraic Smyth, UC Irvine: CS 277, Winter 2014 38 5 2.5 (from US Zip code data: each point = 1 Zip code) x 10 MEDIAN HOUSEHOLD INCOME units = dollars 2 1.5 1 0.5 0 0 2 4 6 8 10 MEDIAN PERCAPITA INCOME 12 14 4 x 10 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 39 Constant Variance versus Changing Variance variation in Y does not depend on X variation in Y changes with the value of X e.g., Y = annual tax paid, X = income Padhraic Smyth, UC Irvine: CS 277, Winter 2014 40 Problems with Scatter Plots of Large Data appears: later apps older; reality: downward slope (more apps, more variance) 96,000 bank loan applicants scatter plot degrades into black smudge ... Padhraic Smyth, UC Irvine: CS 277, Winter 2014 41 Problems with Scatter Plots of Large Data appears: later apps older; reality: downward slope (more apps, more variance) 96,000 bank loan applicants scatter plot degrades into black smudge ... Padhraic Smyth, UC Irvine: CS 277, Winter 2014 42 Contour Plots (based on local density) can help recall: (same 96,000 bank loan apps as before) shows variance(y) with x is indeed due to horizontal skew in density unimodal skewed skewed Padhraic Smyth, UC Irvine: CS 277, Winter 2014 43 Scatter-Plot Matrices Pima Indians Diabetes data Padhraic Smyth, UC Irvine: CS 277, Winter 2014 44 Another Scatter-Plot Matrix For interactive visualization the concept of “linked plots” is generally useful, i.e., clicking on 1 or more points in 1 window and having these same points highlighted in other windows Padhraic Smyth, UC Irvine: CS 277, Winter 2014 45 Using Color to Show Group Information in Scatter Plots Iris classification data set, 3 classes Figure from www.originlab.com Padhraic Smyth, UC Irvine: CS 277, Winter 2014 46 Another Example with Grouping by Color Figure from hci.stanford.edu Padhraic Smyth, UC Irvine: CS 277, Winter 2014 47 Outlier Detection • Definition of an outlier? No precise definition Generally….”A data point that is significantly different to the rest of the data” But how do we define “significantly different”? (many answers to this…..) Typically assumed to mean that the point was measured in error, or is not a true measurement in some sense Outliers in 1 dimension Outlier in 2 dimensions 9 8 7 Y VALUES – – – – 6 5 4 3 2 1 2 3 4 5 X VALUES 6 7 8 9 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 48 0 0 5 10 15 0 20 0 5 10 Example 1: The Effect of Outliers on Regression X values DATA SET 3 15 5 10 10 0 0 5 5 10 15 DATA SET 1 5 20 10 1 5 1 0 Y values 15 Y values 15 Y values 5 10 Y values Y values Y values DATA SET 4 DATA SET 2 10 0 5 10 X values 0 0 5 10 15 0 20 0 15 2 X values 5 0 10 0 X values X values Least SquaresDATA Fit SET with the Outlier 3 DATA SET 4 15 5 20 10 15 20 X values Least SquaresDATA Fit without the Outlier SET 3 15 1 10 10 10 1 5 0 0 5 10 X values 15 20 5 0 0 5 10 X values Y values 15 Y values 15 Y values Y values 2 15 DATA SET 1 15 15 X values 5 0 0 15 5 20 10 15 20 X values Padhraic Smyth, UC Irvine: CS 277, Winter 2014 49 Example 2: Least Square Fitting is Sensitive to Outliers 18 16 16 2 cost for this one datum 14 12 Heavy penalty for large errors 10 5 4 3 2 1 0 -20 8 6 4 -15 -10 -5 0 5 2 0 0 2 4 6 8 10 12 14 16 18 20 Slide courtesy of Alex Ihler Padhraic Smyth, UC Irvine: CS 277, Winter 2014 50 More Robust Cost Functions for Training Regression Models (MSE) (MAE) Something else entirely, e.g., (Blue Line) Slide courtesy of Alex Ihler Padhraic Smyth, UC Irvine: CS 277, Winter 2014 51 L1 is more Robust to Outliers than L2 18 L2, original data 16 L1, original data 14 12 L2, outlier data 10 L1, outlier data 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 Slide courtesy of Alex Ihler Padhraic Smyth, UC Irvine: CS 277, Winter 2014 52 Detection of Outliers in Multiple Dimensions 9 8 Y VALUES 7 6 5 4 3 2 1 2 3 4 5 X VALUES 6 7 8 9 • Detecting “multi-dimensional outliers” is generally difficult • In the example above, the blue point will not look like an outlier if we were to plot 1-dimensional histograms of Y or X – it only stands out in the 2d plot • Now consider the same situation but in 3 or more dimensions Padhraic Smyth, UC Irvine: CS 277, Winter 2014 53 Some Advice on Outliers • Use visualization (e.g., in 1d, 2d) to spot obvious outliers • Use domain knowledge and known constraints – E.g., Age in years should be between 0 and 120 • Use the model itself to help detect outliers – E.g., in regression, data points with errors much larger than the others may be outliers • Use robust techniques that are not overly sensitive to outliers – E.g., median is more robust than the mean, L1 is more robust than L2, etc • Automated outlier detection algorithms? …not always useful – E.g., fit probability density model to N-1 points and determine how likely the Nth point is – May not work well in high dimensions and/or if there are multiple outliers • In general: for large data sets outliers you can probably assume that outliers are present and proceed with caution…. Padhraic Smyth, UC Irvine: CS 277, Winter 2014 54 Multivariate Visualization • Multivariate -> multiple variables • 2 variables: scatter plots, etc • 3 variables: – – – – • 4 variables: – – • 3-dimensional plots Look impressive, but often not that useful Can be cognitively challenging to interpret Alternatives: overlay color-coding (e.g., categorical data) on 2d scatter plot 3d with color or time Can be effective in certain situations, but tricky Higher dimensions – – – – Generally difficult Scatter plots, icon plots, parallel coordinates: all have weaknesses Alternative: “map” data to lower dimensions, e.g., PCA or multidimensional scaling Main problem: high-dimensional structure may not be apparent in low-dimensional views Padhraic Smyth, UC Irvine: CS 277, Winter 2014 55 Using Icons to Encode Information, e.g., Star Plots Each star represents a single observation. Star plots are used to examine the relative values for a single data point The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. Useful for small data sets with up to 10 or so variables 1 2 3 4 Price Mileage (MPG) 1978 Repair Record (1 = Worst, 5 = Best) 1977 Repair Record (1 = Worst, 5 = Best) 5 6 7 8 Headroom Rear Seat Room Trunk Space Weight Limitations? Small data sets, small dimensions Ordering of variables may affect perception 9 Length Padhraic Smyth, UC Irvine: CS 277, Winter 2014 56 Another Example of Icon Plots Figure from statsoft.com Padhraic Smyth, UC Irvine: CS 277, Winter 2014 57 Combining Scatter Plots and Icon Plots Figure from statsoft.com Padhraic Smyth, UC Irvine: CS 277, Winter 2014 58 Chernoff Faces • Variable values associated with facial characteristic parameters, e.g., head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening • Limitations? – Only up to 10 or so dimensions – Overemphasizes certain variables because of our perceptual biases Padhraic Smyth, UC Irvine: CS 277, Winter 2014 59 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 60 Parallel Coordinates Method Epileptic Seizure Data 1 (of n) cases Interactive “brushing” is useful for seeing distinctions (this case is a “brushed” one, with a darker line, to standout from the n-1 other cases) Padhraic Smyth, UC Irvine: CS 277, Winter 2014 61 More elaborate parallel coordinates example (from E. Wegman, 1999). 12,000 bank customers with 8 variables Additional “dependent” variable is profit (green for positive, red for negative) Padhraic Smyth, UC Irvine: CS 277, Winter 2014 62 Interactive “Grand Tour” Techniques • “Grand Tour” idea – – – – – Cycle continuously through multiple projections of the data Cycles through all possible projections (depending on time constraints) Projects can be 1, 2, or 3d typically (often 2d) Can link with scatter plot matrices (see following example) Asimov (1985) • Example on following 2 slides – 7 dimensional physics data, color-coded by group, shown with (a) Standard scatter matrix (b) 2 static snapshots of grand tour Padhraic Smyth, UC Irvine: CS 277, Winter 2014 63 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 64 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 65 Visualization of an email network using 2-dimensional graph drawing or “embedding” Data from 500 researchers at Hewlett-Packard over approximately 1 year. Various structural elements of the network are apparent Padhraic Smyth, UC Irvine: CS 277, Winter 2014 66 Exploratory Data Analysis Visualizing Time-Series Data Padhraic Smyth, UC Irvine: CS 277, Winter 2014 67 Time-Series Data: Example 1 Historical data on millions of miles flown by UK airline passengers …..note a number of different systematic effects Summer “double peaks” (favor early or late) Summer peaks steady growth trend New Year bumps Padhraic Smyth, UC Irvine: CS 277, Winter 2014 68 Time-Series Data: Example 2 Data from study on weight measurements over time of children in Scotland Experimental Study: More milk -> better health? 20,000 children: 5k raw, 5k pasteurize, 10k control (no supplement) mean weight vs mean age for 10k control group Weight Would expect smooth weight growth plot. Plot shows an unexpected pattern (steps), not apparent from raw data table. Age Why do the children appear to grow in spurts? Padhraic Smyth, UC Irvine: CS 277, Winter 2014 69 Time-Series Data: Example 3 (Google Trends) Search Query = whiskey Padhraic Smyth, UC Irvine: CS 277, Winter 2014 70 Time-Series Data: Example 4 (Google Trends) Search Query = NSA Padhraic Smyth, UC Irvine: CS 277, Winter 2014 71 Spatial Distribution of the Same Data (Google Trends) Search Query = whiskey Padhraic Smyth, UC Irvine: CS 277, Winter 2014 72 Summary on Exploration/Visualization • Always useful and worthwhile to visualize data – – – – human visual system is excellent at pattern recognition gives us a general idea of how data is distributed, e.g., extreme skew detect “obvious outliers” and errors in the data gain a general understanding of low-dimensional properties • Many different visualization techniques • Limitations – generally only useful up to 3 or 4 dimensions – massive data: only so many pixels on a screen - but subsampling is useful Padhraic Smyth, UC Irvine: CS 277, Winter 2014