exploratory_data_analysis - Donald Bren School of Information

advertisement
CS 277, Data Mining
Exploratory Data Analysis
Padhraic Smyth
Department of Computer Science
Bren School of Information and Computer Sciences
University of California, Irvine
2
Outline
Assignment 1: Questions?
Today’s Lecture: Exploratory Data Analysis
– Analyzing single variables
– Analyzing pairs of variables
– Higher-dimensional visualization techniques
Next Lecture: Clustering and Dimension Reduction
– Dimension reduction methods
– Clustering methods
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
3
What are the main Data Mining Techniques?
•
Descriptive Methods
–
–
–
–
–
Exploratory Data Analysis, Visualization
Dimension reduction (principal components, factor models, topic models)
Clustering
Pattern and Anomaly Detection
….and more
• Predictive Modeling
–
–
–
–
–
Classification
Ranking
Regression
Matrix completion (recommender systems)
…and more
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
4
The Typical
Data Mining Process
Problem Definition
Defining
the Goal
Understanding the
Problem Domain
Data Definition
Defining and
Understanding
Features
Creating
Training
and Test Data
Data Exploration
Exploratory Data Analysis
Data Mining
Running Data
Mining Algorithms
Evaluating
Results/Models
Model Deployment
System Implementation
And Testing
Evaluation
“in the field”
Model in Operations
Model
Monitoring
Model
Updating
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
5
Exploratory Data Analysis: Single Variables
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
6
Summary Statistics
Mean: “center of data”
Mode: location of highest data density
Variance: “spread of data”
Skew: indication of non-symmetry
Range: max - min
Median: 50% of values below, 50% above
Quantiles: e.g., values such that 25%, 50%, 75% are smaller
Note that some of these statistics can be misleading
E.g., mean for data with 2 clusters may be in a region with zero data
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
7
Histogram of Unimodal Data
1000 data points simulated from a Normal distribution, mean 10, variance 1, 30 bins
1200
1000
800
600
400
200
0
6
7
8
9
10
11
12
13
14
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
8
Histograms: Unimodal Data
100 data points from a Normal, mean 10, variance 1, with 5, 10, 30 bins
25
40
35
20
30
25
15
20
10
15
10
5
5
0
6
7
8
9
10
11
12
13
0
6
7
8
9
10
11
12
13
12
10
8
6
4
2
0
6
7
8
9
10
11
12
13
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
9
Histogram of Multimodal Data
15000 data points simulated from a mixture of 3 Normal distributions, 300 bins
400
350
300
250
200
150
100
50
0
5
6
7
8
9
10
11
12
13
14
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
10
Histogram of Multimodal Data
15000 data points simulated from a mixture of 3 Normal distributions, 300 bins
6000
3500
3000
5000
2500
4000
2000
3000
1500
2000
1000
1000
0
500
5
6
7
8
9
10
11
12
13
14
400
0
5
7
8
9
10
11
12
13
14
11
12
13
14
160
350
140
300
120
250
100
200
80
150
60
100
40
50
20
0
6
5
6
7
8
9
10
11
12
13
14
0
5
6
7
8
9
10
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
11
Skewed Data
5000 data points simulated from an exponential distribution, 100 bins
450
400
350
300
250
200
150
100
50
0
0
1
2
3
4
5
6
7
8
9
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
12
Another Skewed Data Set
10000 data points simulated from a mixture of 2 exponentials, 100 bins
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
0
20
40
60
80
100
120
140
160
180
200
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
13
Same Skewed Data after taking Logs (base 10)
10000 data points simulated from a mixture of 2 exponentials, 100 bins
350
300
250
200
150
100
50
0
-4
-3
-2
-1
0
1
2
3
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
14
What will the mean or median tell us about this data?
900
800
700
600
500
400
300
200
100
0
9
10
11
12
13
14
15
16
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
15
Issues with Histograms
•
For small data sets, histograms can be misleading. Small changes in the data or to
the bucket boundaries can result in very different histograms.
•
For large data sets, histograms can be quite effective at illustrating general
properties of the distribution.
•
Can smooth histogram using a variety of techniques
– E.g., kernel density estimation, which avoids bins – but requires some notion of “scale”
•
Histograms effectively only work with 1 variable at a time
– Difficult to extend to 2 dimensions, not possible for >2
– So histograms tell us nothing about the relationships among variables
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
16
US Zipcode Data: Population by Zipcode
900
8000
K = 50
7000
K = 500
800
700
6000
600
5000
500
4000
400
3000
300
2000
200
1000
0
100
0
2
4
6
8
10
0
12
0
2
4
6
8
10
12
4
4
x 10
x 10
400
K = 50
350
300
250
200
150
100
50
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
17
Histogram with Outliers
Pima Indians Diabetes Data,
From UC Irvine Machine Learning Repository
Number
of
Individuals
X values
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
18
Histogram with Outliers
Number
of
Individuals
Pima Indians Diabetes Data,
From UC Irvine Machine Learning Repository
blood pressure = 0 ?
Diastolic Blood Pressure
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
19
Box Plots: Pima Indians Diabetes Data
Two side-by-side box-plots of individuals from the Pima Indians Diabetes Data Set
Body
Mass
Index
Healthy
Individuals
Diabetic
Individuals
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
20
Box Plots: Pima Indians Diabetes Data
Two side-by-side box-plots of individuals from the Pima Indians Diabetes Data Set
Body
Mass
Index
Plots all data
points outside
“whiskers”
Upper
Whisker
1.5 x
Q3-Q1
Q3
Q2
(median)
Box = middle
50% of data
Q1
Lower
Whisker
Healthy
Individuals
Diabetic
Individuals
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
21
Box Plots: Pima Indians Diabetes Data
healthy
Diastolic
Blood
Pressure
24-hour
Serum
Insulin
Plasma
Glucose
Concentration
Body
Mass
Index
diabetic
healthy
diabetic
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
22
Exploratory Data Analysis
Analyzing more than 1 variable at a time…
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
23
Relationships between Pairs of Variables
• Say we have a variable Y we want to predict and many variables X that we
could use to predict Y
• In exploratory data analysis we may be interested in quickly finding out if a
particular X variable is potentially useful at predicting Y
• Options?
– Linear correlation
– Scatter plot: plot Y values versus X values
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
24
Linear Dependence between Pairs of Variables
• Covariance and correlation measure linear dependence
• Assume we have two variables or attributes X and Y and n objects taking on
values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is:
1 n
Cov( X , Y )   ( x(i )  x )( y (i )  y )
n i 1
• The covariance is a measure of how X and Y vary together.
– it will be large and positive if large values of X are associated with large values of Y
and small X  small Y
• (Linear) Correlation = scaled covariance, varies between -1 and 1
n
 ( x(i)  x )( y(i)  y )
 ( X ,Y ) 
i 1


  ( x(i )  x ) 2  ( y (i )  y ) 2 
i 1
 i 1

n
n
1
2
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
25
Correlation coefficient
• Covariance depends on ranges of X and Y
• Standardize by dividing by standard deviation
• Linear correlation coefficient is defined as:
n
 ( x(i)  x )( y(i)  y )
 ( X ,Y ) 
i 1


  ( x(i )  x ) 2  ( y (i )  y ) 2 
i 1
 i 1

n
n
1
2
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
26
Data Set on Housing Prices in Boston
(widely used data set in research on regression models)
1
CRIM
per capita crime rate by town
2
ZN
proportion of residential land zoned for lots over 25,000 ft2
3
INDUS
proportion of non-retail business acres per town
4
NOX
Nitrogen oxide concentration (parts per 10 million)
5
RM
average number of rooms per dwelling
6
AGE
proportion of owner-occupied units built prior to 1940
7
DIS
weighted distances to five Boston employment centres
8
RAD
index of accessibility to radial highways
9
TAX
full-value property-tax rate per $10,000
10
PTRATIO
pupil-teacher ratio by town
11
MEDV
Median value of owner-occupied homes in $1000's
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
27
Matrix of Pairwise Linear Correlations
-1 0 +1
Crime Rate
Industry
Nitrous oxide
Average # rooms
Proportion of old houses
Highway accessibility
Property tax rate
Student-teacher ratio
Data on characteristics
of Boston housing
Percentage of
large residential lots
Distance to
employment
centers
Median
house value
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
28
Dangers of searching for correlations in high-dimensional data
Simulated 50 random Gaussian/normal data vectors, each with 100 variables
Results in a 50 x 100 data matrix
Below is a histogram of the 100 choose 2 pairs of correlation coefficients
Even if data are entirely random (no dependence) there is a very high probability
some variables will appear dependent just by chance.
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
29
Examples of X-Y plots and linear correlation values
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
30
Examples of X-Y plots and linear correlation values
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
31
Examples of X-Y plots and linear correlation values
Non-Linear Dependence
Lack of linear correlation does not imply lack of dependence
Linear Dependence
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
32
DATA SET 2
15
10
10
Y values
Y values
DATA SET 1
15
5
0
0
5
10
15
5
0
20
0
5
X values
10
10
5
5
10
20
15
20
DATA SET 4
15
Y values
Y values
DATA SET 3
0
15
X values
15
0
10
15
20
X values
5
0
0
5
10
X values
Anscombe, Francis (1973), Graphs in Statistical Analysis,
The American Statistician, pp. 195-199.
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
33
Guess the Linear Correlation Values for each Data Set
DATA SET 2
15
10
10
Y values
Y values
DATA SET 1
15
5
0
0
5
10
15
5
0
20
0
5
X values
10
10
5
5
10
20
15
20
DATA SET 4
15
Y values
Y values
DATA SET 3
0
15
X values
15
0
10
15
20
X values
5
0
0
5
10
X values
Anscombe, Francis (1973), Graphs in Statistical Analysis,
The American Statistician, pp. 195-199.
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
34
Actual Correlation Values
DATA SET 1
DATA SET 2
15
15
Correlation = 0.82
10
Y values
Y values
Correlation = 0.82
5
0
0
5
10
15
10
5
0
20
0
5
X values
DATA SET 3
20
15
20
DATA SET 4
15
Correlation = 0.82
Correlation = 0.82
10
Y values
Y values
15
X values
15
5
0
10
0
5
10
15
20
X values
10
5
0
0
5
10
X values
Anscombe, Francis (1973), Graphs in Statistical Analysis,
The American Statistician, pp. 195-199.
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
35
Summary Statistics for each Data Set
Summary Statistics of Data Set 1
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Correlation = 0.82
Summary Statistics of Data Set 2
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Correlation = 0.82
Summary Statistics of Data Set 3
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Correlation = 0.82
Summary Statistics of Data Set 4
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Correlation = 0.82
Anscombe, Francis (1973), Graphs in Statistical Analysis,
The American Statistician, pp. 195-199.
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
36
Conclusions so far?
• Summary statistics are useful…..up to a point
• Linear correlation measures can be misleading
• There really is no substitute for plotting/visualizing the data
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
37
Scatter Plots
• Plot the value of one variable against the other
• Simple…but can be very informative, can reveal more than summary statistics
• For example, we can…
– See if variables are dependent on each other (beyond linear dependence)
– Detect if outliers are present
– Can color-code to overlay group information (e.g., color points by class label for classification
problems)
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
38
5
2.5
(from US Zip code data: each point = 1 Zip code)
x 10
MEDIAN
HOUSEHOLD
INCOME
units = dollars
2
1.5
1
0.5
0
0
2
4
6
8
10
MEDIAN PERCAPITA INCOME
12
14
4
x 10
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
39
Constant Variance versus Changing Variance
variation in Y does not depend on X
variation in Y changes with the value of X
e.g., Y = annual tax paid, X = income
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
40
Problems with Scatter Plots of Large Data
appears: later apps older; reality: downward slope (more apps, more variance)
96,000 bank loan applicants
scatter plot degrades into black smudge ...
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
41
Problems with Scatter Plots of Large Data
appears: later apps older; reality: downward slope (more apps, more variance)
96,000 bank loan applicants
scatter plot degrades into black smudge ...
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
42
Contour Plots (based on local density) can help
recall:
(same 96,000 bank loan apps as before)
shows variance(y)  with x 
is indeed due to horizontal
skew in density
unimodal
skewed

skewed 
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
43
Scatter-Plot Matrices
Pima Indians Diabetes data
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
44
Another Scatter-Plot Matrix
For interactive visualization
the concept of “linked
plots” is generally useful,
i.e., clicking on 1 or more
points in 1 window and
having these same points
highlighted in other
windows
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
45
Using Color to Show Group Information in Scatter Plots
Iris classification data set, 3 classes
Figure from www.originlab.com
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
46
Another Example with Grouping by Color
Figure from hci.stanford.edu
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
47
Outlier Detection
• Definition of an outlier?
No precise definition
Generally….”A data point that is significantly different to the rest of the data”
But how do we define “significantly different”? (many answers to this…..)
Typically assumed to mean that the point was measured in error, or is not a true
measurement in some sense
Outliers in 1 dimension
Outlier in 2 dimensions
9
8
7
Y VALUES
–
–
–
–
6
5
4
3
2
1
2
3
4
5
X VALUES
6
7
8
9
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
48
0
0
5
10
15
0
20
0
5
10
Example 1: The Effect of Outliers on Regression
X values
DATA SET 3
15
5
10
10
0
0
5
5
10
15
DATA SET 1
5
20
10
1
5
1
0
Y values
15
Y values
15
Y values
5
10
Y values
Y values
Y values
DATA SET 4
DATA SET 2
10
0
5
10
X values
0
0
5
10
15
0
20
0
15
2
X values
5
0
10
0
X values
X values
Least SquaresDATA
Fit SET
with
the Outlier
3
DATA SET 4
15
5
20
10
15
20
X values
Least SquaresDATA
Fit without
the Outlier
SET 3
15
1
10
10
10
1
5
0
0
5
10
X values
15
20
5
0
0
5
10
X values
Y values
15
Y values
15
Y values
Y values
2
15
DATA SET 1
15
15
X values
5
0
0
15
5
20
10
15
20
X values
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
49
Example 2: Least Square Fitting is Sensitive to Outliers
18
16
16 2 cost for this one datum
14
12
Heavy penalty for large errors
10
5
4
3
2
1
0
-20
8
6
4
-15
-10
-5
0
5
2
0
0
2
4
6
8
10
12
14
16
18
20
Slide courtesy of Alex Ihler
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
50
More Robust Cost Functions for Training Regression Models
(MSE)
(MAE)
Something else entirely, e.g.,
(Blue Line)
Slide courtesy of Alex Ihler
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
51
L1 is more Robust to Outliers than L2
18
L2, original data
16
L1, original data
14
12
L2, outlier data
10
L1, outlier data
8
6
4
2
0
0
2
4
6
8
10
12
14
16
18
20
Slide courtesy of Alex Ihler
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
52
Detection of Outliers in Multiple Dimensions
9
8
Y VALUES
7
6
5
4
3
2
1
2
3
4
5
X VALUES
6
7
8
9
• Detecting “multi-dimensional outliers” is generally difficult
• In the example above, the blue point will not look like an outlier if we were to
plot 1-dimensional histograms of Y or X – it only stands out in the 2d plot
• Now consider the same situation but in 3 or more dimensions
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
53
Some Advice on Outliers
• Use visualization (e.g., in 1d, 2d) to spot obvious outliers
• Use domain knowledge and known constraints
– E.g., Age in years should be between 0 and 120
• Use the model itself to help detect outliers
– E.g., in regression, data points with errors much larger than the others may be outliers
• Use robust techniques that are not overly sensitive to outliers
– E.g., median is more robust than the mean, L1 is more robust than L2, etc
• Automated outlier detection algorithms? …not always useful
– E.g., fit probability density model to N-1 points and determine how likely the Nth point is
– May not work well in high dimensions and/or if there are multiple outliers
• In general: for large data sets outliers you can probably assume that outliers
are present and proceed with caution….
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
54
Multivariate Visualization
•
Multivariate -> multiple variables
•
2 variables: scatter plots, etc
•
3 variables:
–
–
–
–
•
4 variables:
–
–
•
3-dimensional plots
Look impressive, but often not that useful
Can be cognitively challenging to interpret
Alternatives: overlay color-coding (e.g., categorical data) on 2d scatter plot
3d with color or time
Can be effective in certain situations, but tricky
Higher dimensions
–
–
–
–
Generally difficult
Scatter plots, icon plots, parallel coordinates: all have weaknesses
Alternative: “map” data to lower dimensions, e.g., PCA or multidimensional scaling
Main problem: high-dimensional structure may not be apparent in low-dimensional views
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
55
Using Icons to Encode Information, e.g., Star Plots
Each star represents a single
observation. Star plots are used to
examine the relative values for a
single data point
The star plot consists of a sequence of
equi-angular spokes, called radii, with
each spoke representing one of the
variables.
Useful for small data sets with up to
10 or so variables
1
2
3
4
Price
Mileage (MPG)
1978 Repair Record (1 = Worst, 5 = Best)
1977 Repair Record (1 = Worst, 5 = Best)
5
6
7
8
Headroom
Rear Seat Room
Trunk Space
Weight
Limitations?
Small data sets, small dimensions
Ordering of variables may affect
perception
9 Length
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
56
Another Example of Icon Plots
Figure from statsoft.com
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
57
Combining Scatter Plots and Icon Plots
Figure from statsoft.com
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
58
Chernoff Faces
•
Variable values associated with facial characteristic parameters, e.g.,
head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth
shape, eye spacing, eye size, mouth length and degree of mouth opening
• Limitations?
– Only up to 10 or so dimensions
– Overemphasizes certain variables
because of our perceptual biases
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
59
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
60
Parallel Coordinates Method
Epileptic Seizure Data
1 (of n)
cases
Interactive
“brushing” is useful
for seeing
distinctions
(this case is
a “brushed”
one, with a
darker line,
to standout
from the n-1
other cases)
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
61
More elaborate parallel coordinates example (from E. Wegman, 1999).
12,000 bank customers with 8 variables
Additional “dependent” variable is profit (green for positive, red for negative)
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
62
Interactive “Grand Tour” Techniques
• “Grand Tour” idea
–
–
–
–
–
Cycle continuously through multiple projections of the data
Cycles through all possible projections (depending on time constraints)
Projects can be 1, 2, or 3d typically (often 2d)
Can link with scatter plot matrices (see following example)
Asimov (1985)
• Example on following 2 slides
–
7 dimensional physics data, color-coded by group, shown with
(a) Standard scatter matrix
(b) 2 static snapshots of grand tour
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
63
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
64
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
65
Visualization of an email network using
2-dimensional graph drawing or “embedding”
Data from 500 researchers at Hewlett-Packard
over approximately 1 year.
Various structural elements of the network
are apparent
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
66
Exploratory Data Analysis
Visualizing Time-Series Data
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
67
Time-Series Data: Example 1
Historical data on millions of miles flown by UK airline passengers
…..note a number of different systematic effects
Summer “double peaks”
(favor early or late)
Summer
peaks
steady growth
trend
New Year bumps
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
68
Time-Series Data: Example 2
Data from study on weight measurements over time of children in Scotland
Experimental Study:
More milk -> better health?
20,000 children:
5k raw, 5k pasteurize,
10k control (no supplement)
mean weight vs mean age
for 10k control group
Weight
Would expect smooth weight growth plot.
Plot shows an unexpected pattern (steps),
not apparent from raw data table.
Age
Why do the children appear to grow in spurts?
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
69
Time-Series Data: Example 3 (Google Trends)
Search Query = whiskey
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
70
Time-Series Data: Example 4 (Google Trends)
Search Query = NSA
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
71
Spatial Distribution of the Same Data (Google Trends)
Search Query = whiskey
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
72
Summary on Exploration/Visualization
• Always useful and worthwhile to visualize data
–
–
–
–
human visual system is excellent at pattern recognition
gives us a general idea of how data is distributed, e.g., extreme skew
detect “obvious outliers” and errors in the data
gain a general understanding of low-dimensional properties
• Many different visualization techniques
• Limitations
– generally only useful up to 3 or 4 dimensions
– massive data: only so many pixels on a screen - but subsampling is useful
Padhraic Smyth, UC Irvine: CS 277, Winter 2014
Download