2. Presence of Missing Values

advertisement
2. Presence of Missing Values
Missingness is multivariate
Explore joint distribution of missings,
and their association with variables
Evaluate imputation schemes
2{1
Simple Visualization Strategies
Complete case analysis
Assign some xed value
{ Pro: Easy identication of missings
{ Con: Problematic for high-d views
Assign imputed values
{ Pro: Data and all plots available
{ Con: Missingness information is lost
2{2
Advanced Visualization Strategies
MANET: Missings Are Now Equally Treated
(Unwin, Theus, Hofmann, et al)
MANET incorporates missing values in barcharts,
histograms, boxplots, dotplots, scatterplots, etc.
XGobi
Natural extensions of linked scatterplot paradigm
2{3
Missing Values in XGobi
Prerequisite: Retention of case-value identity
Optional display of missing values
Quick reset of xed or imputed values
Multiple linked windows
{ One window (...) plotting the data
{ One window plotting a multivariate indicator matrix of
missingness: that is, jittered zeroes and ones
2{4
Dening Missing Values in XGobi
Use \." or \na" or \NA" in the data le
Create an filename.missing le
{ Allows the investigation of censored data with missings
2{5
Exploration of Imputation Methods in XGobi
Reads user-supplied les of sets of imputed values
Allows rapid cycling through imputations
2{6
Variation in Work Eciency During Exercise
What is the eect of subject characteristics on work eciency?
136 human subjects
Covariates: sex, age, body weight, %body fat, ...
3 walk and 4 step exercises, AM and PM
... 14 repeated measures for each primary response variable:
mechanical work, eciency, etc.
Missing == failed exercise
(Data courtesy of Ming Sun, MD and George Reed, Vanderbilt
University Medical Center)
2{7
Eciency Exercise: Covariates
The points are brushed according to sex.
2{8
Mechanical Work: Repeated Measures
By default, missing values are shown as zeroes.
2{9
Aside: Parallel Coordinates Plots
Axes are parallel instead of orthogonal
Each case is represented by a line instead of a point
Strength: Many variables in a single view
Weakness: Overplotting! Interaction is essential.
2 { 10
Parallel Coordinates Plot of Covariates
2 { 11
Parallel Coordinates Plots of Covariates
One displays the data for men, and one for women.
2 { 12
Mechanical Work
For this data, something like a time series plot ...
2 { 13
Eciency
Missings are not drawn in the parallel coordinates view.
2 { 14
Eciency Data: Observations
Using parallel coordinates plots
{ Work monotonic in exertion; not so eciency
{ Morning and afternoon similar
{ Most \missings" on hardest step
Using missing values plot
{ Number of missings increased with exertion
{ Correlation between missingness and % body fat
{ Correlation between number of missings and % body fat
2 { 15
0.4
0.3
0.0
0.1
0.2
step4a
0.2
step4p
0.0
0.1
0.2
0.0
0.1
step4p
0.3
0.4
Eciency data: Exploring imputation schemes
0.3
0.0
0.1
0.2
0.3
step4a
Univariate imputation is inadequate because correlation
structure is ignored
Multivariate imputation (courtesy of Chuan-hai Liu) is
considerably better
2 { 16
Importing Imputed Values into XGobi
filename.imp:
Each column contains all imputed values
filename.impnames: one imputation name to a line
(Three rows from a set of data with missings)
467 585 na2 43 na4
463 580 86 67 12
379 na1 na3 88 12
filename.imp (Four sets of imputed values)
583 582 584 580 (imputed values for na1)
86 90 85 84 (imputed values for na2)
86 90 88 87 (imputed values for na3)
12 14 11 13 (imputed values for na4)
filename.dat
2 { 17
Quick Quiz
25
ss temp
25
air temp
20
0
20
-5
0
5
70
80
90
100
20
22
humidity
24
26
28
30
air temp
ss temp
humidity
air temp
5
zon.winds
-5
25
ss temp
30
zon.winds
0
-5
20
zon.winds
5
30
30
Are the salient features in each of these plots potentially due to
coding of missing values?
humidity
20
humidity
air temp
ss temp
22
24
air temp
26
28
30
-5
0
5
mer.winds
ss temp
air temp
2 { 18
Download