pptx

advertisement
Multivariate data
visualization
multivariate data
Graphical
display
Data graphics visually display measured quantities by
means of the combined use of points, lines, a coordinate
system, numbers, symbols, words, shading and color.
“There is no statistical tool that is as powerful as a
well-chosen graph.”
Chambers, Cleveland, Kleiner, and Tukey
“Humans are good at discerning subtle patterns that are
really there, but equally so at imagining them when they are
altogether absent.”
Carl Sagan
multivariate data
Graphical
display
Advantages of graphical methods:
-In comparison with other types of presentation, well-design charts are more
effective in creating interest and in appealing to the attention of the reader.
-Visual relationships as portrayed by charts and graphs are more easily grasped
and more easily remembered.
-The use of charts and graphs saves time since the essential meaning of large
measures of statistical data can be visualized at a glance.
-Charts and graphs provide a comprehensive picture of a problem that makes for
a more complete and better balanced understanding that could be derived from
tabular or textual forms of presentation.
-Charts and graphs can bring out hidden facts and relationships and can
stimulate, as well as aid, analytical thinking and investigation.
multivariate data
Graphical
display
Goals for graphical displays of data:
• To provide an overview
•To tell a story
•To suggest hypotheses
•To criticize a model
multivariate data data
Air pollution
Air pollution in 41 cities in the USA.
R data “USairpollution”
Variables:
SO2: SO2 content of air in micrograms per cubic meter
temp: average annual temperature in degrees Fahrenheit
manu: number of manufacturing enterprises employing 20 or more
workers
popul: population size (1970 census) in thousands
wind: average annual wind speed in miles per hour
precip: average annual precipitation in inches
predays: average number of days with precipitation per year
multivariate data
Scatterplot
The relational graphic - the greatest of all graphical designs.
It links at list two variables encouraging the viewer to assess the
possible causal relationship between the plotted variables.
The standard for representing continuous bivariate data but it
can be enhanced in a variety of ways to accommodate
information about other variables.
Build a basic scatterplot of the two variables “manu” and
“popul”in USairpollution data.
multivariate data
Scatterplot
# read the date Usairpollution:
>USairpollution=read.csv("E:/Multivariate_analysis/Data/USai
rpollution.csv",header=T)
# set up the labels for the two variables:
> mlab="Manufacturing enterprises with 20 or more workers"
> plab="Population size (1970 census) in thousands“
#plot manu vs. popul :
> plot(popul~manu, data=USairpollution,xlab=mlab,ylab=plab)
3500
2500
1500
0 500
Population size (1970 census) in thousands
multivariate data
Scatterplot
0
500
1000 1500 2000 2500 3000
Manufacturing enterprises with 20 or more workers
multivariate data
Scatterplot
We construct a scatter plot that shows the marginal distribution
of manu and popul in two ways:
1. Marginal distribution as rug plots on each axis:
>plot(popul~manu,data=USairpollution,xlab=mlab,ylab=plab)
>rug(USairpollution$manu,side=1)
> rug(USairpollution$popul,side=2)
3500
2500
1500
0 500
Population size (1970 census) in thousands
multivariate data
Scatterplot
0
500
1000 1500 2000 2500 3000
Manufacturing enterprises with 20 or more workers
Rugs
multivariate data
Scatterplot
2. Marginal distribution of manu as a histogram and that
of popul as boxplot:
# set the plot margins
>par(mar=c(2,2,2,2))
# set the limit of the x axis
> xlim=with(USairpollution,range(manu))*1.1
# set the display of the plots
> layout(matrix(c(1,3,2,0),nrow=2,byrow=TRUE),
widths=c(2,1),heights=c(2,1),respect=TRUE)
multivariate data
Scatterplot
# plot popul vs manu
>plot(popul~manu,data=USairpollution,cex.lab=0.7,
xlab=mlab,ylab=plab,xlim=xlim)
# add labels for the points on the scatterplot
> with(USairpollution, text(manu,popul,cex=0.6,
labels=abbreviate(City)))
# add histogram for manu along the x axis
> with(USairpollution,hist(manu,xlim=xlim))
# add a boxplot for popul along the y axis
> with(USairpollution,boxplot(popul))
multivariateboxplot
data
Bivariate
Based on calculating “robust” measures of location, scale,
and correlation.
Consists of a pair of concentric ellipse:
- Hinge (include 50% of the data )
- Fence (delimitates potentially troublesome outliers)
Resistant regression lines of both y on x and x on y are
shown, with the intersection showing the bivariate location
estimator.
The acute angle between the regression lines will be small
for a large absolute value of correlations and large for a
small one.
multivariate data
Bivariate
boxplot
# select row number for marginal cities in USairpollution
>lab=c(“Houston”,"Chicago","Detroit","Cleveland","Philadel
phia")
>outcity=match(lab, USairpollution[,1])
# subset data for manu and popul variables
x=USairpollution[,c("manu","popul")]
#>draw
the bivariate boxplot
# need packages: “MVA” and HSAUR2” for bvbox
> bvbox(x,main="USairpollution",xlab=mlab,ylab=plab)
# add text for the outcity
> text(x$manu[outcity],x$popul[outcity],labels=lab,cex=0.7,
pos=c(2,2,4,2,2))
Hindge
2000
3000
Chicago
Philadelphia
Regression lines
1000
Detroit
Houston
Cleveland
Fence
0
Population size (1970 census) in thousands
USairpollution
0
1000
2000
3000
Manufacturing enterprises with 20 or more workers
multivariateboxplot
data
Bivariate
Scatter plots should always be consulted when calculating
correlation coefficients because the presence of outliers can
on occasion considerably distort the value of a correlation
coefficient.
The observations identified as outliers may then be excluded
from the calculation of the correlation coefficients.
The bivariate boxplot identifies Chicago, Philadelphia, Detroit,
and Cleveland as outliers in the scatterplot of manu and
popul.
multivariate data
R output
# correlation between manu and popul variables
> with(USairpollution,cor(manu,popul))
[1] 0.9552693 # correlation coefficient
# remove the outliers
>outcity=match(c("Chicago","Detroit","Cleveland","Philadelph
ia"),USairpollution[,1])
# correlation between manu and popul with the outliers
removed
>with(USairpollution,cor(manu[-outcity],popul[-outcity]))
[1] 0.7955549 # correlation coefficient
multivariate
data
The convex hull
of bivariate
data
An alternative approach to using the scatterplot combined
with the bivariate boxplot.
Gives a robust estimation of correlation.
The convex hull of a set of bivariate observations consists of
the vertices of the smallest convex polyhedron in variable
space within which or on which all data points lie.
Removal of the points lying on the convex hull can eliminate
isolated outliers without disturbing the general shape of the
bivariate distribution.
multivariate data
R output
# scatterplot of manu and popul in USairpollution data
>plot(popul~manu,data=USairpollution,cex.lab=0.7,
xlab=mlab,ylab=plab,xlim=xlim)
# extract the hull from manu and popul data
>hull=with(USairpollution,chull(manu,popul))
[1] 9 15 41 6 2 18 16 14 7
# plot the convex hull
>with(USairpollution,polygon(manu[hull],popul[hull],
density=15,angle=30))
# correlation of the manu and popul with the hull removed
> with(USairpollution,cor(manu[-hull],popul[-hull]))
[1] 0.9225267
data
Themultivariate
chi-plot
Suggested by Fisher and Switzer (1985,2001).
The chi-plot is designed to address the problem of
independence. It transforms the measurements (x11 ,..., xn1)
and (x12,..., xn2) into values ( 1 ,…,  n ) and (  1 ,…, n ),
which, plotted in a scatterplot, can be used to detect
deviations form independence.
Under independence the  i values should show a nonsystematic random fluctuation around zero; the  i values
measure the distance of unit i from the “center” of the
bivariate distribution.
data
Themultivariate
chi-plot
# scatter plot for manu and popul
>with(USairpollution,plot(manu,popul, xlab=mlab,ylab=plab,
cex.lab=0.9))
# need package “asbio”
> with(USairpollution,chi.plot(manu,popul))
data
Themultivariate
chi-plot
Departure from independence is indicated in the second pot
by a lack of points in the horizontal band indicated on the plot.
multivariate
data
The bubble and
other
glyph plots
Bubble plot – three variables are displayed: two are used
to form the scatter plot itself and the value of the third
variable are represented by circles with radii proportional
to these values and centered on the appropriate plot in
the scatterplot.
Make a bubble plot for temp, wind, and SO2 in
“USairpollution” data.
The plot seems to suggest that cities with moderate annual
temperature and moderate annual wind speeds tend to
suffer the greatest air pollution, but does not tell us anything
about the rest of variables.
The bubble and other glyph plots
Bubble plot
# set ylim for wind
>ylim=with(USairpollution,range(wind)*c(0.95,1))
# scatterplot of wind and temp
>plot(wind~temp,data=USairpollution,
xlab="Average annual temperature(Fahrenheit)",
ylab="Average annual wind speed (m.p.h.)",
pch=10,ylim=ylim)
# plot SO2 as circles
>with(USairpollution,symbols(temp,wind,circles=SO2,
inches=0.5,add=TRUE))
multivariate data
The bubble and
other glyph plots
Bubble plot
multivariate
data glyph plots
The bubble and
other
Make a star plot for all variables in “USairpollution” as a
seven-sided “star” with the lengths of each side
representing each of the seven variables. The plot does
not communicate much useful information about the data
but it shows which units have similar shapes.
Star plot
# star plot for all variables in USairpollution
>stars(USairpollution,labels=as.character(USairpollution$Ci
ty),cex=0.45)
multivariate data
The bubble and
other glyph plots
Star plot
multivariate data
The scatterplot
matrix
A scatterplot matrix is a square, symmetric grid of bivariate
scatterplots.
The grid has q rows and columns, each one corresponding to
a different variable.
Each of the grid’s cells shows a scatterplot of two variables.
Variable j is plotted against variable i in the ij cell,
and the same variables appear in the cell ji, with the x- and
y- axes of the scatterplots interchanged.
multivariate data
The scatterplot
matrix
# plot the grid
pairs(USairpollution[,-1],cex.labels=NULL,font.labels=0.1,
gap=0.1,pch=".",cex=0.00001)
# plot the grid and the regression lines
>pairs(USairpollution[,-1],
pch=".",cex=0.1,gap=0,font.labels=0.001,
panel=function(x,y,...){
points(x,y,...)
abline(lm(y~x),col="grey")
})
multivariate data
The scatterplot
matrix
65
0 2000
10
40
20
SO2
temp
manu
0
0 3500
12
popul
6
10 60
wind
precip
predays
20
80
0 2000
6
9
40 120
40
45 75
45
multivariate data
The scatterplot
matrix
60
75
0
2000
10
40
20 80
45
45 65
SO2
2500
temp
2500
0
manu
0
popul
50
6 9
wind
40
predays
140
10
precip
20
80
0
2000
6
9
12
40
120
multivariate data
Three-dimensional
plots
# three dimensional plot for hips, chest, waist in Measure data
> library(scatterplot3d)
>with(Measure,scatterplot3d(chest,waist,hips,
pch=(1:2)[gender],type="h“, angle=55))
32
waist
32 34 36 38 40 42
hips
The three dimensional plot of chest,
waist, and hips in “Measure” data
suggests the presence of two
separate groups of point
corresponding to the males
and females in the data.
34
36
38
chest
40
42
44
34
32
30
28
26
24
22
multivariate data
Three-dimensional
plots
40 45 50 55 60 65 70 75 80 6
temp
7
8
13
12
11
10
9
wind
SO2
The plot of temp, wind, and SO2
from the “USairpollution” data
shows that two observations
with high SO2 levels stand out.
100
120
020406080
# three dimensional plot for temp, wind, and SO2
> library(scatterplot3d)
>with(USairpollution,scatterplot3d(temp,wind,SO2, type="h“,
angle=55))
data
Trellismultivariate
graphics
Trellis graphics is an approach to examining high-dimensional
structure in data by means of one-, tow-, and threedimensional graphs.
The problem addressed is how observations of one or more
variables depend on the observations of the other variables.
Multiple conditioning that allows some type of plot to be
displayed for different values of a given variable (or variables).
data
Trellismultivariate
graphics
# scatterplot of SO2 and temp for light and high winds
> library(lattice)
> plot(xyplot(SO2~temp|cut(wind,2),data=USairpollution))
SO2
Scatterplots of SO2 and temp
conditioned on values of
wind divided in two equal
parts.
50 60 70
(5.99,9.35]
(9.35,12.7]
100
80
60
40
20
50 60 70
temp
data
Trellismultivariate
graphics
# three-dimensional plots of temp, wind, and precip
conditioned on four levels of SO2.
>pollution=with(USairpollution,equal.count(SO2,4))
>plot(cloud(precip~temp*wind|pollution,panel.aspect=0.9,
data=USairpollution))
Not very useful for small data sets.
Download