Multivariate data visualization multivariate data Graphical display Data graphics visually display measured quantities by means of the combined use of points, lines, a coordinate system, numbers, symbols, words, shading and color. “There is no statistical tool that is as powerful as a well-chosen graph.” Chambers, Cleveland, Kleiner, and Tukey “Humans are good at discerning subtle patterns that are really there, but equally so at imagining them when they are altogether absent.” Carl Sagan multivariate data Graphical display Advantages of graphical methods: -In comparison with other types of presentation, well-design charts are more effective in creating interest and in appealing to the attention of the reader. -Visual relationships as portrayed by charts and graphs are more easily grasped and more easily remembered. -The use of charts and graphs saves time since the essential meaning of large measures of statistical data can be visualized at a glance. -Charts and graphs provide a comprehensive picture of a problem that makes for a more complete and better balanced understanding that could be derived from tabular or textual forms of presentation. -Charts and graphs can bring out hidden facts and relationships and can stimulate, as well as aid, analytical thinking and investigation. multivariate data Graphical display Goals for graphical displays of data: • To provide an overview •To tell a story •To suggest hypotheses •To criticize a model multivariate data data Air pollution Air pollution in 41 cities in the USA. R data “USairpollution” Variables: SO2: SO2 content of air in micrograms per cubic meter temp: average annual temperature in degrees Fahrenheit manu: number of manufacturing enterprises employing 20 or more workers popul: population size (1970 census) in thousands wind: average annual wind speed in miles per hour precip: average annual precipitation in inches predays: average number of days with precipitation per year multivariate data Scatterplot The relational graphic - the greatest of all graphical designs. It links at list two variables encouraging the viewer to assess the possible causal relationship between the plotted variables. The standard for representing continuous bivariate data but it can be enhanced in a variety of ways to accommodate information about other variables. Build a basic scatterplot of the two variables “manu” and “popul”in USairpollution data. multivariate data Scatterplot # read the date Usairpollution: >USairpollution=read.csv("E:/Multivariate_analysis/Data/USai rpollution.csv",header=T) # set up the labels for the two variables: > mlab="Manufacturing enterprises with 20 or more workers" > plab="Population size (1970 census) in thousands“ #plot manu vs. popul : > plot(popul~manu, data=USairpollution,xlab=mlab,ylab=plab) 3500 2500 1500 0 500 Population size (1970 census) in thousands multivariate data Scatterplot 0 500 1000 1500 2000 2500 3000 Manufacturing enterprises with 20 or more workers multivariate data Scatterplot We construct a scatter plot that shows the marginal distribution of manu and popul in two ways: 1. Marginal distribution as rug plots on each axis: >plot(popul~manu,data=USairpollution,xlab=mlab,ylab=plab) >rug(USairpollution$manu,side=1) > rug(USairpollution$popul,side=2) 3500 2500 1500 0 500 Population size (1970 census) in thousands multivariate data Scatterplot 0 500 1000 1500 2000 2500 3000 Manufacturing enterprises with 20 or more workers Rugs multivariate data Scatterplot 2. Marginal distribution of manu as a histogram and that of popul as boxplot: # set the plot margins >par(mar=c(2,2,2,2)) # set the limit of the x axis > xlim=with(USairpollution,range(manu))*1.1 # set the display of the plots > layout(matrix(c(1,3,2,0),nrow=2,byrow=TRUE), widths=c(2,1),heights=c(2,1),respect=TRUE) multivariate data Scatterplot # plot popul vs manu >plot(popul~manu,data=USairpollution,cex.lab=0.7, xlab=mlab,ylab=plab,xlim=xlim) # add labels for the points on the scatterplot > with(USairpollution, text(manu,popul,cex=0.6, labels=abbreviate(City))) # add histogram for manu along the x axis > with(USairpollution,hist(manu,xlim=xlim)) # add a boxplot for popul along the y axis > with(USairpollution,boxplot(popul)) multivariateboxplot data Bivariate Based on calculating “robust” measures of location, scale, and correlation. Consists of a pair of concentric ellipse: - Hinge (include 50% of the data ) - Fence (delimitates potentially troublesome outliers) Resistant regression lines of both y on x and x on y are shown, with the intersection showing the bivariate location estimator. The acute angle between the regression lines will be small for a large absolute value of correlations and large for a small one. multivariate data Bivariate boxplot # select row number for marginal cities in USairpollution >lab=c(“Houston”,"Chicago","Detroit","Cleveland","Philadel phia") >outcity=match(lab, USairpollution[,1]) # subset data for manu and popul variables x=USairpollution[,c("manu","popul")] #>draw the bivariate boxplot # need packages: “MVA” and HSAUR2” for bvbox > bvbox(x,main="USairpollution",xlab=mlab,ylab=plab) # add text for the outcity > text(x$manu[outcity],x$popul[outcity],labels=lab,cex=0.7, pos=c(2,2,4,2,2)) Hindge 2000 3000 Chicago Philadelphia Regression lines 1000 Detroit Houston Cleveland Fence 0 Population size (1970 census) in thousands USairpollution 0 1000 2000 3000 Manufacturing enterprises with 20 or more workers multivariateboxplot data Bivariate Scatter plots should always be consulted when calculating correlation coefficients because the presence of outliers can on occasion considerably distort the value of a correlation coefficient. The observations identified as outliers may then be excluded from the calculation of the correlation coefficients. The bivariate boxplot identifies Chicago, Philadelphia, Detroit, and Cleveland as outliers in the scatterplot of manu and popul. multivariate data R output # correlation between manu and popul variables > with(USairpollution,cor(manu,popul)) [1] 0.9552693 # correlation coefficient # remove the outliers >outcity=match(c("Chicago","Detroit","Cleveland","Philadelph ia"),USairpollution[,1]) # correlation between manu and popul with the outliers removed >with(USairpollution,cor(manu[-outcity],popul[-outcity])) [1] 0.7955549 # correlation coefficient multivariate data The convex hull of bivariate data An alternative approach to using the scatterplot combined with the bivariate boxplot. Gives a robust estimation of correlation. The convex hull of a set of bivariate observations consists of the vertices of the smallest convex polyhedron in variable space within which or on which all data points lie. Removal of the points lying on the convex hull can eliminate isolated outliers without disturbing the general shape of the bivariate distribution. multivariate data R output # scatterplot of manu and popul in USairpollution data >plot(popul~manu,data=USairpollution,cex.lab=0.7, xlab=mlab,ylab=plab,xlim=xlim) # extract the hull from manu and popul data >hull=with(USairpollution,chull(manu,popul)) [1] 9 15 41 6 2 18 16 14 7 # plot the convex hull >with(USairpollution,polygon(manu[hull],popul[hull], density=15,angle=30)) # correlation of the manu and popul with the hull removed > with(USairpollution,cor(manu[-hull],popul[-hull])) [1] 0.9225267 data Themultivariate chi-plot Suggested by Fisher and Switzer (1985,2001). The chi-plot is designed to address the problem of independence. It transforms the measurements (x11 ,..., xn1) and (x12,..., xn2) into values ( 1 ,…, n ) and ( 1 ,…, n ), which, plotted in a scatterplot, can be used to detect deviations form independence. Under independence the i values should show a nonsystematic random fluctuation around zero; the i values measure the distance of unit i from the “center” of the bivariate distribution. data Themultivariate chi-plot # scatter plot for manu and popul >with(USairpollution,plot(manu,popul, xlab=mlab,ylab=plab, cex.lab=0.9)) # need package “asbio” > with(USairpollution,chi.plot(manu,popul)) data Themultivariate chi-plot Departure from independence is indicated in the second pot by a lack of points in the horizontal band indicated on the plot. multivariate data The bubble and other glyph plots Bubble plot – three variables are displayed: two are used to form the scatter plot itself and the value of the third variable are represented by circles with radii proportional to these values and centered on the appropriate plot in the scatterplot. Make a bubble plot for temp, wind, and SO2 in “USairpollution” data. The plot seems to suggest that cities with moderate annual temperature and moderate annual wind speeds tend to suffer the greatest air pollution, but does not tell us anything about the rest of variables. The bubble and other glyph plots Bubble plot # set ylim for wind >ylim=with(USairpollution,range(wind)*c(0.95,1)) # scatterplot of wind and temp >plot(wind~temp,data=USairpollution, xlab="Average annual temperature(Fahrenheit)", ylab="Average annual wind speed (m.p.h.)", pch=10,ylim=ylim) # plot SO2 as circles >with(USairpollution,symbols(temp,wind,circles=SO2, inches=0.5,add=TRUE)) multivariate data The bubble and other glyph plots Bubble plot multivariate data glyph plots The bubble and other Make a star plot for all variables in “USairpollution” as a seven-sided “star” with the lengths of each side representing each of the seven variables. The plot does not communicate much useful information about the data but it shows which units have similar shapes. Star plot # star plot for all variables in USairpollution >stars(USairpollution,labels=as.character(USairpollution$Ci ty),cex=0.45) multivariate data The bubble and other glyph plots Star plot multivariate data The scatterplot matrix A scatterplot matrix is a square, symmetric grid of bivariate scatterplots. The grid has q rows and columns, each one corresponding to a different variable. Each of the grid’s cells shows a scatterplot of two variables. Variable j is plotted against variable i in the ij cell, and the same variables appear in the cell ji, with the x- and y- axes of the scatterplots interchanged. multivariate data The scatterplot matrix # plot the grid pairs(USairpollution[,-1],cex.labels=NULL,font.labels=0.1, gap=0.1,pch=".",cex=0.00001) # plot the grid and the regression lines >pairs(USairpollution[,-1], pch=".",cex=0.1,gap=0,font.labels=0.001, panel=function(x,y,...){ points(x,y,...) abline(lm(y~x),col="grey") }) multivariate data The scatterplot matrix 65 0 2000 10 40 20 SO2 temp manu 0 0 3500 12 popul 6 10 60 wind precip predays 20 80 0 2000 6 9 40 120 40 45 75 45 multivariate data The scatterplot matrix 60 75 0 2000 10 40 20 80 45 45 65 SO2 2500 temp 2500 0 manu 0 popul 50 6 9 wind 40 predays 140 10 precip 20 80 0 2000 6 9 12 40 120 multivariate data Three-dimensional plots # three dimensional plot for hips, chest, waist in Measure data > library(scatterplot3d) >with(Measure,scatterplot3d(chest,waist,hips, pch=(1:2)[gender],type="h“, angle=55)) 32 waist 32 34 36 38 40 42 hips The three dimensional plot of chest, waist, and hips in “Measure” data suggests the presence of two separate groups of point corresponding to the males and females in the data. 34 36 38 chest 40 42 44 34 32 30 28 26 24 22 multivariate data Three-dimensional plots 40 45 50 55 60 65 70 75 80 6 temp 7 8 13 12 11 10 9 wind SO2 The plot of temp, wind, and SO2 from the “USairpollution” data shows that two observations with high SO2 levels stand out. 100 120 020406080 # three dimensional plot for temp, wind, and SO2 > library(scatterplot3d) >with(USairpollution,scatterplot3d(temp,wind,SO2, type="h“, angle=55)) data Trellismultivariate graphics Trellis graphics is an approach to examining high-dimensional structure in data by means of one-, tow-, and threedimensional graphs. The problem addressed is how observations of one or more variables depend on the observations of the other variables. Multiple conditioning that allows some type of plot to be displayed for different values of a given variable (or variables). data Trellismultivariate graphics # scatterplot of SO2 and temp for light and high winds > library(lattice) > plot(xyplot(SO2~temp|cut(wind,2),data=USairpollution)) SO2 Scatterplots of SO2 and temp conditioned on values of wind divided in two equal parts. 50 60 70 (5.99,9.35] (9.35,12.7] 100 80 60 40 20 50 60 70 temp data Trellismultivariate graphics # three-dimensional plots of temp, wind, and precip conditioned on four levels of SO2. >pollution=with(USairpollution,equal.count(SO2,4)) >plot(cloud(precip~temp*wind|pollution,panel.aspect=0.9, data=USairpollution)) Not very useful for small data sets.