Outline • Research Question: What determines height? • Data Input • Look at One Variable • Compare Two Variables • Children’s Height and Parents Height • Children’s Height and Gender • Graphic Packages: ggplot2 What factors are most responsible for height? Galton’s Family Height Dataset X1 X2 X3 Y Galton’s Notebook on Families & Height > getwd() [1] "C:/Users/johnp_000/Documents" > setwd() Dataset Input h <- read.csv("GaltonFamilies.csv") Object Function Filename str() summary() Data Types: Numbers and Factors/Categorical Variable Steps Type Child’s Height Continuous Histogram Dad’s Height Mom’s Height Continuous Scatter Gender Categorical Boxplot Frequency Distribution, Histogram hist(h$child) Density Plot plot(density(h$childHeight)) Area = 1 Mode, Bimodal hist(h$childHeight,freq=F, breaks =25, ylim = c(0,0.14)) curve(dnorm(x, mean=mean(h$childHeight), sd=sd(h$childHeight)), col="red", add=T) Grammar of Graphics Seven Components formations Legend Axes ggplot2 built using the grammar of graphics approach Hadley Wickman and ggplot2 Asst. Professor of Statistics at Rice University ggplot2 plyr reshape rggobi profr http://ggplot2.org/ ggplot2 In ggplot2 a plot is made up of layers. Pl o t ggplot2 library(ggplot2) h.gg <- ggplot(h, aes(child)) h.gg + geom_histogram(binwidth = 1 ) + labs(x = "Height", y = "Frequency") h.gg + geom_density() ggplot2 h.gg <- ggplot(h, aes(child)) + theme(legend.position = "right") h.gg + geom_density() + labs(x = "Height", y = "Frequency") h.gg + geom_density(aes(fill=factor(gender)), size=2) Box Plot Children’s Height vs. Gender boxplot(h$child~gender,data=h, col=(c("pink","lightblue")), main="Children's Height by Gender", xlab="Gender", ylab="") Descriptive Stats: Box Plot Subset Males men<- subset(h, gender=='male') Subset Females women <- subset(h, gender==‘female') Children’s Height: Males hist(men$childHeight) Children’s Height: Females hist(women$child) ggplot2 library(ggplot2) h.bb <- ggplot(h, aes(factor(gender), child)) h.bb + geom_boxplot() h.bb + geom_boxplot(aes(fill = factor(gender))) Variable Y X1, X2 X3 Steps Type Child’s Height Continuous Histogram Dad’s Height Mom’s Height Continuous Scatter Gender Categorical Boxplot Correlation Correlation ?cor cor(h$father, h$child) 0.2660385 Scatterplot Matrix: pairs() Correlations Matrix library(car) scatterplotMatrix(heights) ggplot2 Analytics & History: 1st Regression Line The first “Regression Line” Variable Steps Type Child’s Height Continuous Histogram Dad’s Height Mom’s Height Continuous Scatter Gender Categorical Boxplot Appendix What software do you use for creating charts or data visualizations? 0% R Excel Python Tableau D3 BI Tools Matlab Javascript SAS SPSS Google Scientific S/W Mathematica SQL Others 5% 10% 15% 20% 25% 30% 35% 40% 45% 47% 45% 19% 15% 14% 8% 5% 4% 3% 2% 2% 2% 2% 2% 22% May, 2013 N=172 .net BIRT cytoscape flot gephi gnuplot graphite iDashboards Incanter Java JMP LogiXML MDX Mondrian octave openlayers OpenViz PhP Powerpoint precog Prezi processing Ptotobi Silverlight splunk SSRS talend webGL Wijmo WPF Xcelcuis XLMiner 50% Visualization and Reporting Steep Learning Curve Easy to Use Standard Interactive Visualizations BI Software: Tableau http://public.tableausoftware.com/views/PapelbonPitchFX/PapelbonPitchFX http://rcharts.io/gallery/ https://plot.ly/r/ http://shiny.rstudio.com/gallery/movie-explorer.html The next data visual was produced with about 150 lines of R code Data Viz Tutorials