Project I Comments are up on Blackboard or sent by email … so far • Asking questions is not easy, avoid questions that have single word as an answer “yes”, “Walt Disney”, “Action”, … statistics ≠ data lookup • Use tools from class to answer: pivot table & graphics • Give enough details for Pivot tables: row variable(s), column variables (if any), summary statistics, summarized variable things to watch out for • Each movie is included multiple times: this introduces dependencies in your data: e.g. can’t average “Days” or “Total Gross” Solution? Multiple Pivot tables back to back ! • Why can’t we average percentage change? Solution? Work from total amounts, compute percentages yourself. Re-work & resubmit • fix your outline as indicated, • resubmit by Thursday before class Intro to Graphics in R Stat 480 Heike Hofmann Goals introduction to R: • Alearngentle how to view data and produce graphics • Practice your question generation skills • Learn which plots are best for answering which questions • Revise reading plots • Explore a large data set with graphics Crime Data • FBI publishes data under FOIA at http://www.ucrdatatool.gov/ • For each state, we have numbers of different offenses since 1960 • Variables: State,Year, Population, Index, Arson, Murder, Burglary, Rape, Vehicle Theft, ... Getting started •install.packages("ggplot2") # once per computer" •library(ggplot2) # every time you open R" • data(package="ggplot2")" • fbi <- read.csv("http://hofroe.net/stat480/ fbi-60-12.csv")" • head(fbi) str(fbi) summary(fbi) # Ways to inspect a data set" •# Make sure you type things exactly; # R is very fussy What do we have? • A data.frame = a list of variables of the same length (but may be different types) • Has row and column names Extracting bits of data.frame •x$variable" •x[, "variable"]" •x[rows, columns]" •x[1:5, 2:3]" •x[c(1,5,6), c("State","Year")]" •x$variable[rows] Statistical summaries •mean, median, •sd, var, cor min, max, range" What can we learn from this data? • Inspect the data • Figure out what the variables are from http:// www.ucrdatatool.gov/ (http://www.ucrdatatool.gov/offenses.cfm) • Write down questions that you could answer with this data • 4 minutes by yourself, then pair up for another 3 minutes, and we'll write ideas on the board How we will answer these questions • Explore how one (or more) variables are distributed - barchart or histogram • Explore how two variables are related scatterplot, boxplot, tile plot • Explore how two variables are related, conditioned on other variables - facetting, color & other aesthetics Scatterplot • Two continuous variables •qplot(Burglary, Murder, data=fbi)" •qplot(log(Burglary), log(Murder), data=fbi)" •qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi) Revision: Interpreting a scatterplot • Big patterns • Form and direction • Strength • Small patterns • Deviations from the pattern • Outliers Interpreting Scatterplots • Form • Is the plot linear? Is the plot curved? Is there a distinct pattern in the plot? Are there multiple groups? • Strength • Does the plot follow the form very closely? Or is there a lot of variation? Interpreting Scatterplots • Direction • Is the pattern increasing? Is the plot decreasing? • Positively: Above (below) average in one variable tends to be associated with above (below) average in another variable. • Negatively: Above (below) average in one variable tends to be associated with below (above) average in another variable. Form: Linear Strength: Strong, very close to a straight line. Direction: Two variables are positively associated. No outliers. Form: Roughly linear, two distinct groups (more than 40% and less than 40%.) Strength: not strong. Data points are scattered. Direction: Negatively Associated. Outliers: None Aesthetics • Can map other variables to size or colour •qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi, colour=State)" •qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi, size=Population)" • other aesthetics: shape Facetting • Can facet to display plots for different subsets •qplot(Year, Population, data=fbi, facets=~State) Facets vs aesthetics • Will need to experiment as to which one answers your question/tells the story best • Remember, just like with pivot tables we want comparisons of interest to be close together Your turn • Work through each of the example plots • Try variations to answer your questions Finished? • Continue to polish your questions about the data • Go to http://had.co.nz/ggplot2 and figure out how to make other plots that you know about Histograms and barcharts • Used to display the distribution of a variable • Continuous variable → histogram • Categorical variable → bar chart • For the histogram, you should always vary the binwidth Examples qplot(Population, data=fbi, geom="histogram")" qplot(Population, data=fbi, geom="histogram", binwidth=100000)" qplot(Population, data=fbi, geom="histogram", binwidth=1000000)" qplot(Population, data=fbi, geom="histogram", binwidth=5000000)" Aesthetics & facetting • As for scatterplot, you can map fill to another variable, or use facetting to compare subsets • Facetting is generally more useful, as it is easier to compare different groups Your turn • Explore the distribution of Murders • What can you see? What might explain that pattern? • Make sure to experiment with bin width! • Use facetting to explore the relationship between Murders and States Zooming •qplot(Year, Burglary, data=fbi, facets=~State, ylim=c(0,1000000))