Project I Comments are up on Blackboard or sent by email

advertisement
Project I
Comments are up on Blackboard or sent by email
… so far
• Asking questions is not easy, avoid
questions that have single word as an
answer “yes”, “Walt Disney”, “Action”, …
statistics ≠ data lookup
• Use tools from class to answer: pivot table
& graphics
• Give enough details for Pivot tables: row
variable(s), column variables (if any),
summary statistics, summarized variable
things to watch out for
• Each movie is included multiple times:
this introduces dependencies in your data: e.g. can’t average “Days” or “Total Gross”
Solution? Multiple Pivot tables back to back
!
•
Why can’t we average percentage change?
Solution? Work from total amounts, compute
percentages yourself.
Re-work & resubmit
• fix your outline as indicated,
• resubmit by Thursday before class
Intro to Graphics in R
Stat 480
Heike Hofmann
Goals
introduction to R: • Alearngentle
how to view data and produce graphics
• Practice your question generation skills
• Learn which plots are best for answering
which questions
• Revise reading plots
• Explore a large data set with graphics
Crime Data
• FBI publishes data under FOIA at
http://www.ucrdatatool.gov/
• For each state, we have numbers of
different offenses since 1960
• Variables:
State,Year, Population, Index, Arson, Murder,
Burglary, Rape, Vehicle Theft, ...
Getting started
•install.packages("ggplot2")
# once per computer"
•library(ggplot2)
# every time you open
R"
• data(package="ggplot2")"
• fbi <- read.csv("http://hofroe.net/stat480/
fbi-60-12.csv")"
• head(fbi)
str(fbi)
summary(fbi)
# Ways to inspect a data set"
•#
Make sure you type things exactly;
# R is very fussy
What do we have?
• A data.frame = a list of variables of the same
length (but may be different types)
• Has row and column names
Extracting bits of
data.frame
•x$variable"
•x[, "variable"]"
•x[rows, columns]"
•x[1:5, 2:3]"
•x[c(1,5,6), c("State","Year")]"
•x$variable[rows]
Statistical summaries
•mean, median,
•sd, var, cor
min, max, range"
What can we learn from
this data?
• Inspect the data
• Figure out what the variables are from http://
www.ucrdatatool.gov/ (http://www.ucrdatatool.gov/offenses.cfm)
• Write down questions that you could answer
with this data
• 4 minutes by yourself, then pair up for another
3 minutes, and we'll write ideas on the board
How we will answer
these questions
• Explore how one (or more) variables are
distributed - barchart or histogram
• Explore how two variables are related scatterplot, boxplot, tile plot
• Explore how two variables are related,
conditioned on other variables - facetting,
color & other aesthetics
Scatterplot
• Two continuous variables
•qplot(Burglary, Murder, data=fbi)"
•qplot(log(Burglary), log(Murder),
data=fbi)"
•qplot(log(Burglary),
log(Motor.Vehicle.Theft), data=fbi)
Revision:
Interpreting a scatterplot
• Big patterns
• Form and direction
• Strength
• Small patterns
• Deviations from the pattern
• Outliers
Interpreting
Scatterplots
• Form
• Is the plot linear?
Is the plot curved? Is
there a distinct pattern in the plot? Are
there multiple groups?
• Strength
• Does the plot follow the form very
closely? Or is there a lot of variation?
Interpreting
Scatterplots
• Direction
• Is the pattern increasing?
Is the plot
decreasing?
•
Positively: Above (below) average in one variable tends
to be associated with above (below) average in another
variable.
•
Negatively: Above (below) average in one variable
tends to be associated with below (above) average in
another variable.
Form: Linear
Strength: Strong, very close
to a straight line.
Direction: Two variables
are positively associated.
No outliers.
Form: Roughly linear, two distinct
groups (more than 40% and less than
40%.)
Strength: not strong. Data points are
scattered.
Direction: Negatively Associated.
Outliers: None
Aesthetics
• Can map other variables to size or colour
•qplot(log(Burglary),
log(Motor.Vehicle.Theft), data=fbi,
colour=State)"
•qplot(log(Burglary),
log(Motor.Vehicle.Theft), data=fbi,
size=Population)"
• other aesthetics: shape
Facetting
• Can facet to display plots for different
subsets
•qplot(Year,
Population,
data=fbi, facets=~State)
Facets vs aesthetics
• Will need to experiment as to which one
answers your question/tells the story best
• Remember, just like with pivot tables we
want comparisons of interest to be close
together
Your turn
• Work through each of the example plots • Try variations to answer your questions
Finished?
• Continue to polish your questions about
the data
• Go to http://had.co.nz/ggplot2 and figure
out how to make other plots that you
know about
Histograms and
barcharts
• Used to display the distribution of a
variable
• Continuous variable → histogram
• Categorical variable → bar chart
• For the histogram, you should always vary
the binwidth
Examples
qplot(Population, data=fbi,
geom="histogram")"
qplot(Population, data=fbi,
geom="histogram", binwidth=100000)"
qplot(Population, data=fbi,
geom="histogram", binwidth=1000000)"
qplot(Population, data=fbi,
geom="histogram", binwidth=5000000)"
Aesthetics & facetting
• As for scatterplot, you can map fill to
another variable, or use facetting to
compare subsets
• Facetting is generally more useful, as it is
easier to compare different groups
Your turn
• Explore the distribution of Murders
• What can you see? What might explain
that pattern?
• Make sure to experiment with bin width!
• Use facetting to explore the relationship
between Murders and States
Zooming
•qplot(Year,
Burglary, data=fbi,
facets=~State, ylim=c(0,1000000))
Download