Document 10696559

advertisement
Subsetting Data
Stat 480
Heike Hofmann
Outline
• More on qplot
• Subsetting Data
• Indexing (review)
• Logical Operations
Scatterplot
Load the fbi data into R, load the ggplot2 package
• Two continuous variables
•qplot(Burglary, Murder, data=fbi)!
•qplot(log(Burglary), log(Murder),
data=fbi)!
•qplot(log(Burglary),
log(Motor.Vehicle.Theft), data=fbi)
Revision:
Interpreting a scatterplot
• Big patterns
• Form and direction
• Strength
• Small patterns
• Deviations from the pattern
• Outliers
Interpreting
Scatterplots
• Form
• Is the plot linear?
Is the plot curved? Is
there a distinct pattern in the plot? Are
there multiple groups?
• Strength
• Does the plot follow the form very
closely? Or is there a lot of variation?
Interpreting
Scatterplots
• Direction
• Is the pattern increasing?
Is the plot
decreasing?
•
Positively: Above (below) average in one variable tends
to be associated with above (below) average in another
variable.
•
Negatively: Above (below) average in one variable
tends to be associated with below (above) average in
another variable.
Form: Linear
Strength: Strong, very close
to a straight line.
Direction: Two variables
are positively associated.
No outliers.
Form: Roughly linear, two distinct
groups (more than 40% and less than
40%.)
Strength: not strong. Data points are
scattered.
Direction: Negatively Associated.
Outliers: None
Aesthetics
• Can map other variables to size or colour
•qplot(log(Burglary),
log(Motor.Vehicle.Theft), data=fbi,
colour=State)!
•qplot(log(Burglary),
log(Motor.Vehicle.Theft), data=fbi,
size=Population)!
• other aesthetics: shape
Facetting
• Can facet to display plots for different
subsets
•qplot(Year,
Population,
data=fbi, facets=~State)
Facets vs aesthetics
• Will need to experiment as to which one
answers your question/tells the story best
• Remember, just like with pivot tables we
want comparisons of interest to be close
together
Univariate plots
qplot (x, data=dataset)
• produces histogram or barchart
• Categorical variable → bar chart
geom="bar"!
• Continuous variable → histogram
geom="histogram"
• For the histogram, you should always vary the
binwidth
Histograms and Barcharts
• What do we look for?
• Symmetry/Skewness
• Modes, Groups (big pattern: where is the bulk of the data?)
• Gaps & Outliers
(deviation from the big pattern: where are the
other points?)
Aesthetics & facetting
• Use fill to color bars according to
another variable, or use facetting to
compare subsets
• Facetting (later) is generally more useful, as
it is easier to compare different groups
Boxplots
definition by J.W. Tukey (1960s, EDA 1977)
qplot(x,y, geom="boxplot")
0.4
0.2
0.0
-0.2
-0.4
-3
-2
-1
0
1
Median
quartiles:
hinges:
outliers:
25%
75%
data point
≥ 25% - 1.5 * IQR
2
3
IQR = inter-quartile range =
= upper quartile - lower quartile
≤ 75% + 1.5 * IQR
data points between hinges and quartile ± 3*IQR
extreme outliers: data points beyond quartile ± 3*IQR
Boxplots
• Pros:
• Symmetry vs Skewness
• Outliers
• Quick Summary
• Comparisons across multiple Treatments (side by side boxplots)
• Cons:
• Boxplots hide multiple modes and gaps in the
data
Your turn
• Explore the distribution of Murders
• What can you see? What might explain
that pattern?
• Make sure to experiment with bin width in
histograms!
• Use facetting to explore the relationship
between Murders and States
Subsets of Data
• Facetting & Zooming are visual ways of
subsetting the data
!
• Next: use R to subset
Logical vectors
• Very important!
• Usually created with a logical comparison
•<, >, ==, !=, <=, >=!
•x %in% c(1, 4, 3, 7)!
•subset
Logical expressions
• & and | are the logical and and or
• ! is the logical negation
• use parentheses () when linking
expressions to avoid mis-interpretation
Logical Operators
A
B
A & B is the set of elements
that is both in A and B
A
B
A | B is the set of elements
that is in A or in B or in both
Updating subsets
• You can take a subset and update the
original data
•a <- 1:4!
•a[2:3] <•a!
0!
• Very useful with logical subsetting
Practice
•a
<- c(1,15, 3,20, 5,8,9,10, 1,3)!
• Get logical vector that is TRUE when number
is:
• less than 20
• squared value is at least 100 or less than 10
• equals 1 or 3
• even (look at a %% 2)
Subsets
•
•
subset(dataset, logical expression)
subset(fbi, Year == 2011)
subset(fbi, (Year == 2011) & (State ==
“Kansas”))
Useful Commands
•
•
•
•
nrow(dataset)
# number of records!
quantile(variable, probs=0.001, na.rm=T)
# retrieves 0.1 percentile of variable!
which(logical variable)
# retrieves all indices for which the variable
is TRUE!
which.max(variable)
which.min(variable) # retrieve index of highest (lowest) value in variable
Your Turn
FBI Data
• Get a subset of all crimes in Iowa, i.e.:
...
iowa <- ...
Plot incidences/rates for one type of crime over
time.
• Get a subset of all crimes in 2009, and plot one
aspect of it.
• Get a subset of the data that includes number of
Homicides in the last five years. Find rate of
homicides, extract all states that have a rate >
90% across the States, and plot.
Download