The dplyr package stat 480

advertisement
The dplyr package
stat 480
library(dplyr)
library(Lahman)
Your Turn
• Use library(Lahman) to make the baseball archive active in R.
• This package has a lot of different baseball related data.
Have a look at library(help=Lahman)
• The dataset HallOfFame contains voting data to determine
inductions to Baseball’s Hall of Fame. • Based on the data, how many candidates were considered in
2014?
• How many baseball players were inducted? • Who are they (in order to answer this question in R, look at
the data set Master)
• How often were the successful inductees considered before?
summarise
!
• What does the summarise function do?
Read up on it on its help pages:
•help(summarise)
summarise
# overall batting average"
summarise(Batting,"
mba = sum(H, na.rm=T)/sum(AB, na.rm=T)"
)"
!
!
summarise(Batting,"
first = min(yearID),""
"
"
"
# first year of baseball records"
career = max(yearID) - min(yearID)," # duration of record taking in
years"
nteams = length(unique(teamID)),""
# different number of teams"
nplayers = length(unique(playerID))" "
# number of baseball players
# in the dataset"
)
dplyr package
• introduces workflow that makes working
with large datasets (relatively) easy
• main functionality:
group_by, summarise, mutate,
filter • http://cran.rstudio.com/web/packages/dplyr/
vignettes/introduction.html
group_by
• group_by(data,
var1, ...) is a function that takes a dataset and
introduces a group for each (combination
of) level(s) of the grouping variable(s)
• Power combination: group_by and summarise
for a grouped dataframe, the summary
statistics will be calculated for every group
library(dplyr)
library(Lahman)
!
!
summarise(Batting,
+
mba = sum(H, na.rm=T)/sum(AB, na.rm=T),
+
first = min(yearID), # first year of baseball records
+
career = max(yearID) - min(yearID), # duration of record taking in years
+
nteams = length(unique(teamID)),
# different number of teams
+
nplayers = length(unique(playerID)) # number of baseball players
# in the dataset
+ )
mba first career nteams nplayers
1 0.2620073 1871
142
149
18107
>
!
> summarise(group_by(Batting, playerID),
+
mba = sum(H, na.rm=T)/sum(AB, na.rm=T),
+
first = min(yearID), # first year of baseball records
+
career = max(yearID) - min(yearID), # duration of record taking in years
+
nteams = length(unique(teamID)),
# different number of teams
+
nplayers = length(unique(playerID)) # number of baseball players
# in the dataset
+ )
Source: local data frame [18,107 x 6]
!
1
2
3
4
5
playerID
aardsda01
aaronha01
aaronto01
aasedo01
abadan01
mba first career nteams nplayers
0.0000000 2004
9
7
1
0.3049984 1954
22
3
1
0.2288136 1962
9
2
1
0.0000000 1977
13
5
1
0.0952381 2001
5
3
1
Chaining operator %.%
• x %.% f(y)
• Batting %.%
is equivalent to f(x, y)"
group_by(playerID) is equivalent to
group_by(Batting, playerID)
• Read %.% as ‘then’ i.e. “take data, then group it by player’s id, then summarise it to …”
Chained version of
example
Batting %.%
group_by(playerID) %.%
summarise(
seasons = max(yearID)-min(yearID)+1,
atbats = sum(AB, na.rm=T),
avg = sum(H, na.rm=T)/sum(AB, na.rm=T)
)
Your Turn
• Use dplyr statements to get for each player
from the HallOfFame data
• the number of times the player was on the
ballots
• whether the final attempt was successful or
not
• the percentage of votes the player got for the
final round
filter
• filter(data,
expr1, ...) is a function that takes a dataset and subsets
it according to a set of expressions
• filter() works similarly to subset()
except that you can give it any number of
filtering conditions which are joined together
with the logical ‘AND’ &. You can use other boolean operators
explicitly
Your Turn
• Use dplyr statements to get the number of
team members on a team for each season
(think of unique)
• Has the number of homeruns per season
changed over time? Summarize the data with
dplyr routines first, then visualize.
Functions in R
• Have been using functions a lot, now we
want to write them ourselves!
• Idea: avoid repetitive coding (errors will
creep in)
• Instead: extract common core, wrap it in a
function, make it reusable
Basic Structure
• Name
• Input arguments
• names, • default values
• Body
• Output values
A first function
mean <- function(x) {
return(sum(x)/length(x))
}
!
mean(1:15)
mean(c(1:15, NA))
!
mean <- function(x, na.rm=F) {
if (na.rm) x <- na.omit(x)
return(sum(x)/length(x))
}
!
mean(1:15)
mean(c(1:15, NA), na.rm=T)
Function mean
• Name:
• Input arguments
mean
x, na.rm=T"
• names, • default values
• Body
• Output values
if(na.rm) x <- na.omit(x)"
return(sum(x)/length(x))
Function Writing
• Start simple, then extend • Test out each step of the way
• Don’t try too much at once
•help(browser)
Practice
• Write a function called mba input: playerID
output: life-time batting average for playerID
• what does mba(“bondsba01”)do?
• write a function called pstats
input: playerID
output: life-time batting average for playerID number of overall at bats
Checkpoint
• Submit all of your code for the last Your Turn at
http://heike.wufoo.com/forms/check-point/
Testing
• Always test the functions you’ve written
• Even better: let somebody else test them
for you
• Switch seats with your neighbor, test their
functions!
Download