dplyr & Functions stat 480 Heike Hofmann

advertisement
dplyr & Functions
stat 480
Heike Hofmann
Outline
• dplyr functions and package
• Functions
library(dplyr)
data(baseball, package=”plyr”))
Your Turn
•
Use data(baseball, package="plyr") to make
the baseball dataset active in R.
•
Subset the data on your favorite player (you will need
the Lahmann ID, e.g.
Sammy Sosa sosasa01, Barry Bonds bondsba01,
Babe Ruth ruthba01)
Compute your player’s batting averages for each
season (batting average = #Hits/#at bats).
Define a new variable experience in the data set as
year - min(year)
Plot averages by #years of experience. Compute an alltime batting average for your player.
summarise
• What does the summarise function do?
Read up on it on its help pages:
•help(summarise)
summarise
# overall batting average
summarise(baseball,
!
mba = sum(h)/sum(ab)
)
summarise(baseball,
!
first = min(year),! !
!
duration = max(year) - min(year),!
# duration of record taking in years
!
nteams = length(unique(team)),! !
# different number of teams
!
nplayers = length(unique(id))!
# number of baseball players
# in the dataset
)
!
!
!
!
# first year of baseball records
dplyr package
• introduces workflow that makes working
with large datasets (relatively) easy
• main functionality:
group_by, summarise, mutate,
filter
• http://cran.rstudio.com/web/packages/dplyr/
vignettes/introduction.html
group_by
• group_by(data,
var1, ...)
is a function that takes a dataset and
introduces a group for each (combination
of) level(s) of the grouping variable(s)
• Power combination:
group_by and summarise
for a grouped dataframe, the summary
statistics will be calculated for every group
library(dplyr)
data(baseball, package="plyr")
summarise(baseball,
seasons = max(year)-min(year)+1,
atbats = sum(ab),
avg = sum(h, na.rm=T)/sum(ab, na.rm=T)
)
seasons atbats
avg
1
137 4891061 0.2739821
summarise(group_by(baseball, id),
seasons = max(year)-min(year)+1,
atbats = sum(ab),
avg = sum(h, na.rm=T)/sum(ab, na.rm=T)
)
1
2
3
4
5
id seasons atbats
avg
perezne01
12
5127 0.2672128
walketo04
12
4554 0.2889767
sweenma01
13
1738 0.2600690
schmija01
13
591 0.1049069
loaizes01
13
265 0.1660377
Chaining operator %.%
•
•
•
x %.% f(y)
is equivalent to f(x, y)
baseball %.% group_by(id)
is equivalent to
group_by(baseball, id)
Read %.% as ‘then’ i.e.
“take data,
then group it by player’s id,
then summarise it to …”
Chained version of
example
baseball %.%
group_by(id) %.%
summarise(
seasons = max(year)-min(year)+1,
atbats = sum(ab),
avg = sum(h, na.rm=T)/sum(ab, na.rm=T)
)
Your Turn
• Use dplyr statements to get
(a) the life time batting average for each
player (mba)
(b) the life time number of times each player
was at bats. (nab)
• Plot nab versus mba.
filter
• filter(data,
expr1, ...)
is a function that takes a dataset and subsets
it according to a set of expressions
• filter() works similarly to subset()
except that you can give it any number of
filtering conditions which are joined together
with the logical ‘AND’ &.
You can use other boolean operators
explicitly
Your Turn
• Use dplyr statements to get the number of
team members on a team for each season
(think of unique)
• Has the number of homeruns per season
changed over time? Summarize the data with
dplyr routines first, then visualize.
Functions in R
• Have been using functions a lot, now we
want to write them ourselves!
• Idea: avoid repetitive coding (errors will
creep in)
• Instead: extract common core, wrap it in a
function, make it reusable
Basic Structure
• Name
• Input arguments
•
•
names,
default values
• Body
• Output values
A first function
mean <- function(x) {
return(sum(x)/length(x))
}
mean(1:15)
mean(c(1:15, NA))
mean <- function(x, na.rm=F) {
if (na.rm) x <- na.omit(x)
return(sum(x)/length(x))
}
mean(1:15)
mean(c(1:15, NA), na.rm=T)
Function mean
• Name:
• Input arguments
•
•
mean
x, na.rm=T
names,
default values
• Body
• Output values
if(na.rm) x <- na.omit(x)
return(sum(x)/length(x))
Function Writing
• Start simple, then extend
• Test out each step of the way
• Don’t try too much at once
• help(browser)
Practice
• Write a function called mba
input: playerID
output: life-time batting average for playerID
• what does mba(“bondsba01”)do?
• write a function called pstats
input: playerID
output: life-time batting average for playerID
number of overall at bats
Checkpoint
• Submit all of your code for the last Your Turn at
http://heike.wufoo.com/forms/check-point/
Testing
• Always test the functions you’ve written
• Even better: let somebody else test them
for you
• Switch seats with your neighbor, test their
functions!
Download