The dplyr package stat 480 library(dplyr) library(Lahman) Your Turn • Use library(Lahman) to make the baseball archive active in R. • This package has a lot of different baseball related data. Have a look at library(help=Lahman) • The dataset HallOfFame contains voting data to determine inductions to Baseball’s Hall of Fame. • Based on the data, how many candidates were considered in 2014? • How many baseball players were inducted? • Who are they (in order to answer this question in R, look at the data set Master) • How often were the successful inductees considered before? summarise ! • What does the summarise function do? Read up on it on its help pages: •help(summarise) summarise # overall batting average" summarise(Batting," mba = sum(H, na.rm=T)/sum(AB, na.rm=T)" )" ! ! summarise(Batting," first = min(yearID),"" " " " # first year of baseball records" career = max(yearID) - min(yearID)," # duration of record taking in years" nteams = length(unique(teamID)),"" # different number of teams" nplayers = length(unique(playerID))" " # number of baseball players # in the dataset" ) dplyr package • introduces workflow that makes working with large datasets (relatively) easy • main functionality: group_by, summarise, mutate, filter • http://cran.rstudio.com/web/packages/dplyr/ vignettes/introduction.html group_by • group_by(data, var1, ...) is a function that takes a dataset and introduces a group for each (combination of) level(s) of the grouping variable(s) • Power combination: group_by and summarise for a grouped dataframe, the summary statistics will be calculated for every group library(dplyr) library(Lahman) ! ! summarise(Batting, + mba = sum(H, na.rm=T)/sum(AB, na.rm=T), + first = min(yearID), # first year of baseball records + career = max(yearID) - min(yearID), # duration of record taking in years + nteams = length(unique(teamID)), # different number of teams + nplayers = length(unique(playerID)) # number of baseball players # in the dataset + ) mba first career nteams nplayers 1 0.2620073 1871 142 149 18107 > ! > summarise(group_by(Batting, playerID), + mba = sum(H, na.rm=T)/sum(AB, na.rm=T), + first = min(yearID), # first year of baseball records + career = max(yearID) - min(yearID), # duration of record taking in years + nteams = length(unique(teamID)), # different number of teams + nplayers = length(unique(playerID)) # number of baseball players # in the dataset + ) Source: local data frame [18,107 x 6] ! 1 2 3 4 5 playerID aardsda01 aaronha01 aaronto01 aasedo01 abadan01 mba first career nteams nplayers 0.0000000 2004 9 7 1 0.3049984 1954 22 3 1 0.2288136 1962 9 2 1 0.0000000 1977 13 5 1 0.0952381 2001 5 3 1 Chaining operator %.% • x %.% f(y) • Batting %.% is equivalent to f(x, y)" group_by(playerID) is equivalent to group_by(Batting, playerID) • Read %.% as ‘then’ i.e. “take data, then group it by player’s id, then summarise it to …” Chained version of example Batting %.% group_by(playerID) %.% summarise( seasons = max(yearID)-min(yearID)+1, atbats = sum(AB, na.rm=T), avg = sum(H, na.rm=T)/sum(AB, na.rm=T) ) Your Turn • Use dplyr statements to get for each player from the HallOfFame data • the number of times the player was on the ballots • whether the final attempt was successful or not • the percentage of votes the player got for the final round filter • filter(data, expr1, ...) is a function that takes a dataset and subsets it according to a set of expressions • filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with the logical ‘AND’ &. You can use other boolean operators explicitly Your Turn • Use dplyr statements to get the number of team members on a team for each season (think of unique) • Has the number of homeruns per season changed over time? Summarize the data with dplyr routines first, then visualize. Functions in R • Have been using functions a lot, now we want to write them ourselves! • Idea: avoid repetitive coding (errors will creep in) • Instead: extract common core, wrap it in a function, make it reusable Basic Structure • Name • Input arguments • names, • default values • Body • Output values A first function mean <- function(x) { return(sum(x)/length(x)) } ! mean(1:15) mean(c(1:15, NA)) ! mean <- function(x, na.rm=F) { if (na.rm) x <- na.omit(x) return(sum(x)/length(x)) } ! mean(1:15) mean(c(1:15, NA), na.rm=T) Function mean • Name: • Input arguments mean x, na.rm=T" • names, • default values • Body • Output values if(na.rm) x <- na.omit(x)" return(sum(x)/length(x)) Function Writing • Start simple, then extend • Test out each step of the way • Don’t try too much at once •help(browser) Practice • Write a function called mba input: playerID output: life-time batting average for playerID • what does mba(“bondsba01”)do? • write a function called pstats input: playerID output: life-time batting average for playerID number of overall at bats Checkpoint • Submit all of your code for the last Your Turn at http://heike.wufoo.com/forms/check-point/ Testing • Always test the functions you’ve written • Even better: let somebody else test them for you • Switch seats with your neighbor, test their functions!