dplyr & Functions stat 480 Heike Hofmann Outline • dplyr functions and package • Functions library(dplyr) data(baseball, package=”plyr”)) Your Turn • Use data(baseball, package="plyr") to make the baseball dataset active in R. • Subset the data on your favorite player (you will need the Lahmann ID, e.g. Sammy Sosa sosasa01, Barry Bonds bondsba01, Babe Ruth ruthba01) Compute your player’s batting averages for each season (batting average = #Hits/#at bats). Define a new variable experience in the data set as year - min(year) Plot averages by #years of experience. Compute an alltime batting average for your player. summarise • What does the summarise function do? Read up on it on its help pages: •help(summarise) summarise # overall batting average summarise(baseball, ! mba = sum(h)/sum(ab) ) summarise(baseball, ! first = min(year),! ! ! duration = max(year) - min(year),! # duration of record taking in years ! nteams = length(unique(team)),! ! # different number of teams ! nplayers = length(unique(id))! # number of baseball players # in the dataset ) ! ! ! ! # first year of baseball records dplyr package • introduces workflow that makes working with large datasets (relatively) easy • main functionality: group_by, summarise, mutate, filter • http://cran.rstudio.com/web/packages/dplyr/ vignettes/introduction.html group_by • group_by(data, var1, ...) is a function that takes a dataset and introduces a group for each (combination of) level(s) of the grouping variable(s) • Power combination: group_by and summarise for a grouped dataframe, the summary statistics will be calculated for every group library(dplyr) data(baseball, package="plyr") summarise(baseball, seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T) ) seasons atbats avg 1 137 4891061 0.2739821 summarise(group_by(baseball, id), seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T) ) 1 2 3 4 5 id seasons atbats avg perezne01 12 5127 0.2672128 walketo04 12 4554 0.2889767 sweenma01 13 1738 0.2600690 schmija01 13 591 0.1049069 loaizes01 13 265 0.1660377 Chaining operator %.% • • • x %.% f(y) is equivalent to f(x, y) baseball %.% group_by(id) is equivalent to group_by(baseball, id) Read %.% as ‘then’ i.e. “take data, then group it by player’s id, then summarise it to …” Chained version of example baseball %.% group_by(id) %.% summarise( seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T) ) Your Turn • Use dplyr statements to get (a) the life time batting average for each player (mba) (b) the life time number of times each player was at bats. (nab) • Plot nab versus mba. filter • filter(data, expr1, ...) is a function that takes a dataset and subsets it according to a set of expressions • filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with the logical ‘AND’ &. You can use other boolean operators explicitly Your Turn • Use dplyr statements to get the number of team members on a team for each season (think of unique) • Has the number of homeruns per season changed over time? Summarize the data with dplyr routines first, then visualize. Functions in R • Have been using functions a lot, now we want to write them ourselves! • Idea: avoid repetitive coding (errors will creep in) • Instead: extract common core, wrap it in a function, make it reusable Basic Structure • Name • Input arguments • • names, default values • Body • Output values A first function mean <- function(x) { return(sum(x)/length(x)) } mean(1:15) mean(c(1:15, NA)) mean <- function(x, na.rm=F) { if (na.rm) x <- na.omit(x) return(sum(x)/length(x)) } mean(1:15) mean(c(1:15, NA), na.rm=T) Function mean • Name: • Input arguments • • mean x, na.rm=T names, default values • Body • Output values if(na.rm) x <- na.omit(x) return(sum(x)/length(x)) Function Writing • Start simple, then extend • Test out each step of the way • Don’t try too much at once • help(browser) Practice • Write a function called mba input: playerID output: life-time batting average for playerID • what does mba(“bondsba01”)do? • write a function called pstats input: playerID output: life-time batting average for playerID number of overall at bats Checkpoint • Submit all of your code for the last Your Turn at http://heike.wufoo.com/forms/check-point/ Testing • Always test the functions you’ve written • Even better: let somebody else test them for you • Switch seats with your neighbor, test their functions!