Three hour tutorial data.table 30 June 2014 useR! - Los Angeles Matt Dowle Overview data.table in a nutshell 20 mins Client server recorded demo 20 mins Main features in more detail 2 hours Q&A throughout 20 mins Every question is a good question! Feel free to interrupt. 2 What is data.table? Think data.frame, inherits from it data.table() and ?data.table Goals: Reduce programming time fewer function calls, less variable name repetition Reduce compute time fast aggregation, update by reference In-memory only, 64bit and 100GB routine Useful in finance but wider use in mind, too e.g. genomics 3 Reducing programming time trades[ filledShares < orderedShares, sum( (orderedShares-filledShares) * orderPrice / fx ), by = "date,region,algo" Aside : could add database backend ] R : SQL : i j by WHERE SELECT GROUP BY 4 Reducing compute time e.g. 10 million rows x 3 columns x,y,v 230MB DF[DF$x=="R" & DF$y==123,] # 8 s DT[.("R",123)] # 0.008s tapply(DF$v,DF$x,sum) # 22 DT[,sum(v),by=x] # s 0.83s See above in timings vignette (copy and paste) 5 Fast and friendly file reading e.g. 50MB .csv, 1 million rows x 6 columns read.csv("test.csv") # 30-60s read.csv("test.csv", colClasses=, nrows=, etc...) # 10s fread("test.csv") # 3s e.g. 20GB .csv, 200 million rows x 16 columns read.csv(” big.csv” , ...) # hours fread("big.csv") # 8m 6 Update by reference using := Add new column ”sectorMCAP” by group : DT[,sectorMCAP:=sum(MCAP),by=Sector] Delete a column (0.00s even on 20GB table) : DT[,colToDelete:=NULL] Be explicit to really copy entire 20GB : DT2 = copy(DT) 7 Why R? 1) R's lazy evaluation enables the syntax : DT[ filledShares < orderedShares ] query optimization before evaluation 2) Pass DT to any package taking DF. It works. is.data.frame(DT) == TRUE 3) CRAN (cross platform release, quality control) 4) Thousands of statistical packages to use with data.table 8 Web page visits 9 Downloads RStudio mirror only 10 data.table support 11 Client/server recorded demo http://www.youtube.com/watch?v=rvT8XThGA8o 12 Main features in more detail ... 13 Essential! Given a 10,000 x 10,000 matrix in any language Sum the rows Sum the columns Is one way faster, and why? 14 setkey(DT, colA, colB) Sorts the table by colA then colB. That's all. Like a telephone number directory: last name then first name X[Y] is just binary search to X's key You DO need a key for joins X[Y] You DO NOT need a key for by= (but many examples online include it) 15 Example DT x y B 7 A 2 B 1 A 5 B 9 16 DT[2:3,] x y B A 7 2 B 1 A 5 B 9 17 DT x y B A 7 2 B 1 A 5 B 9 18 setkey(DT, x) x y A 2 A 5 B 7 B 1 B 9 19 DT[”B”,] x y A 2 A B 5 7 B 1 B 9 20 DT[”B”,mult=”first”] x y A 2 A B 5 7 B 1 B 9 21 DT[”B”,mult=”last”] x y A 2 A 5 B 7 B B 1 9 22 DT[”B”,sum(y)] == 17 x y A 2 A B 5 7 B 1 B 9 23 DT[c(”A”,”B”),sum(y)] == 24 x y A 2 A 5 B 7 B 1 B 9 24 X[Y] A A B B B 2 5 7 1 9 [ ]= B 11 C 12 B 7 B 1 B 9 C NA 11 11 11 12 Outer join by default (in SQL parlance) 25 X[Y, nomatch=0] A A B B B 2 5 7 1 9 [ ]= B 11 C 12 B B B 7 1 9 11 11 11 Inner join 26 X[Y,head(.SD,n),by=.EACHI] x A A B B B y 2 5 7 1 9 Join inherited column [] z n A 1 B 2 = x A B B i.e. select a data driven topN for each i row y 2 7 1 27 ”Cold” by (i.e. without setkey) Consecutive calls unrelated to key are fine and common practice : > DT[, sum(v), by="x,y"] > DT[, sum(v), by="z"] > DT[, sum(v), by=colA%%5] Also known as "ad hoc by" 28 Programatically vary by bys = list ( quote(y%%2), quote(x), quote(y%%3) ) for (this in bys) print(X[,sum(y),by=eval(this)]) 1: 2: 1: 2: 1: 2: 3: this 0 1 this A B this 2 1 0 V1 2 22 V1 7 17 V1 7 8 9 29 DT[i, j, by] Out loud: ”Take DT, subset rows using i, then calculate j grouped by by” Once you grok the above reading, you don't need to memorize any other functions as all operations follow the same intuition as base. 30 June 2012 31 data.table answer NB: It isn't just the speed, but the simplicity. It's easy to write and easy to read. 32 User's reaction ”data.table is awesome! That took about 3 seconds for the whole thing!!!” Davy Kavanagh, 15 Jun 2012 33 but ... Example had by=key(dt) ? Yes, but it didn't need to. If the data is very large (1GB+) and the groups are big too then getting the groups together in memory can speed up a bit (cache efficiency). 34 by= and keyby= Both by and keyby retain row order within groups – important, often relied on Unlike SQL by retains order of the groups (by order of first appearance) - important, often relied on setkeyv(DT[,,by=],by) DT[,,keyby=] # same, shortcut 35 Prevailing joins (roll=TRUE) One reason for setkey's design. Last Observation (the prevailing one) Carried Forward (LOCF), efficiently Roll forwards or backward Roll the last observation forwards, or not Roll the first observation backwards, or not Limit the roll; e.g. 30 days (roll = 30) Join to nearest value (roll = ”nearest”) i.e. ordered joins 36 … continued roll = [-Inf,+Inf] | TRUE | FALSE | ”nearest” rollends = c(FALSE,TRUE) By example ... 37 PRC setkey(PRC, id, date) 1. PRC[.(“SBRY”)] # all 3 rows 2. PRC[.(“SBRY”,20080502),price] # 391.50 3. PRC[.(“SBRY”,20080505),price] # NA 4. PRC[.(“SBRY”,20080505),price,roll=TRUE] # 391.50 5. PRC[.(“SBRY”,20080601),price,roll=TRUE] # 389.00 6. PRC[.(“SBRY”,20080601),price,roll=TRUE,rollends=FALSE] # NA 7. PRC[.(“SBRY”,20080601),price,roll=20] # NA 8. PRC[.(“SBRY”,20080601),price,roll=40] # 389.00 38 Performance All daily prices 1986-2008 for all non-US equities • 183,000,000 rows (id, date, price) • 2.7 GB system.time(PRICES[id==”VOD”]) # vector scan user system elapsed 66.431 15.395 81.831 system.time(PRICES[“VOD”]) user system elapsed 0.003 0.000 0.002 # binary search setkey(PRICES, id, date) needed first (one-off apx 20 secs) 39 roll = ”nearest” x A A A B y 2 9 11 3 value 1.1 1.2 1.3 1.4 setkey(DT, x, y) DT[.(”A”,7), roll=”nearest”] 40 Variable name repetition The 3rd highest voted [R] question (of 43k) How to sort a dataframe by column(s) in R (*) DF[with(DF, order(-z, b)), ] - vs DT[ order(-z, b) ] quarterlyreport[with(lastquarterlyreport,order(z,b)),] Silent incorrect results due to using a similar variable by mistake. Easily done when this appears on a page of code. - vs quarterlyreport[ order(-z, b) ] (*) Click link for more information 41 but ... Yes order() is slow when used in i because that's base R's order(). That's where ”optimization before evaluation” comes in. We now auto convert order() to the internal forder() so you don't have to know. Available in v1.9.3 on GitHub, soon on CRAN 42 split-apply-combine Why ”split” 10GB into many small groups??? Since 2010, data.table : Allocates memory for largest group Reuses that same memory for all groups Allocates result data.table up front Implemented in C eval() of j within each group 43 Recent innovations Instead of the eval(j) from C, dplyr converts to an Rcpp function and calls that from C. Skipping the R eval step. In response, data.table now has GForce: one function call that computes the aggregate across groups. Called once only so no need to speed up many calls! Both approaches limited to simple aggregates: sum, mean, sd, etc. But often that's all that's needed. 44 data.table over-allocates 45 Assigning to a subset 46 continued Easy to write, easy to read 47 Multiple := DT[, `:=`(newCol1=mean(colA), newCol2=sd(colA)), by=sector] Can combine with a subset in i as well `:=` is functional form and standard R e.g. `<-` and `=`(x,2) 48 set* functions set() setattr() setnames() setcolorder() setkey() setkeyv() setDT() setorder() 49 copy() data.table IS copied-on-change by <- and = as usual in R. Those ops aren't changed. No copy by := or set* You have to use those, so it's clear to readers of your code When you need a copy, call copy(DT) Why copy a 20GB data.table, even once. Why copy a whole column, even once. 50 list columns Each cell can be a different type Each cell can be vector Each cell can itself be a data.table Combine list columns with i and by 51 list column example data.table( x = letters[1:3], y = list( 1:10, letters[1:4], data.table(a=1:3,b=4:6) )) x y 1: a 1,2,3,4,5,6, 2: b a,b,c,d 3: c <data.table> 52 All options datatable.verbose FALSE datatable.nomatch NA_integer_ datatable.optimize Inf datatable.print.nrows datatable.print.topn datatable.allow.cartesian datatable.alloccol datatable.integer64 100L 5L FALSE quote(max(100L,ncol(DT)+64L)) ” integer64” 53 All symbols .N .SD .I .BY .GRP 54 .SD stocks[, head(.SD,2), by=sector] stocks[, lapply(.SD, sum), by=sector] stocks[, lapply(.SD, sum), by=sector, .SDcols=c("mcap",paste0(revenueFQ",1:8))] 55 .I if (length(err <- allocation[, if(length(unique(Price))>1) .I, by=stock ]$V1 )) { warning("Fills allocated to different accounts at different prices! Investigate.") print(allocation[err]) } else { cat("Ok All fills allocated to each account at same price\n") } 56 Analogous to SQL DT[ where, select | update, group by ] [ having ] [ order by ] [ i, j, by ] ... [ i, j, by ] i.e. chaining 57 New in v1.9.2 on CRAN 37 new features and 43 bug fixes set() can now add columns just like := .SDcols “de-select” columns by name or position; e.g., DT[,lapply(.SD,mean),by=colA,.SDcols=-c(3,4)] fread() a subset of columns fread() commands; e.g., fread("grep blah file.txt") Speed gains 58 Radix sort for integer R's method=”radix” is not actually a radix sort … it's a counting sort. See ?setkey/Notes. data.table liked and used it, though. A true radix sort caters for range > 100,000 ( Negatives was a one line change to R we suggested and was accepted in R 3.1 ) Adapted to integer from Terdiman and Herf's code for float … 59 Radix sort for numeric R reminder: numeric == floating point numbers Radix Sort Revisited, Pierre Terdiman, 2000 http://codercorner.com/RadixSortRevisited.htm Radix Tricks, Michael Herf, 2001 http://stereopsis.com/radix.html Their C code now in data.table with minor changes; e.g., NA/NaN and 6-pass for double 60 Radix sort for numeric R reminder: numeric == floating point numbers Radix Sort Revisited, Pierre Terdiman, 2000 http://codercorner.com/RadixSortRevisited.htm Radix Tricks, Michael Herf, 2001 http://stereopsis.com/radix.html Their C code now in data.table with minor changes; e.g., NA/NaN and 6-pass for double 61 Faster for those cases 20 million rows x 4 columns, 539MB a & b (numeric), c (integer), d (character) v1.8.10 v1.9.2 setkey(DT, a) 54.9s 5.3s setkey(DT, c) 48.0s 3.9s 102.3s 6.9s setkey(DT, a, b) ”Cold” grouping (no setkey first) : DT[, mean(b), by=c] 47.0s 3.4s https://gist.github.com/arunsrinivasan/451056660118628befff 62 New feature: melt i.e. reshape2 for data.table 20 million rows x 6 columns (a:f) 768MB melt(DF, id=”d”, measure=1:2) 4.1s (*) melt(DT, id=“d”, measure=1:2) 1.7s (*) including Kevin Ushey's C code in reshape2, was 190s melt(DF, …, na.rm=TRUE) 39.5s melt(DT, …, na.rm=TRUE) 2.7s https://gist.github.com/arunsrinivasan/451056660118628befff 63 New feature: dcast i.e. reshape2 for data.table 20 million rows x 6 columns (a:f) 768MB dcast(DF, d~e, ..., fun=sum) 76.7 sec dcast(DT, d~e, …, fun=sum) 7.5 sec reshape2::dcast hasn't been Kevin'd yet https://gist.github.com/arunsrinivasan/451056660118628befff 64 … melt/dcast continued Q: Why not submit a pull request to reshape2 ? A: This C implementation calls data.table internals at C-level (e.g. fastorder, grouping, and joins). It makes sense for this code to be together. 65 Miscellaneous 1 DT[, (myvar):=NULL] Spaces and specials; e.g., by="a, b, c" DT[4:7,newCol:=8][] extra [] to print at prompt auto fills rows 1:3 with NA rbindlist(lapply(fileNames, fread)) rbindlist has fill and use.names 66 Miscellaneous 2 Dates and times Errors & warnings are deliberately very long Not joins X[!Y] Column plonk & non-coercion on assign by-without-by => by=.EACHI Secondary keys / merge R3, singleton logicals, reference counting bit64::integer64 67 Miscellaneous 3 Print method vs typing DF, copy fixed in R-devel How to benchmark mult = ”all” | ”first” | ”last” (may expand) with=FALSE which=TRUE CJ() and SJ() Chained queries: DT[...][...][...] Dynamic and flexible queries (eval text and quote) 68 Miscellaneous 4 fread drop and select (by name or number) fread colClasses can be ranges of columns fread sep2 Vector search vs binary search One column == is ok, but not 2+ due to temporary logicals (e.g. slide 5 earlier) 69 Not (that) much to learn Main manual page: ?data.table Run example(data.table) at the prompt (53 examples) No methods, no functions, just use what you're used to in R 70 Thank you https://github.com/Rdatatable/datatable/ http://stackoverflow.com/questions/tagged/data.table > install.packages(”data.table”) > require(data.table) > ?data.table > ?fread Learn by example : > example(data.table) 71