CMYK 0/100/100/20 66/54/42/17 Why is R slow? How to run R programs faster? Tomas Kalibera 34/21/10/0 My Background Virtual machines, runtimes for programming languages Real-time Java Automatic memory management Evaluating software performance Benchmarks Using statistical methods R User Currently working on: FastR CMYK 0/100/100/20 66/54/42/17 34/21/10/0 A new, experimental virtual machine for (a subset of) R language. Discovering optimizations that can speed-up R. Core team Jan Vitek Tomas Kalibera Petr Maj Wider team Floreal Morandat Community: Dynamic Languages for Scalable Data Analytics Use one dynamic, high level language for data analytics tasks running on platforms from a tablet to the cloud. R, Matlab, Python, Julia Large software companies interested in R NSF Funded Workshop at SPLASH 2013 Software Infrastructure for Sustained Innovation Virtual Machines, R & FastR Source code int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "tm n\n"); return 1; } int n = atoi(argv[1]); printf("n = %d\n", n); Parse tree parsing argc main != if 2 decl call call ret (AST) Interpreter Parse tree main executed directly by Interpreter argc != if 2 decl call call ret Class If Node Condition, TrueBranch, FalseBranch; Result execute() { If (Condition.execute() == TRUE) { TrueBranch.execute() } else { FalseBranch.execute() } Return NULL; } GNU R works like this. Easy to develop, maintain. Compiler Parse tree main argc != if 2 decl call call ret compilation linking 0000000000400580 <main>: 400580: 41 54 400582: 83 ff 02 400585: 55 400586: 53 400587: 74 25 400589: 48 8b 0d 400590: ba 05 00 400595: be 01 00 40059a: bf 04 08 40059f: e8 cc ff 4005a4: b8 01 00 4005a9: 5b 4005aa: 5d 4005ab: 41 5c 4005ad: c3 Machine code c8 00 00 40 ff 00 0a 20 00 00 00 00 ff 00 push cmp push push je mov mov mov mov callq mov pop pop pop retq %r12 $0x2,%edi %rbp %rbx 4005ae <main+0x2e> 0x200ac8(%rip),%rcx $0x5,%edx $0x1,%esi $0x400804,%edi 400570 <fwrite@plt> $0x1,%eax %rbx %rbp %r12 Ahead of time: C/C++/Fortran Just-in-time: Java/C# Fast. FastR ● ● Self-optimizing AST interpreter – Aims to be still easy to develop, maintain – But fast 0/100/100/20 The AST (tree) rewrites as the program executes – ● CMYK Speculative rewrites, recovery Runs on a JVM – High-performance garbage collector – Just-in-Time compilation improves speed 66/54/42/17 34/21/10/0 Understanding why GNU-R is slow Speeding-up R programs Toeplitz Matrix In AT&T R Benchmarks 2.5 (Simon Urbanek) Initializing a square matrix 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1 a i , j =∣i− j∣+1 TM using For Loop (as included in AT&T R Benchmarks 2.5) tmFor <- function(n) { b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b } ai , j =∣i− j∣+1 TM using For Loop (as included in AT&T R Benchmarks 2.5 ) tmFor <- function(n) { b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 N = 500 } N = 1000 } N = 1500 b } a i , j =∣i− j∣+1 650 ms 2610 ms 5910 ms This is very slow! TM in C int *b = (int *)malloc(n * n * sizeof(int)); for(j = 1; j <= n; j++) { for(k = 1; k <= n; k++) { b[(k - 1) + (j - 1) * n] = abs(j - k) + 1; } } In R In C N = 500 650 ms N = 500 0.2 ms N = 1000 2610 ms N = 1000 0.9 ms N = 1500 5910 ms N = 1500 2.1 ms R slowdowns is hundreds of fold. Toeplitz Matrix Understanding why GNU-R is slow TM: Checking with a profiler > > > > Rprof() dummy <- tmFor(5000) Rprof(NULL) summaryRProf() $by.self self.time self.pct total.time total.pct "tmFor" 51.42 86.36 59.54 100.00 "abs" 2.80 4.70 2.80 4.70 "-" 2.76 4.64 2.76 4.64 "+" 2.42 4.06 2.42 4.06 "matrix" 0.12 0.20 0.12 0.20 ":" 0.02 0.03 0.02 0.03 $by.total total.time total.pct self.time self.pct "tmFor" 59.54 100.00 51.42 86.36 "abs" 2.80 4.70 2.80 4.70 "-" 2.76 4.64 2.76 4.64 "+" 2.42 4.06 2.42 4.06 "matrix" 0.12 0.20 0.12 0.20 ":" 0.02 0.03 0.02 0.03 TM: R profiler does not help tmFor <- function(n) { b <- matrix(nrow = n, ncol = n) for (j in 1:n) { Performance for (k in 1:n) { b[k,j] <- abs(j - k) + 1 critical } part. } b } TM: Checking with a system profiler env CFLAGS=-g ./configure --with-blas --with-lapack --enable-R-static-lib –disable-BLAS-shlib make perf record -g -- ~/work/R/R-3.0.2/R-3.0.2-dbg/bin/R --slave < runtm.r perf report -g source("tm.r") dummy <- tmFor(5000) + + - + + + 9.91% R R 9.53% R R 6.67% R R - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal + 2.21% Rf_eval 1.08% 0.75% 0.74% R R R R R R [.] Rf_eval [.] Rf_cons [.] Rf_findVarInFrame3 [.] real_binary [.] integer_binary [.] do_abs TM: Checking with a system profiler + + - 9.91% R R 9.53% R R 6.67% R R - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal + 2.21% Rf_eval [.] Rf_eval [.] Rf_cons [.] Rf_findVarInFrame3 Variable look-up Variable look-up R built-in functions can be changed for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } abs is a built-in function abs can be changed at any time > abs <- function(x) { x * x } > abs(-10) [1] 100 > for(i in 11:13) { if (i==12) { abs <- sqrt } ; print(abs(i)) } [1] 11 [1] 3.464102 [1] 3.605551 Variable look-up R built-in functions can be changed tmFor <- function(n) { b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b GlobalEnv } .Primitive("abs") tmFor n abs n BaseNamespaceEnv tmFor n b n nj n k Variable look-up R built-in functions can be changed for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } abs is a built-in function + - ( [ { ← for : > `:` <- sum are all built-in functions > 1:10 [1] 11 > `<-` <- function(x,val) { eval.parent( assign(deparse(substitute(x)), 100)) } > z <- 10 [1] 100 Variable look-up Variables can be deleted variable look-up is needed for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } > x <- 10 > rm(x) > x Error: object 'x' not found Loop control variable can be deleted > for(i in 1:3) { if (i==2) { rm(i) } else [1] 1 [1] 3 print(i) } > for(i in 1:3) { if (i==2) { rm(i) } ; print(i) } [1] 1 Error in print(i) : object 'i' not found TM: Checking with a system profiler + - 9.91% R R [.] Rf_eval 9.53% R R [.] Rf_cons - Rf_cons + 29.87% Rf_allocList + 24.96% Rf_evalList + 14.35% Rf_evalListKeepMissing + 6.04% Rf_lcons + 5.90% Rf_DispatchOrEval + 5.29% Rf_list2 + 3.85% evalseq + 3.26% Rf_defineVar + 3.04% Rf_list1 + 1.18% Rf_eval + 0.75% replaceCall + 0.52% evalArgs 6.67% R R [.] Rf_findVarInFrame3 Linked-list allocation and use + Linked-list allocation and use Arguments passed as linked-list for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 Converted to a general replacement call of form F(X) ← Y The replacement call is then transformed F(X) ← Y TMP ← X X ← “F<-”( TMP, value = Y ) b[k,j] ← Y TMP ← b b ← “[<-”( TMP, k, j, value = Y ) Linked-list allocation and use Replacement call is expensive b[k,j] ← Y TMP ← b b ← “[<-”( TMP, k, j, value = Y ) This linked list allocated in each iteration <n b n [<n TMP n n k nj Y n Toeplitz Matrix Speeding-up R programs R Byte-code compiler env R_ENABLE_JIT=3 R AST > require(compiler) Loading required package: compiler > help(cmpfun) Bytecode N = 500 650 ms 130 ms N = 1000 2610 ms 530 ms N = 1500 5910 ms 1150 ms Always use byte-code compiler! TM: Sapply tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) }) } TM: Sapply tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) }) N = 500 } N = 1000 N = 1500 For Sapply 130 ms 320 ms 530 ms 1300 ms 1150 ms 2960 ms Using sapply instead of for sometimes helps. Not now... TM: Rows Algo tmRows <- function(n) { b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= 2) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } 1 b 2 } 2 3 4 5 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1 TM: Rows Algo tmRows <- function(n) { b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= n) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } For } N = 500 b N = 1000 } N = 1500 Rows 130 ms 13 ms 530 ms 59 ms 1150 ms 169 ms Much faster. Reduced calls, lookups. TM: Cols Algo tmCols <- function(n) { b <- matrix(nrow = n, ncol = n) b[,1] <- 1:n if (n >= 2) { for(col in 2:n) { b[,col] <- c(col, b[-n, col-1]) } } 1 b 2 } 2 3 4 5 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1 TM: Cols2 Algo tmByCols <- function(n) { if (n >= 2) { sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } 1 }) 2 } else { 3 1 4 } 5 } 2 3 4 5 1 2 3 4 2 1 2 3 3 2 1 2 4 3 2 1 TM: Cols2 Algo tmByCols <- function(n) { if (n >= 2) { sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } N = 500 }) N = 1000 } else { N = 1500 1 } Much faster. Reduced } Rows Cols2 13 ms 5 ms 59 ms 39 ms 169 ms 58 ms calls, lookups. TM: Outer Algo tmOuter <- function(n) { outer(X = 1:n, Y = 1:n, FUN = function(j,k) { abs(j - k) + 1 }) } 1 2 3 4 5 2 1 2 3 4 3 2 1 2 3 4 3 2 1 2 5 4 3 2 1 TM: Outer Algo tmOuter <- function(n) { outer(X = 1:n, Y = 1:n, FUN = function(j,k) { abs(j - k) + 1 }) Cols2 } Outer C N = 500 5 ms 2 ms 0.2 ms N = 1000 39 ms 27 ms 0.9 ms N = 1500 58 ms 47 ms 2.1 ms Yet faster. Vectorized. Also easy to read. TM: Summary For Outer C For-FastR N = 500 130 ms 2 ms 0.2 ms 13 ms N = 1000 530 ms 27 ms 0.9 ms 47 ms N = 1500 1150 ms 47 ms 2.1 ms 101 ms Summary ● Use byte-code compiler ● Vectorize ● Use built-ins (sum, prod, cumsum, outer) ● Use simplest data structure possible ● – Matrix instead of data.frame – Avoid data.frame indexing Save and re-use intermediate results Please consider donating your code/data in form of benchmarks.