Why is R slow? How to run R programs faster?

advertisement
CMYK
0/100/100/20
66/54/42/17
Why is R slow?
How to run R programs faster?
Tomas Kalibera
34/21/10/0
My Background
Virtual machines, runtimes for programming languages
Real-time Java
Automatic memory management
Evaluating software performance
Benchmarks
Using statistical methods
R User
Currently working on: FastR
CMYK
0/100/100/20
66/54/42/17
34/21/10/0
A new, experimental virtual machine for (a subset of)
R language. Discovering optimizations that can
speed-up R.
Core team
Jan Vitek
Tomas Kalibera
Petr Maj
Wider team
Floreal Morandat
Community: Dynamic Languages
for Scalable Data Analytics
Use one dynamic, high level language for data
analytics tasks running on platforms from a tablet to
the cloud.
R, Matlab, Python, Julia
Large software companies interested in R
NSF Funded Workshop at SPLASH 2013
Software Infrastructure for Sustained Innovation
Virtual Machines, R & FastR
Source code
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "tm n\n");
return 1;
}
int n = atoi(argv[1]);
printf("n = %d\n", n);
Parse tree
parsing
argc
main
!=
if
2
decl
call
call
ret
(AST) Interpreter
Parse tree
main
executed directly by
Interpreter
argc
!=
if
2
decl
call
call
ret
Class If
Node Condition, TrueBranch, FalseBranch;
Result execute() {
If (Condition.execute() == TRUE) {
TrueBranch.execute()
} else {
FalseBranch.execute()
}
Return NULL;
}
GNU R works like this.
Easy to develop, maintain.
Compiler
Parse tree
main
argc
!=
if
2
decl
call
call
ret
compilation
linking
0000000000400580 <main>:
400580:
41 54
400582:
83 ff 02
400585:
55
400586:
53
400587:
74 25
400589:
48 8b 0d
400590:
ba 05 00
400595:
be 01 00
40059a:
bf 04 08
40059f:
e8 cc ff
4005a4:
b8 01 00
4005a9:
5b
4005aa:
5d
4005ab:
41 5c
4005ad:
c3
Machine code
c8
00
00
40
ff
00
0a 20 00
00
00
00
ff
00
push
cmp
push
push
je
mov
mov
mov
mov
callq
mov
pop
pop
pop
retq
%r12
$0x2,%edi
%rbp
%rbx
4005ae <main+0x2e>
0x200ac8(%rip),%rcx
$0x5,%edx
$0x1,%esi
$0x400804,%edi
400570 <fwrite@plt>
$0x1,%eax
%rbx
%rbp
%r12
Ahead of time: C/C++/Fortran
Just-in-time:
Java/C#
Fast.
FastR
●
●
Self-optimizing AST interpreter
–
Aims to be still easy to develop, maintain
–
But fast
0/100/100/20
The AST (tree) rewrites as the program
executes
–
●
CMYK
Speculative rewrites, recovery
Runs on a JVM
–
High-performance garbage collector
–
Just-in-Time compilation improves speed
66/54/42/17
34/21/10/0
Understanding why GNU-R is slow
Speeding-up R programs
Toeplitz Matrix
In AT&T R Benchmarks 2.5 (Simon Urbanek)
Initializing a square matrix
1
2
3
4
5
2
1
2
3
4
3
2
1
2
3
4
3
2
1
2
5
4
3
2
1
a i , j =∣i− j∣+1
TM using For Loop
(as included in AT&T R Benchmarks 2.5)
tmFor <- function(n) {
b <- matrix(nrow = n, ncol = n)
for (j in 1:n) {
for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
}
}
b
}
ai , j =∣i− j∣+1
TM using For Loop
(as included in AT&T R Benchmarks 2.5 )
tmFor <- function(n) {
b <- matrix(nrow = n, ncol = n)
for (j in 1:n) {
for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
N = 500
}
N = 1000
}
N = 1500
b
}
a i , j =∣i− j∣+1
650 ms
2610 ms
5910 ms
This is very slow!
TM in C
int *b = (int *)malloc(n * n * sizeof(int));
for(j = 1; j <= n; j++) {
for(k = 1; k <= n; k++) {
b[(k - 1) + (j - 1) * n] = abs(j - k) + 1;
}
}
In R
In C
N = 500
650 ms
N = 500
0.2 ms
N = 1000
2610 ms
N = 1000
0.9 ms
N = 1500
5910 ms
N = 1500
2.1 ms
R slowdowns is hundreds of fold.
Toeplitz Matrix
Understanding why GNU-R is slow
TM: Checking with a profiler
>
>
>
>
Rprof()
dummy <- tmFor(5000)
Rprof(NULL)
summaryRProf()
$by.self
self.time self.pct total.time total.pct
"tmFor"
51.42
86.36
59.54
100.00
"abs"
2.80
4.70
2.80
4.70
"-"
2.76
4.64
2.76
4.64
"+"
2.42
4.06
2.42
4.06
"matrix"
0.12
0.20
0.12
0.20
":"
0.02
0.03
0.02
0.03
$by.total
total.time total.pct self.time self.pct
"tmFor"
59.54
100.00
51.42
86.36
"abs"
2.80
4.70
2.80
4.70
"-"
2.76
4.64
2.76
4.64
"+"
2.42
4.06
2.42
4.06
"matrix"
0.12
0.20
0.12
0.20
":"
0.02
0.03
0.02
0.03
TM: R profiler does not help
tmFor <- function(n) {
b <- matrix(nrow = n, ncol = n)
for (j in 1:n) {
Performance
for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
critical
}
part.
}
b
}
TM: Checking with a system profiler
env CFLAGS=-g ./configure --with-blas --with-lapack
--enable-R-static-lib –disable-BLAS-shlib
make
perf record -g -- ~/work/R/R-3.0.2/R-3.0.2-dbg/bin/R --slave < runtm.r
perf report -g
source("tm.r")
dummy <- tmFor(5000)
+
+
-
+
+
+
9.91%
R R
9.53%
R R
6.67%
R R
- Rf_findVarInFrame3
+ 29.17% Rf_findVar
+ 7.84% EnsureLocal
+ 2.21% Rf_eval
1.08%
0.75%
0.74%
R
R
R
R
R
R
[.] Rf_eval
[.] Rf_cons
[.] Rf_findVarInFrame3
[.] real_binary
[.] integer_binary
[.] do_abs
TM: Checking with a system profiler
+
+
-
9.91%
R R
9.53%
R R
6.67%
R R
- Rf_findVarInFrame3
+ 29.17% Rf_findVar
+ 7.84% EnsureLocal
+ 2.21% Rf_eval
[.] Rf_eval
[.] Rf_cons
[.] Rf_findVarInFrame3
Variable look-up
Variable look-up
R built-in functions can be changed
for (j in 1:n) {
for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
}
}
abs is a built-in function
abs can be changed at any time
> abs <- function(x) { x * x }
> abs(-10)
[1] 100
> for(i in 11:13) { if (i==12) { abs <- sqrt } ; print(abs(i)) }
[1] 11
[1] 3.464102
[1] 3.605551
Variable look-up
R built-in functions can be changed
tmFor <- function(n) {
b <- matrix(nrow = n, ncol = n)
for (j in 1:n) {
for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
}
}
b
GlobalEnv
}
.Primitive("abs")
tmFor
n
abs
n
BaseNamespaceEnv
tmFor
n
b
n
nj
n
k
Variable look-up
R built-in functions can be changed
for (j in 1:n) {
for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
}
}
abs is a built-in function
+ - ( [ { ← for :
> `:` <- sum
are all built-in functions
> 1:10
[1] 11
> `<-` <- function(x,val) { eval.parent( assign(deparse(substitute(x)), 100)) }
> z <- 10
[1] 100
Variable look-up
Variables can be deleted
variable look-up
is needed
for (j in 1:n) {
for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
}
}
> x <- 10
> rm(x)
> x
Error: object 'x' not found
Loop control variable can
be deleted
> for(i in 1:3) { if (i==2) { rm(i) } else
[1] 1
[1] 3
print(i) }
> for(i in 1:3) { if (i==2) { rm(i) } ; print(i) }
[1] 1
Error in print(i) : object 'i' not found
TM: Checking with a system profiler
+
-
9.91%
R R
[.] Rf_eval
9.53%
R R
[.] Rf_cons
- Rf_cons
+ 29.87% Rf_allocList
+ 24.96% Rf_evalList
+ 14.35% Rf_evalListKeepMissing
+ 6.04% Rf_lcons
+ 5.90% Rf_DispatchOrEval
+ 5.29% Rf_list2
+ 3.85% evalseq
+ 3.26% Rf_defineVar
+ 3.04% Rf_list1
+ 1.18% Rf_eval
+ 0.75% replaceCall
+ 0.52% evalArgs
6.67%
R R
[.] Rf_findVarInFrame3
Linked-list
allocation and
use
+
Linked-list allocation and use
Arguments passed as linked-list
for (j in 1:n) {
for (k in 1:n) {
b[k,j] <- abs(j - k) + 1
Converted to a general
replacement call of form
F(X) ← Y
The replacement call is then transformed
F(X) ← Y
TMP ← X
X ← “F<-”( TMP, value = Y )
b[k,j] ← Y
TMP ← b
b ← “[<-”( TMP, k, j, value = Y )
Linked-list allocation and use
Replacement call
is expensive
b[k,j] ← Y
TMP ← b
b ← “[<-”( TMP, k, j, value = Y )
This linked list
allocated in each
iteration
<n
b
n
[<n
TMP
n
n
k
nj
Y
n
Toeplitz Matrix
Speeding-up R programs
R Byte-code compiler
env R_ENABLE_JIT=3 R
AST
> require(compiler)
Loading required package: compiler
> help(cmpfun)
Bytecode
N = 500
650 ms
130 ms
N = 1000
2610 ms
530 ms
N = 1500
5910 ms
1150 ms
Always use byte-code compiler!
TM: Sapply
tmSapply <- function(n) {
sapply(1:n, function(j) {
sapply(1:n, function(k) {
abs(j - k) + 1
})
})
}
TM: Sapply
tmSapply <- function(n) {
sapply(1:n, function(j) {
sapply(1:n, function(k) {
abs(j - k) + 1
})
})
N = 500
}
N = 1000
N = 1500
For
Sapply
130 ms
320 ms
530 ms
1300 ms
1150 ms
2960 ms
Using sapply instead of for sometimes
helps. Not now...
TM: Rows Algo
tmRows <- function(n) {
b <- matrix(nrow = n, ncol = n)
b[1,] <- 1:n
if (n >= 2) {
for(r in 2:n) {
b[r,] <- c(r, b[r-1,-n])
}
}
1
b
2
}
2
3
4
5
1
2
3
4
3
2
1
2
3
4
3
2
1
2
5
4
3
2
1
TM: Rows Algo
tmRows <- function(n) {
b <- matrix(nrow = n, ncol = n)
b[1,] <- 1:n
if (n >= n) {
for(r in 2:n) {
b[r,] <- c(r, b[r-1,-n])
}
For
}
N = 500
b
N = 1000
}
N = 1500
Rows
130 ms
13 ms
530 ms
59 ms
1150 ms
169 ms
Much faster. Reduced calls, lookups.
TM: Cols Algo
tmCols <- function(n) {
b <- matrix(nrow = n, ncol = n)
b[,1] <- 1:n
if (n >= 2) {
for(col in 2:n) {
b[,col] <- c(col, b[-n, col-1])
}
}
1
b
2
}
2
3
4
5
1
2
3
4
3
2
1
2
3
4
3
2
1
2
5
4
3
2
1
TM: Cols2 Algo
tmByCols <- function(n) {
if (n >= 2) {
sapply(1:n, function(col) {
if (col < n) {
c( col:1, 2:(n-col+1) )
} else {
n:1
}
1
})
2
} else {
3
1
4
}
5
}
2
3
4
5
1
2
3
4
2
1
2
3
3
2
1
2
4
3
2
1
TM: Cols2 Algo
tmByCols <- function(n) {
if (n >= 2) {
sapply(1:n, function(col) {
if (col < n) {
c( col:1, 2:(n-col+1) )
} else {
n:1
}
N = 500
})
N = 1000
} else {
N = 1500
1
}
Much
faster.
Reduced
}
Rows
Cols2
13 ms
5 ms
59 ms
39 ms
169 ms
58 ms
calls, lookups.
TM: Outer Algo
tmOuter <- function(n) {
outer(X = 1:n, Y = 1:n, FUN = function(j,k) {
abs(j - k) + 1
})
}
1
2
3
4
5
2
1
2
3
4
3
2
1
2
3
4
3
2
1
2
5
4
3
2
1
TM: Outer Algo
tmOuter <- function(n) {
outer(X = 1:n, Y = 1:n, FUN = function(j,k) {
abs(j - k) + 1
})
Cols2
}
Outer
C
N = 500
5 ms
2 ms
0.2 ms
N = 1000
39 ms
27 ms
0.9 ms
N = 1500
58 ms
47 ms
2.1 ms
Yet faster. Vectorized.
Also easy to read.
TM: Summary
For
Outer
C
For-FastR
N = 500
130 ms
2 ms
0.2 ms
13 ms
N = 1000
530 ms
27 ms
0.9 ms
47 ms
N = 1500
1150 ms
47 ms
2.1 ms
101 ms
Summary
●
Use byte-code compiler
●
Vectorize
●
Use built-ins (sum, prod, cumsum, outer)
●
Use simplest data structure possible
●
–
Matrix instead of data.frame
–
Avoid data.frame indexing
Save and re-use intermediate results
Please consider donating your code/data in form of
benchmarks.
Download