Using large datasets in R with FF Micha

advertisement
Using large
datasets in R
with FF
MichaƂ Piotrowski
EPCC, The University of Edinburgh
mpiotrow@epcc.ed.ac.uk
FF - Acknowledgments
ff: memory-efficient storage of large data on
disk and fast access functions
Authors: Oehlschlägel, Adler, Nenadic, Zucchini
http://ff.r-forge.r-project.org/
January 2010
2
Talk Overview
•
•
•
•
January 2010
Motivation
How does FF work?
FF functionality
FF objects and snowfall
3
Motivation
• Memory limitations in R:
– Big objects, larger than available RAM.
– Too many datasets > RAM
– Too many copies of the same data
– Sharing data between parallel R processes on a multi-core
machine (snowfall)
January 2010
4
How does FF work?
Comparison of ff and native R objects in memory
* based on the poster by ff author D. Adler
January 2010
5
FF demo
Creating ff objects:
Operating on ff objects:
January 2010
6
FF demo
The ffapply functions support convenient batched processing
of ff objects such that each single batch or chunk will not
exhaust RAM
Finalising options:
close releases resources and closes file
delete releases resources and deletes file
deleteIfOpen releases resources and deletes file only if file was open
January 2010
7
R.ff
Package 'R.ff' is an experiment to provide as much as possible
of R's basic functionality to work with ff:
! != %% %*% %/% & | * + -/ < <= == > >= ^ abs acos acosh asin asinh
atan atanh bessel Ibessel Jbessel Kbessel Ybeta ceiling choose
colMeans colSums cos cosh crossprod cummax cummin cumprod cumsum
dbeta dbinom dcauchy dchisq dexpdf dgamma dgeom dhyper digamma dlnorm
dlogis dnbinom dnorm dpois dsignrank dt dunif dweibull dwilcox exp
expm1 factorial fivenum floor gamma gammaCody IQR is.na is.nanjitter
lbeta lchoose lfactorial lgammalog log10 log1p log2 logbmad order
pbeta pbinom pcauchy pchisq pexp pf pgamma pgeom phyper plnorm plogis
pnbinom pnorm ppois psigamma psignrank pt punif pweibull pwilcox
qbeta qbinom qcauchy qchisq qexp qf qgamma qgeom qhyper qlnorm qlogis
qnbinom qnorm qpois qsignrank qt quantile qunif qweibull qwilcoxrange
range rbeta rbinom rcauchy rchisq rexprf rgamma rgeom rhyper rlnorm
rlogis rnbinom rnorm round rowMeans rowSums rpois rsignrank rt runif
rweibullr wilcox sample sd sign signif sin sinh sort sqrt summary t
tabulate tan tanh trigamma truncvar
January 2010
8
Snow & Snowfall
• Snow (http://cran.rproject.org/web/packages/snow/index.html)
– Provides a high-level interface for using a workstation cluster or
multicore CPU for parallel computations in R.
– Master / Slave model of communication
• Snowfall (http://cran.r-project.org/web/packages/snowfall/index.html)
– Usability wrapper around snow.
– Easier development of parallel R programs.
January 2010
9
FF objects and snowfall
January 2010
10
FF objects and snowfall
January 2010
11
Conclusions
• Implementing fast memory mapped access to flat files.
• ff objects can be created, stored, used and removed,
almost like standard R RAM objects
• Allows to process datasets that do not fit into CPU physical
memory
• ff objects are perfect to read the same data from many R
processes
• FF objects are currently supported in the SPRINT package.
January 2010
12
Download