Using large datasets in R with FF MichaĆ Piotrowski EPCC, The University of Edinburgh mpiotrow@epcc.ed.ac.uk FF - Acknowledgments ff: memory-efficient storage of large data on disk and fast access functions Authors: Oehlschlägel, Adler, Nenadic, Zucchini http://ff.r-forge.r-project.org/ January 2010 2 Talk Overview • • • • January 2010 Motivation How does FF work? FF functionality FF objects and snowfall 3 Motivation • Memory limitations in R: – Big objects, larger than available RAM. – Too many datasets > RAM – Too many copies of the same data – Sharing data between parallel R processes on a multi-core machine (snowfall) January 2010 4 How does FF work? Comparison of ff and native R objects in memory * based on the poster by ff author D. Adler January 2010 5 FF demo Creating ff objects: Operating on ff objects: January 2010 6 FF demo The ffapply functions support convenient batched processing of ff objects such that each single batch or chunk will not exhaust RAM Finalising options: close releases resources and closes file delete releases resources and deletes file deleteIfOpen releases resources and deletes file only if file was open January 2010 7 R.ff Package 'R.ff' is an experiment to provide as much as possible of R's basic functionality to work with ff: ! != %% %*% %/% & | * + -/ < <= == > >= ^ abs acos acosh asin asinh atan atanh bessel Ibessel Jbessel Kbessel Ybeta ceiling choose colMeans colSums cos cosh crossprod cummax cummin cumprod cumsum dbeta dbinom dcauchy dchisq dexpdf dgamma dgeom dhyper digamma dlnorm dlogis dnbinom dnorm dpois dsignrank dt dunif dweibull dwilcox exp expm1 factorial fivenum floor gamma gammaCody IQR is.na is.nanjitter lbeta lchoose lfactorial lgammalog log10 log1p log2 logbmad order pbeta pbinom pcauchy pchisq pexp pf pgamma pgeom phyper plnorm plogis pnbinom pnorm ppois psigamma psignrank pt punif pweibull pwilcox qbeta qbinom qcauchy qchisq qexp qf qgamma qgeom qhyper qlnorm qlogis qnbinom qnorm qpois qsignrank qt quantile qunif qweibull qwilcoxrange range rbeta rbinom rcauchy rchisq rexprf rgamma rgeom rhyper rlnorm rlogis rnbinom rnorm round rowMeans rowSums rpois rsignrank rt runif rweibullr wilcox sample sd sign signif sin sinh sort sqrt summary t tabulate tan tanh trigamma truncvar January 2010 8 Snow & Snowfall • Snow (http://cran.rproject.org/web/packages/snow/index.html) – Provides a high-level interface for using a workstation cluster or multicore CPU for parallel computations in R. – Master / Slave model of communication • Snowfall (http://cran.r-project.org/web/packages/snowfall/index.html) – Usability wrapper around snow. – Easier development of parallel R programs. January 2010 9 FF objects and snowfall January 2010 10 FF objects and snowfall January 2010 11 Conclusions • Implementing fast memory mapped access to flat files. • ff objects can be created, stored, used and removed, almost like standard R RAM objects • Allows to process datasets that do not fit into CPU physical memory • ff objects are perfect to read the same data from many R processes • FF objects are currently supported in the SPRINT package. January 2010 12