RRE (Revolution R Enterprise) vs. R at PC Cluster Edward Cheng 2.18.2014 2015/4/9 1 PC Cluster 2015/4/9 2 Environment • Node01~node36,stathpc: RHEL 5 + RRE 6.1 (R-2.14.2) • Node51~node60, himemhpc: RHEL 6 + RRE 7.0 (R-3.0.2) 2015/4/9 3 History R 起源 1993, Professor, Ross Ihaka and Robert Gentleman, University of Aukland, 紐西蘭 Reolution Analytics 公司 (www.revolutionanalytics.com) 2008 by Intel Capital 等創投投資 董事會成員有:Robert Gentleman 教授 (R founder), Norman H. Nie 顧問 (前 SPSS CEO) Revolution R Enterprise (企業版 R) 2015/4/9 4 R • R is world’s most widely used statistics programming language. • Free and open source software 2015/4/9 5 R usage 2015/4/9 6 R package growth 2015/4/9 7 Why Revolution R 2015/4/9 8 Performance R-2.14.2 RRE 6.1 R-3.0.1 RRE 7.0 Matrix Multiply (10000*10000) 751 sec 35 sec 568 sec 20 sec SVD (10000*10000) 5746 sec 374 sec 4549 sec 256 sec 2015/4/9 9 Big Data is coming 2015/4/9 10 Definition • “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 2015/4/9 11 Bytes 2015/4/9 12 Big Data • 2011 年全球數位資料的使用量約為 1.8 ZB (1 ZB = 2 的 70 次方位元組)。依 據 IDC(International Data Corporation)所做的研究報告預測,到 2020 年的總量將是現在的 44 倍,約為 35.2 ZB。 2015/4/9 13 Big Data 2006 累計儲存了850 TB的 網頁資料 2009 每週約有二億二千萬張 照片上傳,也就是需要 25 TB的空間儲存 2011 BIG DATA 海嘯來襲 2015/4/9 每分鐘約有48小時 (48GB)的影片上傳 (每天約有70TB) 14 eBay The world’s largest online marketplace • We have over 50 petabytes of data • We have over 400 million items for sale • We process more than 250 million user queries per day • We have over 112 million active users • We sold over US$75 billion in merchandize in 2012 2015/4/9 15 Big Problems • Capacity data too big to fit into memory • Speed computation may be too slow to be useful 2015/4/9 16 Distributed computing 2015/4/9 17 RevoScaleR • RevoScaleR Package RevoScaleR analysis functions such as rxCube, rxLinMod, rxCovCor, rxLogit, and rxGlm will provide significant speed improvements over any alternatives. These algorithms are all optimized for handling big data. 2015/4/9 18 Multi-threaded Processing 2015/4/9 19 .xdf data format • The XDF file format, a binary file format with an interface that optimizes row and column processing and analysis. 2015/4/9 20