ppt - WordPress.com

advertisement
Big Data,
Bigger Data
&
Big R Data
Birmingham R Users Meeting
23rd April 2013
Andy Pryke
Andy@The-Data-Mine.co.uk / @AndyPryke
My Bias…
www.the-data-mine.co.uk
I work in commercial data mining,
data analysis and data visualisation
Background in computing and
artificial intelligence
Use R to write programs which
analyse data
What is Big Data?
www.the-data-mine.co.uk
Depends who you ask.
Answers are often “too big to ….”
…load into memory
…store on a hard drive
…fit in a standard database
Plus
“Fast changing”
Not just relational
My “Big Data” Definition
www.the-data-mine.co.uk
“Data collections big
enough to require you to
change the way you
store and process them.”
- Andy Pryke
Data Size Limits in R
www.the-data-mine.co.uk
Standard R packages use a single thread,
with data held in memory (RAM)
help("Memory-limits")
•
•
Vectors limited to 2 Billion items
Memory limit of ~128Tb
Servers with 1Tb+ memory are available
• Also, Amazon EC2 servers up to 244Gb
Overview
www.the-data-mine.co.uk
• Problems using R with Big Data
• Processing data on disk
• Hadoop for parallel computation and Big
Data storage / access
• “In Database” analysis
• What next for Birmingham R User Group?
Background: R matrix class
www.the-data-mine.co.uk
“matrix”
- Built in (package base).
- Stored in RAM
- “Dense” - takes up memory
to store zero values)
Can be replaced by…..
Sparse / Disk Based Matrices
www.the-data-mine.co.uk
• Matrix – Package Matrix. Sparse. In RAM
• big.matrix – Package bigmemory /
bigmemoryExtras & VAM. On disk. VAM
allows access from parallel R sessions
• Analysis – Packages irlba, bigalgebra,
biganalytics (R-Forge list)etc.
More details?
“Large-Scale Linear Algebra with R”, Bryan
W. Lewis, Boston R Users Meetup
Commercial Versions of R
www.the-data-mine.co.uk
Revolution Analytics have specialised
versions of R for parallel execution & big data
I believe many if not most components are
also available under Free Open Source
licences, including the RHadoop set of
packages
Plenty more info here
Background: Hadoop
www.the-data-mine.co.uk
• Parallel data processing environment
based on Google’s “MapReduce” model
• “Map” – divide up data and sending it for
processing to multiple nodes.
• “Reduce” – Combine the results
Plus:
• Hadoop Distributed File System (HDFS)
• HBase – Distributed database like
Google’s BigTable
RHadoop – Revolution Analytics
www.the-data-mine.co.uk
Package: rmr2, rhbase, rhdfs
• Example code using RMR (R Map-Reduce)
• R and Hadoop – Step by Step Tutorials
• Install and Demo RHadoop (Google for
more of these online)
• Data Hacking with RHadoop
E.g. Function Output
RHadoop
wc.map <- function(., lines) {
## split "lines" of text into a vector of individual "words"
words <- unlist(strsplit(x = lines,split = " "))
www.the-data-mine.co.uk
keyval(words,1) ## each word occurs once
}
## In, 1
## the, 1
## beginning, 1
##...
wc.reduce <- function(word, counts ) {
## Add up the counts, grouping them by word
keyval(word, sum(counts))
}
## the, 2345
## word, 987
## beginning, 123
##...
wordcount <- function(input, output = NULL){
mapreduce(
input = input ,
output = output,
input.format = "text",
map = wc.map,
reduce = wc.reduce,
combine = T)
}
Other Hadoop libraries for R
www.the-data-mine.co.uk
Other packages: hive, segue, RHIPE…
segue
– easy way to distribute CPU intensive work
- Uses Amazon’s Elastic Map Reduce service,
which costs money.
- not designed for big data, but easy and fun.
Example follows…
# first, let's generate a 10-element list of
# 999 random numbers + RHadoop
1 NA:
> myList <- getMyTestList()
www.the-data-mine.co.uk
# Add up each set of 999 numbers
> outputLocal <- lapply(myList, mean, na.rm=T)
> outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T)
RUNNING - 2011-01-04 15:16:57
RUNNING - 2011-01-04 15:17:27
RUNNING - 2011-01-04 15:17:58
WAITING - 2011-01-04 15:18:29
## Check local and cluster results match
> all.equal(outputEmr, outputLocal)
[1] TRUE
# The key is the emrlapply() function. It works just like lapply(),
# but automagically spreads its work across the specified cluster
Oracle R Connector for Hadoop
www.the-data-mine.co.uk
• Integrates with Oracle Db, “Oracle Big Data
Appliance” (sounds expensive!) & HDFS
• Map-Reduce is very similar to the rmr example
• Documentation lists examples for Linear
Regression, k-means, working with graphs
amongst others
• Introduction to Oracle R Connector for Hadoop.
• Oracle also offer some in-database algorithms
for R via Oracle R Enterprise (overview)
Teradata Integration
www.the-data-mine.co.uk
Package: teradataR
• Teradata offer in-database analytics, accessible
through R
• These include k-means clustering, descriptive
statistics and the ability to create and call indatabase user defined functions
What Next?
www.the-data-mine.co.uk
I propose an informal “big data” Special Interest
Group, where we collaborate to explore big data
options within R, producing example code etc.
“R” you interested?
Download