Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연 Learning Data Analytics with R and Hadoop Content • Understanding the data analytics project life cycle • Understanding data analytics problems – Exploring web pages categorization – Computing the frequency of stock market change – Predicting the sale price of blue book for bulldozers Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle 1. 2. 3. 4. 5. Identifying the problem Designing data requirement Preprocessing data Performing analytics over data Visualizing data Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle 1. Identifying the problem • business analytics trends change by performing data analytics over web datasets for growing business • data analytical application needs to be scalable for collecting insights from their datasets • If we want to know how to increase the business identify the important pages of our website by categorizing them based on these popular pages, their types, their traffic sources, and their content we will be able to decide the roadmap to improve business by improving web traffic(content) Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle 2. Designing data requirement • to perform the data analytics, if needs datasets from related domains • social media analytics (problem specification) use the data source as Facebook or Twitter For identifying the user characteristics, we need user profile information, likes, and posts as data attributes. Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle 3. Preprocessing data • • • • • data cleansing data aggregation data augmentation data sorting data formatting • Big Data the datasets need to be formatted and uploaded to HDFS used various nodes with Mappers and Reducers in Hadoop clusters. Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle 4. Performing analytics over data • various machine learning(custom algorithmic concepts) Regression Classification Clustering model-based recommendation • Big Data the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters by translating their data analytics logic to the MapReduce job which is to be run over Hadoop clusters. Learning Data Analytics with R and Hadoop Understanding the data analytics project life cycle 5. Visualizing data • Ggplot2 • rCharts Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 1. Identifying the problem • To identify the category of a web page of a website based on the visit count of the pages • To identify the importance of web pages designed for websites based on the content, design, or visits of the lower popular pages can be improved or increased. Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 2. Designing data requirement • Use Google Analytics dataset date: This is the date of the day when the web page was visited source: This is the referral to the web page pageTitle: This is the title of the web page pagePath: This is the URL of the web page Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 2. Designing data requirement • the code for the extraction process from Google Analytics Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 2. Designing data requirement • the code for the extraction process from Google Analytics Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 3. Preprocessing data Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 4. Performing analytics over data • Initialize by setting Hadoop variable & loading the RHadoop library • Upload the datasets to HDFS Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 4. Performing analytics over data • MapReduce 1 Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 4. Performing analytics over data • MapReduce 1 Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 4. Performing analytics over data • MapReduce 2 Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 4. Performing analytics over data • MapReduce 2 Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 5. Visualizing data • the web page categorization output using the three categories • if we have more information, such as sources, we can represent the web pages as nodes of a graph, colored by popularity with directed edges when users follow the links Learning Data Analytics with R and Hadoop Understanding data analytics problems - Computing the frequency of stock market change 1. Identifying the problem • • • it will calculate the frequency of past changes for one particular symbol of the stock market, such as a Fourier Transformation the investor can get more insights on changes for different time periods To calculate the frequencies of percentage change Learning Data Analytics with R and Hadoop Understanding data analytics problems - Computing the frequency of stock market change 2. Designing data requirement • Use Yahoo! Finance as the input dataset From month From day From year To month To day To year Symbol Learning Data Analytics with R and Hadoop Understanding data analytics problems - Computing the frequency of stock market change 3. Preprocessing data • To perform the analytics over the extracted dataset stock_BP <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=BP") write.csv(stock_BP,"table.csv", row.names=FALSE) • uploading table.csv to hdfs bin/hadoop dfs -put /usr/jyk/table.csv /input/ Learning Data Analytics with R and Hadoop Understanding data analytics problems - Computing the frequency of stock market change 4. Performing analytics over data • Mapper : stock_mapper.R options(warn=-1) input<-file("stdin","r") while(length(currentLine<-readLines(input,n=1,warn=FALSE))>0){ fields<-unlist(strsplit(currentLine,",")) open<-as.double(fields[2]) close<-as.double(fields[5]) change<-(close-open) write(paste(change,1,sep="\t"),stdout()) } close(input) Learning Data Analytics with R and Hadoop Understanding data analytics problems - Computing the frequency of stock market change 4. Performing analytics over data • Reducer: stock_reducer.R current.key<-NA current.val<-0.0 conn<-file("stdin","r") while(length(next.line<-readLines(conn,n=1))>0){ split.line<-strsplit(next.line,"\t") key<-split.line[[1]][1] val<-as.numeric(split.line[[1]][2]) if(is.na(current.key)){ current.key<-key current.val<-val } else{ if(current.key==key){ current.val<-current.val+val } else{ write(paste(current.key,current.val,sep="\t"),stdout()) current.key<-key current.val<-val } } } write(paste(current.key,current.val,sep="\t"),stdout()) close(conn) Learning Data Analytics with R and Hadoop Understanding data analytics problems - Computing the frequency of stock market change 4. Performing analytics over data • MapReduce /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-streaming-2.5.0-mr1-cdh5.3.3.jar \ -input input/table.csv \ -output outputs \ -file /home/jyk/Documents/stock_mapper.R \ -mapper /home/jyk/Documents/stock_mapper.R \ -file /home/jyk/Documents/stock_reducer.R \ -reducer /home/jyk/Documents/stock_reducer.R Learning Data Analytics with R and Hadoop Understanding data analytics problems - Computing the frequency of stock market change 4. Performing analytics over data Learning Data Analytics with R and Hadoop Understanding data analytics problems - Computing the frequency of stock market change 4. Performing analytics over data Learning Data Analytics with R and Hadoop Understanding data analytics problems - Exploring web pages categorization 5. Visualizing data library(ggplot2) myStockData <- read.delim("stock_output.txt", header=F, sep="", dec=".") ggplot(myStockData, aes(x=V1, y=V2)) + geom_smooth() + geom_point() Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 1. Identifying the problem • How large datasets can be resampled & applied the random forest model with R and Hadoop • To predict the sale price of a particular piece of heavy equipment at a usage auction based on its usage, equipment type, and configuration Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 2. Designing data requirement • Use Kaggle competition http://www.kaggle.com/c/bluebook-for-bulldozers File name Description format (size) Train This is a training set that contains data for 2011. Valid This is a validation set that contains data from January 1, 2012 to April 30, 2012. Data dictionary This is the metadata of the training dataset variables. Machine_Appendix This contains the correct year of manufacturing for a given machine along with the make, model, and product class details. Test This tests datasets. random_forest_benchmark_test This is the benchmark solution provided by the host. Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 3. Preprocessing data • Loading Train.csv dataset & Machine_Appendix.csv Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 3. Preprocessing data • Add a few features & merge Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 4. Performing analytics over data • Random sampling N data points in our initial training set A set of M different models for an ensemble classifier Each of the M models will be fitted with K data points • Poisson sampling KM < N: we are not using the full amount of data available to us KM = N: we can exactly partition our dataset to produce totally independent samples KM > N: we must resample some of our data with replacements Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 4. Performing analytics over data • Poisson sampling the generation of independent samples by using N training input points three parameters : N, M, and K where K is fixed T=K/N to eliminate the need for the value of N in advance K / N-average fraction of input data in each model 10% T = frac.per.model = 0.1 number of models M = num.models = 50 Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 4. Performing analytics over data • Fitting random forests Under fitting Normal fitting Over fitting Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 4. Performing analytics over data • Mapper Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 4. Performing analytics over data • Reducer Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 4. Performing analytics over data • MapReducer Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 4. Performing analytics over data • Each of the 50 samples produced a random forest with 10 trees, so the final random forest is a collection of 500 trees, fitted in a distributed fashion over a Hadoop cluster. Learning Data Analytics with R and Hadoop Understanding data analytics problems - Predicting the sale price of blue book for bulldozers 5. Visualizing data Thank you