Learning Data Analytics with R and Hadoop

advertisement
Big data analytics with R and Hadoop
Chapter 5 Learning Data Analytics with R and Hadoop
데이터마이닝연구실
2015.04.23
김지연
Learning Data Analytics with R and Hadoop
Content
• Understanding the data analytics project life cycle
• Understanding data analytics problems
– Exploring web pages categorization
– Computing the frequency of stock market change
– Predicting the sale price of blue book for bulldozers
Learning Data Analytics with R and Hadoop
Understanding the data analytics project life cycle
1.
2.
3.
4.
5.
Identifying the problem
Designing data requirement
Preprocessing data
Performing analytics over data
Visualizing data
Learning Data Analytics with R and Hadoop
Understanding the data analytics project life cycle
1. Identifying the problem
• business analytics trends change by performing data analytics over web
datasets for growing business
• data analytical application needs to be scalable for collecting insights
from their datasets
• If we want to know how to increase the business
 identify the important pages of our website by categorizing them
 based on these popular pages, their types, their traffic sources, and their
content
 we will be able to decide the roadmap to improve business by improving web
traffic(content)
Learning Data Analytics with R and Hadoop
Understanding the data analytics project life cycle
2. Designing data requirement
• to perform the data analytics, if needs datasets from related domains
• social media analytics (problem specification)
 use the data source as Facebook or Twitter
 For identifying the user characteristics, we need user profile information,
likes, and posts as data attributes.
Learning Data Analytics with R and Hadoop
Understanding the data analytics project life cycle
3. Preprocessing data
•
•
•
•
•
data cleansing
data aggregation
data augmentation
data sorting
data formatting
• Big Data
 the datasets need to be formatted and uploaded to HDFS
 used various nodes with Mappers and Reducers in Hadoop clusters.
Learning Data Analytics with R and Hadoop
Understanding the data analytics project life cycle
4. Performing analytics over data
• various machine learning(custom algorithmic concepts)




Regression
Classification
Clustering
model-based recommendation
• Big Data
 the same algorithms can be translated to MapReduce algorithms for running
them on Hadoop clusters by translating their data analytics logic to the
MapReduce job which is to be run over Hadoop clusters.
Learning Data Analytics with R and Hadoop
Understanding the data analytics project life cycle
5. Visualizing data
• Ggplot2
• rCharts
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
1. Identifying the problem
• To identify the category of a web page of a website based on the visit
count of the pages
• To identify the importance of web pages designed for websites
based on the content, design, or visits of the lower popular pages can
be improved or increased.
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
2. Designing data requirement
• Use Google Analytics dataset




date: This is the date of the day when the web page was visited
source: This is the referral to the web page
pageTitle: This is the title of the web page
pagePath: This is the URL of the web page
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
2. Designing data requirement
• the code for the extraction process from Google Analytics
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
2. Designing data requirement
• the code for the extraction process from Google Analytics
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
3. Preprocessing data
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
4. Performing analytics over data
• Initialize by setting Hadoop variable & loading the RHadoop library
• Upload the datasets to HDFS
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
4. Performing analytics over data
• MapReduce 1
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
4. Performing analytics over data
• MapReduce 1
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
4. Performing analytics over data
• MapReduce 2
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
4. Performing analytics over data
• MapReduce 2
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
5. Visualizing data
• the web page categorization output using the three categories
• if we have more information, such as sources,
we can represent the web pages as nodes of a graph, colored by
popularity with directed edges when users follow the links
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Computing the frequency of stock market change
1. Identifying the problem
•
•
•
it will calculate the frequency of past changes for one particular symbol of the
stock market, such as a Fourier Transformation
the investor can get more insights on changes for different time periods
To calculate the frequencies of percentage change
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Computing the frequency of stock market change
2. Designing data requirement
• Use Yahoo! Finance as the input dataset







From month
From day
From year
To month
To day
To year
Symbol
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Computing the frequency of stock market change
3. Preprocessing data
• To perform the analytics over the extracted dataset
stock_BP <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=BP")
write.csv(stock_BP,"table.csv", row.names=FALSE)
• uploading table.csv to hdfs
bin/hadoop dfs -put /usr/jyk/table.csv /input/
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Computing the frequency of stock market change
4. Performing analytics over data
• Mapper : stock_mapper.R
options(warn=-1)
input<-file("stdin","r")
while(length(currentLine<-readLines(input,n=1,warn=FALSE))>0){
fields<-unlist(strsplit(currentLine,","))
open<-as.double(fields[2])
close<-as.double(fields[5])
change<-(close-open)
write(paste(change,1,sep="\t"),stdout())
}
close(input)
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Computing the frequency of stock market change
4. Performing analytics over data
• Reducer: stock_reducer.R
current.key<-NA
current.val<-0.0
conn<-file("stdin","r")
while(length(next.line<-readLines(conn,n=1))>0){
split.line<-strsplit(next.line,"\t")
key<-split.line[[1]][1]
val<-as.numeric(split.line[[1]][2])
if(is.na(current.key)){
current.key<-key
current.val<-val
}
else{
if(current.key==key){
current.val<-current.val+val
}
else{
write(paste(current.key,current.val,sep="\t"),stdout())
current.key<-key
current.val<-val
}
}
}
write(paste(current.key,current.val,sep="\t"),stdout())
close(conn)
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Computing the frequency of stock market change
4. Performing analytics over data
• MapReduce
/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop jar
/opt/cloudera/parcels/CDH/jars/hadoop-streaming-2.5.0-mr1-cdh5.3.3.jar \
-input input/table.csv \
-output outputs \
-file /home/jyk/Documents/stock_mapper.R \
-mapper /home/jyk/Documents/stock_mapper.R \
-file /home/jyk/Documents/stock_reducer.R \
-reducer /home/jyk/Documents/stock_reducer.R
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Computing the frequency of stock market change
4. Performing analytics over data
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Computing the frequency of stock market change
4. Performing analytics over data
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Exploring web pages categorization
5. Visualizing data
library(ggplot2)
myStockData <- read.delim("stock_output.txt", header=F, sep="", dec=".")
ggplot(myStockData, aes(x=V1, y=V2)) + geom_smooth() + geom_point()
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
1. Identifying the problem
• How large datasets can be resampled & applied the random forest
model with R and Hadoop
• To predict the sale price of a particular piece of heavy equipment at a
usage auction based on its usage, equipment type, and configuration
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
2. Designing data requirement
• Use Kaggle competition
 http://www.kaggle.com/c/bluebook-for-bulldozers
File name
Description format (size)
Train
This is a training set that contains data for 2011.
Valid
This is a validation set that contains data from January 1,
2012 to April 30, 2012.
Data dictionary
This is the metadata of the training dataset variables.
Machine_Appendix
This contains the correct year of manufacturing for a given
machine along with the make, model, and product class
details.
Test
This tests datasets.
random_forest_benchmark_test
This is the benchmark solution provided by the host.
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
3. Preprocessing data
• Loading Train.csv dataset & Machine_Appendix.csv
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
3. Preprocessing data
• Add a few features & merge
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
4. Performing analytics over data
• Random sampling
 N data points in our initial training set
 A set of M different models for an ensemble classifier
 Each of the M models will be fitted with K data points
• Poisson sampling
 KM < N: we are not using the full amount of data available to us
 KM = N: we can exactly partition our dataset to produce totally independent
samples
 KM > N: we must resample some of our data with replacements
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
4. Performing analytics over data
• Poisson sampling
 the generation of independent samples by using N training input points
 three parameters : N, M, and K where K is fixed
 T=K/N to eliminate the need for the value of N in advance
 K / N-average fraction of input data in each model 10%
T = frac.per.model = 0.1
 number of models
M = num.models = 50
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
4. Performing analytics over data
• Fitting random forests
Under fitting
Normal fitting
Over fitting
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
4. Performing analytics over data
• Mapper
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
4. Performing analytics over data
• Reducer
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
4. Performing analytics over data
• MapReducer
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
4. Performing analytics over data
• Each of the 50 samples produced a random forest with 10 trees, so the
final random forest is a collection of 500 trees, fitted in a distributed
fashion over a Hadoop cluster.
Learning Data Analytics with R and Hadoop
Understanding data analytics problems
- Predicting the sale price of blue book for bulldozers
5. Visualizing data
Thank you
Download