Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 2015.04.23 김지연

Content
• Understanding the data analytics project life cycle
• Understanding data analytics problems
  – Exploring web pages categorization
  – Computing the frequency of stock market change
  – Predicting the sale price of blue book for bulldozers

Understanding the data analytics project life cycle
1. Identifying the problem
2. Designing data requirement
3. Preprocessing data
4. Performing analytics over data
5. Visualizing data

1. Identifying the problem • business analytics trends change by performing data analytics over web datasets for growing business
• data analytical application needs to be scalable for collecting insights from their datasets
• If we want to know how to increase the business identify the important pages of our website by categorizing them based on these popular pages, their types, their traffic sources, and their content we will be able to decide the roadmap to improve business by improving web traffic(content)

2. Designing data requirement
• to perform the data analytics, if needs datasets from related domains
• social media analytics (problem specification) use the data source as Facebook or Twitter For identifying the user characteristics, we need user profile information, likes, and posts as data attributes. 3. Preprocessing data
• data cleansing
• data aggregation
• data augmentation
• data sorting
• data formatting
• Big Data the datasets need to be formatted and uploaded to HDFS used various nodes with Mappers and Reducers in Hadoop clusters.

4. Performing analytics over data
• various machine learning(custom algorithmic concepts)
  Regression
  Classification
  Clustering
  model-based recommendation
• Big Data the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters by translating their data analytics logic to the MapReduce job which is to be run over Hadoop clusters.

5. Visualizing data
• Ggplot2
• rCharts

Understanding data analytics problems - Exploring web pages categorization

1. Identifying the problem
• To identify the category of a web page of a website based on the visit count of the pages
• To identify the importance of web pages designed for websites based on the content, design, or visits of the lower popular pages can be improved or increased. 2. Designing data requirement
• Use Google Analytics dataset
  date: This is the date of the day when the web page was visited
  source: This is the referral to the web page
  pageTitle: This is the title of the web page
  pagePath: This is the URL of the web page • the code for the extraction process from Google Analytics

3. Preprocessing data

4. Performing analytics over data
• Initialize by setting Hadoop variable & loading the RHadoop library
• Upload the datasets to HDFS
• MapReduce 1 • MapReduce 2

5. Visualizing data
• the web page categorization output using the three categories
• if we have more information, such as sources, we can represent the web pages as nodes of a graph, colored by popularity with directed edges when users follow the links

Understanding data analytics problems - Computing the frequency of stock market change

1. Identifying the problem • it will calculate the frequency of past changes for one particular symbol of the stock market, such as a Fourier Transformation
• the investor can get more insights on changes for different time periods
• To calculate the frequencies of percentage change

2. Designing data requirement
• Use Yahoo! Finance as the input dataset
  From month
  From day
  From year
  To month
  To day
  To year
  Symbol

3. Preprocessing data • To perform the analytics over the extracted dataset
stock_BP <- read.csv("")
write.csv(stock_BP,"table.csv", row.names=FALSE)

• uploading table.csv to hdfs
bin/hadoop dfs -put /usr/jyk/table.csv /input/

4. Performing analytics over data
• Mapper : stock_mapper.R
options(warn=-1)
input<-file("stdin","r")
while(length(currentLine<-readLines(input,n=1,warn=FALSE))>0){
    fields<-unlist(strsplit(currentLine,","))
    open<-as.double(fields[2])
    close<-as.double(fields[5])
    change<-(close-open)
    write(paste(change,1,sep="\t"),stdout())
}
close(input) • Reducer: stock_reducer.R
current.key<-NA
current.val<-0.0
conn<-file("stdin","r")
while(length(next.line<-readLines(conn,n=1))>0){
    split.line<-strsplit(next.line,"\t")
    key<-split.line[[1]][1]
    val<-as.numeric(split.line[[1]][2])
    if({
        current.key<-key
        current.val<-val
    }
    else{
        if(current.key==key){
            current.val<-current.val+val
        }
        else{
            write(paste(current.key,current.val,sep="\t"),stdout())
            current.key<-key
            current.val<-val
        }
    }
}
write(paste(current.key,current.val,sep="\t"),stdout())
close(conn) • MapReduce
/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-streaming-2.5.0-mr1-cdh5.3.3.jar \
-input input/table.csv \
-output outputs \
-file /home/jyk/Documents/stock_mapper.R \
-mapper /home/jyk/Documents/stock_mapper.R \
-file /home/jyk/Documents/stock_reducer.R \
-reducer /home/jyk/Documents/stock_reducer.R

5. Visualizing data library(ggplot2)
myStockData <- read.delim("stock_output.txt", header=F, sep="", dec=".")
ggplot(myStockData, aes(x=V1, y=V2)) + geom_smooth() + geom_point()

Understanding data analytics problems - Predicting the sale price of blue book for bulldozers

1. Identifying the problem
• How large datasets can be resampled & applied the random forest model with R and Hadoop
• To predict the sale price of a particular piece of heavy equipment at a usage auction based on its usage, equipment type, and configuration

2. Designing data requirement
• Use Kaggle competition

File name | Description | format (size)
Train | This is a training set that contains data for 2011.
Valid | This is a validation set that contains data from January 1, 2012 to April 30, 2012. Data dictionary | This is the metadata of the training dataset variables.
Machine_Appendix | This contains the correct year of manufacturing for a given machine along with the make, model, and product class details.
Test | This tests datasets.
random_forest_benchmark_test | This is the benchmark solution provided by the host.

3. Preprocessing data
• Loading Train.csv dataset & Machine_Appendix.csv
• Add a few features & merge

4. Performing analytics over data • Random sampling
  N data points in our initial training set
  A set of M different models for an ensemble classifier
  Each of the M models will be fitted with K data points

• Poisson sampling
  KM < N: we are not using the full amount of data available to us
  KM = N: we can exactly partition our dataset to produce totally independent samples
  KM > N: we must resample some of our data with replacements • Poisson sampling the generation of independent samples by using N training input points three parameters : N, M, and K where K is fixed
  T=K/N to eliminate the need for the value of N in advance
  K / N-average fraction of input data in each model
  10% T = frac.per.model = 0.1
  number of models M = num.models = 50

• Fitting random forests
  Under fitting
  Normal fitting
  Over fitting

• Mapper • Reducer
• MapReducer
• Each of the 50 samples produced a random forest with 10 trees, so the final random forest is a collection of 500 trees, fitted in a distributed fashion over a Hadoop cluster.

5. Visualizing data

Thank you