IEEE Paper Template in A4 (V1)

advertisement
A Scalable Random Forest approach for electrical load calculation for
household
1
Sushmita Patil, 2Poonam Ghuli
1
Student, Department of Computer Science and Engineering, R.V.C.E., Bangalore, Karnataka
Professor, Department of Computer Science and Engineering, R.V.C.E., Bangalore, Karnataka
2
1
susp24@gmail.com, 2 poonamghuli@rvce.edu.in
Abstract- The size of the databases used in today’s
enterprises has been growing at exponential rates
everyday. Simultaneously, the need to process and
analyze the large volumes of data(big data) for
business decision making has also increased. In this
scenario distributed computing is becoming a very
powerful technique to solve computational
problems more efficiently and in a effective way.
The electrical energy requirements for households
keep varying very frequently . In this paper the total
electrical consumption for houses is calculated and
further classified based on the values available in
dataset in a very effective manner. Here an opensource distributed computing platform Google App
Engine (GAE) is used. This paper implements a
computational paradigm named MapReduce along
with the approach of scalable random forest
algorithm. MapReduce allows for massive
scalability across multiple nodes by processing
large data sets with parallel and distributed
algorithms. GAE’s cloud platform helps solving the
problem of scalability and efficiency. In this paper
the data set used is based on recordings originating
from smart plugs, which are deployed in private
households. It is equipped with a range of sensors
which measure different, power consumption
related values. These values are taken to calculate
the consumption for a required house and further
classify it as a high or low electrical energy
consuming house. It has been observed that the
total consumption result calculation is almost 100%
correct and the processing efficiency increases with
the increase in the number of nodes (shards) .
Keywords - Bigdata, MapReduce, scalable random
forest , Google app engine (GAE)
I INTRODUCTION
With development of the information technology,
there is a massive growth in the size of data that is
getting generated. This huge data causes a great
challenge for processing of data and classification.
Google App Engine’s hadoop platform provides a
open-source implementation of Google's distributed
file system and the MapReduce framework, for
scalable distributed computing or cloud computing.
MapReduce programming model provides an
efficient solution for processing huge datasets in an
extremely parallel environment . And it is one of
the most popular parallel model for data processing
in cloud computing platform. However, using the
traditional machine learning algorithms with
MapReduce programming framework is very
necessary in dealing with massive datasets. Random
Forest algorithm is a commonly used algorithm
applied to data classification and is applicable even
for massive data. In this paper, we use scalable
random forest approach[2] to process the electrical
dataset.
A. Big Data
The term Big data is used for huge data sets whose
size is beyond the ability of normally used software
tools to capture, manage, and process the data
within a acceptable time. Big data sizes can range
from a few dozen terabytes to many petabytes of
data in a single data set[3] .Typical examples of big
data includes web log information , sensor network
data, extra terrestrial data,
data from social
networks, weather science, military surveillance,
medical record and large-scale finance and
eCommerce. One current feature of big data is the
difficulty working with it using traditional
relational databases, it requires instead massively
parallel processing software running on tens,
hundreds, or even thousands of systems. The
Fig 1- Mapreduce Framework
various challenges involved in big data
management include – scalability, unstructured data,
real time analytics, fault tolerance etc.
C. Google App Engine
GAE hosts and runs applications on google’s largeB. MapReduce Programming Framework
scale server infrastructure. It’s an example of a
MapReduce is a programming model for processing platform-as-a-service (PaaS) cloud computing
huge data sets[1]. Users specifies a map function solution, web application developers use it to create
that processes a key/value pair to generate a set of their applications using a set of Google tools. It has
intermediate values and a reduce function that the following main components: scalable services, a
merges all intermediate values to produce the result. data store and a runtime environment.
GAE’s front-end service handles HTTP requests
Map phase: The master node takes the input dataset, and maps them accordingly to the appropriate
partitions it up into smaller subsets, and distributes application servers. The work of application servers
them to worker nodes. A worker node may follow is to start, initialize, and reuse application instances
the same procedure again, leading to a multi-level for incoming requests. During huge traffic hours,
tree structure. The worker node processes the subset, GAE automatically allocates additional resources to
and passes the answer back to its master node. Map start new systems. The number of new instances for
takes a pair of data with a type in one data domain, an application and the distribution of requests
and returns a list of pairs in a another domain:
depend on data size and resources use pattern. So,
Map (k1, v1) → list (K2, v2)
GAE infrastructure offers load balancing, auto
scaling and fault tolerance capabilities — that is, it
Reduce phase: The master node then collects the automatically adjusts the number of application
intermediate answers to all the subsets and instances (virtual machines) according to the
combines them in some way to form the output – request rate [4,5,6].
the solution to the problem it was initially trying to Each application instance runs in a sandbox (a
solve. The Reduce function is then applied in runtime environment abstracted from the
parallel to each group, which returns a collection of underlying operating system). This prevents
values in the same domain:
applications from performing unwanted malicious
Reduce (K2, list (v2)) → list (v3)
operations and enables GAE to optimize CPU and
memory utilization for multiple applications on the
same system. Although sandboxing has advantages
it imposes various programmer restrictions:
 Applications do not have access to the
underlying hardware and only limited access
to network resources.
 Applications built on java can make use of
only a subset of the standard library
functions.
 Applications can’t use threads.
 A request has a maximum of 30 seconds to
respond to the client.
produce intermediate results in the form of keyvalue pairs. Once the map process is completed,
the reduce process begins. The master node creates
certain number of worker nodes to perform the
reduce operations. Intermediate data from the map
process is processed by the reduce workers based
on the keys. A same reduce worker handles
intermediate data with the same key. An important
advantage of MapReduce is that allows map and
reduce operations to be distributed so they can be
processed on different processors in parallel. This
parallelization can substantially decrease the
processing time required for huge data sets.
II RELATED WORK
Random Forest is one of the
popular data
classification algorithm for machine learning.
Scalable random forest[2] is an improved Random
Forest algorithm based on Map Reduce model. This
new algorithm makes data classification for
massive datasets in computer cluster or cloud
computing environment . By distributing scalable
random forest algorithm processes and optimizes
the subsets of the data across multiple computing
nodes. The experimental results show that the
scalable random forest algorithm has accuracy
degradation but higher performance compared to
traditional Random Forest algorithm. In this paper
an approach of scalable random forest is used and is
implemented in google app engine(GAE)
environment . GAE provides a very high degree of
fault tolerance , if a node fails then the task of that
node is put in the taskqueue and rerun so that the
final accuracy of the result does not get affected .
III SYSTEM ARCHITECTURE
In Fig 2 the overall system architecture of the
system is shown . The user accesses the hadoop
system through the master node . The task is further
assigned to the worker nodes . Depending on the
size of data required number of worker nodes or
shards are selected . The data is distributed among
the shards and parallel processing of data happens.
Finally the intermediate data is consolidated and
given to the master node and the results are
obtained .The cloud system is used to get the
required number of nodes for parallel processing of
data .
MapReduce is being widely used to support parallel
computing on large data sets in distributed
computing environment . The map process is
initiated by a master node. In the first part of the
map the input is split into many small sub-problems,
and distributed to worker nodes. Worker nodes
process the sub-problems independently and
Fig 2 System Architecture
experiment results show that this approach has good
fault tolerance and scalability. The result of total
consumption obtained is almost 100% accurate as
the Google app engine provides full fault tolerance.
If any of the nodes fail during the processing, the
failed task is again added to the task queue and is
reconsidered for calculation, unlike other non cloud
 Stage 2- In this stage, SMRF approach platforms wherein if a node fails there is an effect
based on MapReduce programming model on the accuracy of the total result. Since the data is
will divide the original dataset into many stored and processed on cloud it is highly scalable
subsets based on K value.
and it will work on any huge size of data.
 Stage 3 –In the stage the total consumption Depending on the size of data the number of nodes
of house is calculated based on houseid and required for processing can be increased or
decreased by the user. Also with the increase in the
the classification of data is done
number of nodes the processing time reduces
IV METHDOLOGY
The overall procedure involves 3 stages
 Stage 1- In this stage the K value (nodes or
shards) is negotiated in order to make good
use of the computing resources of computer
clusters or cloud computing.
Fig 4 TimeVs No of shards
Fig 3 Methodology of the system
Fig 4 shows a graph of how the time taken
decreases with the increase in the number of shards
or nodes. With the addition of 3 shards or nodes the
V EXPERIMENTS AND RESULTS
time taken approximately reduces by 2
The research aimed at making electrical load seconds .The system works more efficiently as the
calculation using distributed environment. The size of data is increased.
dataset in the csv format is given as input and
mapreduce operations are carriedout on them to
calculate the total consumption for a house and
further classify it into the type of consumption. The
VI CONCLUSION AND FUTURE WORK
This project aims at making the load calculation of
the houses which can be used to do the estimation
of energy requirements of the households and
further classify it and find the consumption
category (low or high). The scalable random forest
approach of mapreduce is used which is one of the
efficient approach for making calculation involving
large data and works at a very efficient rate . The
platform of google app engine employed provides
scalability and fault tolerance to a large extent
which reduces the overhead on the user .
The calculated load can be used to make electrical
forecasts. Such forecasts can be used in demand
side management to proactively influence load and
adapt it to the supply situation, e.g., current
production of renewable energy sources. Further
system can be extended to calculate the power
consumption of individual devices in houses which
can be used to check the performance of devices .
References
[1] Aditya B. Patel, Manashvi Birla, Ushma Nair
“Addressing Big Data Problem Using Hadoop and
Map Reduce” presented at NIRMA university
international
conference
on
engineering
NUiCONE-2012,pp 2-3, 06-08December, 2012
[2] JiaweiHanl, YanhengLiul, XinSunl “A Scalable
Random Forest Algorithm Based on MapReduce”
presented at the IEEE Summer Power Meeting ,pp
1-4 2013 IEEE
[3] JyotiNandimath , AnkurPatil, Ekata Banerjee,
PratimaKakade “Big Data Analysis Using Apache
Hadoop” presented at IEEE IRI 2013, August 14-16,
2013, San Francisco, California, USA
[4] RaduProdan, Michael Sperk, and Simon
Ostermann,,“EvaluatingHigh-Performance
Computingon Google App Engine”, pp 1-2, 2012
IEEE
[5] MaciejMalawski, MaciejKuz´niar, PiotrWójcik,
and Marian Bubak, “How to Use Google App
Engine for Free Computing”,pp 5-6, 2013 IEEE
[6] Lei Hu, PengYue, Hongxiu Zhou,
“Geoprocessing in Google Cloud Computing: case
studies”.
[7] J. D. Y. Correa and J. A. B. Ricaurte, “Web
Application Deveploment Technologies Using
Google Web Toolkit And Google App EngineJava”, IEEE 20141,,,””
Download