A Scalable Random Forest approach for electrical load calculation for household 1 Sushmita Patil, 2Poonam Ghuli 1 Student, Department of Computer Science and Engineering, R.V.C.E., Bangalore, Karnataka Professor, Department of Computer Science and Engineering, R.V.C.E., Bangalore, Karnataka 2 1 susp24@gmail.com, 2 poonamghuli@rvce.edu.in Abstract- The size of the databases used in today’s enterprises has been growing at exponential rates everyday. Simultaneously, the need to process and analyze the large volumes of data(big data) for business decision making has also increased. In this scenario distributed computing is becoming a very powerful technique to solve computational problems more efficiently and in a effective way. The electrical energy requirements for households keep varying very frequently . In this paper the total electrical consumption for houses is calculated and further classified based on the values available in dataset in a very effective manner. Here an opensource distributed computing platform Google App Engine (GAE) is used. This paper implements a computational paradigm named MapReduce along with the approach of scalable random forest algorithm. MapReduce allows for massive scalability across multiple nodes by processing large data sets with parallel and distributed algorithms. GAE’s cloud platform helps solving the problem of scalability and efficiency. In this paper the data set used is based on recordings originating from smart plugs, which are deployed in private households. It is equipped with a range of sensors which measure different, power consumption related values. These values are taken to calculate the consumption for a required house and further classify it as a high or low electrical energy consuming house. It has been observed that the total consumption result calculation is almost 100% correct and the processing efficiency increases with the increase in the number of nodes (shards) . Keywords - Bigdata, MapReduce, scalable random forest , Google app engine (GAE) I INTRODUCTION With development of the information technology, there is a massive growth in the size of data that is getting generated. This huge data causes a great challenge for processing of data and classification. Google App Engine’s hadoop platform provides a open-source implementation of Google's distributed file system and the MapReduce framework, for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient solution for processing huge datasets in an extremely parallel environment . And it is one of the most popular parallel model for data processing in cloud computing platform. However, using the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets. Random Forest algorithm is a commonly used algorithm applied to data classification and is applicable even for massive data. In this paper, we use scalable random forest approach[2] to process the electrical dataset. A. Big Data The term Big data is used for huge data sets whose size is beyond the ability of normally used software tools to capture, manage, and process the data within a acceptable time. Big data sizes can range from a few dozen terabytes to many petabytes of data in a single data set[3] .Typical examples of big data includes web log information , sensor network data, extra terrestrial data, data from social networks, weather science, military surveillance, medical record and large-scale finance and eCommerce. One current feature of big data is the difficulty working with it using traditional relational databases, it requires instead massively parallel processing software running on tens, hundreds, or even thousands of systems. The Fig 1- Mapreduce Framework various challenges involved in big data management include – scalability, unstructured data, real time analytics, fault tolerance etc. C. Google App Engine GAE hosts and runs applications on google’s largeB. MapReduce Programming Framework scale server infrastructure. It’s an example of a MapReduce is a programming model for processing platform-as-a-service (PaaS) cloud computing huge data sets[1]. Users specifies a map function solution, web application developers use it to create that processes a key/value pair to generate a set of their applications using a set of Google tools. It has intermediate values and a reduce function that the following main components: scalable services, a merges all intermediate values to produce the result. data store and a runtime environment. GAE’s front-end service handles HTTP requests Map phase: The master node takes the input dataset, and maps them accordingly to the appropriate partitions it up into smaller subsets, and distributes application servers. The work of application servers them to worker nodes. A worker node may follow is to start, initialize, and reuse application instances the same procedure again, leading to a multi-level for incoming requests. During huge traffic hours, tree structure. The worker node processes the subset, GAE automatically allocates additional resources to and passes the answer back to its master node. Map start new systems. The number of new instances for takes a pair of data with a type in one data domain, an application and the distribution of requests and returns a list of pairs in a another domain: depend on data size and resources use pattern. So, Map (k1, v1) → list (K2, v2) GAE infrastructure offers load balancing, auto scaling and fault tolerance capabilities — that is, it Reduce phase: The master node then collects the automatically adjusts the number of application intermediate answers to all the subsets and instances (virtual machines) according to the combines them in some way to form the output – request rate [4,5,6]. the solution to the problem it was initially trying to Each application instance runs in a sandbox (a solve. The Reduce function is then applied in runtime environment abstracted from the parallel to each group, which returns a collection of underlying operating system). This prevents values in the same domain: applications from performing unwanted malicious Reduce (K2, list (v2)) → list (v3) operations and enables GAE to optimize CPU and memory utilization for multiple applications on the same system. Although sandboxing has advantages it imposes various programmer restrictions: Applications do not have access to the underlying hardware and only limited access to network resources. Applications built on java can make use of only a subset of the standard library functions. Applications can’t use threads. A request has a maximum of 30 seconds to respond to the client. produce intermediate results in the form of keyvalue pairs. Once the map process is completed, the reduce process begins. The master node creates certain number of worker nodes to perform the reduce operations. Intermediate data from the map process is processed by the reduce workers based on the keys. A same reduce worker handles intermediate data with the same key. An important advantage of MapReduce is that allows map and reduce operations to be distributed so they can be processed on different processors in parallel. This parallelization can substantially decrease the processing time required for huge data sets. II RELATED WORK Random Forest is one of the popular data classification algorithm for machine learning. Scalable random forest[2] is an improved Random Forest algorithm based on Map Reduce model. This new algorithm makes data classification for massive datasets in computer cluster or cloud computing environment . By distributing scalable random forest algorithm processes and optimizes the subsets of the data across multiple computing nodes. The experimental results show that the scalable random forest algorithm has accuracy degradation but higher performance compared to traditional Random Forest algorithm. In this paper an approach of scalable random forest is used and is implemented in google app engine(GAE) environment . GAE provides a very high degree of fault tolerance , if a node fails then the task of that node is put in the taskqueue and rerun so that the final accuracy of the result does not get affected . III SYSTEM ARCHITECTURE In Fig 2 the overall system architecture of the system is shown . The user accesses the hadoop system through the master node . The task is further assigned to the worker nodes . Depending on the size of data required number of worker nodes or shards are selected . The data is distributed among the shards and parallel processing of data happens. Finally the intermediate data is consolidated and given to the master node and the results are obtained .The cloud system is used to get the required number of nodes for parallel processing of data . MapReduce is being widely used to support parallel computing on large data sets in distributed computing environment . The map process is initiated by a master node. In the first part of the map the input is split into many small sub-problems, and distributed to worker nodes. Worker nodes process the sub-problems independently and Fig 2 System Architecture experiment results show that this approach has good fault tolerance and scalability. The result of total consumption obtained is almost 100% accurate as the Google app engine provides full fault tolerance. If any of the nodes fail during the processing, the failed task is again added to the task queue and is reconsidered for calculation, unlike other non cloud Stage 2- In this stage, SMRF approach platforms wherein if a node fails there is an effect based on MapReduce programming model on the accuracy of the total result. Since the data is will divide the original dataset into many stored and processed on cloud it is highly scalable subsets based on K value. and it will work on any huge size of data. Stage 3 –In the stage the total consumption Depending on the size of data the number of nodes of house is calculated based on houseid and required for processing can be increased or decreased by the user. Also with the increase in the the classification of data is done number of nodes the processing time reduces IV METHDOLOGY The overall procedure involves 3 stages Stage 1- In this stage the K value (nodes or shards) is negotiated in order to make good use of the computing resources of computer clusters or cloud computing. Fig 4 TimeVs No of shards Fig 3 Methodology of the system Fig 4 shows a graph of how the time taken decreases with the increase in the number of shards or nodes. With the addition of 3 shards or nodes the V EXPERIMENTS AND RESULTS time taken approximately reduces by 2 The research aimed at making electrical load seconds .The system works more efficiently as the calculation using distributed environment. The size of data is increased. dataset in the csv format is given as input and mapreduce operations are carriedout on them to calculate the total consumption for a house and further classify it into the type of consumption. The VI CONCLUSION AND FUTURE WORK This project aims at making the load calculation of the houses which can be used to do the estimation of energy requirements of the households and further classify it and find the consumption category (low or high). The scalable random forest approach of mapreduce is used which is one of the efficient approach for making calculation involving large data and works at a very efficient rate . The platform of google app engine employed provides scalability and fault tolerance to a large extent which reduces the overhead on the user . The calculated load can be used to make electrical forecasts. Such forecasts can be used in demand side management to proactively influence load and adapt it to the supply situation, e.g., current production of renewable energy sources. Further system can be extended to calculate the power consumption of individual devices in houses which can be used to check the performance of devices . References [1] Aditya B. Patel, Manashvi Birla, Ushma Nair “Addressing Big Data Problem Using Hadoop and Map Reduce” presented at NIRMA university international conference on engineering NUiCONE-2012,pp 2-3, 06-08December, 2012 [2] JiaweiHanl, YanhengLiul, XinSunl “A Scalable Random Forest Algorithm Based on MapReduce” presented at the IEEE Summer Power Meeting ,pp 1-4 2013 IEEE [3] JyotiNandimath , AnkurPatil, Ekata Banerjee, PratimaKakade “Big Data Analysis Using Apache Hadoop” presented at IEEE IRI 2013, August 14-16, 2013, San Francisco, California, USA [4] RaduProdan, Michael Sperk, and Simon Ostermann,,“EvaluatingHigh-Performance Computingon Google App Engine”, pp 1-2, 2012 IEEE [5] MaciejMalawski, MaciejKuz´niar, PiotrWójcik, and Marian Bubak, “How to Use Google App Engine for Free Computing”,pp 5-6, 2013 IEEE [6] Lei Hu, PengYue, Hongxiu Zhou, “Geoprocessing in Google Cloud Computing: case studies”. [7] J. D. Y. Correa and J. A. B. Ricaurte, “Web Application Deveploment Technologies Using Google Web Toolkit And Google App EngineJava”, IEEE 20141,,,””