International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 3, Issue 3, March 2014 ISSN 2319 - 4847 Big Data Analytics In Forecasting Lakes Levels Prashant Shrivastava1, S. Pandiaraj2 and Dr. J. Jagadeesan3 1 Student M. Tech. (CSE), Department of Computer Science and Engineering, SRM University, Chennai, India 2 Assistant Professor, Department of Computer Science and Engineering, SRM University, Chennai, India 3 Head of Department, Department of Computer Science and Engineering, SRM University, Chennai, India Abstract Big data Analytics can be used in the study of developing forecasting models for predicting water levels of lakes. Autoregressive Integrated Moving Average (ARIMA) modeling is used for the study as it works well on time series values. Historical data which is large in size is stored with in big database. Various technologies are used to observe the pattern of historical data and predict the future levels by applying data-driven analytics and data mining concepts. Keywords: Big data, ARIMA, Hadoop, Analytics 1. INTRODUCTION In modern time where data is available in abundance and require frequent data overload, the ability to explore unique insight can help organizations to improve their decision making which directly results in the improved ability to identify opportunities, minimize possible risks, and control costs. Big data analytics is not only about managing large or diverse data but it is about self-questioning on the available data, deriving new hypotheses and discovering to make data-driven decisions. It helps to uncover useful information, hidden patterns, and several unknown correlations. Such information can provide advantages in predicting future and result in business benefits, such as more effective planning and better implementations of corrective steps for avoiding ciaos. Big data analytics can be done using various software tools commonly used in the study of advanced analytics disciplines such as data mining and predictive analytics. A new category of big data technology has emerged and is being used in many big data analytics such as NoSQL databases, Hadoop and MapReduce. These core technologies are revolution in an open source software framework that can perform processing of large data sets across clustered systems. Water supplying agencies of various cities have no clear way of forecasting the varying degree of availability of waters in there lakes [1] in coming days so that they can plan accordingly to avoid the situation of scarcity of water. If proper mechanisms and use of latest technologies are done than it will become easy to forecast the availability of water and also the system can be built so robust and scalable such that it can accommodate more number of lakes data from different cities and can easy provide forecast without increasing the cost of investment for each city for the government. 2. METHOD USED Forecasting and prediction of water levels in the lakes [3] in cities that reaches to city population for consumption is mostly done manually and has the limitation that it cannot take into consideration the past levels of water availability for forecasting the future levels patterns that can help in better planning and can avoid the scarcity of the water in Cities to a greater extent. Forecasting approaches can be classified into various – 1. Qualitative Approach – In this approach there is no use of any mathematical model due to the fact that the data available is not considered to be contributing to the future values (long-term forecasting) 2. Quantitative Approach – In this approach the historical data are available. It is based on analysis of historical data having the time series [2] of particular variable and other related time series. It also examines the cause-and-effect relationships of one type of variable vs. other relevant variables 3. Time Series Approach – In this approach we have a single variable that keeps changing with time and whose future values are definitely related in some form to its past values. To solve this problem a technical solution is proposed based on time series approach in this study that can be easily used with less cost and has the advantage of further scaling up by including more and more lakes data without affecting the performance. The data is first collected from the lakes in the ingest data operation, which is next moved into the big data storage where it can be stored and shared for forecasting. The big data storage can be used at various levels for visualization such as by researchers, business analysts, data scientists, operators etc. Finally the demand of water forecast is done that can be used to decide strategy by the decision makers in the water supply departments. Volume 3, Issue 3, March 2014 Page 247 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 3, Issue 3, March 2014 ISSN 2319 - 4847 Figure 9: Data Stages in Proposed System The various stages of data that is moved in the proposed solution consists of – 1. Initial Data Collection Stage – In this stage the historical data is collected from the lakes and put into the temporary storage before been transferred to the next stage. 2. Store & Share Big Data Stage – In this stage the data is stored into the big database coming from the first stage. This is the main storage of the system and can be used for different stages. 3. Visualization Stage – In this stage various users such as data scientists, researchers, business team and users can dig into the available big data and perform the analytics based on their needs 4. Forecast Demand Stage – In this stage the data forecasted is sent to the output systems where in the clear representation of data is made for the future 5. Decides Strategy Stage – In this stage the concerned authorities can make necessary strategy decisions coming from the forecast demand stage to avoid any scarcity of water in the cities. The architectural representation of the proposed system can be shown as below – Figure 10: Architecture Diagram The cluster can be created at a centralized location and shared across various lakes for storing historical information. The collected data forms input to the Auto-Regressive Integrated Moving Average (ARIMA) modeling system used for forecasting through Hadoop [4] [5]. It is integration (I) (gives that the modeling time series has been transformed into a stationary time series) of Auto Regressive (AR) process and Moving Average (MA) process. A non-seasonal ARIMA model “ARIMA (p, d, q)” is based on three parameters – ‘p’ is the number of autoregressive values, ‘d’ is the number of non-seasonal differences, ‘q’ is the number of lagged forecast errors. A seasonal ARIMA model (SARIMA) repeats itself after a certain months. For example – monthly data can have 12 observations values per year or quarterly data can have 4 observations per year or daily data can have 5 or 7 observations Volume 3, Issue 3, March 2014 Page 248 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 3, Issue 3, March 2014 ISSN 2319 - 4847 per week. The seasonal ARIMA model “ARIMA (p, d, q) (P, D, Q)” is based on six parameters, which has a seasonal order (P, D, Q), which is combined with these non-seasonal coefficients (p, d, q). There are four stages in ARIMA model for time series – (1) Identification of Model [There are three types of models – (a) AR – Autoregressive (b) MA – Moving Average (C) ARMA – Includes both AR and MA together] (2) Estimation of Model (3) Diagnostic Checking (4) Forecasting We used R programming language which has stats package that includes an ARIMA function which works on the principle of ARIMA modeling of time series. Besides the ARIMA (p, d, q), the function also includes seasonal factors The "forecast" package of R programming language can automatically select an ARIMA model for a given time series with the auto.arima() function. The package can even simulate seasonal and non-seasonal ARIMA models with its simulate.Arima() function. R language provides additional packages to integrate it with Hadoop. Rhbase is the R package that provides connectivity to HBASE using the Thrift server. It provides APIs that can provide read, write, and modify tables in HBASE. Thrift is a communication framework that is used for cross-language remote procedure calls. Figure 11: Interaction between R and HBASE ARIMA model provides a nice technique for forecasting the magnitude of any time series variable [6]. The strength of this model lies in the fact that the approach is suitable for any time series with various pattern of change. ARIMA models provide useful tools that can be used to compare the performance of other forecasting models such as neural network, kernel regression etc. Figure 12: ARIMA Predictions vs. Actuals It is clearly observed that the values of the actual lakes levels vs. the ARIMA levels predicted are very close to each other and hence forms a good base to consider that the approach for predicting the lakes levels is appropriate. Using the technologies such as hadoop for establishing the software environment for the project can be easily adapted for any actual scenario. The software solution developed used for this article consists of using Ubuntu 12.04 LTS as the Linux operating system. This can be done over any virtual machine such as Oracle Virtual machine or VMware. A single node cluster is created for simulating the clustered environment for implementation using Hadoop [7] [8]. Hbase is used for storing the historical data for the lakes [9]. The query and analysis performed for making prediction is done by using R-Programming language. It is used for forecasting the water levels of the lakes. The predictions made are displayed in form of graphs by using Java GUI for R language (Jaguar). 3. CONCLUSION In this paper a portable and scalable system is discussed for forecasting the water levels of lakes. All open source technologies are used in implementation that definitely reduces the cost of implementation and at the same time provide with the solution implemented in the latest cutting edge technologies. The forecasted values are actual values are seen to be very close to each other with the range of less than five percentage of deviation. The solution can be further enhanced to make the system further simpler by exploring various other alternative technologies used for the same purpose. In addition various other time series based forecasting algorithms and models can be explored for achieving more accuracy and efficiency. Volume 3, Issue 3, March 2014 Page 249 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 3, Issue 3, March 2014 ISSN 2319 - 4847 The study can be easily adapted for any type of forecasting requirements in any area which requires future values prediction based on historical values in the time series. The system can be easily modified and can be applied in areas such as weather forecasting, financial markets etc. References [1] Hydrological analysis for water level projections in Taihu Lake, China (Journal of Flood Risk Management )by L. Liu, Z.X. Xu, N.S. Reynard, C.W. Hu and R.G. Jones (March 2013) [2] Stochastic modeling of Lake Van water level time series with jumps and multiple trends (Hydrology and Earth System Sciences (An Interactive Open Access Journal of the European Geosciences Union)) by H. Aksoy, N. E. Unal, E. Eris, and M. I. Yuce (February 2013) [3] Predicting Water Levels at Kainji Dam Using Artificial Neural Networks (Nigerian Journal of Technology (NIJOTECH)) by C.C. Nwobi-Okoye, A.C. Igboanugo (March 2013) [4] Hadoop-based ARIMA Algorithm and its Application in Weather Forecast (International Journal of Database Theory and Application) by Leixiao Li, Zhiqiang Ma, Limin Liu and Yuhong Fan(2013) [5] New York City (NYC), government web site for the white paper on New York City’s Operations Support Tool (OST) White Paper http://www.nyc.gov/html/dep/pdf/reports/ost_white_paper.pdf [6] International Journal from Science & Engineering Research Support Society (SERSC): Hadoop-based ARIMA Algorithm and its Application in Weather Forecast http://www.sersc.org/journals/IJDTA/vol6_no5/11.pdf [7] Running Hadoop on Ubuntu Linux (Single-Node Cluster) http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ [8] Running Hadoop on Ubuntu Linux (Multi-Node Cluster) http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ [9] Book Title – Hadoop The Definitive Guide; By Tom White; Oreilly Publication (Fourth Indian Reprint: Jun 2013) AUTHORS Prashant Shrivastava received his B.E. in Computer Science & Engineering degree from SGSITS Indore in 1999. He is currently persuing his M. Tech. in Computer Science & Engineering degree from SRM University, Ramapuram, Chennai. He is working as Senior Project Manager in L&T Infotech since 2010. His research interests are in the área of Cloud Computing, Big Data Analytics, Open source technologies, Java Enterprise Software Solutions, ETL and Dataware housing solutions and Data mining. He has nearly 15 years of industry experience. S. Pandiaraj received his B.E. in Computer Science & Engineering degree from Madras University and completed his M.E. in Computer Science & Engineering from Sathyabama University, Chennai, 2010. He is working as Assistant Professor in Department of Computer Science and Engineering, Ramapuram Campus, SRM University, Ramapuram, Chennai since June 2010. His research interests are in área of wireless network and computer architecture. He has more than 10 years of teaching experience. Dr. J. Jagadeesan, received his B.E. in Computer Engineering degree from Madurai Kamraj University in 1991 and completed his M. Tech. In Computer Science & Engineering from Dr. MGR University, 2006. He received his Ph. D. degree in Computer Science & Engineering from Anna University, Chennai in 2013-14. He has around 20 years of teaching experience. His research interest are in the field of Software Engineering, DBMS, Computer Architecture and Software testing. He is Member of Board of studies-M.Sc (Computer Technology), M.Sc. (Information Technology) and M.Sc. (Multimedia Technology).-Periyar University (1999-2001).He is Member in Indian Science Congress Association and Membership in CSI Chapter. He is Certified Internal Auditor ISO 9001:2000. He received Best Teacher Award SNK Fomra Institute of Technology (2004-2005). Volume 3, Issue 3, March 2014 Page 250