International Journal of Application or Innovation in Engineering & Management... Web Site: www.ijaiem.org Email: Volume 3, Issue 3, March 2014

advertisement
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 3, Issue 3, March 2014
ISSN 2319 - 4847
Big Data Analytics In Forecasting Lakes Levels
Prashant Shrivastava1, S. Pandiaraj2 and Dr. J. Jagadeesan3
1
Student M. Tech. (CSE), Department of Computer Science and Engineering, SRM University, Chennai, India
2
Assistant Professor, Department of Computer Science and Engineering, SRM University, Chennai, India
3
Head of Department, Department of Computer Science and Engineering, SRM University, Chennai, India
Abstract
Big data Analytics can be used in the study of developing forecasting models for predicting water levels of lakes. Autoregressive
Integrated Moving Average (ARIMA) modeling is used for the study as it works well on time series values. Historical data which
is large in size is stored with in big database. Various technologies are used to observe the pattern of historical data and predict
the future levels by applying data-driven analytics and data mining concepts.
Keywords: Big data, ARIMA, Hadoop, Analytics
1. INTRODUCTION
In modern time where data is available in abundance and require frequent data overload, the ability to explore unique
insight can help organizations to improve their decision making which directly results in the improved ability to identify
opportunities, minimize possible risks, and control costs. Big data analytics is not only about managing large or diverse
data but it is about self-questioning on the available data, deriving new hypotheses and discovering to make data-driven
decisions. It helps to uncover useful information, hidden patterns, and several unknown correlations. Such information
can provide advantages in predicting future and result in business benefits, such as more effective planning and better
implementations of corrective steps for avoiding ciaos. Big data analytics can be done using various software tools
commonly used in the study of advanced analytics disciplines such as data mining and predictive analytics. A new
category of big data technology has emerged and is being used in many big data analytics such as NoSQL databases,
Hadoop and MapReduce. These core technologies are revolution in an open source software framework that can perform
processing of large data sets across clustered systems.
Water supplying agencies of various cities have no clear way of forecasting the varying degree of availability of waters in
there lakes [1] in coming days so that they can plan accordingly to avoid the situation of scarcity of water. If proper
mechanisms and use of latest technologies are done than it will become easy to forecast the availability of water and also
the system can be built so robust and scalable such that it can accommodate more number of lakes data from different
cities and can easy provide forecast without increasing the cost of investment for each city for the government.
2. METHOD USED
Forecasting and prediction of water levels in the lakes [3] in cities that reaches to city population for consumption is
mostly done manually and has the limitation that it cannot take into consideration the past levels of water availability for
forecasting the future levels patterns that can help in better planning and can avoid the scarcity of the water in Cities to a
greater extent.
Forecasting approaches can be classified into various –
1. Qualitative Approach – In this approach there is no use of any mathematical model due to the fact that the data
available is not considered to be contributing to the future values (long-term forecasting)
2. Quantitative Approach – In this approach the historical data are available. It is based on analysis of historical data
having the time series [2] of particular variable and other related time series. It also examines the cause-and-effect
relationships of one type of variable vs. other relevant variables
3. Time Series Approach – In this approach we have a single variable that keeps changing with time and whose
future values are definitely related in some form to its past values.
To solve this problem a technical solution is proposed based on time series approach in this study that can be easily used
with less cost and has the advantage of further scaling up by including more and more lakes data without affecting the
performance. The data is first collected from the lakes in the ingest data operation, which is next moved into the big data
storage where it can be stored and shared for forecasting. The big data storage can be used at various levels for
visualization such as by researchers, business analysts, data scientists, operators etc. Finally the demand of water forecast
is done that can be used to decide strategy by the decision makers in the water supply departments.
Volume 3, Issue 3, March 2014
Page 247
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 3, Issue 3, March 2014
ISSN 2319 - 4847
Figure 9: Data Stages in Proposed System
The various stages of data that is moved in the proposed solution consists of –
1. Initial Data Collection Stage – In this stage the historical data is collected from the lakes and put into the
temporary storage before been transferred to the next stage.
2. Store & Share Big Data Stage – In this stage the data is stored into the big database coming from the first stage.
This is the main storage of the system and can be used for different stages.
3. Visualization Stage – In this stage various users such as data scientists, researchers, business team and users can
dig into the available big data and perform the analytics based on their needs
4. Forecast Demand Stage – In this stage the data forecasted is sent to the output systems where in the clear
representation of data is made for the future
5. Decides Strategy Stage – In this stage the concerned authorities can make necessary strategy decisions coming
from the forecast demand stage to avoid any scarcity of water in the cities.
The architectural representation of the proposed system can be shown as below –
Figure 10: Architecture Diagram
The cluster can be created at a centralized location and shared across various lakes for storing historical information. The
collected data forms input to the Auto-Regressive Integrated Moving Average (ARIMA) modeling system used for
forecasting through Hadoop [4] [5]. It is integration (I) (gives that the modeling time series has been transformed into a
stationary time series) of Auto Regressive (AR) process and Moving Average (MA) process.
A non-seasonal ARIMA model “ARIMA (p, d, q)” is based on three parameters – ‘p’ is the number of autoregressive
values, ‘d’ is the number of non-seasonal differences, ‘q’ is the number of lagged forecast errors.
A seasonal ARIMA model (SARIMA) repeats itself after a certain months. For example – monthly data can have 12
observations values per year or quarterly data can have 4 observations per year or daily data can have 5 or 7 observations
Volume 3, Issue 3, March 2014
Page 248
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 3, Issue 3, March 2014
ISSN 2319 - 4847
per week. The seasonal ARIMA model “ARIMA (p, d, q) (P, D, Q)” is based on six parameters, which has a seasonal
order (P, D, Q), which is combined with these non-seasonal coefficients (p, d, q).
There are four stages in ARIMA model for time series – (1) Identification of Model [There are three types of models – (a)
AR – Autoregressive (b) MA – Moving Average (C) ARMA – Includes both AR and MA together] (2) Estimation of
Model (3) Diagnostic Checking (4) Forecasting
We used R programming language which has stats package that includes an ARIMA function which works on the
principle of ARIMA modeling of time series. Besides the ARIMA (p, d, q), the function also includes seasonal factors
The "forecast" package of R programming language can automatically select an ARIMA model for a given time series
with the auto.arima() function. The package can even simulate seasonal and non-seasonal ARIMA models with its
simulate.Arima() function. R language provides additional packages to integrate it with Hadoop. Rhbase is the R
package that provides connectivity to HBASE using the Thrift server. It provides APIs that can provide read, write, and
modify tables in HBASE. Thrift is a communication framework that is used for cross-language remote procedure calls.
Figure 11: Interaction between R and HBASE
ARIMA model provides a nice technique for forecasting the magnitude of any time series variable [6]. The strength of
this model lies in the fact that the approach is suitable for any time series with various pattern of change. ARIMA models
provide useful tools that can be used to compare the performance of other forecasting models such as neural network,
kernel regression etc.
Figure 12: ARIMA Predictions vs. Actuals
It is clearly observed that the values of the actual lakes levels vs. the ARIMA levels predicted are very close to each other
and hence forms a good base to consider that the approach for predicting the lakes levels is appropriate.
Using the technologies such as hadoop for establishing the software environment for the project can be easily adapted for
any actual scenario. The software solution developed used for this article consists of using Ubuntu 12.04 LTS as the
Linux operating system. This can be done over any virtual machine such as Oracle Virtual machine or VMware. A single
node cluster is created for simulating the clustered environment for implementation using Hadoop [7] [8]. Hbase is used
for storing the historical data for the lakes [9]. The query and analysis performed for making prediction is done by using
R-Programming language. It is used for forecasting the water levels of the lakes. The predictions made are displayed in
form of graphs by using Java GUI for R language (Jaguar).
3. CONCLUSION
In this paper a portable and scalable system is discussed for forecasting the water levels of lakes. All open source
technologies are used in implementation that definitely reduces the cost of implementation and at the same time provide
with the solution implemented in the latest cutting edge technologies. The forecasted values are actual values are seen to
be very close to each other with the range of less than five percentage of deviation. The solution can be further enhanced
to make the system further simpler by exploring various other alternative technologies used for the same purpose. In
addition various other time series based forecasting algorithms and models can be explored for achieving more accuracy
and efficiency.
Volume 3, Issue 3, March 2014
Page 249
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 3, Issue 3, March 2014
ISSN 2319 - 4847
The study can be easily adapted for any type of forecasting requirements in any area which requires future values
prediction based on historical values in the time series. The system can be easily modified and can be applied in areas
such as weather forecasting, financial markets etc.
References
[1] Hydrological analysis for water level projections in Taihu Lake, China (Journal of Flood Risk Management )by L.
Liu, Z.X. Xu, N.S. Reynard, C.W. Hu and R.G. Jones (March 2013)
[2] Stochastic modeling of Lake Van water level time series with jumps and multiple trends (Hydrology and Earth
System Sciences (An Interactive Open Access Journal of the European Geosciences Union)) by H. Aksoy, N. E. Unal,
E. Eris, and M. I. Yuce (February 2013)
[3] Predicting Water Levels at Kainji Dam Using Artificial Neural Networks (Nigerian Journal of Technology
(NIJOTECH)) by C.C. Nwobi-Okoye, A.C. Igboanugo (March 2013)
[4] Hadoop-based ARIMA Algorithm and its Application in Weather Forecast (International Journal of Database Theory
and Application) by Leixiao Li, Zhiqiang Ma, Limin Liu and Yuhong Fan(2013)
[5] New York City (NYC), government web site for the white paper on New York City’s Operations Support Tool (OST)
White Paper http://www.nyc.gov/html/dep/pdf/reports/ost_white_paper.pdf
[6] International Journal from Science & Engineering Research Support Society (SERSC): Hadoop-based ARIMA
Algorithm and its Application in Weather Forecast
http://www.sersc.org/journals/IJDTA/vol6_no5/11.pdf
[7] Running Hadoop on Ubuntu Linux (Single-Node Cluster)
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
[8] Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
[9] Book Title – Hadoop The Definitive Guide; By Tom White; Oreilly Publication (Fourth Indian Reprint: Jun 2013)
AUTHORS
Prashant Shrivastava received his B.E. in Computer Science & Engineering degree from SGSITS Indore in
1999. He is currently persuing his M. Tech. in Computer Science & Engineering degree from SRM
University, Ramapuram, Chennai. He is working as Senior Project Manager in L&T Infotech since 2010. His
research interests are in the área of Cloud Computing, Big Data Analytics, Open source technologies, Java
Enterprise Software Solutions, ETL and Dataware housing solutions and Data mining. He has nearly 15 years of industry
experience.
S. Pandiaraj received his B.E. in Computer Science & Engineering degree from Madras University and
completed his M.E. in Computer Science & Engineering from Sathyabama University, Chennai, 2010. He is
working as Assistant Professor in Department of Computer Science and Engineering, Ramapuram Campus,
SRM University, Ramapuram, Chennai since June 2010. His research interests are in área of wireless network
and computer architecture. He has more than 10 years of teaching experience.
Dr. J. Jagadeesan, received his B.E. in Computer Engineering degree from Madurai Kamraj University in
1991 and completed his M. Tech. In Computer Science & Engineering from Dr. MGR University, 2006. He
received his Ph. D. degree in Computer Science & Engineering from Anna University, Chennai in 2013-14. He
has around 20 years of teaching experience. His research interest are in the field of Software Engineering,
DBMS, Computer Architecture and Software testing. He is Member of Board of studies-M.Sc (Computer Technology),
M.Sc. (Information Technology) and M.Sc. (Multimedia Technology).-Periyar University (1999-2001).He is Member in
Indian Science Congress Association and Membership in CSI Chapter. He is Certified Internal Auditor ISO 9001:2000.
He received Best Teacher Award SNK Fomra Institute of Technology (2004-2005).
Volume 3, Issue 3, March 2014
Page 250
Download