Predictive Analysis on Big data for a Febrile Viral syndrome

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014
Predictive Analysis on Big data for a Febrile
Viral syndrome
Thara D K#1, Veena A*2
#1
Assistant Professor Dept Of ISE CIT Gubbi(T), Tumkur (D), VTU, Belgaum, Karnataka, India.
Assistant Professor Dept Of CSE CIT Gubbi(T), Tumkur (D), VTU, Belgaum, Karnataka, India.
*2
Abstract--- Big Data Everywhere! The popularity of
computers in everyday existence has already augmented
and keeps mounting the accessible digital data both in
volume and diversity. The volume of the data generated
in an organization is increasing everyday. An efficient
and scalable storage system is required to manage data
growth. The first quarter part of the article talks about
Map Reduce, which is a software framework that
processes vast amount of data in parallel on large
clusters. Predictive analytics is one of the best ways to
use data in order to improve decision making. Hence
the remaining part of the article demonstrates that the
sparse fine grained data is the basis for predictive
analytics. Predictive analytics for a Febrile Viral
Syndrome is performed by one of the best modeling
technique Naive Bayes. The results based on Naive
Bayes are conservative as one would expect
theoretically and empirically.
Big data processing has some stages that consist of
data accumulation and copying, information mining
and cleaning, data depiction, aggregation and
integration, data modelling, analysis and query
processing, and analysis.
The disputes in big data handling comprise
heterogeneity and partial, range, relevance,
confidentiality and individual cooperation [2]. Big
data is measured in terabytes as shown in figure 1.
Big data is calculated in terabytes as shown in fig 1.
Keywords: Big Data; Map Reduce; Predictive analytics
I. INTRODUCTION
Big data refers to a large amount of digital data. A
large amount of data is gathered and stored from
various fields like, Banking transactions, data from
various social networking sites, data from various
websites, transactions in shopping arcades etc. The
data which is in the order of petabytes is called big
data.
Big data directs to that large amount of data which
is not possible to handle using conventional
atmosphere. The amount of data that is collected in
every two days is expected to be five exabytes. And it
is assumed that this is equal to the amount of data
that is collected up to 2003. From the year 2007 it
became very difficult to store all the data that was
produced. The production of this huge amount of data
gave birth to new disputes. The disputes included
handling of data rills which are arriving at different
rate of change, operating different data rills
concurrently.
The work on big data is magnetizing heavily
nowadays. Hadoop is a free ware tool which uses
Map Reduce for distributed processing. It has
Magnetized interests from all over the globe for
handling big data in real time [1] and it has turned
into a consistent system for handling big data.
ISSN: 2231-5381
Fig 1: Representation of Data
Big data classically refers to the subsequent
categories of data:
 Traditional enterprise data – comprises
consumer data from Customer Relationship
Management
(“CRM”)
systems,
transactional Enterprise Resource Planning
(“ERP”) data, web store transactions, and
all-purpose ledger data.

Machine-generated /sensor data – comprises
Call Detail Records (“CDR”), weblogs,
smart meters, manufacturing sensors,
equipment logs (often referred to as digital
exhaust), and trading systems data.

Social data – includes consumer comment
streams, micro-blogging sites like Twitter,
communal media platforms like Facebook
http://www.ijettjournal.org
Page 388
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014
A. Significant personalities of Big Data
Other challenges faced are attaining
Volume: The enormous amount of data formed by
numerous feasible sources like Face book, Twitter,
Yahoo movies, Mobile data over the internet etc.





confidentiality and safety issues
Heterogeneity
correctness
preserving cloud service for Big Data
data possession
It is mandatory to enclose scalable structural design
in order to accumulate massive quantity of data that
is generated from many data sources.
Scalability: As the data expands additional nodes
should be added to the cluster without disturbing the
coordination.
Conventional RDBMS is not appropriate enough for
the Big Data analysis for two causes.
Fig 2: Significant personalities of Big Data
Variety: Data composed over numerous public
websites, data composed while examining patient’s
information, data composed over blogs etc are all
supposed to be in dissimilar arrangement. The data
can be either prearranged data or not prearranged data
or unstructured multidimensional data.
Velocity: The rapidity of using communal websites is
rising at an exceedingly high tempo day by day. The
Face book users are increasing day by day and today
it has reached the heights of 250 million users per
day. For this cause in turn it is mandatory to
accomplish recurrent and rapid judgment building
about the data. Big data is soo dynamic, once the
judgments are prepared about one set of data, these
judgments pressurize the next set of data that is
gathered and analysed. This adds one additional
measurement to velocity.
Value: Financial significance of dissimilar data is
dissimilar. Every data mined will enclose its own
concealed precious information. That's why it is
mandatory to recognize and extract that precious
information which in turn can be used for
examination.
B. Challenges of Big Data
The foremost challenge faced by big data study is
Scalability and Complexity.
Data is rising at a very high pace. Conventional
RDBMS and SQL are not in harmony for the study of
Big Data.
ISSN: 2231-5381
1) System cannot be extended as a cluster with
more number of nodes because of its ACID
properties. It may also result in considerable
network hindrance and complicatedness in
sustaining the reliability.
2) It is not well-matched with amorphous or
semi structured data.
A superior big data technique should have two traits:
1) It should be potential to mine and stock up
enormous sum of data in little period. Even
though the price of storage devices is
plummeting, the rate of data mine is not
mounting at the same tempo. Hadoop is one of
the resourceful distributed file system which can
be used for data mining and storing in little
interval.
2) It should be potential to work out bulky quantity
of data in little interval. But, since the processor
spawns immense amount of temperature it is not
feasible to amplify the micro processor rapidity.
Hence the only resolution is to go with
concurrent data processing. Map Reduce is one
such procedure which can be used to attain
concurrent data processing [7].
C. Open Source Softwares for Big Data
Big Data analysis is mostly related with number of
Open Source Softwares. It primarily covenants with
Hadoop and many additional such softwares as:
 Apache Hadoop
 Apache Pig
http://www.ijettjournal.org
Page 389
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014
 Casding
 Scribe
 Apache HBase
 Apache Cassandra
 Apache S4
 Storm
Fig 4 : Example for Map Reduce
 R
Map Reduce is judged as one of the prevailing
prototype for Big data which consists of datasets at
the degree of terabytes and higher.
The qualities of an ideal Map Reduce procedure are,
1. Accomplish high degree of load balancing in
compound clusters.
2. Handling of less space, CPU and I/O time.
 Elastic Search
 MongoDB
 CouchDB
II. MAP REDUCE
Map Reduce is a procedure used for engendering and
processing bulky sum of data which is in the order of
terabytes onwards. This model was projected by
Google. The outcomes are wonned in several
sections like Business, Edification, and Farming etc.
The stages concerned in this procedure are made
known
in
the
figure
below.
III. SPARSE, FINE-GRAINED (BEHAVIOR)
DATA
The data derived by observing the behaviour of
individuals is known as sparse fine grained data.
For example, data on personal visits to enormous
numbers of definite web pages are used in predictive
analytics for targeting online display advertisements.
Data on individual geographic locations are used for
targeting mobile advertisements [3]. Data on the
individual merchants with which one transacts are
used to target banking advertisements [4].
For any given instance more attributes have a value
of zero or “not present” is the major characteristic of
the dataset.
Fig 3 : Map Reduce Model [18]
The two imperative stages of Map Reduce are: Map
Phase and Reduce Phase. The first stage DFS, splits
the data and that data is given as input to the next
stage, Map Phase. The map phase maps the data and
is given for shuffle to the Shuffle phase. Now the
shuffled data is condensed in the Reduce phase.
Map Reduce is a software procedure used for those
applications which comprise processing of bulky sum
of data in parallel on big clusters without any
liabilities.
ISSN: 2231-5381
Predictive modelling is a not a new fact. And it does
not mean that large data will yield superior predictive
performance. Some scientists believe that sampling
or reducing the number attributes is favourable [5]
while some have strong belief that large data escort to
lower assessment dissimilarity and better predictive
performance [4].
The baseline is that the solution does not depend on
the data type, sharing of data across all attributes.
Hence, sparse fine-grained data can be created.
Collecting large data over most behaviours or
individuals from human behaviours, one can expect
better predictions using predictive analysis.
http://www.ijettjournal.org
Page 390
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014
intellectual theorem affords the probability of a
former event, given that an assured successive
incident has risen.
In the above table the dataset which we are using, in
that we are predicting whether the person is affected
by Febrile Viral Syndrome or not based on the
content of the attributes such as fever, RT-PCR, RTLAMP, polyartyhritis, retro-orbital pain, shoulder
pain, rash, back pain, joint pain etc. The datasets are
quite sparse and the sparseness increases as the
dataset size increases. Active elements are those that
are present or non – zero which can be a tiny fraction
of the elements in the matrix for massive, sparse data.
The sparseness is the percent of the entries that are
not active (i.e., are zero) in the feature- vector matrix
for each dataset. Active elements are those that
actually have a value in the data.
IV. PREDICTIVE TECHNIQUE USING NAIVE
BAYES CLASSIFIER
Prediction is alike to categorization; where we are
trying to expect the value of an arithmetical variable
or value range of the variable expected to have or the
category of some unlabelled exemplar variable.
The Naive Bayes classifier is a more refined routine
than the naive rule. The principal scheme is to
incorporate the information given in a set of
predictors into the naive rule to attain added precise
categorizations. In other terms, the probability of a
record belonging to a assured class is now assessed
not only based on the dominance of that class but
also on the supplementary information that is given
on that record in terms of its X information.
The naive Bayes routine is extremely functional
when very bulky datasets are obtainable. For the
illustration, web-search companies like Google use
naive Bayes classifiers to precise misspellings that
client type in. When you type an axiom that
comprises a misspelled utterance into Google it
proposes a spelling rectification for the axiom. The
rectifications are based on information not only on
the rate of recurrences of likely-spelled terminologies
that were typed by millions of other clients, but also
on the additional words in the axiom.
The naive Bayes method is based on restricted
probabilities, and in particular on a primary theorem
in probability theory, called Bayes Theorem. This
ISSN: 2231-5381
In this algorithm every variable in the specified
dataset is understood to be autonomous of all the
remaining variables. None of the variable has
consequence on the additional variables. This is one
of the extraordinary possessions of Naïve Bayes
algorithm. It performs tremendously fit on bulky
sparse dataset. The key reason for this is the fact that
within the class, the covariances of the variables are
nil. This property of Bayes rule constructs the
following equation.
=
,⋯,
( = ) ∙ ( ,⋯, | = )
=
( ,⋯, )
∝ ( = )∙
=
=
This algorithm works out merely those variables
which have a reckoning that is non zero. In a sparse
dataset all zeros are overlooked and working out is
done merely for the nonzero values. This helps in
lessening in time for the working out.
V. PREDICTING TARGET VARIABLE FOR
MULTI FEATURED DATASET
Sparse data is implicit in the appearance of a matrix,
where every row of the matrix is implicit an instance
and each column is understood as a feature for each
instance.
If the i-th instance of the matrix is denoted as Xi, and
j-th feature is denoted as Xj, then the j-th feature of
the i-the instance can be denoted as Xij. The Sparse
dataset matrix comprises of more number of zeros
and less number of ones. Each element comprising
the value one in the matrix is dealt as an Active
Element and each element comprising the value zero
is dealt as an Inactive element.
In case of dense sparse data, the computational time
can be reduced considering only the Active elements
of the matrix [6].
The average number of active elements in the matrix
is many times smaller than the total number of
features in the matrix. Hence it is advisable to use
sparse fine grained data technique.
If we have binary attributes and all of them are binary
means they
follow Bernoulli
distribution.
http://www.ijettjournal.org
Page 391
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014
Multivariate mean n tuple, one may be binomial, one
may be poissons. Each variable in this multivariate
each of them follows Bernoulli. X1, X2, X3 ……. all
of them follow Bernoulli distribution. We have to
give the probability of success (for one feature as
success and another as failure…) say for example 0.7,
0.3, 0.25, 0.35, 0.4 this tells how the Bernoulli
distribution is.
The variables are all binary features. Each of them
has either YES/NO. And the problem is binary
classification.
Bernoulli distribution is an independent and identical
distribution. All the features are independent. If one
feature is having YES it does not mean that the next
feature is also YES. Each is independently chosen.
The probabilities are all same hence it is identical.
[3] Provost F. Geo-social targeting for privacy-friendly mobile
advertising: Position paper. Working paper CeDER-11-06A. Stern
School of Business, New York University, 2011.
[4] Martens D, Provost F. Pseudo-social network targeting from
consumer transaction data. Working paper CeDER- 11-05. Stern
School of Business, New York University, 2011.
[5] Kohavi R, John GH. Wrappers for feature subset
selection.Artif Intell 1997; 97:273–324.
[10] Ajzen I. The theory of planned behavior. Theor Cogn Self
Regul 1991; 50:179–211.
[6] Enric Junque de Fortuny, David Martens and Foster Provost
(2013) Predictive modeling with big data.
[7] Surajit Das, Dr. Dinesh Gopalani (2014) Big Data Analysis
Issues and Evolution of Hadoop : IJPRET, Volume 2(8).
Each row represents the instance and each column
represents the feature. By seeing the behaviour of the
person (fever, RT-PCR, RT-LAMP, polyartyhritis,
retro-orbital pain, shoulder pain, rash, back pain, joint
pain), the value of the target variable is predicted as
whether the person is AFFECTED or NOT
AFFECTED. The target variable is also binomial.
VI. CONCLUSION
We have conversed big data and the map reduce
programming procedure that is used for spawning
and processing the huge data. In this paper we have
focused on a particular sort of data as being a reason
for continued attention to scaling up. Predictive
modeling is based on or more data instances for
which we want to predict the value of a target
variable. Our main contribution is to use sparse finegrain data to predict the target variable.
VII. FUTURE ENHANCEMENT
In future enhancement we can use multivariate naive
bayes that can mine massive, sparse data extremely
efficiently and predict the values for multi valued
target variable. Predictive modeling with large,
transactional data can be made substantially more
accurate by increasing the data size to a massive scale.
REFERENCES
[1] BogdanGhit¸ AlexandruIosup and Dick Epema (2005).
Towards an Optimized Big Data Processing System. USA: IEEE.
P1-4.
[2] LiranEinav and Jonathan Levin (2013). The Data Revolution
and Economic Analysis. USA: Prepared for the NBER Innovation
Policy and the Economy Conference. p1-29.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 392
Download