International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014 Predictive Analysis on Big data for a Febrile Viral syndrome Thara D K#1, Veena A*2 #1 Assistant Professor Dept Of ISE CIT Gubbi(T), Tumkur (D), VTU, Belgaum, Karnataka, India. Assistant Professor Dept Of CSE CIT Gubbi(T), Tumkur (D), VTU, Belgaum, Karnataka, India. *2 Abstract--- Big Data Everywhere! The popularity of computers in everyday existence has already augmented and keeps mounting the accessible digital data both in volume and diversity. The volume of the data generated in an organization is increasing everyday. An efficient and scalable storage system is required to manage data growth. The first quarter part of the article talks about Map Reduce, which is a software framework that processes vast amount of data in parallel on large clusters. Predictive analytics is one of the best ways to use data in order to improve decision making. Hence the remaining part of the article demonstrates that the sparse fine grained data is the basis for predictive analytics. Predictive analytics for a Febrile Viral Syndrome is performed by one of the best modeling technique Naive Bayes. The results based on Naive Bayes are conservative as one would expect theoretically and empirically. Big data processing has some stages that consist of data accumulation and copying, information mining and cleaning, data depiction, aggregation and integration, data modelling, analysis and query processing, and analysis. The disputes in big data handling comprise heterogeneity and partial, range, relevance, confidentiality and individual cooperation [2]. Big data is measured in terabytes as shown in figure 1. Big data is calculated in terabytes as shown in fig 1. Keywords: Big Data; Map Reduce; Predictive analytics I. INTRODUCTION Big data refers to a large amount of digital data. A large amount of data is gathered and stored from various fields like, Banking transactions, data from various social networking sites, data from various websites, transactions in shopping arcades etc. The data which is in the order of petabytes is called big data. Big data directs to that large amount of data which is not possible to handle using conventional atmosphere. The amount of data that is collected in every two days is expected to be five exabytes. And it is assumed that this is equal to the amount of data that is collected up to 2003. From the year 2007 it became very difficult to store all the data that was produced. The production of this huge amount of data gave birth to new disputes. The disputes included handling of data rills which are arriving at different rate of change, operating different data rills concurrently. The work on big data is magnetizing heavily nowadays. Hadoop is a free ware tool which uses Map Reduce for distributed processing. It has Magnetized interests from all over the globe for handling big data in real time [1] and it has turned into a consistent system for handling big data. ISSN: 2231-5381 Fig 1: Representation of Data Big data classically refers to the subsequent categories of data: Traditional enterprise data – comprises consumer data from Customer Relationship Management (“CRM”) systems, transactional Enterprise Resource Planning (“ERP”) data, web store transactions, and all-purpose ledger data. Machine-generated /sensor data – comprises Call Detail Records (“CDR”), weblogs, smart meters, manufacturing sensors, equipment logs (often referred to as digital exhaust), and trading systems data. Social data – includes consumer comment streams, micro-blogging sites like Twitter, communal media platforms like Facebook http://www.ijettjournal.org Page 388 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014 A. Significant personalities of Big Data Other challenges faced are attaining Volume: The enormous amount of data formed by numerous feasible sources like Face book, Twitter, Yahoo movies, Mobile data over the internet etc. confidentiality and safety issues Heterogeneity correctness preserving cloud service for Big Data data possession It is mandatory to enclose scalable structural design in order to accumulate massive quantity of data that is generated from many data sources. Scalability: As the data expands additional nodes should be added to the cluster without disturbing the coordination. Conventional RDBMS is not appropriate enough for the Big Data analysis for two causes. Fig 2: Significant personalities of Big Data Variety: Data composed over numerous public websites, data composed while examining patient’s information, data composed over blogs etc are all supposed to be in dissimilar arrangement. The data can be either prearranged data or not prearranged data or unstructured multidimensional data. Velocity: The rapidity of using communal websites is rising at an exceedingly high tempo day by day. The Face book users are increasing day by day and today it has reached the heights of 250 million users per day. For this cause in turn it is mandatory to accomplish recurrent and rapid judgment building about the data. Big data is soo dynamic, once the judgments are prepared about one set of data, these judgments pressurize the next set of data that is gathered and analysed. This adds one additional measurement to velocity. Value: Financial significance of dissimilar data is dissimilar. Every data mined will enclose its own concealed precious information. That's why it is mandatory to recognize and extract that precious information which in turn can be used for examination. B. Challenges of Big Data The foremost challenge faced by big data study is Scalability and Complexity. Data is rising at a very high pace. Conventional RDBMS and SQL are not in harmony for the study of Big Data. ISSN: 2231-5381 1) System cannot be extended as a cluster with more number of nodes because of its ACID properties. It may also result in considerable network hindrance and complicatedness in sustaining the reliability. 2) It is not well-matched with amorphous or semi structured data. A superior big data technique should have two traits: 1) It should be potential to mine and stock up enormous sum of data in little period. Even though the price of storage devices is plummeting, the rate of data mine is not mounting at the same tempo. Hadoop is one of the resourceful distributed file system which can be used for data mining and storing in little interval. 2) It should be potential to work out bulky quantity of data in little interval. But, since the processor spawns immense amount of temperature it is not feasible to amplify the micro processor rapidity. Hence the only resolution is to go with concurrent data processing. Map Reduce is one such procedure which can be used to attain concurrent data processing [7]. C. Open Source Softwares for Big Data Big Data analysis is mostly related with number of Open Source Softwares. It primarily covenants with Hadoop and many additional such softwares as: Apache Hadoop Apache Pig http://www.ijettjournal.org Page 389 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014 Casding Scribe Apache HBase Apache Cassandra Apache S4 Storm Fig 4 : Example for Map Reduce R Map Reduce is judged as one of the prevailing prototype for Big data which consists of datasets at the degree of terabytes and higher. The qualities of an ideal Map Reduce procedure are, 1. Accomplish high degree of load balancing in compound clusters. 2. Handling of less space, CPU and I/O time. Elastic Search MongoDB CouchDB II. MAP REDUCE Map Reduce is a procedure used for engendering and processing bulky sum of data which is in the order of terabytes onwards. This model was projected by Google. The outcomes are wonned in several sections like Business, Edification, and Farming etc. The stages concerned in this procedure are made known in the figure below. III. SPARSE, FINE-GRAINED (BEHAVIOR) DATA The data derived by observing the behaviour of individuals is known as sparse fine grained data. For example, data on personal visits to enormous numbers of definite web pages are used in predictive analytics for targeting online display advertisements. Data on individual geographic locations are used for targeting mobile advertisements [3]. Data on the individual merchants with which one transacts are used to target banking advertisements [4]. For any given instance more attributes have a value of zero or “not present” is the major characteristic of the dataset. Fig 3 : Map Reduce Model [18] The two imperative stages of Map Reduce are: Map Phase and Reduce Phase. The first stage DFS, splits the data and that data is given as input to the next stage, Map Phase. The map phase maps the data and is given for shuffle to the Shuffle phase. Now the shuffled data is condensed in the Reduce phase. Map Reduce is a software procedure used for those applications which comprise processing of bulky sum of data in parallel on big clusters without any liabilities. ISSN: 2231-5381 Predictive modelling is a not a new fact. And it does not mean that large data will yield superior predictive performance. Some scientists believe that sampling or reducing the number attributes is favourable [5] while some have strong belief that large data escort to lower assessment dissimilarity and better predictive performance [4]. The baseline is that the solution does not depend on the data type, sharing of data across all attributes. Hence, sparse fine-grained data can be created. Collecting large data over most behaviours or individuals from human behaviours, one can expect better predictions using predictive analysis. http://www.ijettjournal.org Page 390 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014 intellectual theorem affords the probability of a former event, given that an assured successive incident has risen. In the above table the dataset which we are using, in that we are predicting whether the person is affected by Febrile Viral Syndrome or not based on the content of the attributes such as fever, RT-PCR, RTLAMP, polyartyhritis, retro-orbital pain, shoulder pain, rash, back pain, joint pain etc. The datasets are quite sparse and the sparseness increases as the dataset size increases. Active elements are those that are present or non – zero which can be a tiny fraction of the elements in the matrix for massive, sparse data. The sparseness is the percent of the entries that are not active (i.e., are zero) in the feature- vector matrix for each dataset. Active elements are those that actually have a value in the data. IV. PREDICTIVE TECHNIQUE USING NAIVE BAYES CLASSIFIER Prediction is alike to categorization; where we are trying to expect the value of an arithmetical variable or value range of the variable expected to have or the category of some unlabelled exemplar variable. The Naive Bayes classifier is a more refined routine than the naive rule. The principal scheme is to incorporate the information given in a set of predictors into the naive rule to attain added precise categorizations. In other terms, the probability of a record belonging to a assured class is now assessed not only based on the dominance of that class but also on the supplementary information that is given on that record in terms of its X information. The naive Bayes routine is extremely functional when very bulky datasets are obtainable. For the illustration, web-search companies like Google use naive Bayes classifiers to precise misspellings that client type in. When you type an axiom that comprises a misspelled utterance into Google it proposes a spelling rectification for the axiom. The rectifications are based on information not only on the rate of recurrences of likely-spelled terminologies that were typed by millions of other clients, but also on the additional words in the axiom. The naive Bayes method is based on restricted probabilities, and in particular on a primary theorem in probability theory, called Bayes Theorem. This ISSN: 2231-5381 In this algorithm every variable in the specified dataset is understood to be autonomous of all the remaining variables. None of the variable has consequence on the additional variables. This is one of the extraordinary possessions of Naïve Bayes algorithm. It performs tremendously fit on bulky sparse dataset. The key reason for this is the fact that within the class, the covariances of the variables are nil. This property of Bayes rule constructs the following equation. = ,⋯, ( = ) ∙ ( ,⋯, | = ) = ( ,⋯, ) ∝ ( = )∙ = = This algorithm works out merely those variables which have a reckoning that is non zero. In a sparse dataset all zeros are overlooked and working out is done merely for the nonzero values. This helps in lessening in time for the working out. V. PREDICTING TARGET VARIABLE FOR MULTI FEATURED DATASET Sparse data is implicit in the appearance of a matrix, where every row of the matrix is implicit an instance and each column is understood as a feature for each instance. If the i-th instance of the matrix is denoted as Xi, and j-th feature is denoted as Xj, then the j-th feature of the i-the instance can be denoted as Xij. The Sparse dataset matrix comprises of more number of zeros and less number of ones. Each element comprising the value one in the matrix is dealt as an Active Element and each element comprising the value zero is dealt as an Inactive element. In case of dense sparse data, the computational time can be reduced considering only the Active elements of the matrix [6]. The average number of active elements in the matrix is many times smaller than the total number of features in the matrix. Hence it is advisable to use sparse fine grained data technique. If we have binary attributes and all of them are binary means they follow Bernoulli distribution. http://www.ijettjournal.org Page 391 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 8 - Jun 2014 Multivariate mean n tuple, one may be binomial, one may be poissons. Each variable in this multivariate each of them follows Bernoulli. X1, X2, X3 ……. all of them follow Bernoulli distribution. We have to give the probability of success (for one feature as success and another as failure…) say for example 0.7, 0.3, 0.25, 0.35, 0.4 this tells how the Bernoulli distribution is. The variables are all binary features. Each of them has either YES/NO. And the problem is binary classification. Bernoulli distribution is an independent and identical distribution. All the features are independent. If one feature is having YES it does not mean that the next feature is also YES. Each is independently chosen. The probabilities are all same hence it is identical. [3] Provost F. Geo-social targeting for privacy-friendly mobile advertising: Position paper. Working paper CeDER-11-06A. Stern School of Business, New York University, 2011. [4] Martens D, Provost F. Pseudo-social network targeting from consumer transaction data. Working paper CeDER- 11-05. Stern School of Business, New York University, 2011. [5] Kohavi R, John GH. Wrappers for feature subset selection.Artif Intell 1997; 97:273–324. [10] Ajzen I. The theory of planned behavior. Theor Cogn Self Regul 1991; 50:179–211. [6] Enric Junque de Fortuny, David Martens and Foster Provost (2013) Predictive modeling with big data. [7] Surajit Das, Dr. Dinesh Gopalani (2014) Big Data Analysis Issues and Evolution of Hadoop : IJPRET, Volume 2(8). Each row represents the instance and each column represents the feature. By seeing the behaviour of the person (fever, RT-PCR, RT-LAMP, polyartyhritis, retro-orbital pain, shoulder pain, rash, back pain, joint pain), the value of the target variable is predicted as whether the person is AFFECTED or NOT AFFECTED. The target variable is also binomial. VI. CONCLUSION We have conversed big data and the map reduce programming procedure that is used for spawning and processing the huge data. In this paper we have focused on a particular sort of data as being a reason for continued attention to scaling up. Predictive modeling is based on or more data instances for which we want to predict the value of a target variable. Our main contribution is to use sparse finegrain data to predict the target variable. VII. FUTURE ENHANCEMENT In future enhancement we can use multivariate naive bayes that can mine massive, sparse data extremely efficiently and predict the values for multi valued target variable. Predictive modeling with large, transactional data can be made substantially more accurate by increasing the data size to a massive scale. REFERENCES [1] BogdanGhit¸ AlexandruIosup and Dick Epema (2005). Towards an Optimized Big Data Processing System. USA: IEEE. P1-4. [2] LiranEinav and Jonathan Levin (2013). The Data Revolution and Economic Analysis. USA: Prepared for the NBER Innovation Policy and the Economy Conference. p1-29. ISSN: 2231-5381 http://www.ijettjournal.org Page 392