A Predictive Approach to Error Log Analytics A SUMMER INTERNSHIP REPORT Submitted by RAKSHIT DWIVEDI SYSTEM ID: 2017012524 ROLL No: 170251142 Under the Supervision of Ms. Swati Bansal In partial fulfilment of summer internship for the award of degree of MASTER OF BUSINESS ADMINISTRATION SCHOOL OF BUSINESS STUDIES SHARDA UNIVERSITY Greater Noida, UP AUGUST, 2018 Sharda University Plot No. 32-34, Knowledge Park III, Greater Noida, Uttar Pradesh 201306 Certificate of Approval The following Summer Project Report titled is “A Predictive Approach to Error Log Analytics” hereby approved as a certified study in business analytics carried out and presented in a manner satisfactory to warrant its acceptance as a prerequisite for the award of Post-Graduate Diploma in Management for which it has been submitted. It is understood that by this approval the undersigned do not necessarily endorse or approve any statement made, opinion expressed or conclusion drawn therein but approve the Summer Project Report only for the purpose it is submitted Summer Project Report Examination Committee for evaluation of summer project report Name 1. Faculty Examiner 2. PG summer project co-coordinator Signature PREFACE Industrial internship is a program, which is conducted to acquire practical knowledge. It is believed that practical working experience will be added advance in our future life, which may also help to achieve our aim and ambition. It provides a chance to acquire knowledge from global business and earmark for executives. It identifies the practical phenomena including risk and also enables to take probable alternative decisions. The knowledge is based on learning and experience. It is really a matter of pleasure that, I have completed my internship program in Madrid Software Training Solutions. The program was conducted from June 02, 2018 to July 31, 2018 as a part of summer internship program for my MBA at Sharda University, Greater Noida. This report has been prepared for the fulfilment of academic curriculum as required under the program. While preparing this report, I gathered practical experience of working and finally I would like to say that tireless struggle would become successful when any person or organisation will get benefit from this report. ACKNOWLEDGMENT This internship report is an accumulation of many people’s endeavour. This report would never have been possible without the consistent support and assistance of the people whom I approached during the various stages of writing this report. First of all I would like to sincerely express my gratitude and thanks to my faculty supervisor Ms. Swati Bansal for her continuous assistance and guidance to complete this report. Her help, guidance and constructive comments were very helpful in the completion of this report. I am grateful to my industry mentor Mr. Sachin Arora (Team Leader,KPMG) for his support and supervision. I am thankful to their support and open minded behaviour which he has shown towards me during preparation of my report. I am also grateful to each and every employee of the Madrid Software Training Solutions with special mention Mr.Amit Kataria for their cordial acceptance. They have been very helpful in showing me the work process and provided relevant information for my report whenever I approached. Finally, my heartfelt gratitude for Sharda University School of Business Studies and associated instructors with whom I did courses and who have given me valuable education. RAKSHIT DWIVEDI SYSTEM ID: 2017012524 DECLARATION I, Rakshit Dwivedi hereby declare that the work titled “A Predictive Approach to Error Log Analytics” is a genuine work done by me under the faculty guide Ms. Swati Bansal and has not been published or submitted elsewhere for the requirement of a degree programme. Any literature, data or work done by others and cited within this project has been given due acknowledgement and listed in reference section. RAKSHIT DWIVEDI Ms. Swati Bansal System ID:2017012524 Faculty Guide TABLE OF CONTENTS Index List of Abbreviations Abstract Page Nos. i ii Chapter 1: Introduction 1.1 Introduction to Big Data 1.2 Data Analytics 1.3 Understanding Logs and Error Log Analysis using Big Data 1.3.1 Role of Combiner in Map Reduce And Error Log Analysis 1.3.2 Purpose of Log 1.4 Recommender Systems 1.4.1 Taxonomy for Recommender Systems 1 1 4 8 9 10 10 12 Chapter 2: Literature Review 2.1 Big Data 2.1.1 Why Big Data 2.1.2 Characteristics of Big Data Platform 2.1.3 Big Data Challenges 2.1.4 Map Reduce Technique 2.1.5 Architecture of Map Reduce 2.1.6 Dealing with Failure 2.1.7 Benefits of Map Reduce 2.1.8 Pitfalls and Challenges in Map Reduce 2.2 Basic Logging and Descriptive Analytics 2.3 Predictive Analytics and Recommender System 16 16 17 17 18 18 19 21 22 22 26 30 Chapter 3: Problem Statement And Methodology 3.1 Problem Statement 3.1.1 Existing Systems in Theory 3.2 Motivation 3.3 Methodology 34 34 34 36 39 Chapter 4:Proposed Framework 41 41 41 42 44 45 4.1 A Combiner Approach to Effective Error Log Analysis Using Big Data 4.1.1 Role of Combiner in Map Reduce And Error Log Analysis 4.1.2 Purpose of log 4.2 Effective Error Log Analysis Using Correlation 4.2.1 Terminology Used 4.2.2 Benefits of Using a Known Error Database (KEDB) 4.2.3 The KEDB Implementation 4.2.4 Importance of R in Data Analytics 4.3 A predictive model for Error Log Analytics 4.3.1 Information Collection Phase 4.3.2 Explicit Feedback 4.3.3 Implicit Feedback 4.3.4 Hybrid Feedback Chapter 5: Result Analysis 5.1 Effective Combiner Approach to Error Log Analytics 5.1.1 Input 5.1.2 Output 5.2 Effective Log Analysis using Correlation 5.2.1 Descriptive Representation of Correlation between Parameters of Dataset 5.3 A Predictive model for Error Log Analytics Chapter 6: Conclusion And Future Work 6.1 Conclusion 6.2 Future Work References 47 49 49 51 51 52 52 53 56 56 56 57 58 58 67 73 73 73 LIST OF ABBREVIATIONS KEDB Known Error Database TDM Term Document Matrix WWW World Wide Web HDFS Hadoop Distributed File System SAP Systems, Applications and Products DOCS Documents LDTM Large Document Term Matrix Abstract Recommender systems are software tools to tackle the problem of information overload by helping users to find items that are most relevant for them within an often unmanageable set of choices. To create these personalized recommendations for a user, the algorithmic task of a recommender system is usually to quantify the user’s interest in each item by predicting a relevance score, e.g., from the user’s current situation or personal preferences in the past. Predictive analytics is a kind of business analytics which enables predictions to be made, about the probability of happening of a particular event in the future, based on data of the past. The concept of predictive analytics is widely inculcated in the departments of the most successful organizations where it supports their decisionmaking process and helps achieve their goals of customer satisfaction and proper delivery and monitoring of existing systems. These days, recommender systems are used in various domains to recommend items such as products on e-commerce sites, movies and music on media portals, or people in social networks. To judge the user’s preference list, recommender systems proposed in past researches often utilized explicit feedback, i.e., deliberately given ratings or like/dislike statements for items. In practice, however, in many of today’s application domains of recommender systems this kind of information is not existent. Therefore, recommender systems have to rely on implicit feedback that is derived from the users’ behaviour and interactions with the system. This information can be extracted from navigation or transaction logs. Using implicit feedback leads to new challenges and open questions regarding, for example, the huge amount of signals to process, the ambiguity of the feedback, and the inevitable noise in the data. The system we use for obtaining feedback for the recommendation system is called Hybrid feedback which is a combination of both implicit and explicit feedback techniques. This thesis by publication explores some of these challenges and questions that have not been covered in previous research. It focuses on building a recommendation system for Error Log Analytics. The thesis is divided into two parts. The first part deals with the importance of big data, map reduce and types of logs available. Descriptive analytics of log dataset of company A is done. The second part focuses on building a Recommendation system and the different techniques. In this thesis we use similarity of 2 vectors to build one. Page 1 Chapter -1 Introduction This chapter introduces the basic concepts of Big data, its architecture and the concept of map reduce. It also goes on to explain the role of combiner in map reduce approach. For analysis purposes data set of company A is used and referred. Also, it describes the types of logs available to us and the concept of error log analytics. Furthermore, it explains the Recommendation System. 1.1 Introduction to Big Data Big data has revolutionized commerce in 21st century and changed the perspective of people towards data in general. The term Big can be given a relative definition because what we describe as ―big‖ today can become small in the times to come but Big Data can always be defined as data which cannot be handled with the available resources and orthodox technology methods. With the continuous increase in data come varied challenges in the forms of different formats, representations and speeds at which data is generated. The orthodox mechanisms of processing data could handle only structured data in the form of tables but with the advent of Big Data Technology, unstructured, semi structured and structured, all types of data can be processed and handled with ease. A definition of Big Data which everyone can agree upon is long overdue and hence we can say that Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information [1].The forms of data that come from varied sources such as documents, e-mails, text logs and social media can be: Structured data is also known as data coming from Relational database. Semi Structured data is the XML data. Unstructured data comes from Word documents, PDF files, Text files and Media Logs. Big Data can be viewed as an opportunity because such huge volume of data upon analysis can open windows which are yet to be explored. The data is not measured in terms of Page 2 Gigabytes or Terabytes but in terms of Pitabytes or Exabytes. Applications Page 3 involving Big Data can be Transactional such as Facebook, Twitter, YouTube, Photo box and Analytic. The data can also be incomplete or involving time stamp. The concept of big data can be easily described using five V’s: Volume, Velocity, Variety, Veracity and Value. 1. Volume: It alludes to the humungous data generated every second. E-mails, twitter messages, photos, video clips, sensor data etc. produced and shared every second. We are not talking in terms of Terabytes but the measurement ranges to Zettabytes or Brontobytes This increase in data makes data sets too large to store and analyze using traditional database methodologies. Big data helps in analyzing this huge data by breaking and storing data at different locations and combining them as and when needed. 2. Velocity can be defined as the speed of production and consumption of data. Examples include social media videos going viral in seconds, the speed at which online payments are processed and the speed at which we do trading of shares. Big data technology helps us to analyze the data while it is being generated, without ever putting it into databases. 3. Variety is the data we use now. Previously we focused on structured data that is stored in the form of tables or in relational databases example includes financial data such as sales by product or region. At present, 80% of the world’s data is unstructured, and therefore can’t easily be put into tables like photos, video slots or social media updates. With big data technology we can now operate on differed data types (structured and unstructured) including messages, social media conversations, photos, sensor data, video or voice recordings and bring them together with more traditional, structured data. 4. Veracity is discussed in terms of unworthiness of the data. With many variations of big data, quality is compromised and data accuracy is lost as is the case of Twitter posts with hash tags, abbreviations, typing errors and colloquial speech as well as the reliability and accuracy of content but with the advent of big data and analytics we now work with these type of data. Page 4 5. Value: We have access to big data but unless we can turn it into value it is useless. It can be easily established that 'value' is the most important V of Big Data. Large volumes of data can be processed using the Map reduce technique. Map Reduce is a processing technique for distributed computing based on Java. It usually divides the data into pieces which are processed in a parallel manner. The Map Reduce consists of one slave TaskTracker and a master JobTracker per cluster-node. The master is responsible for giving tasks to the slaves, monitoring tasks and re- executing the failed tasks. The slave completes the tasks as instructed by the master. The applications usually specify the input/output locations and supply map and reduce functions via implementations of suitable interfaces and/or abstract-classes. These, and other job parameters, constitute the job configuration. The Hadoop job client then proffers the job (jar/executable etc.) and configuration to the JobTracker which then takes over the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, giving status and diagnostic information to the job-client.[3] The concept of map reduce works in 2 phases : Mapper Phase: In this phase the dataset is converted into key –value pairs. Reduce Phase: In which several outputs from the map task are combined to form reduced set of tuples. Hadoop is the most popular implementation of Map Reduce because of ease of availability as it is an entirely open source platform for handling Big Data. The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of huge data sets across multiple computers using a simple programming model. It enables applications to work with thousands of independently computational computers and petabytes of data. Hadoop has taken inspiration from Google's Map Reduce and Google File System (GFS). HDFS (Hadoop Distributed File System) . The Hadoop Distributed File System (HDFS) is a distributed file system that provides fault tolerance and is designed to run on commodity hardware. HDFS provides high throughput access to application data Page 5 and is suitable for applications that have large data sets. Hadoop provides a distributed file system (HDFS) that is able to store data across thousands of servers, and a means of running jobs (Map/Reduce jobs) across those server machines, running the work near the data. HDFS has master/slave architecture. Large data is automatically divided into blocks which are managed by different nodes in the hadoop cluster. Steps: 1. Every file is split into blocks and these blocks are then processed by user-defined code into {key, value} pairs using map phase. 2. The map functions are executed on distributed machines to generate output {key, value} pairs which are then written on their respective local disks. 3. Each reduce function uses HTTP GET method to pull {key, value} pairs corresponding to its allocated key space. 1.2 Data Analytics Table 1.1 Data Analytics and its Types Page 6 Descriptive, Predictive, and Prescriptive Analytics Explained Here is a guide to understand and then selecting the right Descriptive, Predictive, and Prescriptive Analytics With the humongous amount of data available to businesses regarding their supply chain, logs, servers in general these days, companies are turning to analytics solutions to excerpt the meaning from this gigantic volumes of data to help improvise the decision making process. Predictive Analytics helps companies to optimize their effort need capabilities to analyze historical data so as to forecast what might happen in the future. The thrill of doing accomplishing it the right way and taking your systems to a data driven approach is a great achievement. Huge return of investments can be enjoyed as evidenced by companies that have optimized their lowered operating costs, supply chain optimization, increased revenues, or improved their customer service and product mix using past data to analyze what is going to happen in the future. Comparing and understanding all these analytical options can be a baffling task. However, with good fortune these analytical options can be categorized at a high level into three distinct types. No one type of analytical approach is better than another, and as a matter of fact, they co-exist and enhance each other. In order for a business to have a universal view of the market and how a company competes for its spot and efficiency within that market requires a robust analytic environment. Descriptive Analytics, which makes use of data aggregation and data mining to provide insight into the past and answer of ―What has happened?‖ that is to find the problem lying in the system. Predictive Analytics, which use statistical models and predicts techniques to understand the future and answer: ―What could happen?‖ that is what could happen in the future. Page 7 Prescriptive Analytics, which use optimization and simulation algorithms to advice on possible outcomes and answer: ―What should we do?‖ that is how to solve the problem. Descriptive Analytics: Insight into the past Descriptive analytics or statistics is exactly what the name suggests that is it ―Describes‖, or summarizes the raw data and coverts it into something that is interpretable by humans. They are analytics of the past. They deal with the data available with the company. The past refers to any point of time that an event has occurred, whether it is one minute ago, or one year ago. Descriptive analytics are insightful because they grant us an opportunity to learn from past behaviors, and understand how they might influence future outcomes. The majority of statistics techniques we use fall into this department of analytics. Think about the basic arithmetic like sums, averages, percent changes. Usually, the elemental data is a aggregate, or count of a filtered column of data to which the basic math is applied. For all pragmatic purposes, there are an infinite number of these statistics. Descriptive statistics are useful to show things like, total stock in average dollars spent per customer, stock left in inventory and year over year change in sales. Common examples of descriptive analytics are reports that provide finance, inventory and customers, historical insights regarding the company’s production, financials, operations and sales. Descriptive Analytics is used when you need to fathom at an aggregate level how and at what level things are run in the company, and when you require to see the summary of how different aspects of processes are done in the company. Predictive Analytics Refers To Understanding the future Predictive gets its form from the root that is the ability to ―Predict‖ occurrences of the future. This analytics is about understanding and predicting the future. Predictive analytics provides companies with insights that are actionable based on data. Page 8 Predictive analytics provide probability or likelihood of a future outcome. It is important to remember that no statistical algorithm can ―predict‖ the future with 100% certainty. Organizations use these statistics to predict what might happen in the future. This is because the foundation of predictive analytics is based on probabilities. These statistics collect the data from the system and the missing values are guessed based on the previously collected data. They combine historical data found in HR, POS ERP and CRM systems to locate patterns in the data and apply algorithms and statistical models to capture relationships that cannot be easily seen between varied data sets. Organizations use Predictive analytics and statistics whenever they want to see what the future holds. Predictive analytics can be used throughout the organization from identifying purchasing patterns to identifying trends in sales activities; from forecasting customer behavior to forecasting how the error occurred can be solved. They also help forecast demand for inputs from the inventory, supply chain and operations One of the most common applications most people are familiar with is the use of predictive analytics to produce a list of recommendations while shopping online wherein depending on your choice of looking at the items a list of similar items are displayed beneath. These recommendations are used by sale and customer services to determine the probability of customers making the online purchase. Typical business uses include, predicting what items customers will purchase together, or forecasting inventory levels based upon a myriad of variables, understanding how sales might close at the end of the year, Predictive Analytics is used whenever you need to know something about the future, or fill in the gaps in the information. Prescriptive Analytics gives Advise on possible outcomes A comparatively new field of prescriptive analytics permits users to ―prescribe‖ a number of different possible alternatives to a prediction and guide them towards a plausible solution. In a nut-shell, these analytics are regarding providing advice. Page 9 Prescriptive analytics is an attempt to quantify the impact of future decisions in order to advise on all plausible outcomes before the actual decisions are made. At their best, prescriptive analytics predicts not only what will happen, but also why it will happen providing recommendations regarding actions that will take advantage of the predictions. These analytics go beyond what descriptive and predictive analytics suggest by recommending one or more plausible outcomes or courses of action. Typically they predict multiple future outcomes and allow organizations to choose from a number of possible outcomes based upon their actions. Prescriptive analytics uses a combination of tools and techniques such as algorithms, machine learning and computational modeling procedures and business rules. These techniques are applied against input from many different data sets including real-time data feeds, big data and historical and transactional data, Prescriptive analytics are relatively complex monitor, and most organizations are not yet using them in their daily course of planning for business. If and when implemented correctly, they can have a significant impact on how businesses analyze decisions, on the organization’s bottom line. Big organizations are successful in using prescriptive analytics to optimizing their production, scheduling and inventory in the supply chain to make sure that the deliverables reach customers on time thereby increasing the overall experience of customer satisfaction altogether. Use Prescriptive Analytics anytime you need to provide users with advice on what action to take. 1.3 Understanding Logs and Error Log Analysis using Big Data Big data and error log analytics together make a very intriguing topic for research. In their paper Souza and Katkar (2014) reviewed types of logs available and then utilize the corresponding log information for a very important Business Analytics function Predictive Analysis and Classification. The various categories of severity are Error, Warning and Info that show the severity count present in the error log file. The Page 10 straight line equation y=mx+c, is used to predict the future severity value. The independent variable x consists of the influencing parameters for prediction, while y is the predicted value. In this paper, three more categories have been discussed: Fatal, Trace and Debug. Katkar and Kasliwal (2014) explained the types of logs and their impact on systems focusing particularly on web server logs and gave various techniques to analyze them. According to them Data Mining is used for finding expected patterns from that large set of log data using Web Mining. When used together, predictive analytics and data mining can make the future prediction more efficient with respect to web access. Bruckman (2006) explored the various types of log analysis namely quantitative and qualitative analysis to understand the relationship between them. Qualitative log analysis is generally done manually whereas quantitative log analysis is done manually and or is automated. Joshila et al(2011) discussed different types of logs available namely error log, access log, common log format(CLF), Combined log format, multiple access logs, status codes sent by their servers and combines the information of logs with Web Mining. In this paper, error log has been explored and analyzed. 1.3.1 Role of Combiner in Map Reduce And Error Log Analysis Role of Combiner in Map Reduce A Combiner, also known as a semi-reducer, operates by taking the inputs from the Map class and henceforth passing the output key-value pairs to the Reducer class. The main purpose of a Combiner is to summarize the map output records with the same key. The result (key-value collection) from the combiner will be sent over the network to the actual Reducer task as input, thereby reducing the load at the Reducer. The Combiner class is used in between the Map class and the Reduce class to minimize the volume of data transfer between Map and Reduce. Usually, the output of the map task is large and the data transferred to the reducer task is high. Page 11 Here is a brief summary on how MapReduce Combiner works: A combiner does not have a default interface and it must implement the Reducer interface’s reduce() method. A combiner operates on each and every map output key. It is a must for the combiner to have the same output key-value types as the Reducer class. A combiner can produce summarized statistics from a large dataset because it replaces the original Map output. 1.3.2 Purpose of log Ubiquitous to the study of online activities is the possibility of collecting log file data. It is plausible for the computer to trace every command typed by users—in some cases, every stroke of the key. In cases where users interact only online, we can access a comprehensive record of all of their history of interactions. The completeness of the record and ease of collecting it are unrivalled. However, log file data is more often collected than analyzed. The structure and type of log varies with different applications. Types of log files generally maintained include: 1. Error logs: Keep the records of types of errors and time of occurrence. Helps in resolution of errors due to back tracing. 2. Web server logs: History of activities on the internet stored. New techniques such as clickstream mining use web server log data 3. Console logs: Wellbeing of system applications assessed through system or console logs. 1.4 Recommender Systems The roots of recommender systems were settled due to special needs of works in diverse fields: cognitive science [19], information retrieval [20] or economics [21]. Recommender systems emerged as an independent research area in the middle 90s Page 12 and their important role to enhance data accessibility attracted the attention of both, academic and industrial worlds. Recommender systems are a convenient way to broaden the scope of search algorithms since they help in discovering the items they might not have been found by themselves. A recommendation is basically to offer the user with a list items which would match his preferences according to the things bought previously. There exist varied approaches to accumulate data about the users, by constant monitoring of their interaction, by quizzing them regarding some actions or to fill some feedback forms with personal information included. The user's interaction with the system provides two types of information: Implicit information: Collected from the user interaction and behavior. For example, by keeping the items the user has interacted with and item related information like number of times viewed, items reproductions or user related viewing information as group membership. Explicit information: The users provide this information every time they give opinion about items, rating or liking some item. Generally all the information elaborated by the user consciously. It is based on the rating given by the user. The recommender system accumulates and analyzes both kind of information to generate the user profile consisting of items viewed and ratings and feedback forms. The profile stores information not only about the users likes, but also the information about the user itself, current placing, current personal needs, sex, age, professional position, and so. The way it's used by the recommendation system varies a lot among the different systems. The information stored within is also a determinant factor in the recommender algorithm design. Page 13 1.4.1 Taxonomy for recommender systems The categories in which it is divided describes the diverse models of abstraction for user profile, how it is generated, and how is it late maintained and how does it evolve as the system runs. User profile representation: An accurate profile is an important task since the recommendation success depends on how the system represents the user's interests. Next are listed some models applied in current recommender systems: - History-based Some systems keep a list of purchases, the navigation history or the content of e-mail boxes as a user profile. Additionally, it is also common to keep the relevant feedback of the user associated with each item in the history. Amazon1 web site is a clear example. - Vector-space In the vector space model, items are represented with a vector of features, usually words or concepts which are represented numerically as frequencies, relevance percentage or probability. - Demographic Demographic filtering systems create a user profile through stereotypes. Therefore, the user profile representation is a list of demographic features which represent the kind of user. - User-Item Ratings Matrix Some collaborative filtering systems maintain a user-item ratings matrix as a part of the user profile. The user-item ratings matrix contains historical user ratings on items. Most of these systems do not use a profile learning technique. Systems like Jamendo2 include this technique to represent user profile. Page 14 - Classifier-based Models Systems using a classifier approach as a user profile learning technique, elaborate a methodology to monitor continuously input data in order to classify the information. This is used in the case of decision trees, Bayesian networks and neural networks. - Weighted n-grams Items are represented as a net of words with weights scoring each linking, the system is based on the assumption that words tend to occur one after another a significantly high number of times, extracts fixed length consecutive series of n characters and organizes them with weighted links representing the co-occurrence of different words. Therefore, the structure achieves a context representation of the words. Initial profile generation: - Empty: the profile is built as the users interact with the system. - Manual: the users are asked to register their interest beforehand. - Stereotyping: Collecting user-related information like city, country, lifestyle, age or sex. - Training set: providing the users with some items among which they should select one. - Profile learning technique: The way the profile changes during time. - Not needed: Some systems do not need profile learning technique. Some because they load the user related information from a database or it’s dynamically generated. - Clustering: Is the process of grouping information objects regarding some common features inherited to its information context. User profiles are often clustered in order Page 15 to groups according to some rule to assess which users share common interests. Recommenders like Last.fm3 or iRate4 perform this technique [12]. - Classifiers: Classifiers are general computational models for assigning a category to an input. To build a recommender system using a classifier means using information about the item and the user profile as input, and having the output category represent how strongly to recommend an item to the user. Classifiers may be implemented using many different machine learning strategies including neural networks, decision trees, association rules and Bayesian networks [1]. - Information Retrieval Techniques: When the information source has no clear structure, pre-processing steps are needed to extract relevant information which allows estimation of any information container’s importance. This process comprises two main steps: feature selection and information indexing. - Relevance feedback: The two most common ways to obtain relevance feedback is to use information given explicitly or to get information observed implicitly from the user’s interaction. Moreover, some systems propose implicit-explicit hybrid approaches. - No feedback: Some systems do not update the user profile automatically and, therefore, they do not need relevance feedback. For example, all the systems which update the user profile manually. - Explicit feedback: In several systems, users are required to explicitly evaluate items. These evaluations indicate how relevant or interesting an item is to the user, or how relevant or interesting the user thinks an item is to other users. Some systems invite users to submit information as track playlists. iRate uses this approach to provide its recommender with finer information about user’s preferences. Page 16 - Implicit feedback: Implicit feedback means that the system automatically infers the user’s preferences passively by monitoring the user‟s actions. Most implicit methods obtain relevance feedback by analyzing the links followed by the user, by storing a historic of purchases or by parsing the navigation history. Table 1.2 Types of Recommender Systems Page 17 Chapter 2: Literature Review This chapter presents the literature review of the proposed system. Here we also explain the paper contribution from different researchers and their back ground work in brief. It gives the review of work done in the field of big data, log analytics and recommendation systems. 2.1 Big Data Big Data is a relatively new term that came from the need of big companies as Yahoo, Google, Facebook to analyze big amounts of unstructured data , but this need could identified in a number of other big enterprises as well as in the research and development field. Data becomes Big Data when it basically outgrows the current ability to process it and cope with it efficiently. Such datasets have size beyond the ability of typical database software tools to capture, store and manage. Big Data are those ―Data sets which continues to grow so much that it becomes difficult to manage it using existing database management concepts and tools. The difficulty can be related to data acquisition, storage, search, sharing, analytics and visualization etc. (Singh,S. and Singh,N., 2012), Oracle added a new characteristic for this kind of data and that is low value density meaning that sometimes there is a very big volume of data to process before finding valuable needed information. (Garlasu D., 2013) The following properties associated with Big Data (Aminu, L.M., 2014): 1. Variety- Data is entirely dissimilar consisting of raw, structured, semi structured and even unstructured data. 2. Volume-The big word in big data itself defines the volume. At present, the data is currently in petabytes and is supposed to raise to zettabytes in nearby future. 3. Velocity- Notion which deals with the speed of the data coming from different sources. Page 18 4. Variability- It considers the inconsistencies of the data flow. 5. Complexity- To prevent data from getting out of control it is a responsibility to link, match, cleanse and transform data across systems coming from a variety of sources. 6. Value- Users can be able to run certain queries against the data saved and thus can abstract vital results from the filtered data obtained and can also order it according to the magnitude they need. 2.1.1 Why Big Data Social networking websites generate new data every second and handling such a data is one of the major challenges companies are facing. Data which is stored in data warehouses is causing disruption because it is in a raw format, proper analysis and processing is to be done in order to produce usable information out of it. Big Data can help to gain perspective and make better decisions. It presents an opportunity to create unprecedented business advantage and better service delivery. The concept of Big Data is going to change the way we do things today (Singh,S. and Singh,N., 2012). Big Data is energy source of present world. It refashions future of global economics. Big Data revolution changes the way of thinking in business. It affects decision making from the bottom up and the top down. It speeds up discoveries and small predictions in daily activities (Sase, Y. S., Yadav, P.A., 2014). 2.1.2 Characteristics of Big Data Platform The following basic features should be there in Big Data offering (Singh,S. and Singh,N., 2012): 1. Comprehensive 2. Enterprise Ready 3. Integrated 4. Open Source Based Page 19 5. Low latency reads and updates 6. Robust and Fault Tolerant 7. Scalable 8. Extensible 9. Allow adhoc queries 10. Minimal maintenance 2.1.3 Big Data Challenges The main challenges of Big Data are (Singh,S. and Singh,N., 2012): 1. Variety 2. Volume 3. Analytical workload complexity 4. Agility Many organizations are straining to deal with the increasing volumes of data. In order to solve this problem, the organizations need to reduce the amount of data being stored and exploit new storage techniques which can further improve performance and storage utilization. 2.1.4 Map – Reduce Technique Today’s very challenging problem is to analyze Big Data. For the effective handling of such massive data or applications, the use of map reduce framework has been widely come into focus. Over the last few years, Map Reduce has emerged as the most popular paradigm for parallel, batch style and analysis of large amounts of data. It was a programming model initiated by Google’s Team for processing huge datasets in distributed systems. It is inspired by the functional programming which allows expressing distributed computations on massive amounts of data. It is designed for large scale data processing as it allows running on clusters of commodity hardware. Map reduce is used in areas where the volume of data to analyze grows speedily. Page 20 2.1.5 Architecture Of Map Reduce (Maitrey, S. and Jha,C.K., 2015) MapReduce is a technique that processes large multi-structured data files across the massive data sets. It breaks the processing into small units of work. These broken processes can be executed in parallel across several nodes. As a result a very high performance is achieved. Those programs which are written in this functional style are automatically parallelized and can be executed on a large cluster of commodity machines. The series of steps in its working are: Step 1: The input file is read and then gets split into multiple pieces. Step 2: These splits are then processed by multiple map programs running in parallel. Step 3: The Map Reduce system takes the output from each map program. It then merges (shuffles/sort) the results for input to the reduce program. Technically, all inputs to Map tasks and outputs from Reduce tasks are of key-value pair form. Usually the keys of input elements are not relevant. So, in such conditions we must overlook them. A plan for executions in MapReduce is determined entirely at runtime. MapReduce scheduler utilizes a speculative and redundant execution. Tasks on straggling nodes are redundantly executed on other idle nodes that have finished their assigned tasks. Map and Reduce tasks are executed with no communication between other tasks. Thus, there is no contention arisen by synchronization and no communication cost between tasks during a MR job execution. The figures below show: a) Simplified use of MapReduce b) MapReduce with combiners and partitioners Page 21 Figure 2.1 Simplified use of MapReduce Figure 2.2 MapReduce with partitioners and combiners MapReduce runs in cluster of nodes, one node acts as a master node and other nodes act as workers. Workers nodes are responsible for running map and reduce tasks; the master is responsible for assigning tasks to idle workers. Each map worker reads the content of its associated split and extracts key/value pairs and passes it to the user defined Map Function. The output of the Map function is buffered in memory and partitioned into a set of partitions equals to the number of reducers. Master notifies the reduce workers to read the data from local disks of Map workers. The results or output of reduce function is appended to output files. Users may use these files as input to another MapReduce call, or use it for another distributed application Page 22 (Elsayed, A. et al, 2014). Map Reduce Over Traditional Dbms ( Maitrey, S. and Jha,C.K., 2015) Traditional DBMSs have adopted such strategies which are not appropriate for solving extremely large scale data processing tasks. There was a need for some special purpose data processing tools that can be adapted for solving such problems. While MapReduce is referred to as a new way of processing Big Data, it is also criticized as a ―major step backwards‖ in parallel data processing in comparison with DBMS. MapReduce increases the fault tolerance of long time analysis by numerous checkpoints of completed tasks and data replication. However the frequent I/Os required for fault tolerance reduce efficiency. Parallel DBMS aims at productivity rather than fault tolerance. DBMS actively exploits pipelining intermediate results between query operators. However, it causes potential danger that a large amount of operations need to be redone when a failure happens. Also, DMBS generates a query plan tree for execution, a plan for execution in MapReduce is determined entirely at runtime. MapReduce is simple and efficient tool for query processing in a DBMS. The increasing interest and popularity of MapReduce has led some relational DBMS vendors to support MapReduce functions inside the DBMS. The Teradata Aster Database is an example of a product that supports MapReduce. 2.1.6 Dealing With Failure MapReduce is designed to deal with hundreds or thousands of commodity machines. Therefore, it must tolerate machine failure. The failure may occur in master node or worker nodes. In case of master failure all MapReduce tasks will be aborted, and it have to be redone after assigning new master node. On the other hand, to track worker failure, the master monitors all workers periodically checking worker status. If a worker doesn’t respond to master ping in a certain amount of time, the master marks the worker as failed. In case of failure of map task worker, any map task either in progress or completed by the worker are reset back to their initial idle state Page 23 and will be assigned to other worker. While in case of failure in reduce task worker, any task in progress on a failed worker is assigned to an idle worker. The output of concluded reduce tasks is stored in global file system, so completed reduce tasks do not need to be re-executed. In the other hand, the output of map tasks is stored in local disks, so completed map tasks must be re-executed in case of failure. (Elsayed, A. et al, 2014) 2.1.7 Benefits of MapReduce The following are the advantages of MapReduce (Elsayed Abdelrahman, 2014 and Kyong Ha Lee 2011): 1. Simple and easy to use- The MapReduce model is simple but expressive. With MapReduce, a programmer defines his job with only Map and Reduce Functions, without having to specify physical distribution of his job across the nodes. 2. Flexible- MapReduce does not have any dependency on data model and schema. With MapReduce a programmer can deal with sporadic or unstructured data more easily than they do with DBMS. 3. Independent of the storage- MapReduce is basically independent of the storage layers. Thus, MapReduce can work with different storage layers. 4. Fault tolerance- MapReduce is highly fault tolerant. It is reported that it can continue to work inspite of an average of 1.2 failures per analysis job at Google. 5. High scalability- MapReduce has been designed in such a way that it can scale up to large clusters of machines. It supports runtime scheduling which enables dynamic accimilating of resources during job execution. Hence, offering elastic scalability. 6. Supports data locality 7. Reduces network communication cost 8. Ability to handle data for heterogeneous system, since mapreduce is storage independent, and it can analyze data stored in different storage system. 2.1.8 Pitfalls and Challenges in MapReduce The following are the pitfalls in the MapReduce framework compared to DBMS Page 24 (Lee.K.H, et al, 2011): 1. No high level language support like SQL in DBMS and any query optimization technique till 2011. 2. MapReduce is schema free and index free. An MR Job can work right after its input is loaded into its storage. 3. A single fixed dataflow which don’t support for algorithms that require multiple inputs. MapReduce is primitively designed to read a single input and generate single output. 4. Low efficiency- With fault tolerance and scalability as its primary goals, MapReduce operations are not always optimized for I/O efficiency. In addition, MapReduce is are blocking operations. A transition to the next stage cannot be made until all the tasks of the current stage are concluded. Also, MapReduce has a latency problem that comes from its inherent batch processing nature. All of the inputs for an MR job should be prepared in advance for processing. 5. Very young compared to 40 years of DBMS. The two major challenges are (Maitrey, S. and Jha,C.K., 2015): 1. Due to frequent checkpoints and runtime scheduling with speculative execution, MapReduce reveals low efficiency. Thus, how to increase productivity guaranteeing the same level of scalability and fault tolerance is a major challenge. The efficiency problem is expected to be overcome in two ways: Improving MapReduce itself or leveraging new hardware. 2. Second challenge is how to efficiently manage resources in the clusters which can be as large as 4,000 nodes in multi user environment and achieving high utilization of MR clusters. APACHE HADOOP MapReduce which has been popularized by Google utilizes the Google File System (GFS) as an underlying storage layer to read input and store output. GFS is a chunk based distributed file system that supports fault tolerance by data partitioning and replication. We proceed our explanation with Hadoop since Google’s MapReduce Page 25 code is not available to the public for its proprietary use. Hadoop is an open source Java implementation of MapReduce. Other implementations such as DISCO written in erlang also available but are not as popular. Hadoop consists of two layers: a data storage layer called Hadoop distributed file system (HDFS) and a data processing layer called Hadoop MapReduce framework. HDFS is a block structured file system managed by a single master node like Google’s GFS. Large data is automatically split into blocks which are managed by different nodes in the Hadoop cluster. Figure 2.2 shows the architecture of Apache Hadoop. Figure 2.3 Hadoop Master Slave Architecture An HDFS cluster consists of a single Name node, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Data nodes, usually one per node in the cluster, which manages storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or blocks and these blocks are stored in a set of data nodes. Namenode determines the mapping of of blocks to Datanodes. Page 26 Hadoop cluster comprises of a single master node and multiple slaves or ―worker nodes‖. The Job Tracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data or at least are in the same rack. A Task Tracker is a node in the cluster that accepts tasks Map, Reduce and shuffle operations – from a JobTracker. The master node acts as the Name node and the JobTracker whereas the slave or a worker node acts as both a DataNode and TaskTracker. In a larger cluster, the HDFS is handled through a dedicated NameNode server to host the file system index and a Secondary NameNode that can generate snapshots of the name node’s memory structures, thus preventing file system corruption and reducing loss of data. All the blocks of a file are of the same size except the last block. Files in HDFS are writeonce and have strictly one writer at any time. The NameNode makes all the decisions regarding replications of blocks. It periodically receives a heartbeat and block report from each Datanodes in the cluster. Receipt of the Heartbeat implies that the DataNode is functioning properly. A blockreport contains a list of blocks on a DataNode. The input data is taken by the Name node and divided into splits for map phase. Map runs for each split. The TaskTracker then retrieves key/value pairs from the data chunk and assigns them to mappers. The output key-value pairs of the map function are sorted and stored locally and passed to the reducers as the input data through HTTP. Once the process is finished in reduce function, the final result is handed over to HDFS through network. HDFS client is the third major category in system architecture. HDFS supports operations to read, write and delete files and operations to create and delete directories. The user references files and directories by paths in the namespace. Client nodes have Hadoop installed with all the cluster settings, but are neither a Master nor a Slave. Instead, the role of the client machine is to load data into the Page 27 cluster, submit map reduce jobs describing how the data should be processed and then retrieve or view the results of the job when it is finished. (Jyoti Nandimath et al, 2013) in her paper Big Data with Apache Hadoop concluded that Hadoop Applications performs the operations on Big Data in optimal time and produce an output with minimum utilization of resources. Tapan P.Gondaliya and Dr. Hiven D.Joshi also in their paper conclude that Apache Hadoop is the best solution of Big Data problem. They also provide a brief introduction of the components over Hadoop like Apache Hive, Apache Pig, Apache Mahout, Apache HBase, Apache Sqoop and Apache Flume. (Aditya B. Patel, 2012) present a paper in which various experiments using Apache Hadoop are done. He concluded that results obtained from various experiments indicate that its favorable to use Apache Hadoop for Big Data Analysis n their future work will be focused on evaluation and modeling of Hadoop data intensive applications on cloud platform like Amazon EC2. 2.2 Basic Logging and Descriptive Analytics The section promotes the need for effective logging techniques for interactive data analysis systems such as: describing the exploration process, implementing intelligent user interfaces so as to create recommendation systems using predictive analysis, evaluating analysis tools and interfaces, and gaining insights of the analysis term as a whole. These terms are not only of great excitement to researchers who long to understand these topics in depths, but immensely valuable to industry patrons, who can use this information to design the products in a manner that they become better suited to the needs of the users. Implementing intelligent user interfaces: Clippy: Microsoft Assistant is one example of an intelligent user interface. These assist the user by loading some of the complexity in working with the tool at hand, often by automated means. Other examples stated in different papers such as adaptive or Page 28 adaptable (Andrea Bunt et al,2007) interfaces, predictive interfaces as mentioned by (Swapna Reddy et al,2009) and mixed-initiative interfaces (Eric Horvitz,1999), as well as automated user assistants explained (Pattie Meas, 1994). Automated interfaces usually rely heavily on statistical models of user behavior and thus require accurate accounting user actions at a level that corresponds to the variables being modeled. The predictive systems created rely heavily on previously accumulated data as mentioned in the section above for example a KEDB to predict and model user’s behavioral pattern. Another example, Wrangler has a mixed-initiative interface that gives suggestions to assist users clean their data based on frequencies of user actions as explained by (Qui Guo et al,2010) in their paper. Wrangler was originally established on a transformation language with a small number of operators. While identifying this list of transforms and pairing them with interface gestures mentioned, the authors were able to capitalize on their extensive hand experience, as well as prior work on languages for data cleaning helped them in creating such a system. However, for data exploration purposes rather than data cleaning, it is not clearly mentioned what set of transforms and visualizations should be supported and used in order to get the desired results. Related work done previously on the topic has relied vastly on the intuition and experience of the author with particular situations and patterns encountered before to determine what actions to support in what situations (Robert St. Amant et al,1998.). However, these situations could be better determined by having detailed activity records from data exploration and visualization tools with direct manipulation interfaces, logged at an appropriate level of granularity (David Gotz et al,2009) rather than just relying on pure instinct and gut of one particular individual who might hold an expertise in the area. This problem can be rectified by evaluating analysis tools and interfaces with various data sets to check their level of predictions. More specifically, researchers and industry patrons revaluate interfaces on various parameters to understand the user behavior, performance, thoughts, and experience, by contrasting and comparing design alternatives, computing usability metrics, and certifying conformance with standards as mentioned in the paper. (David M Hilbert et Page 29 al,2000). To accomplish the targets set by using events logged from current UI systems, researchers have invented a wide variety of techniques ranging from synchronizing data gathered from different sources available to them and then transforming, comparing, summarizing, and imagining event streams that abstract low-level log events into high-level modeled events which help in predictive analysis. A substitute to these automated techniques is to perform a task in carefully controlled laboratory environment or focusing on long term studies of specific tools in an isolated environment (Youn Ah Kang et al,2011). These studies in reality involve watching videos of study subjects performing a task, grilling the subjects about their experience, and evaluating how well they performed the task in the environment created for them. While the research is immensely valuable, some disadvantages of these techniques are that they do not scale well, they generate results that are not amenable to comparison or combination with data from other studies, and the process of recording the data is too open to subjective interpretation. The results are based on previous data so do not gel with theoretical studies stated. High quality automatically logged interaction data would circumvent each of these problems, although at the expense of missing the big picture that these techniques provide. Understanding the analysis ecosystem: In addition to improving upon individual tools and interfaces, developers and researchers want to understand the entire data analysis pipeline. In practice, users leverage multiple tools to explore and visualize their data depending on their needs. For example, a data scientist might use Hadoop and R for statistical work. Summary of Basic Logging Techniques Here we restate the basic types of information that should be logged. Event: The smallest unit of information that is stored in a log is more commonly referred to as event, even if the generation is not from an event driven program. However, graphical user interfaces and other interactive programs are usually event- Page 30 driven. An event in a log is a piece of information that is recorded any time the work or application the user is interested in is run on the system. Work of interest may consist of functions called, queries run, GUI trigger handlers, threads executed, and so on. The kind of information which is logged in for each event and the format it appears in varied across different perspectives and applications; it may include information such as functional parameters, execution durations, caller, source code location, timestamp and severity of error etc. Such events are customarily logged for debugging and performance monitoring purposes. Later we discuss specially what types of events and associated information should be logged for user modeling. 1. User ID: In an ideal situation each event should be relatable to information about the user responsible for triggering the event, in a sense that the interaction of which program with which application of the program caused the event. For some events, the user responsible may be the system itself, for example, in the case of garbage collection. In general, arbitrating causality is not trivial, but for the events of interest for user modeling, it should be straightforward. 2. Timestamps: Events should always be associated with a timestamp that describes the date, time, and time zone information. Timestamps are vital for understanding the order and rate of events but are not always reliable and accurate in predicting the reactions of when an event truly occurred. This is often not a problem when dealing with logs from a single machine but can be extremely challenging and daunting to deal within a distributed setting. 3. Version and configuration: It is crucial to provide some information that ties each event recorded to metadata about the version and configuration of the interface that generated that event. This is paramount because exactly what information is logged and the format it is logged in tends to change across versions and configurations. Without this information, it can become unessentially difficult to parse the logged data, and ambiguities may be introduced. Ideally, even changes to minor details of the interface would be versioned, to facilitate A/B testing. Page 31 4. Open Time and close time: The events should be accompanied by opening and closing times so as to help predict the nature and severity of the damage occurred during the live condition of the event. 2.3 Predictive Analytics and Recommender Systems Analytics is related to ―the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and add value‖ (Davenport and Kim, 2013, p3). According to the differences in the analytical methods used as well as the objective they are being used for, analytics can be divided into the following categories: descriptive, prescriptive and predictive. The descriptive analytics can be referred to as reporting and it describes a certain phenomenon of interest. It incorporates the actions of gathering, organizing, tabulating and depicting data, and even though it is useful for decision makers in the context of an organization, it does not provide details about why certain event occurred nor it is able to say what could happen in the future (Davenport and Kim, 2013; Delen and Demirkan, 2013; Song et al., 2013). The prescriptive analytics on the other hand, is related to making suggestion about a certain set of actions and includes methods of experimental design and optimization (Davenport and Kim, 2013; Sharda et al., 2013; Song et al., 2013). The experimental design demonstrates the reasons why a phenomenon occurs by making experiments where independent variables are manipulated, extraneous variables are controlled and therefore conclusions are being made which result with actions that the decision maker should practice. Optimization as a technique suggests balancing the level of a certain variable related to other variables, thus identifying the ideal level of it – a recommendation for the de-cision maker (for example identifying the ideal price of a product to be sold, the ideal level of supplies to be kept in inventory or the right quantity of a particular order to be made) (Daven-port and Kim, 2013). Page 32 Finally, predictive analytics is about determining the events which would materialize in the future with a certain likelihood (Brydon and Gemino, 2008; Boyer et al., 2012; Davenport and Kim, 2013; Delen and Demirkan, 2013; Schmueli and Koppius, 2010; Sharda et al., 2013; Siegel, 2013). The predictive analytics ―go beyond merely describing the characteristics of the data and the relationships among the variables (factors that can assume a range of different val-ues); they use data from the past to predict the future‖ (Davenport and Kim, 2013, p. 3). In order to make it more clear for the reader to comprehend what predictive analytics refers to, what events are being predicted and what results are being achieved. Davenport and Harris (2007) place the predictive modeling and analytics within the domain of BI&A based on the dimensions: degree of intelligence and the competitive advantage it gives to the organizations that are using it Figure 2.3. Figure 2.4 Competitive Advantage v/s Degree of Intelligence Predictive analytics refers to ―building and assessment of a model aimed at making empirical predictions‖ in the context of quantitative empirical modelling (Shmueli Page 33 and Koppius, 2010, p. 555). That incorporates empirical predictive models (statistical models like data mining algo-rithms for instance) which predict future scenarios and evaluation methods assessing the pre-dictive power of a model. What predictive analytics does is pinpoint relationships between the variables and after that, based on those relationships it predicts the likelihood for a certain event to occur. Despite the predictive purpose the relationships between data are used for, explicit cause-effect relationships are not expected or assumed to be present in the data (Davenport and Kim, 2013). Empirical modelling for explanation refers to statistical models that are used for ―testing causal hypotheses that specify how and why certain phenomena occur‖ (Shmueli and Kop-pius, 2010, p. 554). That also includes explanatory statistical models for testing hypotheses (like regression models, common in IS research and social sciences in general) and methods for evaluating the explanatory power of the model (various statistical test for strengths of rela-tionships). Shmueli and Koppius (2010) point out to the existence of large debate about the difference between explaining and predicting and their research results with explaining the differences between these two terms in five different steps: analysis goal, variables of interest, model building optimized function, model building constraints and model evaluation in Table 2.2. Page 34 Table 2.1 Explanatory Statistical Modelling and Predictive Analytics according to Shmueli and Koppius (2010) Page 35 Chapter -3 Problem Statement and Methodology This chapter focuses on the existing systems present to handle big data and predictive analytics. It also explains the motivation behind developing a recommendation system for error log analytics. The methodology used for creating such is a system is also elaborated in this chapter. 3.1 Problem Statement : Need for Predictive Error Log Analytics Using Big Data 3.1.1 Existing systems in theory: Agile systems are the new it thing, how to support the constant changing demands? You need system logs to identify potential security issues and network failures. If the work is in a highly standardized environment like financial analysis, legal advisory, or in government offices or websites, Log data is the need of the hour for regular audits and compliance reports. In the e-commerce business user logs gives useful insights into the data to help provide a better user experience and conversion that helps in work being done. There are generally two most common log types: Event logs or Error Logs – They provide an extensive view to establish how your system and the components associated with it are performing at any point in time in normal as well as high pressure situations. Whether the servers are running fine, or if there are any network failures and abnormalities in your network all these types of errors are maintained in error logs. User logs – These logs focus on building an intimate understanding of the online user behaviors, such as what they explored on the website, which links were used the most or which products were added in the favorite list etc. keeping a track of the buyer’s profile in your logs for analytics and prediction purposes. Analyzing raw user logs allows a more controlled approach, high accuracy, and transparency in Page 36 introspecting user activities beyond statistics provided by standard web analytics services like Google Analytics or Omniture. With huge amounts of data extending to terabytes, even petabytes, it’s next to impossible for the existing log analysis software to promptly and precisely apprehend patterns and point towards trends and give predictions. In the absence of an efficient and programmed process to give insights into this humongous data, organizations would face the wrath of dumping valuable data in an unrefined ―data lake,‖ and eventually lose the profits that these data insights are capable to provide and lose the competitive advantages it can provide. We developed a unique approach to search and give a data analytics approach for making the best use of log data. Existing System to Navigate and Analyze Logs with Big Data and Search Figure 3.1 Existing Big Data Architecture for Log Analytics Abundant big data applications robust in nature for log analytics have helped numerous organizations avoid the loss of huge valued data and avoided getting the Page 37 data being dumped into the ―data lake.‖ These applications are abided by Hadoop’s processing power, machine learning algorithms, predictive analytics capabilities of R, and advanced search capabilities. A big data endowed log analytics platform: Accumulates the data from different sources and stores raw unprocessed and unstructured log files from multiple business systems (often hundreds of GB daily). Loads the data through buffers for cleaning and processing. Sends it into a log analytics stack for query parsing, search indexing, and trend visualization Enables developers to perform robust and prompt analysis of user trends, clustering, clustering trends, market trends, improve error handling techniques. 3.2 Motivation System logs especially error logs provide a peek into the state of a running system. Instrumentation occasionally generates short messages that are collected in a systemspecific log. The content and format of the logs can vary widely from one system to another and even among same components within a system. For example, an usb driver might generate messages indicating that it had trouble communicating with the device, while a web server might face problems in fulfilling the client request and loading the requested page. The content of the logs is diverse in nature, so are the uses. The log from a printer or USB drive might be used for troubleshooting, while the Server log is more commonly used to study traffic patterns to maximize the revenue from advertising. Undoubtedly, a single log can be used for various purposes: information about the traffic along different network paths, called flows, might help a user ameliorate network performance or observe a malevolent intrusion; or call-detail records can Page 38 help monitor the caller and receiver details in case of a crime investigation, and upon further analysis can reveal call volume and drop rates within entire cities. This paper provides an overview of some of the most common applications of log analysis, describes some of the logs that might be analyzed and the methods of analyzing them, and elucidates some of the lingering challenges. Log analysis is a rich field of research with high impacts on the running ability of the system built. We intend to provide a clear understanding of why log analysis is both vital and difficult. 1. Debugging Many logs are intended to facilitate debugging. As Brian Kernighan wrote in Unix for Beginners in 1979, "The most effective debugging tool is still careful thought, coupled with judiciously placed print statements." Although programs in today’s time are orders more significant and complex than compared to were given say orders that 30 years ago, many people still use the old logging technique using printf to console or local disk and use some combination of manual inspection and regular expressions to locate specific messages or patterns. The most simple and common use for a debug log is to grep for a specific message. If it is believed that an application crashed due to abnormalities in the network behavior, then person in charge should try to locate a "connection dropped" message in the server logs. In most of the cases, it is problematic to figure out what kind of error to look for in the logs, as there is no well-defined mapping between log messages and observed symptoms. For example, when a service suddenly becomes slow, for the person operating it is improbable to see an obvious error message saying, "ERROR: The service latency increased by 10% because bug X, on line Y, was triggered." Instead, users often perform a search related to severe keywords such as "error" or "failure." Such severity levels are often used in a haphazard manner because generally a developer rarely has complete knowledge of how the code will ultimately be used in what scenario. Page 39 Moreover, red-herring messages such as no error detected may contaminate the result set with non-consequential events. Consider the following message from the BlueGene/L supercomputer: YY-MM-DD-HH:MM:SS NULL RAS BGLMASTER FAILURE ciodb exited normally with exit code 0 The severity of the word FAILURE is not helpful, as this message may be generated during non-failure scenarios such as system maintenance. When a developer codes the print statement of a log message, it is bounded to the context of the program source code. The content of the message, however, often excludes this context. Without knowledge of the code surrounding the print statement or what led the program onto that execution path, some of the semantics of the message may be lost—that is, in the absence of context, log messages can be difficult to understand. 2. Performance Log analysis if done in a correct manner is able to enhance or debug the system performance. Getting insights into a system's performance is commonly associated with understanding of how the resources in that system are utilized. Some logs are the same as used in the case of debugging, such as logging lock operations to debug a bottleneck. Other logs are used in tracking the use of individual resources, producing a time series of the resources facing time crunch. Resource-usage statistics often come in the form of cumulative use per time period (e.g., b bits transmitted in the last minute). Bandwidth is also be used as a criterion of data to characterize network or disk performance, page swaps are used to represent memory effectiveness, or CPU utilization to characterize loadbalancing quality Page 40 As seen in the case of debugging logs, performance logs must also be interpreted correctly. Two types of contexts are especially useful in performance analysis: the environment of the system in which the application is running and the workload of the system. Performance problems are usually caused by lack of communication between components, and to reveal that such interactions have taken place, you have to seize information from heterogeneous logs generated by multiple sources. This seizing of information can be challenging. In addition to heterogeneous log formats, components in distributed systems may start creating discrepancies on the exact time, making the precise ordering of events across multiple components next to possible to refurbish Also, an event that is dependent to one component (e.g., a log flushing to disk) might cause serious problems for another (e.g., because of the I/O resource contention). As the component causing the problem is unlikely to log the event, it may be hard to capture this root cause. These are just a few of the difficulties that emerge . 3.3 Methodology The complete task of developing a predictive model has been divided into 3 phases or papers. Paper 1: The aim of paper 1 was to justify the need of Hadoop and big data illustrating the advantages of using Map reduce over normal methods of data handling. It gave a combiner approach to error log analysis using big data. This approach to handling error logs specifically has been recommended as it saves execution time as compared to normal map reduce approach. Paper 2: Paper 2 focused on the descriptive analysis of the log dataset of company A at hand. It finds correlation between various parameters of the dataset to give a Page 41 deeper understanding and lead the path towards statistical approach of linear regression to interpret relationships amongst various parameters at hand. Paper 3: Paper 3 gives a recommendation system based on past descriptions of logs and how it was handled. Focusing on the concept of developing term document matrices and finding cosine similarity between the description of the new log coming in and the past data present with us. Figure 3.2 Structure for creation of Recommendation System Page 42 Chapter 4: Proposed Framework This chapter describes the entire workflow required for building the recommendation system breaking the entire task into 3 papers connected to each other so that the ultimate aim of getting the desired system is achieved. The entire work flow was divided into 3 parts consisting of 3 papers: Figure 4.1 Flow of Work Done 4.1 A Combiner Approach to Effective Error Log Analysis Using Big Data 4.1.1 Role of Combiner in Map Reduce And Error Log Analysis Role of Combiner in Map Reduce A Combiner, also known as a semi-reducer, operates by taking the inputs from the Map class and henceforth passing the output key-value pairs to the Reducer class. The main purpose of a Combiner is to summarize the map output records with the same key. The result (key-value collection) from the combiner will be sent over the network to the actual Reducer task as input, thereby reducing the load at the Reducer. The Combiner class is used in between the Map class and the Reduce class to minimize the volume of data transfer between Map and Reduce. Usually, the output of the map task is large and the data transferred to the reducer task is high. Here is a brief summary on how MapReduce Combiner works: • A combiner does not have a default interface and it must implement the Reducer interface’s reduce() method. Page 43 • A combiner operates on each and every map output key. It is a must for the combiner to have the same output key-value types as the Reducer class. • A combiner can produce summarized statistics from a large dataset because it replaces the original Map output. 4.1.2 Purpose of log Ubiquitous to the study of online activities is the possibility of collecting log file data. It is plausible for the computer to trace every command typed by users—in some cases, every stroke of the key. In cases where users interact only online, we can access a comprehensive record of all of their history of interactions. The completeness of the record and ease of collecting it are unrivalled. However, log file data is more often collected than analyzed. The structure and type of log varies with different applications. Types of log files generally maintained include: 1. Error logs: Keep the records of types of errors and time of occurrence. Helps in the resolution of errors due to back tracing. 2. Web server logs: History of activities on the internet stored. New techniques such as clickstream mining use web server log data 3.Console logs: Wellbeing of system applications assessed through system or console logs. Main focus of this paper is on Error Log Analytics. It manages the regeneration of data from semi -structured to a uniform structured format, in order to provide base for analytics .Business Intelligence (BI) functions such as Predictive Analytics are used to predict and forecast the future status of the application on the basis of the current scenario. Proactive measures can be taken rather than responsive measures in order to ensure efficiency in maintenance of the applications and the devices. There are 2 types of Log files : 1. Access Log Page 44 2. Error Log. This paper explores the Analytics of Error logs. Error Log records all the details such as Timestamp, Severity, Application name, Error message ID, Error message details. Error Log is a file that is created during data processing to hold data that is known to contain errors and warnings. It is usually printed after completion of processing so that the errors can be redressed. Error logs are always found in a heterogeneous format which looks something like this. Error logs contain the parameters such as: -Timestamp (When the error was generated). - Severity (Mentions if the message is a warning, error, emergency, notice or debug). - Name of application that generated the error log. - Error message ID. - Error log message description. INPUT : Error log dataset Approach for Map Reduce: 1. First the for the dataset a partitioner is created which divides the errors in 5 categories namely : INFO, FATAL, DEBUG,TRACE and WARN. 2. Now the mapper and reducer functions are run on different amounts of data to analyze the time spent by CPU on the task. 3. To reduce the time spent an additional COMBINER function is added before the reducer thereby reducing the load on reducer function and decreasing the cpu time to perform the same task. Page 45 4.2 Effective Error Log Analysis Using Correlation Log analysis is the process of reconditioning raw log data into useful information for suggesting solutions to existing problems. The market for log analysis software is huge and growing as more business insights are obtained from logs. Stakeholders in this industry need accurate, quantitative data about the log analysis process to identify inefficiencies, streamline workflows, predict tasks, design great level analysis languages, and spot outstanding challenges. For these purposes, it is imperative to understand log analysis in terms of discrete tasks and data transformations that can be measured, evaluated, correlated, and predicted, rather than qualitative portrayals and experience alone. One problem is that logged system events are not an excellent representation of human log analysis activity. Logging code is ideally not designed to capture human behavior at the most efficacious level of granularity. Even if it were, recorded events may not reflect internal mental activities. The goal of this paper is to find correlation and use descriptive analysis on the log dataset of company XYZ. Log Dataset analyzed in this paper contains the following parameters: 1. Track : The project under which the error happened. 2. Incident ID : The ID of incident. 3. Priority : Priority of the error is labeled as High, Medium, Low. 4. Time of Incident Assigned to App Team 5. Major Application Affected : Major application of the company affected by the error. 6. Status : Status of the ticket whether it is closed or open. 7. Primary Assignment: Namely domains like web ecommerce, document central, SAP document etc. 8. Close Time Page 46 9. Restoration Duration (h:mm) Calender days : Time taken to resolve the error in calendar days. 10. BWA Restoration Duration (h:mm) : Time taken in hours 11. To be included in Restoration SLA 12. Restoration SLA Met 13. ELS Filter 14. Closed By 15. Basic Description 16. Description 17. Manual Calculation (D:H:MM:SS) 18. Manual Calculation (Hours) 19. KEDB Check : KEDB refers to Known Error Database. 20. KEDB Compliance (Y/N) 4.2.1 Terminology Used: Known Error Database : KEDB The Known Error Database is a storehouse of information that portrays all of the conditions in your system application that might result in an incident for your customers and users. As users incident issues, the various support engineers pursue the traditional steps stated in the Incident Management process namely Logging, Categorization, Prioritization. Soon after that they are on the hunt to find a correct and viable solution for the user. This is where the KEDB steps in. The engineer should interact with the Page 47 KEDB in a very identical manner of dealing with any Search engine or Knowledge database. The engineer’s search using the ―Known Error‖ field and recover information to view the Workaround field. The KEDB terminology construes of a known error and the workaround field. 1. The Known Error The Known Error is a characterization of the problem in the user’s words. In case of an error, the users contact the service desk for help. While describing the problem they have a limited view of the entire scope of the root cause. The user should use screenshots of error messages, as well as the text of the message to aid searching the kind of error they have encountered. They should also include accurate descriptions of the conditions that they have experienced. The known error is basically an error that has been recorded along with its solution if it is found for future references. These are the types of things we should be describing in the Known Error field. A good case of a Known Error would be: When accessing the Timesheet application using Internet Explorer 6 users experience an error message when submitting the form. The error message reads ―JavaScript exception at line 123‖ The Known Error should be written in terms reflecting the customer’s experience of the Problem. 2. The Workaround The Workaround is a series of chronological steps that the service desk personal could take in order to either restore service to the user or provide temporary relief. The Known Error is a search key. A Workaround is what the engineer is hoping to find – a search result. Having a detailed Workaround, a set of technical actions the Service desk should take to help the user, has multiple benefits – some more obvious than others. Page 48 4.2.2 Benefits of Using a Known Error Database (KEDB) 1. Less Restoration time: In a scenario where the user has lost access to a service due to an anomaly that is already known and has a place in the KEDB. The best possible service that a user could hope for is an instant restoration of service or a temporary resolution. Having a good known error database which makes the problem easy to find also means that the workaround should be faster to locate. All of the time required to properly analyses and understand the root cause of the user’s issue is removed by allowing the service desk engineer a quick access to the workaround, thereby arriving at a solution quickly with less effort. 2. Recurring Workaround: With a known error stored in the KEDB, recurring problems whose solutions were recorded are solved in a manner such that each customer having the same problem is given a solution with same veracity in terms of speed and accuracy. KEDB helps avoid the case of one error different solutions, same types of error are solved in a similar manner, and thereby it is kind of like providing a guideline for helping similar errors. 3. Smart Work: In the absence of a KEDB engineers are often seen spending time and energy trying to find a resolution for the recurring issues. This would be likely in distributed teams working from different offices, but it is also a more common occurrence in a single team. KEDB helps save time, energy, money and resources. 4. Evade skill divide – A team constitutes of engineers at different levels of skill. It is an impossible scenario to employ a team that are all experts in every functional area, so it is natural to have many junior members at a lower skill level. A system for apprehending the workaround for complicated problems allows any engineer to quickly resolve issues that are affecting users. Teams are often cross-functional. We might foresee a scenario wherein there is a centralized application support function in a headoffice with users in remote offices supported by their home IT teams. A KEDB gives all IT engineers a single platform to search for issues bothering the customers. Page 49 5. Avoid conflicting or controversial workarounds: Establish certain parameters and guidelines to control the workarounds that engineers suggest to users. There have been many moments in the past methods that engineers suggest to customers are discussed and asked how they fixed issues internally revealing the complex methods used. For example: disabling the antivirus to avoid unexpected behavior, upgrading whole software suites to fix a minor issue. All the managers can relate to this. Workarounds can help eliminate dangerous workarounds. 6. Avoid Futile Ownership transfer of Incidents – A flaccid point in the Incident Management process is the continuous transfer of ownership between teams. This is the point where a customer issue goes to the bottom of someone else’s queue of work and is left unhandled even if it was a high priority in some other person’s queue of work. Often with not enough detailed context or background information, enabling the service desk to resolve issues themselves prevents transfer of ownership for issues that are already known. 7. Get acumen of the severity of the problem at hand : Well documented Known Errors make it a lot convenient to link new incidents to existing previously documented problems. Firstly this avoids a situation of duplicate logging of problems by different engineers. Second it gives better insights about how severe the problem encountered is. Consider two Problems in your system: A condition that affects a network router and causes it to crash once every 5 months and a transactional database that is running faulty and adding 4 seconds to timesheet entry .It is expected that the first problem would be given a high priority and the second a lower one. It stands to reason that a network outage on a core router would be more lethal to the system than a slowly running timesheet system But which would cause more Incidents over time? You might be associating 5 new Incidents per month against the timesheet problem whereas the switch only causes issues irregularly. Being able to quickly link incidents to existing documented problems allows you to judge the relative impact of each one. Page 50 4.2.3 The KEDB implementation In Technical terms when we talk about the KEDB we generally refer to the Incident Management Database not a completely separate storehouse of data. Minimum one suitable implementation of KEDB in that manner should be implemented. There is a one-to-one relation between Known Error and Problem so it is logically correct that the standard data representation of a problem with its number, assignment data, work notes etc., should also hold the data that is required for the KEDB. It is not incorrect to implement this in a different way that is storing the Problems and Known Errors in separate locations, but it should preferably be kept all together to ease analysis of both the known errors and problems. 4.2.4 Importance of R in Data Analytics R is the only programming language that allows statisticians to perform the most complicated and intricate analyses without getting into too much of details. With so many benefits for data science, R has gradually mounted heights among professionals of big data. According to a 2014 survey, R is one of the most powerful and popular programming languages used by data scientists today. Features of R that makes it popular are: 1. The Fact That R Is an Open Source Programming Language R is free for everyone to use because it is an open source programming language. Programming codes of R can be used across all platforms like Linux, Windows, and Mac. There are no limits with respect to subscription costs or license management, which makes it easily available to data geeks. Also, you can have free access to the R programming libraries. Nevertheless, there are some commercial libraries meant for enterprises dealing with data in terabytes. Hadoop is a good example. 2. The Ultimate Statistical Analysis Kit R is a programming language having all standard data analysis tools to access data in varied formats, for several data manipulation operations – merges, transformations Page 51 and aggregations. It includes tools for conventional and modern statistical models including Regression, ANOVA, GLM and Tree, in its object oriented framework, which makes is easier to extract as well as merge the needed information rather than copying it. 3. Benefits of Charting R has some great tools to aid data visualization to create graphs, bar charts, multi panel lattice charts, scatter plots and new custom designed graphics. Unparalleled charting and graphics offered by R language is highly influenced by data visualization experts. Graphics based on R programming can be seen in blogs like The New York Times, The Economist, and Flowing Data. 4. R Language Offers Consistent Online Support R language is the most sophisticated statistics software because of its quick and consistent online support. The language has a loyal user base because statisticians, scientists and engineers, even without proper computer programming knowledge, can easily use it. 5. The Most Powerful Ecosystem R has the strongest ecosystem, a package with several functionalities built in for modern statisticians. ―dplyr‖ and ―ggplot2‖ are some examples for data manipulation and plotting, which relieves data scientists from graphic and charting capabilities to be included in applications. R programming language can do almost everything, for business and otherwise. It is used by leading social networks like Twitter and data scientists find it an indispensible tool. Error log analytics using R packages majorly dplyr,plyr,ggplot2 has been done which has resulted in numerous graphs which tell us about the correlation and relationship between type of errors and various columns as mentioned above in the log dataset description. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. Page 52 4.3 A predictive model for Error Log Analytics Figure 4.2 Dataset of Logs of Comapany A Phases of recommendation process 4.3.1 Information collection phase This collects relevant information of users to generate a user profile or model for the prediction tasks including user’s attribute, behaviors or content of the resources the user accesses. A recommendation agent cannot function accurately until the user profile/model has been well constructed. The system needs to know as much as possible from the user in order to provide reasonable recommendation right from the onset. Recommender systems rely on different types of input such as the most convenient high quality explicit feedback, which includes explicit input by users regarding their interest in item or implicit feedback by inferring user preferences indirectly through observing user behavior [31]. Hybrid feedback can also be obtained through the combination of both explicit and implicit feedback. In E- learning platform, a user profile is a collection of personal information associated with a specific user. This information includes cognitive skills, intellectual abilities, Page 53 learning styles, interest, preferences and interaction with the system. The user profile is normally used to retrieve the needed information to build up a model of the user. Thus, a user profile describes a simple user model. The success of any recommendation system depends largely on its ability to represent user’s current interests based on the previous data. Definite models are imperative for obtaining useful and accurate recommendations from any prediction techniques. 4.3.2 Explicit feedback The system normally cajoles the user through the system interface to provide ratings for items in order to construct and improve his model. The certainty in recommendation is dependent entirely on the ratings provided by the user. The one shortcoming of this method is, it requires user involvement at every stage and also, users are not always willing to supply enough information. Ignoring the fact that explicit feedback requires more effort from user, it is still viewed as providing accurate and reliable data, since it does not involve extracting preferences from actions, and it focuses on bringing transparency into the recommendation process that results in a slightly higher perceived recommendation quality and instills more faith in the recommendations listed by the system. 4.3.3 Implicit feedback The system automatically infers the user’s preferences by monitoring the varied actions of users such as purchase history, navigation history, and time spent on some web pages, links followed by the user, content of e-mail and button clicks among others. Implicit feedback reduces the burden on users by inferring their user’s preferences from their behavior with the system. The method though does not require effort from the user, but it is less accurate. Also, it has also been argued that implicit preference data might in actuality be more objective, as there is no bias arising from users responding in a socially desirable way [32] and there are no self-image issues or any need for maintaining an image for others [33]. Page 54 4.3.4 Hybrid feedback Hybrid system is a collection of strengths of both implicit and explicit feedback in order to minimize their weaknesses and get a best performing system. This can System be conceived by using an implicit data as a check on explicit rating or allowing user to give explicit feedback only when he chooses to express interest in giving the feedback. Figure 4.3 Recommendation Phases 4.3.5 Steps for building a recommendation system: 1. Understanding the data set based on the correlation and relevance of columns in the data set. 2. Divide the data into 3 groups of based on the track : web, e-commerce and custom. 3. Create a corpus of the data present. Page 55 4. Clean the corpus removing stop words, punctuations, numbers and special characters. 5. Make the tdm(term document matrix) of the corpus of the dataset. 6. Now, make the tdm for any new error log coming into the system. 7. Find the cosine similarity of the new tdm with the tdm’s of the three groups to determine which error group it belongs to. 8. According to the group allotted, the team needed to solve the error and time for the error to be solved is determined. Page 56 The following figure shows the algorithm for developing the proposed framework : Figure 4.4 Algorithm for finding TDM Page 57 Chapter 5: Result Analysis This Chapter presents the experimental work carried out in this thesis. Log files of company A are used as the data to analyze for patterns and correlations and propose a framework for predictive analytics of error log in the dissertation work. To start we have done the analysis of log file dataset of Company A. Section 5.1 of this chapter gives the details of the analysis of dataset with map reduce approach and map reduce with combiner approach with the results. After the analysis of data set using map reduce, we move on to find correlation between different parameters of the log dataset to get a better descriptive analysis of the data set. Section 5.2 focuses on correlation results. The section 5.3 describes all the experimental work done for the proposed recommendation system. 5.1 Effective Combiner Approach to Error Log Analytcis This paper explores the Analytics of Error logs. Error Log records all the details such as Timestamp, Severity, Application name, Error message ID, Error message details. Error Log is a file that is created during data processing to hold data that is known to contain errors and warnings. It is usually printed after completion of processing so that the errors can be redressed. Error logs are always found in a heterogeneous format which looks something like this. Error logs contain the parameters such as: -Timestamp (When the error was generated). - Severity (Mentions if the message is a warning, error, emergency, notice or debug). - Name of application that generated the error log. - Error message ID. - Error log message description. 5.1.1 Input : Error log dataset of company A Approach for Map Reduce: Page 58 1. First the for the dataset a partitioner is created which divides the errors in 5 categories namely : INFO, FATAL, DEBUG, TRACE and WARN. 2. Now the mapper and reducer functions are run on different amounts of data to analyze the time spent by CPU on the task. 3. To reduce the time spent an additional COMBINER function is added before the reducer thereby reducing the load on reducer function and decreasing the CPU time to perform the same task. 5.1.2 Output: Time taken by reducer to process all the logs with combiner and without combiner. Table 5.1 Output SNO Amount of Data Number 1 50 Mb 2 6GB of Time taken for Time taken for records Map Reduce Map Reduce with without Combiner Combiner (in ms) (in ms) 3,26,93 7.9 6.1 6,00,000 15000 13492 The above output proves the point that enormous amount of data expanding to several gigabytes can be easily analyzed using the partitioner and combiner approach as compared to the regular map reduce approach. Hence, using the combiner approach is more efficient in dealing with data sets with respect to time taken for their analysis. It is time effective and minimizes the load on reducer and segregation of errors in 5 categories also contributes towards the efficiency. Page 59 5.2 Effective log analysis using Correlation : Error log analytics using R packages majorly dplyr,plyr,ggplot2 has been done which has resulted in numerous graphs which tell us about the correlation and relationship between type of errors and various columns as mentioned above in the log dataset description. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. 5.2.1 Descriptive Representation of Correlation between Parameters of Dataset The first graph represents that ―Low‖ priority errors are mostly resolved using KEDB. So, It becomes very important for a company to build a KEDB so as to enhance error solving capability of the support team. Previously known type of errors can be easily solved using a KEDB check. Errors with priority low have to be resolved in 7 days, errors with priority medium 48 hours and errors with priority high have to be resolved in 7 hours calendar time. Figure 5.1 Descriptive Representation of Correlation between Parameters of Dataset 5.2.2 The second graph shows the high correlation between number of logs opened and searched in KEDB. The relation is direct because of the fact more logs equals Page 60 more KEDB checks. This correlation leads to the decrease in time required to close the ticket and resolve the error which further results in high customer satisfaction. Figure 5.2 Correlation between number of logs opened and searched 5.2.3. The third graph displays the categories of logs searched in KEDB that is the errors which have been previously seen and resolved. The main type is Alert and KEDB. Alert constitutes of errors such as Server timeout, space issues etc. This gives the company a heads up regarding what type of issues occur frequently and can be solved immediately by keeping a record of it. Page 61 Figure 5.3 Statistical report of number of logs by category 5.2.4 The fourth graph specifies the time series outliers in the period of November wherein the outliers reach a peak on November 6. The omi integration refers to a project wherein the ticket generation is not manual but automatic and can be generated as fast as the errors coming in thereby reducing the load on the team and giving them time to resolve the fast coming errors. The time series outliers help the company in observing the server time outs and the time at which it occurs the most so this help in resolving the issue at hand faster. Figure 5.4 Count of KEDB checks by date of incident Page 62 5.2.5 The graph represents the calendar days taken to restore from the error and time taken by the team to resolve it during the two outliers as well. Figure 5.5Average Restoration Duration 5.2.6 The pie chart shows that priority ―Low‖ has the maximum number of logs available. Figure 5.6 Number of Logs Available Page 63 5.2.7 The figure shows that two team members have high lookup for KEDB in web track and have managed to resolve most of the logs referring to the KEDB which again emphasizes the need of KEDB in a company. Figure 5.7 Need of KEDB Page 64 5.2.8.The logs for primary assignment has outliers in the period before November 6 which were resolved efficiently. Figure 5.8 Logs for Primary Assignment that were Resolved Efficiently 5.2.9 Average of Restoration Duration The graph represents the average restoration duration from the time the error occurred to the time it was resolved. Figure 5.9 Average of Restoration Duration Page 65 5.2.10 Count of KEDB Check The graph shows the maximum number of KEDB checks for a particular error. Figure 5.10 Count of KEDB Check 5.2.11 Ticket Opened By Figure 5.11 Count of Tickets Opened Page 66 5.2.12 The graph shows the correlation between opened by and ―low‖ priority which indicates that errors with priority low were opened and resolved the most by the team as compared to the errors of medium and high priority. Figure 5.12 Correlation between logs opened and priority 5.2.13 The next shows that KEDB and ElS were the most with opened by that means that errors were searched for previously being in the KEDB and ELS that is early life support was given to them for resolution. Figure 5.13 Count of logs opened in KEDB and ELS Page 67 5.2.14 Count of Logs by Date of Incident Figure 5.14 Count of Logs by Date of Incident 5.2.15 The above graphs help us find the correlation between logs and various parameters like the person solving them, priority of error occurring the most, the calendar days required to solve them which in turn will allow us to build a predictive modeling framework for error log analytics Page 68 5.3 A Predictive model for Error Log Analytics Figure 5.15 Dataset of Company A Steps for building a recommendation system: 1. Understanding the data set based on the correlation and relevance of columns in the data set. 2. Divide the data into 3 groups of based on the track : web, e-commerce and custom. 3. Create a corpus of the data present. 4. Clean the corpus removing stop words, punctuations, numbers and special characters. 5. Make the tdm(term document matrix) of the corpus of the dataset. 6. Now, make the tdm for any new error log coming into the system. 7. Find the cosine similarity of the new tdm with the tdm’s of the three groups to determine which error group it belongs to. 8. According to the group allotted, the team needed to solve the error and time for the error to be solved is determined. 1. Group 1 TDM and length : Page 69 Group 1 is custom and the number of records are: 8865. The below screenshot specifies the TDM created for Group 1 and also the number of records present at that time in the tdm. Figure 5.16 Screenshots of Custom Page 70 1. Screenshot for the TDM of group 1: Figure 5.17 Screenshots of TDM of Group 1 Page 71 2. Group 2 length and TDM : The below screenshot mentions the TDM of group 2 and the number of records present in the TDM at that time. Group 2 is SAP and number of records = 2445. Figure 5.18 Screenshots of SAP Page 72 3. Group 3 : Is Web Length : 12645 Example of predicting the group for a new log coming in: 1. txt <- strsplit(―Error in sql server application service TDP backup resolve fast‖, split=‖ ―)[[1]] data <- data.frame(text=txt, stringsAsFactors=FALSE) 2. TDM for this error is explained in point 1 as shown in Figure 5.19 : Figure 5.19 Screenshots of Error Page 73 3. Cosine similarity with group 1: The below screenshot shows the cosine similarity between TDM of groups and the TDM of the new log error entry. Figure 5.20 Cosine Similarity 4. Now analyzing the data we know similarity. Result : 1. Similarity comparison of new log error with group1, group2 and group3 Table 5.2 Result Similarity with Group 1 Similarity with Group 2 Similarity with Group 3 0.231 0.132 0.100 2. Average time for resolution of error Group 1: Custom: 348.82 hours Group2: SAP: 350.97 hours Group 3: Web: 353.105 hours Page 74 Chapter 6: Conclusion and Future work 6.1 Conclusion The above system can be very beneficial to companies in an environment where 1tb of logs are generated and accumulated every day in the system. Log Analytics gives hindsight to how the products and applications created are handled and the efficiency of handling them is very important. In professional terms these error logs are converted to tickets and any ticket can land up anywhere to any person of a group who has no previous knowledge of solving the particular type of ticket thereby increasing the ticket solving time and loosing customer satisfaction. The advantages of Recommendation system proposed are: 1. It creates a flow path for the log error to land in the correct group. 2. Grouping helps create a more sophisticated and efficient approach to error handling. 3. Team specific error handling results in efficiency. 4. The customer can know exactly when the situation will be solved and can prepare for backup services for the allotted time period to solve the error. 5. The approach is highly beneficial for huge amounts of data which cannot be handled by regular approaches. 6.2 Future Work The efficiency of the system can be improved using SVM (Support Vector Machines). Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outlier’s detection. The advantages of support vector machines are: 1. It is Highly Efficient in cases of high dimensional spaces. 2. Adequate in cases where the number of dimensions is greater than the number of samples. Page 75 3. It makes use of a subset of training points in the decision function (called support vectors), so it is also efficient in case of memory. 4. Versatile: Versatility occurs due to different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels. The disadvantages of support vector machines include: 1. In case where the number of features is way greater than the number of samples, the method is likely to give unclear results. 2. SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below). The model prepared by support vector machine as described above is dependent on a subset of the training data, because of which the cost function for preparing the model does not worry about training points that lie beyond the margin. Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction. There are three different implementations of Support Vector Regression: SVR, NuSVR and LinearSVR. LinearSVR provides a faster implementation than SVR but only considers linear kernels, while NuSVR implements a slightly different formulation than SVR and LinearSVR. Page 76 References: 1. McAfee,A. and Brynjolfsson, E. ―Big data: the management revolution‖, Harvard Business Review 90, 2012 , pp.60-68 2. Goodhope, K. et al, ―Building LinkedIn's Real-time Activity Data Pipeline‖, Data Engineering, Vol.35, No.2,2012 pp. 33-45. 3. Bhardwaj, et al, ―Big data analysis: Issues and challenges‖, International Conference on Electrical Electronics Signals Communication and Optimization (EESCO), 2015. 4. Souza,L. and Girish UR, ―Error Log Analytics using Big Data and MapReduce‖ IJCSIT,Vol.6 ,No.3, 2015, pp.2364-2367 5. Bhandarkar, M, ―MapReduce programming with apache Hadoop‖, Parallel & Distributed Processing(IPDPS), IEEE International Symposium, April 2010 6. Narkhede,S. and Baraskar, T. ―HMR Log Analyzer: Web Application Logs over Hadoop Map Reduce‖, International Journal of UbiComp(IJU), Vol.4, No.3, July 2013, pp.41-47 7. Katkar, G.S. and Kasliwal, A.D ―Use of Log Data for Predictive Analytics through Data Mining‖ Current Trends in Technology and Science, Vol.3, No.3 AprilMay 2014, pp. 217-222 8. Peng,W.et al, ―Mining Logs Files for Data-Driven System Management ―, ACM SIGKDD Exploration Newsletter- Natural Language Processing and Text Mining, Vol. 7, Issue1, June 2005, pp. 44-51. 9. Grace, L.K.J et al, ―Analysis of web logs and web user in web mining‖, International Journal of Network Security & Its Applications IJNSA, Vol.3, No.1, January 2011, pp. 99-110. 10. Bruckman, A. ―Chapter 58: Analysis of Log File Data to Understand User Behavior and Learning in an Online Community‖, Georgia Institute of Technology, pp. 1449-1465. 11. ALSPAUGH, S., et al, Better logging to improve interactive data analysis tools In KDD Workshop on Interactive Data Exploration and Analytics ,2014. Page 77 12. BARRETT, R., ET AL. Field studies of computer system administrators: Analysis of system management tools and practices, ACM Conference on Computer Supported Cooperative Work (CSCW), 2004. 13. BITINCKA, L., ET AL. Optimizing data analysis with a semi-structured time series database, OSDI Workshop on Managing Systems via Log Analysis and Machine Learning Techniques,(SLAML) ,2010. 14. CHEN, Y., ET AL. Design implications for enterprise storage systems via multi- dimension trace analysis, ACM Symposium on Operating Systems Principles (SOSP) ,2011. 15. CHEN, Y., ET AL. Interactive analytical processing in big data systems: A cross- industry study of map reduce workloads., International Conference on Very Large Databases (VLDB) ,2012. 16. CHIARINI, M. Provenance for system troubleshooting. In USENIX Conference on System Administration ,LISA, 2011. 17. COUCH, A. L, Standard deviations of the average system administrator. USENIX Conference on System Administration (LISA), 2008. 18. GOTZ, D., ET AL. Characterizing users’ visual analytic activity for insight provenance, IEEE Information Visualization Conference (InfoVis) ,2009 19. LOU, J.-G., ET AL. Mining dependency in distributed systems through unstructured logs analysis., SIGOPS Operating System Review ,2010. 20. LOU, J.-G., FU, Q., WANG, Y., AND LI, J. Mining dependency in distributed systems through unstructured logs analysis. SIGOPS Operating System Review (2010). 21. MAKANJU, A. A., ET AL. Clustering event logs using iterative partitioning ,ACM International Conference on Knowledge Discovery and Data Mining (KDD) (2009). Page 78 22. NORIAKI, K., ET AL. Semantic log analysis based on a user query behavior model. In IEEE International Conference on Data Mining (ICDM) (2003). 23. OLINER, A., ET AL. Advances and challenges in log analysis, ACM Queue,2011. 24. OLINER, A., AND STEARLEY, J. What supercomputers say: A study of five system logs. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2007. 25. OLINER, A. J., ET AL. Using correlated surprise to infer shared influence. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN),2010. 26. OLSTON, C., ET AL. Pig latin: A not-so-foreign language for data processing. In ACM International Conference on Management of Data (SIGMOD),2008. 27. OTHERS, S. K. Wrangler: Interactive visual specification of data transformation scripts, ACM Conference on Human Factors in Computing Systems CHI, 2011. personalization applications. Data Mining and Knowledge Discovery, 5(1/2):33-58, 2001a. 28. Adomavicius, G. and A. Tuzhilin. Multidimensional recommender systems: a data warehousing approach. In Proc. of the 2nd Intl. Workshop on Electronic Commerce (WELCOM’01). Lecture Notes in Computer Science, vol. 2232, Springer, 2001. 29. Adomavicius, G., R. Sankaranarayanan, S. Sen, and A. Tuzhilin. Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach. ACM Transactions on Information Systems, 23(1), January 2005. 30. Aggarwal, C. C., J. L. Wolf, K-L. Wu, and P. S. Yu. Horting hatches an egg: A new graph theoretic approach to collaborative filtering. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 1999. Page 79 31. Ansari, A., S. Essegaier, and R. Kohli. Internet recommendations systems. Journal of Marketing Research, pages 363-375, August 2000. 32. Armstrong, J. S. Principles of Forecasting – A Handbook for Researchers and Practitioners, Kluwer Academic Publishers, 2001. 33. Baeza-Yates, R., B. Ribeiro-Neto. Modern Information Retrieval. Addison- Wesley, 1999. 34. Balabanovic, M. and Y. Shoham. Fab: Content-based, collaborative recommendation.Communications of the ACM, 40(3):66-72, 1997. 35. Basu, C., H. Hirsh, and W. Cohen. Recommendation as classification: Using social and content-based information in recommendation. In Recommender Systems. Papers from 1998 Workshop. Technical Report WS-98-08. AAAI Press, 1998. 36. Belkin, N. and B. Croft. Information filtering and information retrieval. Communications of the ACM, 35(12):29-37, 1992. 37. Billsus, D. and M. Pazzani. Learning collaborative information filters. In International Conference on Machine Learning, Morgan Kaufmann Publishers, 1998. 38. Billsus, D. and M. Pazzani. A Personal News Agent That Talks, Learns and Explains. In Proceedings of the Third Annual Conference on Autonomous Agents, 1999. 39. Billsus, D. and M. Pazzani. User modeling for adaptive news access. User Modeling and User-Adapted Interaction, 10(2-3):147-180, 2000. 40. Billsus, D., C. A. Brunk, C. Evans, B. Gladish, and M. Pazzani. Adaptive interfaces for ubiquitous web access. Communications of the ACM, 45(5):34-38, 2002. 41. Breese, J. S., D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July 1998. Page 80 42. Buhmann, M. D. Approximation and interpolation with radial functions. In Multivariate Approximation and Applications. Eds. N. Dyn, D. Leviatan, D. Levin, and A. Pinkus.Cambridge University Press, 2001. 43. Burke, R. Knowledge-based recommender systems. In A. Kent (ed.), Encyclopedia of Library and Information Systems. Vol.69,No.32 . Marcel Dekker, 2000. 44. Mandal D. Pratiksha, ―Study of Elastic Hadoop On Private Cloud.‖ International Journal of Scientific and Research Publications, Vol.6, No.1, January 2016 321 ISSN 2250-3153. 45. Maitrey S. & Jha C.K., Handling Big Data efficiently by using Map Reduce Technique,2015 IEEE International Conference on Computational Intelligence & Communication Technology. 46. Elsayed A. et al, MapReduce: State of the art and research directions, IJCEE, vol6, No.1, February 2014. 47. Parmar, Hiren, and Tushar Champaneria. "Comparative Study of Open Nebula, Eucalyptus, Open Stack and Cloud Stack." International Journal of Advanced Research in Computer Science and Software Engineering 4.2 (2014). 48. Parmar, Hiren, and Tushar Champaneria. "Comparative Study of Open Nebula, Eucalyptus, Open Stack and Cloud Stack." International Journal of Advanced Research in Computer Science and Software Engineering 4.2 (2014). 49. Manikandan SG, Ravi S. Big Data Analysis Using Apache Hadoop. InIT Convergence and Security (ICITCS), 2014 International Conference on 2014 Oct 28 (pp. 1-4). IEEE. 50. Gohil P, Garg D, Panchal B. A performance analysis of MapReduce applications on big data in cloud based Hadoop. InInformation Communication and Embedded Systems (ICICES), 2014 International Conference on 2014 Feb 27 (pp. 1-6). IEEE. Page 81 51. Nandimath, J. et al, "Big data analysis using Apache Hadoop." In Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on, pp. 700- 703. IEEE, 2013. 52. Jacob, Jobby P., and Anirban Basu. "Performance Analysis of Hadoop Map Reduce on Eucalyptus Private Cloud." International Journal of Computer Applications 79.17 (2013). 53. Iordache, Anca, et al. "Resilin: Elastic MapReduce over multiple clouds." Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on. IEEE, 2013 54. Mittal, Ruchi, and Ruhi Bagga. "Performance Analysis of Multi-Node Hadoop Clusters using Amazon EC2 Instances." International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013). 55. Conejero, Javier, et al. "Scaling archived social media data analysis using a hadoop cloud." IEEE 6th International Conference on Cloud Computing (CLOUD). IEEE, 2013 56. Daneshyar, Samira, and Majid Razmjoo. "Large-scale data processing using Mapreduce in cloud computing Environment." International Journal on Web Service Computing 3.4 (2012): 1. 57. Dittrich J, Quiané-Ruiz JA. Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment. 2012 Aug 1;5(12):2014-5. 58. Lee KH, et al Parallel data processing with MapReduce: a survey. AcM sIGMoD Record. 2012 Jan 11;40(4):11-20. 59. Singh, S. & Singh, N., Big Data Analytics, 2012 International Conference on Communication, Information & Computing Technology, Oct 19-20, Mumbai, India 60. Tang B, Moca M, Chevalier S, He H, Fedak G. Towards mapreduce for desktop grid computing. InP2P, Parallel, Grid, Cloud and Internet Page | 82 Computing (3PGCIC), 2010 International Conference on 2010 Nov 4 (pp. 193-200). IEEE. Page | 83