A Predictive Approach to Error Log Analytics

advertisement
A Predictive Approach to Error Log
Analytics
A SUMMER INTERNSHIP REPORT
Submitted by
RAKSHIT DWIVEDI
SYSTEM ID: 2017012524
ROLL No: 170251142
Under the Supervision of
Ms. Swati Bansal
In partial fulfilment of summer internship for the award of degree of
MASTER OF BUSINESS ADMINISTRATION
SCHOOL OF BUSINESS STUDIES
SHARDA UNIVERSITY
Greater Noida, UP
AUGUST, 2018
Sharda University
Plot No. 32-34, Knowledge Park III, Greater Noida, Uttar Pradesh 201306
Certificate of Approval
The following Summer Project Report titled is “A Predictive Approach to Error Log Analytics” hereby
approved as a certified study in business analytics carried out and presented in a manner satisfactory to
warrant its acceptance as a prerequisite for the award of Post-Graduate Diploma in Management for which it
has been submitted. It is understood that by this approval the undersigned do not necessarily endorse or
approve any statement made, opinion expressed or conclusion drawn therein but approve the Summer Project
Report only for the purpose it is submitted
Summer Project Report Examination Committee for evaluation of summer project report
Name
1. Faculty Examiner
2. PG summer project co-coordinator
Signature
PREFACE
Industrial internship is a program, which is conducted to acquire practical knowledge. It is believed that practical
working experience will be added advance in our future life, which may also help to achieve our aim and
ambition. It provides a chance to acquire knowledge from global business and earmark for executives. It
identifies the practical phenomena including risk and also enables to take probable alternative decisions. The
knowledge is based on learning and experience. It is really a matter of pleasure that, I have completed my
internship program in Madrid Software Training Solutions. The program was conducted from June 02, 2018 to
July 31, 2018 as a part of summer internship program for my MBA at Sharda University, Greater Noida.
This report has been prepared for the fulfilment of academic curriculum as required under the program. While
preparing this report, I gathered practical experience of working and finally I would like to say that tireless
struggle would become successful when any person or organisation will get benefit from this report.
ACKNOWLEDGMENT
This internship report is an accumulation of many people’s endeavour. This report would never have been
possible without the consistent support and assistance of the people whom I approached during the various
stages of writing this report.
First of all I would like to sincerely express my gratitude and thanks to my faculty supervisor Ms. Swati Bansal
for her continuous assistance and guidance to complete this report. Her help, guidance and constructive
comments were very helpful in the completion of this report.
I am grateful to my industry mentor Mr. Sachin Arora (Team Leader,KPMG) for his support and supervision. I
am thankful to their support and open minded behaviour which he has shown towards me during preparation of
my report.
I am also grateful to each and every employee of the Madrid Software Training Solutions with special mention
Mr.Amit Kataria for their cordial acceptance. They have been very helpful in showing me the work process and
provided relevant information for my report whenever I approached.
Finally, my heartfelt gratitude for Sharda University School of Business Studies and associated instructors with
whom I did courses and who have given me valuable education.
RAKSHIT DWIVEDI
SYSTEM ID: 2017012524
DECLARATION
I, Rakshit Dwivedi hereby declare that the work titled “A Predictive Approach to Error Log Analytics” is a
genuine work done by me under the faculty guide Ms. Swati Bansal and has not been published or submitted
elsewhere for the requirement of a degree programme. Any literature, data or work done by others and cited
within this project has been given due acknowledgement and listed in reference section.
RAKSHIT DWIVEDI
Ms. Swati Bansal
System ID:2017012524
Faculty Guide
TABLE OF CONTENTS
Index
List of Abbreviations
Abstract
Page Nos.
i
ii
Chapter 1: Introduction
1.1 Introduction to Big Data
1.2 Data Analytics
1.3 Understanding Logs and Error Log Analysis using Big Data
1.3.1 Role of Combiner in Map Reduce And Error Log Analysis
1.3.2 Purpose of Log
1.4 Recommender Systems
1.4.1 Taxonomy for Recommender Systems
1
1
4
8
9
10
10
12
Chapter 2: Literature Review
2.1 Big Data
2.1.1 Why Big Data
2.1.2 Characteristics of Big Data Platform
2.1.3 Big Data Challenges
2.1.4 Map Reduce Technique
2.1.5 Architecture of Map Reduce
2.1.6 Dealing with Failure
2.1.7 Benefits of Map Reduce
2.1.8 Pitfalls and Challenges in Map Reduce
2.2 Basic Logging and Descriptive Analytics
2.3 Predictive Analytics and Recommender System
16
16
17
17
18
18
19
21
22
22
26
30
Chapter 3: Problem Statement And Methodology
3.1 Problem Statement
3.1.1 Existing Systems in Theory
3.2 Motivation
3.3 Methodology
34
34
34
36
39
Chapter 4:Proposed Framework
41
41
41
42
44
45
4.1 A Combiner Approach to Effective Error Log Analysis Using Big Data
4.1.1 Role of Combiner in Map Reduce And Error Log Analysis
4.1.2 Purpose of log
4.2 Effective Error Log Analysis Using Correlation
4.2.1 Terminology Used
4.2.2 Benefits of Using a Known Error Database (KEDB)
4.2.3 The KEDB Implementation
4.2.4 Importance of R in Data Analytics
4.3 A predictive model for Error Log Analytics
4.3.1 Information Collection Phase
4.3.2 Explicit Feedback
4.3.3 Implicit Feedback
4.3.4 Hybrid Feedback
Chapter 5: Result Analysis
5.1 Effective Combiner Approach to Error Log Analytics
5.1.1 Input
5.1.2 Output
5.2 Effective Log Analysis using Correlation
5.2.1 Descriptive Representation of Correlation between Parameters of Dataset
5.3 A Predictive model for Error Log Analytics
Chapter 6: Conclusion And Future Work
6.1 Conclusion
6.2 Future Work
References
47
49
49
51
51
52
52
53
56
56
56
57
58
58
67
73
73
73
LIST OF ABBREVIATIONS
KEDB
Known Error Database
TDM
Term Document Matrix
WWW
World Wide Web
HDFS
Hadoop Distributed File System
SAP
Systems, Applications and Products
DOCS
Documents
LDTM
Large Document Term Matrix
Abstract
Recommender systems are software tools to tackle the problem of information overload
by helping users to find items that are most relevant for them within an often
unmanageable set of choices. To create these personalized recommendations for a user,
the algorithmic task of a recommender system is usually to quantify the user’s interest
in each item by predicting a relevance score, e.g., from the user’s current situation or
personal preferences in the past.
Predictive analytics is a kind of business analytics which enables predictions to be
made, about the probability of happening of a particular event in the future, based on
data of the past. The concept of predictive analytics is widely inculcated in the
departments of the most successful organizations where it supports their decisionmaking process and helps achieve their goals of customer satisfaction and proper
delivery and monitoring of existing systems. These days, recommender systems are
used in various domains to recommend items such as products on e-commerce sites,
movies and music on media portals, or people in social networks.
To judge the user’s preference list, recommender systems proposed in past researches
often utilized explicit feedback, i.e., deliberately given ratings or like/dislike statements
for items. In practice, however, in many of today’s application domains of
recommender systems this kind of information is not existent. Therefore, recommender
systems have to rely on implicit feedback that is derived from the users’ behaviour and
interactions with the system. This information can be extracted from navigation or
transaction logs. Using implicit feedback leads to new challenges and open questions
regarding, for example, the huge amount of signals to process, the ambiguity of the
feedback, and the inevitable noise in the data. The system we use for obtaining feedback
for the recommendation system is called Hybrid feedback which is a combination of
both implicit and explicit feedback techniques.
This thesis by publication explores some of these challenges and questions that
have not been covered in previous research. It focuses on building a recommendation system for
Error Log Analytics. The thesis is divided into two parts. The first part deals with the importance of
big data, map reduce and types of logs available. Descriptive analytics of log dataset of company A is
done. The second part focuses on building a Recommendation system and the different techniques. In
this thesis we use similarity of 2 vectors to build one.
Page 1
Chapter -1 Introduction
This chapter introduces the basic concepts of Big data, its architecture and the concept of
map reduce. It also goes on to explain the role of combiner in map reduce approach. For
analysis purposes data set of company A is used and referred. Also, it describes the types
of logs available to us and the concept of error log analytics. Furthermore, it explains the
Recommendation System.
1.1 Introduction to Big Data
Big data has revolutionized commerce in 21st century and changed the perspective of people
towards data in general. The term Big can be given a relative definition because what we
describe as ―big‖ today can become small in the times to come but Big Data can always
be defined as data which cannot be handled with the available resources and orthodox
technology methods. With the continuous increase in data come varied challenges in the
forms of different formats, representations and speeds at which data is generated. The
orthodox mechanisms of processing data could handle only structured data in the form of
tables but with the advent of Big Data Technology, unstructured, semi structured and
structured, all types of data can be processed and handled with ease. A definition of Big
Data which everyone can agree upon is long overdue and hence we can say that Big data is
an evolving term that describes any voluminous amount of structured, semi-structured and
unstructured data that has the potential to be mined for information [1].The forms of data
that come from varied sources such as documents, e-mails, text logs and social media can
be:
Structured data is also known as data coming from Relational database.
Semi Structured data is the XML data.
Unstructured data comes from Word documents, PDF files, Text files and Media
Logs.
Big Data can be viewed as an opportunity because such huge volume of data upon analysis
can open windows which are yet to be explored. The data is not measured in terms of
Page 2
Gigabytes or Terabytes but in terms of Pitabytes or Exabytes. Applications
Page 3
involving Big Data can be Transactional such as Facebook, Twitter, YouTube, Photo box
and Analytic. The data can also be incomplete or involving time stamp.
The concept of big data can be easily described using five V’s: Volume, Velocity, Variety,
Veracity and Value.
1. Volume: It alludes to the humungous data generated every second. E-mails, twitter
messages, photos, video clips, sensor data etc. produced and shared every second. We
are not talking in terms of Terabytes but the measurement ranges to Zettabytes or
Brontobytes This increase in data makes data sets too large to store and analyze using
traditional database methodologies. Big data helps in analyzing this huge data by
breaking and storing data at different locations and combining them as and when
needed.
2. Velocity can be defined as the speed of production and consumption of data.
Examples include social media videos going viral in seconds, the speed at which online
payments are processed and the speed at which we do trading of shares. Big data
technology helps us to analyze the data while it is being generated, without ever putting
it into databases.
3. Variety is the data we use now. Previously we focused on structured data that is
stored in the form of tables or in relational databases example includes financial data
such as sales by product or region. At present, 80% of the world’s data is unstructured,
and therefore can’t easily be put into tables like photos, video slots or social media
updates. With big data technology we can now operate on differed data types
(structured and unstructured) including messages, social media conversations, photos,
sensor data, video or voice recordings and bring them together with more traditional,
structured data.
4. Veracity is discussed in terms of unworthiness of the data. With many variations of
big data, quality is compromised and data accuracy is lost as is the case of Twitter posts
with hash tags, abbreviations, typing errors and colloquial speech as well as the
reliability and accuracy of content but with the advent of big data and analytics we now
work with these type of data.
Page 4
5. Value: We have access to big data but unless we can turn it into value it is useless.
It can be easily established that 'value' is the most important V of Big Data.
Large volumes of data can be processed using the Map reduce technique. Map Reduce is a
processing technique for distributed computing based on Java. It usually divides the data
into pieces which are processed in a parallel manner. The Map Reduce consists of one slave
TaskTracker and a master JobTracker per cluster-node. The master is responsible for giving
tasks to the slaves, monitoring tasks and re- executing the failed tasks. The slave completes
the tasks as instructed by the master. The applications usually specify the input/output
locations and supply map and reduce functions via implementations of suitable interfaces
and/or abstract-classes. These, and other job parameters, constitute the job configuration.
The Hadoop job client then proffers the job (jar/executable etc.) and configuration to the
JobTracker
which
then
takes
over
the
responsibility
of
distributing
the
software/configuration to the slaves, scheduling tasks and monitoring them, giving status
and diagnostic information to the job-client.[3] The concept of map reduce works in 2
phases : Mapper Phase: In this phase the dataset is converted into key –value pairs.
Reduce Phase: In which several outputs from the map task are combined to form reduced
set of tuples.
Hadoop is the most popular implementation of Map Reduce because of ease of availability
as it is an entirely open source platform for handling Big Data.
The Apache Hadoop project develops open-source software for reliable, scalable,
distributed computing. The Apache Hadoop software library is a framework that allows for
the distributed processing of huge data sets across multiple computers using a simple
programming model. It enables applications to work with thousands of independently
computational computers and petabytes of data. Hadoop has taken inspiration from Google's
Map Reduce and Google File System (GFS).
HDFS (Hadoop Distributed File System) . The Hadoop Distributed File System (HDFS) is
a distributed file system that provides fault tolerance and is designed to run on commodity
hardware. HDFS provides high throughput access to application data
Page 5
and is suitable for applications that have large data sets. Hadoop provides a distributed file
system (HDFS) that is able to store data across thousands of servers, and a means of running
jobs (Map/Reduce jobs) across those server machines, running the work near the data.
HDFS has master/slave architecture. Large data is automatically divided into blocks which
are managed by different nodes in the hadoop cluster.
Steps:
1. Every file is split into blocks and these blocks are then processed by user-defined
code into {key, value} pairs using map phase.
2. The map functions are executed on distributed machines to generate output {key,
value} pairs which are then written on their respective local disks.
3. Each reduce function uses HTTP GET method to pull {key, value} pairs
corresponding to its allocated key space.
1.2 Data Analytics
Table 1.1 Data Analytics and its Types
Page 6
Descriptive, Predictive, and Prescriptive Analytics Explained
Here is a guide to understand and then selecting the right Descriptive, Predictive, and
Prescriptive Analytics
With the humongous amount of data available to businesses regarding their supply chain,
logs, servers in general these days, companies are turning to analytics solutions to excerpt
the meaning from this gigantic volumes of data to help improvise the decision making
process.
Predictive Analytics helps companies to optimize their effort need capabilities to analyze
historical data so as to forecast what might happen in the future. The thrill of doing
accomplishing it the right way and taking your systems to a data driven approach is a great
achievement. Huge return of investments can be enjoyed as evidenced by companies that
have optimized their lowered operating costs, supply chain optimization, increased
revenues, or improved their customer service and product mix using past data to analyze
what is going to happen in the future.
Comparing and understanding all these analytical options can be a baffling task. However,
with good fortune these analytical options can be categorized at a high level into three
distinct types. No one type of analytical approach is better than another, and as a matter of
fact, they co-exist and enhance each other. In order for a business to have a universal view
of the market and how a company competes for its spot and efficiency within that market
requires a robust analytic environment.
 Descriptive Analytics, which makes use of data aggregation and data mining to
provide insight into the past and answer of ―What has happened?‖ that is to find the
problem lying in the system.
 Predictive Analytics, which use statistical models and predicts techniques to
understand the future and answer: ―What could happen?‖ that is what could happen in
the future.
Page 7
 Prescriptive Analytics, which use optimization and simulation algorithms to advice
on possible outcomes and answer: ―What should we do?‖ that is how to solve the
problem.
Descriptive Analytics: Insight into the past
Descriptive analytics or statistics is exactly what the name suggests that is it
―Describes‖, or summarizes the raw data and coverts it into something that is
interpretable by humans. They are analytics of the past. They deal with the data available
with the company. The past refers to any point of time that an event has occurred, whether
it is one minute ago, or one year ago. Descriptive analytics are insightful because they grant
us an opportunity to learn from past behaviors, and understand how they might influence
future outcomes.
The majority of statistics techniques we use fall into this department of analytics. Think
about the basic arithmetic like sums, averages, percent changes. Usually, the elemental data
is a aggregate, or count of a filtered column of data to which the basic math is applied. For
all pragmatic purposes, there are an infinite number of these statistics. Descriptive statistics
are useful to show things like, total stock in average dollars spent per customer, stock left
in inventory and year over year change in sales. Common examples of descriptive analytics
are reports that provide finance, inventory and customers, historical insights regarding the
company’s production, financials, operations and sales.
Descriptive Analytics is used when you need to fathom at an aggregate level how and at
what level things are run in the company, and when you require to see the summary of how
different aspects of processes are done in the company.
Predictive Analytics Refers To Understanding the future
Predictive gets its form from the root that is the ability to ―Predict‖ occurrences of the future.
This analytics is about understanding and predicting the future. Predictive analytics
provides companies with insights that are actionable based on data.
Page 8
Predictive analytics provide probability or likelihood of a future outcome. It is important to
remember that no statistical algorithm can ―predict‖ the future with 100% certainty.
Organizations use these statistics to predict what might happen in the future. This is because
the foundation of predictive analytics is based on probabilities.
These statistics collect the data from the system and the missing values are guessed based
on the previously collected data. They combine historical data found in HR, POS ERP and
CRM systems to locate patterns in the data and apply algorithms and statistical models to
capture relationships that cannot be easily seen between varied data sets. Organizations use
Predictive analytics and statistics whenever they want to see what the future holds.
Predictive analytics can be used throughout the organization from identifying purchasing
patterns to identifying trends in sales activities; from forecasting customer behavior to
forecasting how the error occurred can be solved. They also help forecast demand for inputs
from the inventory, supply chain and operations
One of the most common applications most people are familiar with is the use of predictive
analytics to produce a list of recommendations while shopping online wherein depending
on your choice of looking at the items a list of similar items are displayed beneath. These
recommendations are used by sale and customer services to determine the probability of
customers making the online purchase. Typical business uses include, predicting what items
customers will purchase together, or forecasting inventory levels based upon a myriad of
variables, understanding how sales might close at the end of the year,
Predictive Analytics is used whenever you need to know something about the future, or fill
in the gaps in the information.
Prescriptive Analytics gives Advise on possible outcomes
A comparatively new field of prescriptive analytics permits users to ―prescribe‖ a number
of different possible alternatives to a prediction and guide them towards a plausible solution.
In a nut-shell, these analytics are regarding providing advice.
Page 9
Prescriptive analytics is an attempt to quantify the impact of future decisions in order to
advise on all plausible outcomes before the actual decisions are made. At their best,
prescriptive analytics predicts not only what will happen, but also why it will happen
providing recommendations regarding actions that will take advantage of the predictions.
These analytics go beyond what descriptive and predictive analytics suggest by
recommending one or more plausible outcomes or courses of action. Typically they predict
multiple future outcomes and allow organizations to choose from a number of possible
outcomes based upon their actions. Prescriptive analytics uses a combination of tools and
techniques such as algorithms, machine learning and computational modeling procedures
and business rules. These techniques are applied against input from many different data sets
including real-time data feeds, big data and historical and transactional data,
Prescriptive analytics are relatively complex monitor, and most organizations are not yet
using them in their daily course of planning for business. If and when implemented
correctly, they can have a significant impact on how businesses analyze decisions, on the
organization’s bottom line. Big organizations are successful in using prescriptive analytics
to optimizing their production, scheduling and inventory in the supply chain to make sure
that the deliverables reach customers on time thereby increasing the overall experience of
customer satisfaction altogether.
Use Prescriptive Analytics anytime you need to provide users with advice on what action
to take.
1.3 Understanding Logs and Error Log Analysis using Big Data
Big data and error log analytics together make a very intriguing topic for research.
In their paper Souza and Katkar (2014) reviewed types of logs available and then utilize the
corresponding log information for a very important Business Analytics function Predictive
Analysis and Classification. The various categories of severity are Error, Warning and Info
that show the severity count present in the error log file. The
Page 10
straight line equation y=mx+c, is used to predict the future severity value. The independent
variable x consists of the influencing parameters for prediction, while y is the predicted
value. In this paper, three more categories have been discussed: Fatal, Trace and Debug.
Katkar and Kasliwal (2014) explained the types of logs and their impact on systems
focusing particularly on web server logs and gave various techniques to analyze them.
According to them Data Mining is used for finding expected patterns from that large set of
log data using Web Mining. When used together, predictive analytics and data mining can
make the future prediction more efficient with respect to web access.
Bruckman (2006) explored the various types of log analysis namely quantitative and
qualitative analysis to understand the relationship between them. Qualitative log analysis is
generally done manually whereas quantitative log analysis is done manually and or is
automated.
Joshila et al(2011) discussed different types of logs available namely error log, access log,
common log format(CLF), Combined log format, multiple access logs, status codes sent by
their servers and combines the information of logs with Web Mining. In this paper, error
log has been explored and analyzed.
1.3.1 Role of Combiner in Map Reduce And Error Log Analysis
Role of Combiner in Map Reduce
A Combiner, also known as a semi-reducer, operates by taking the inputs from the Map
class and henceforth passing the output key-value pairs to the Reducer class. The main
purpose of a Combiner is to summarize the map output records with the same key. The
result (key-value collection) from the combiner will be sent over the network to the actual
Reducer task as input, thereby reducing the load at the Reducer. The Combiner class is used
in between the Map class and the Reduce class to minimize the volume of data transfer
between Map and Reduce. Usually, the output of the map task is large and the data
transferred to the reducer task is high.
Page 11
Here is a brief summary on how MapReduce Combiner works:

A combiner does not have a default interface and it must implement the Reducer
interface’s reduce() method.
A combiner operates on each and every map output key. It is a must for the
combiner to have the same output key-value types as the Reducer class.
A combiner can produce summarized statistics from a large dataset because it
replaces the original Map output.
1.3.2 Purpose of log
Ubiquitous to the study of online activities is the possibility of collecting log file data. It is
plausible for the computer to trace every command typed by users—in some cases, every
stroke of the key. In cases where users interact only online, we can access a comprehensive
record of all of their history of interactions. The completeness of the record and ease of
collecting it are unrivalled. However, log file data is more often collected than analyzed.
The structure and type of log varies with different applications.
Types of log files generally maintained include:
1. Error logs: Keep the records of types of errors and time of occurrence. Helps in
resolution of errors due to back tracing.
2. Web server logs: History of activities on the internet stored. New techniques such
as clickstream mining use web server log data
3. Console logs: Wellbeing of system applications assessed through system or
console logs.
1.4 Recommender Systems
The roots of recommender systems were settled due to special needs of works in diverse
fields: cognitive science [19], information retrieval [20] or economics [21]. Recommender
systems emerged as an independent research area in the middle 90s
Page 12
and their important role to enhance data accessibility attracted the attention of both,
academic and industrial worlds.
Recommender systems are a convenient way to broaden the scope of search algorithms
since they help in discovering the items they might not have been found by themselves. A
recommendation is basically to offer the user with a list items which would match his
preferences according to the things bought previously. There exist varied approaches to
accumulate data about the users, by constant monitoring of their interaction, by quizzing
them regarding some actions or to fill some feedback forms with personal information
included. The user's interaction with the system provides two types of information:
Implicit information: Collected from the user interaction and behavior. For example, by
keeping the items the user has interacted with and item related information like number of
times viewed, items reproductions or user related viewing information as group
membership.
Explicit information: The users provide this information every time they give opinion
about items, rating or liking some item. Generally all the information elaborated by the user
consciously. It is based on the rating given by the user.
The recommender system accumulates and analyzes both kind of information to generate
the user profile consisting of items viewed and ratings and feedback forms. The profile
stores information not only about the users likes, but also the information about the user
itself, current placing, current personal needs, sex, age, professional position, and so. The
way it's used by the recommendation system varies a lot among the different systems. The
information stored within is also a determinant factor in the recommender algorithm design.
Page 13
1.4.1 Taxonomy for recommender systems
The categories in which it is divided describes the diverse models of abstraction for user
profile, how it is generated, and how is it late maintained and how does it evolve as the
system runs.
User profile representation: An accurate profile is an important task since the
recommendation success depends on how the system represents the user's interests.
Next are listed some models applied in current recommender systems:
- History-based
Some systems keep a list of purchases, the navigation history or the content of e-mail boxes
as a user profile. Additionally, it is also common to keep the relevant feedback of the user
associated with each item in the history. Amazon1 web site is a clear example.
- Vector-space
In the vector space model, items are represented with a vector of features, usually words or
concepts which are represented numerically as frequencies, relevance percentage or
probability.
- Demographic
Demographic filtering systems create a user profile through stereotypes. Therefore, the user
profile representation is a list of demographic features which represent the kind of user.
- User-Item Ratings Matrix
Some collaborative filtering systems maintain a user-item ratings matrix as a part of the user
profile. The user-item ratings matrix contains historical user ratings on items. Most of these
systems do not use a profile learning technique. Systems like Jamendo2 include this
technique to represent user profile.
Page 14
- Classifier-based Models
Systems using a classifier approach as a user profile learning technique, elaborate a
methodology to monitor continuously input data in order to classify the information. This
is used in the case of decision trees, Bayesian networks and neural networks.
- Weighted n-grams
Items are represented as a net of words with weights scoring each linking, the system is
based on the assumption that words tend to occur one after another a significantly high
number of times, extracts fixed length consecutive series of n characters and organizes them
with weighted links representing the co-occurrence of different words. Therefore, the
structure achieves a context representation of the words.
Initial profile generation:
- Empty: the profile is built as the users interact with the system.
- Manual: the users are asked to register their interest beforehand.
- Stereotyping: Collecting user-related information like city, country, lifestyle, age or
sex.
- Training set: providing the users with some items among which they should select
one.
- Profile learning technique: The way the profile changes during time.
- Not needed: Some systems do not need profile learning technique. Some because
they load the user related information from a database or it’s dynamically generated.
- Clustering: Is the process of grouping information objects regarding some common
features inherited to its information context. User profiles are often clustered in order
Page 15
to groups according to some rule to assess which users share common interests.
Recommenders like Last.fm3 or iRate4 perform this technique [12].
- Classifiers: Classifiers are general computational models for assigning a category to
an input. To build a recommender system using a classifier means using information
about the item and the user profile as input, and having the output category represent
how strongly to recommend an item to the user. Classifiers may be implemented using
many different machine learning strategies including neural networks, decision trees,
association rules and Bayesian networks [1].
- Information Retrieval Techniques: When the information source has no clear
structure, pre-processing steps are needed to extract relevant information which allows
estimation of any information container’s importance. This process comprises two
main steps: feature selection and information indexing.
- Relevance feedback: The two most common ways to obtain relevance feedback is to use
information given explicitly or to get information observed implicitly from the user’s
interaction. Moreover, some systems propose implicit-explicit hybrid approaches.
- No feedback: Some systems do not update the user profile automatically and,
therefore, they do not need relevance feedback. For example, all the systems which
update the user profile manually.
- Explicit feedback: In several systems, users are required to explicitly evaluate items.
These evaluations indicate how relevant or interesting an item is to the user, or how
relevant or interesting the user thinks an item is to other users. Some systems invite
users to submit information as track playlists. iRate uses this approach to provide its
recommender with finer information about user’s preferences.
Page 16
- Implicit feedback: Implicit feedback means that the system automatically infers the
user’s preferences passively by monitoring the user‟s actions. Most implicit methods
obtain relevance feedback by analyzing the links followed by the user, by storing a
historic of purchases or by parsing the navigation history.
Table 1.2 Types of Recommender Systems
Page 17
Chapter 2: Literature Review
This chapter presents the literature review of the proposed system. Here we also explain
the paper contribution from different researchers and their back ground work in brief. It
gives the review of work done in the field of big data, log analytics and recommendation
systems.
2.1 Big Data
Big Data is a relatively new term that came from the need of big companies as Yahoo,
Google, Facebook to analyze big amounts of unstructured data , but this need could
identified in a number of other big enterprises as well as in the research and development
field. Data becomes Big Data when it basically outgrows the current ability to process it
and cope with it efficiently. Such datasets have size beyond the ability of typical database
software tools to capture, store and manage.
Big Data are those ―Data sets which continues to grow so much that it becomes
difficult to manage it using existing database management concepts and tools. The difficulty
can be related to data acquisition, storage, search, sharing, analytics and visualization etc.
(Singh,S. and Singh,N., 2012), Oracle added a new characteristic for this kind of data and
that is low value density meaning that sometimes there is a very big volume of data to
process before finding valuable needed information. (Garlasu D., 2013)
The following properties associated with Big Data (Aminu, L.M., 2014):
1. Variety- Data is entirely dissimilar consisting of raw, structured, semi structured
and even unstructured data.
2. Volume-The big word in big data itself defines the volume. At present, the data is
currently in petabytes and is supposed to raise to zettabytes in nearby future.
3. Velocity- Notion which deals with the speed of the data coming from different
sources.
Page 18
4. Variability- It considers the inconsistencies of the data flow.
5. Complexity- To prevent data from getting out of control it is a responsibility to link,
match, cleanse and transform data across systems coming from a variety of sources.
6. Value- Users can be able to run certain queries against the data saved and thus can
abstract vital results from the filtered data obtained and can also order it according to
the magnitude they need.
2.1.1 Why Big Data
Social networking websites generate new data every second and handling such a data is one
of the major challenges companies are facing. Data which is stored in data warehouses is
causing disruption because it is in a raw format, proper analysis and processing is to be done
in order to produce usable information out of it.
Big Data can help to gain perspective and make better decisions. It presents an opportunity
to create unprecedented business advantage and better service delivery. The concept of Big
Data is going to change the way we do things today (Singh,S. and Singh,N., 2012). Big Data
is energy source of present world. It refashions future of global economics. Big Data
revolution changes the way of thinking in business. It affects decision making from the
bottom up and the top down. It speeds up discoveries and small predictions in daily activities
(Sase, Y. S., Yadav, P.A., 2014).
2.1.2 Characteristics of Big Data Platform
The following basic features should be there in Big Data offering (Singh,S. and Singh,N.,
2012):
1. Comprehensive
2. Enterprise Ready
3. Integrated
4. Open Source Based
Page 19
5. Low latency reads and updates
6. Robust and Fault Tolerant
7. Scalable
8. Extensible
9. Allow adhoc queries
10. Minimal maintenance
2.1.3 Big Data Challenges
The main challenges of Big Data are (Singh,S. and Singh,N., 2012):
1. Variety
2. Volume
3. Analytical workload complexity
4. Agility
Many organizations are straining to deal with the increasing volumes of data. In order to
solve this problem, the organizations need to reduce the amount of data being stored and
exploit new storage techniques which can further improve performance and storage
utilization.
2.1.4 Map – Reduce Technique
Today’s very challenging problem is to analyze Big Data. For the effective handling of
such massive data or applications, the use of map reduce framework has been widely come
into focus. Over the last few years, Map Reduce has emerged as the most popular paradigm
for parallel, batch style and analysis of large amounts of data. It was a programming model
initiated by Google’s Team for processing huge datasets in distributed systems.
It is inspired by the functional programming which allows expressing distributed
computations on massive amounts of data. It is designed for large scale data processing as
it allows running on clusters of commodity hardware. Map reduce is used in areas where
the volume of data to analyze grows speedily.
Page 20
2.1.5 Architecture Of Map Reduce (Maitrey, S. and Jha,C.K., 2015)
MapReduce is a technique that processes large multi-structured data files across the
massive data sets. It breaks the processing into small units of work. These broken processes
can be executed in parallel across several nodes. As a result a very high performance is
achieved. Those programs which are written in this functional style are automatically
parallelized and can be executed on a large cluster of commodity machines. The series of
steps in its working are:
Step 1: The input file is read and then gets split into multiple pieces.
Step 2: These splits are then processed by multiple map programs running in parallel. Step
3: The Map Reduce system takes the output from each map program. It then merges
(shuffles/sort) the results for input to the reduce program.
Technically, all inputs to Map tasks and outputs from Reduce tasks are of key-value pair
form. Usually the keys of input elements are not relevant. So, in such conditions we must
overlook them. A plan for executions in MapReduce is determined entirely at runtime.
MapReduce scheduler utilizes a speculative and redundant execution. Tasks on straggling
nodes are redundantly executed on other idle nodes that have finished their assigned tasks.
Map and Reduce tasks are executed with no communication between other tasks. Thus,
there is no contention arisen by synchronization and no communication cost between tasks
during a MR job execution.
The figures below show: a) Simplified use of MapReduce
b) MapReduce with combiners and partitioners
Page 21
Figure 2.1 Simplified use of MapReduce
Figure 2.2 MapReduce with partitioners and combiners
MapReduce runs in cluster of nodes, one node acts as a master node and other nodes act
as workers. Workers nodes are responsible for running map and reduce tasks; the master
is responsible for assigning tasks to idle workers. Each map worker reads the content of
its associated split and extracts key/value pairs and passes it to the user defined Map
Function. The output of the Map function is buffered in memory and partitioned into a set
of partitions equals to the number of reducers. Master notifies the reduce workers to read
the data from local disks of Map workers. The results or output of reduce function is
appended to output files. Users may use these files as input to another MapReduce call, or
use it for another distributed application
Page 22
(Elsayed, A. et al, 2014).
Map Reduce Over Traditional Dbms ( Maitrey, S. and Jha,C.K., 2015) Traditional
DBMSs have adopted such strategies which are not appropriate for solving extremely
large scale data processing tasks. There was a need for some special purpose data
processing tools that can be adapted for solving such problems. While MapReduce is
referred to as a new way of processing Big Data, it is also criticized as a ―major step
backwards‖ in parallel data processing in comparison with DBMS. MapReduce
increases the fault tolerance of long time analysis by numerous checkpoints of completed
tasks and data replication. However the frequent I/Os required for fault tolerance reduce
efficiency. Parallel DBMS aims at productivity rather than fault tolerance. DBMS
actively exploits pipelining intermediate results between query operators. However, it
causes potential danger that a large amount of operations need to be redone when a
failure happens.
Also, DMBS generates a query plan tree for execution, a plan for execution in
MapReduce is determined entirely at runtime.
MapReduce is simple and efficient tool for query processing in a DBMS. The increasing
interest and popularity of MapReduce has led some relational DBMS vendors to support
MapReduce functions inside the DBMS. The Teradata Aster Database is an example of a
product that supports MapReduce.
2.1.6 Dealing With Failure
MapReduce is designed to deal with hundreds or thousands of commodity machines.
Therefore, it must tolerate machine failure. The failure may occur in master node or worker
nodes. In case of master failure all MapReduce tasks will be aborted, and it have to be
redone after assigning new master node. On the other hand, to track worker failure, the
master monitors all workers periodically checking worker status. If a worker doesn’t
respond to master ping in a certain amount of time, the master marks the worker as failed.
In case of failure of map task worker, any map task either in progress or completed by the
worker are reset back to their initial idle state
Page 23
and will be assigned to other worker. While in case of failure in reduce task worker, any
task in progress on a failed worker is assigned to an idle worker. The output of concluded
reduce tasks is stored in global file system, so completed reduce tasks do not need to be
re-executed. In the other hand, the output of map tasks is stored in local disks, so completed
map tasks must be re-executed in case of failure. (Elsayed, A. et al, 2014)
2.1.7 Benefits of MapReduce
The following are the advantages of MapReduce (Elsayed Abdelrahman, 2014 and Kyong
Ha Lee 2011):
1. Simple and easy to use- The MapReduce model is simple but expressive. With
MapReduce, a programmer defines his job with only Map and Reduce Functions,
without having to specify physical distribution of his job across the nodes.
2. Flexible- MapReduce does not have any dependency on data model and schema.
With MapReduce a programmer can deal with sporadic or unstructured data more
easily than they do with DBMS.
3. Independent of the storage- MapReduce is basically independent of the storage
layers. Thus, MapReduce can work with different storage layers.
4. Fault tolerance- MapReduce is highly fault tolerant. It is reported that it can continue
to work inspite of an average of 1.2 failures per analysis job at Google.
5. High scalability- MapReduce has been designed in such a way that it can scale up
to large clusters of machines. It supports runtime scheduling which enables dynamic
accimilating of resources during job execution. Hence, offering elastic scalability.
6. Supports data locality
7. Reduces network communication cost
8. Ability to handle data for heterogeneous system, since mapreduce is storage
independent, and it can analyze data stored in different storage system.
2.1.8 Pitfalls and Challenges in MapReduce
The following are the pitfalls in the MapReduce framework compared to DBMS
Page 24
(Lee.K.H, et al, 2011):
1. No high level language support like SQL in DBMS and any query optimization
technique till 2011.
2. MapReduce is schema free and index free. An MR Job can work right after its input
is loaded into its storage.
3. A single fixed dataflow which don’t support for algorithms that require multiple
inputs. MapReduce is primitively designed to read a single input and generate single
output.
4. Low efficiency- With fault tolerance and scalability as its primary goals,
MapReduce operations are not always optimized for I/O efficiency. In addition,
MapReduce is are blocking operations. A transition to the next stage cannot be made
until all the tasks of the current stage are concluded. Also, MapReduce has a latency
problem that comes from its inherent batch processing nature. All of the inputs for an
MR job should be prepared in advance for processing.
5. Very young compared to 40 years of DBMS.
The two major challenges are (Maitrey, S. and Jha,C.K., 2015):
1. Due to frequent checkpoints and runtime scheduling with speculative execution,
MapReduce reveals low efficiency. Thus, how to increase productivity guaranteeing
the same level of scalability and fault tolerance is a major challenge. The efficiency
problem is expected to be overcome in two ways: Improving MapReduce itself or
leveraging new hardware.
2. Second challenge is how to efficiently manage resources in the clusters which can
be as large as 4,000 nodes in multi user environment and achieving high utilization of
MR clusters.
APACHE HADOOP
MapReduce which has been popularized by Google utilizes the Google File System (GFS)
as an underlying storage layer to read input and store output. GFS is a chunk based
distributed file system that supports fault tolerance by data partitioning and replication.
We proceed our explanation with Hadoop since Google’s MapReduce
Page 25
code is not available to the public for its proprietary use.
Hadoop is an open source Java implementation of MapReduce. Other implementations
such as DISCO written in erlang also available but are not as popular. Hadoop consists of
two layers: a data storage layer called Hadoop distributed file system (HDFS) and a data
processing layer called Hadoop MapReduce framework. HDFS is a block structured file
system managed by a single master node like Google’s GFS. Large data is automatically
split into blocks which are managed by different nodes in the Hadoop cluster.
Figure 2.2 shows the architecture of Apache Hadoop.
Figure 2.3 Hadoop Master Slave Architecture
An HDFS cluster consists of a single Name node, a master server that manages the file
system namespace and regulates access to files by clients. In addition, there are a number
of Data nodes, usually one per node in the cluster, which manages storage attached to the
nodes that they run on. HDFS exposes a file system namespace and allows user data to be
stored in files. Internally, a file is split into one or blocks and these blocks are stored in a
set of data nodes. Namenode determines the mapping of of blocks to Datanodes.
Page 26
Hadoop cluster comprises of a single master node and multiple slaves or ―worker
nodes‖. The Job Tracker is the service within Hadoop that farms out MapReduce tasks to
specific nodes in the cluster, ideally the nodes that have the data or at least are in the same
rack. A Task Tracker is a node in the cluster that accepts tasks Map, Reduce and shuffle
operations – from a JobTracker.
The master node acts as the Name node and the JobTracker whereas the slave or a worker
node acts as both a DataNode and TaskTracker. In a larger cluster, the HDFS is handled
through a dedicated NameNode server to host the file system index and a Secondary
NameNode that can generate snapshots of the name node’s memory structures, thus
preventing file system corruption and reducing loss of data.
All the blocks of a file are of the same size except the last block. Files in HDFS are writeonce and have strictly one writer at any time. The NameNode makes all the decisions
regarding replications of blocks. It periodically receives a heartbeat and block report from
each Datanodes in the cluster. Receipt of the Heartbeat implies that the DataNode is
functioning properly. A blockreport contains a list of blocks on a DataNode.
The input data is taken by the Name node and divided into splits for map phase. Map runs
for each split. The TaskTracker then retrieves key/value pairs from the data chunk and
assigns them to mappers. The output key-value pairs of the map function are sorted and
stored locally and passed to the reducers as the input data through HTTP. Once the process
is finished in reduce function, the final result is handed over to HDFS through network.
HDFS client is the third major category in system architecture. HDFS supports operations
to read, write and delete files and operations to create and delete directories. The user
references files and directories by paths in the namespace. Client nodes have Hadoop
installed with all the cluster settings, but are neither a Master nor a Slave. Instead, the role
of the client machine is to load data into the
Page 27
cluster, submit map reduce jobs describing how the data should be processed and then
retrieve or view the results of the job when it is finished.
(Jyoti Nandimath et al, 2013) in her paper Big Data with Apache Hadoop concluded that
Hadoop Applications performs the operations on Big Data in optimal time and produce an
output with minimum utilization of resources.
Tapan P.Gondaliya and Dr. Hiven D.Joshi also in their paper conclude that Apache
Hadoop is the best solution of Big Data problem. They also provide a brief introduction of
the components over Hadoop like Apache Hive, Apache Pig, Apache Mahout, Apache
HBase, Apache Sqoop and Apache Flume.
(Aditya B. Patel, 2012) present a paper in which various experiments using Apache
Hadoop are done. He concluded that results obtained from various experiments indicate
that its favorable to use Apache Hadoop for Big Data Analysis n their future work will be
focused on evaluation and modeling of Hadoop data intensive applications on cloud
platform like Amazon EC2.
2.2 Basic Logging and Descriptive Analytics
The section promotes the need for effective logging techniques for interactive data analysis
systems such as: describing the exploration process, implementing intelligent user
interfaces so as to create recommendation systems using predictive analysis, evaluating
analysis tools and interfaces, and gaining insights of the analysis term as a whole. These
terms are not only of great excitement to researchers who long to understand these topics in
depths, but immensely valuable to industry patrons, who can use this information to design
the products in a manner that they become better suited to the needs of the users.
Implementing intelligent user interfaces: Clippy: Microsoft Assistant is one example of an
intelligent user interface. These assist the user by loading some of the complexity in working
with the tool at hand, often by automated means. Other examples stated in different papers
such as adaptive or
Page 28
adaptable (Andrea Bunt et al,2007) interfaces, predictive interfaces as mentioned by
(Swapna Reddy et al,2009) and mixed-initiative interfaces (Eric Horvitz,1999), as well as
automated user assistants explained (Pattie Meas, 1994). Automated interfaces usually rely
heavily on statistical models of user behavior and thus require accurate accounting user
actions at a level that corresponds to the variables being modeled. The predictive systems
created rely heavily on previously accumulated data as mentioned in the section above for
example a KEDB to predict and model user’s behavioral pattern.
Another example, Wrangler has a mixed-initiative interface that gives suggestions to assist
users clean their data based on frequencies of user actions as explained by (Qui Guo et
al,2010) in their paper. Wrangler was originally established on a transformation language
with a small number of operators. While identifying this list of transforms and pairing them
with interface gestures mentioned, the authors were able to capitalize on their extensive
hand experience, as well as prior work on languages for data cleaning helped them in
creating such a system. However, for data exploration purposes rather than data cleaning, it
is not clearly mentioned what set of transforms and visualizations should be supported and
used in order to get the desired results. Related work done previously on the topic has relied
vastly on the intuition and experience of the author with particular situations and patterns
encountered before to determine what actions to support in what situations (Robert St.
Amant et al,1998.). However, these situations could be better determined by having detailed
activity records from data exploration and visualization tools with direct manipulation
interfaces, logged at an appropriate level of granularity (David Gotz et al,2009) rather than
just relying on pure instinct and gut of one particular individual who might hold an expertise
in the area. This problem can be rectified by evaluating analysis tools and interfaces with
various data sets to check their level of predictions. More specifically, researchers and
industry patrons revaluate interfaces on various parameters to understand the user behavior,
performance, thoughts, and experience, by contrasting and comparing design alternatives,
computing usability metrics, and certifying conformance with standards as mentioned in the
paper. (David M Hilbert et
Page 29
al,2000). To accomplish the targets set by using events logged from current UI systems,
researchers have invented a wide variety of techniques ranging from synchronizing data
gathered from different sources available to them and then transforming, comparing,
summarizing, and imagining event streams that abstract low-level log events into high-level
modeled events which help in predictive analysis. A substitute to these automated
techniques is to perform a task in carefully controlled laboratory environment or focusing
on long term studies of specific tools in an isolated environment (Youn Ah Kang et al,2011).
These studies in reality involve watching videos of study subjects performing a task, grilling
the subjects about their experience, and evaluating how well they performed the task in the
environment created for them.
While the research is immensely valuable, some disadvantages of these techniques are that
they do not scale well, they generate results that are not amenable to comparison or
combination with data from other studies, and the process of recording the data is too open
to subjective interpretation. The results are based on previous data so do not gel with
theoretical studies stated. High quality automatically logged interaction data would
circumvent each of these problems, although at the expense of missing the big picture that
these techniques provide.
Understanding the analysis ecosystem: In addition to improving upon individual tools and
interfaces, developers and researchers want to understand the entire data analysis pipeline.
In practice, users leverage multiple tools to explore and visualize their data depending on
their needs. For example, a data scientist might use Hadoop and R for statistical work.
Summary of Basic Logging Techniques
Here we restate the basic types of information that should be logged.
Event: The smallest unit of information that is stored in a log is more commonly referred
to as event, even if the generation is not from an event driven program. However, graphical
user interfaces and other interactive programs are usually event-
Page 30
driven. An event in a log is a piece of information that is recorded any time the work or
application the user is interested in is run on the system. Work of interest may consist of
functions called, queries run, GUI trigger handlers, threads executed, and so on. The kind
of information which is logged in for each event and the format it appears in varied across
different perspectives and applications; it may include information such as functional
parameters, execution durations, caller, source code location, timestamp and severity of
error etc. Such events are customarily logged for debugging and performance monitoring
purposes. Later we discuss specially what types of events and associated information should
be logged for user modeling.
1. User ID: In an ideal situation each event should be relatable to information about
the user responsible for triggering the event, in a sense that the interaction of which
program with which application of the program caused the event. For some events, the
user responsible may be the system itself, for example, in the case of garbage collection.
In general, arbitrating causality is not trivial, but for the events of interest for user
modeling, it should be straightforward.
2. Timestamps: Events should always be associated with a timestamp that describes
the date, time, and time zone information. Timestamps are vital for understanding the
order and rate of events but are not always reliable and accurate in predicting the
reactions of when an event truly occurred. This is often not a problem when dealing
with logs from a single machine but can be extremely challenging and daunting to deal
within a distributed setting.
3. Version and configuration: It is crucial to provide some information that ties each
event recorded to metadata about the version and configuration of the interface that
generated that event. This is paramount because exactly what information is logged and
the format it is logged in tends to change across versions and configurations. Without
this information, it can become unessentially difficult to parse the logged data, and
ambiguities may be introduced. Ideally, even changes to minor details of the interface
would be versioned, to facilitate A/B testing.
Page 31
4. Open Time and close time: The events should be accompanied by opening and
closing times so as to help predict the nature and severity of the damage occurred during
the live condition of the event.
2.3 Predictive Analytics and Recommender Systems
Analytics is related to ―the extensive use of data, statistical and quantitative analysis,
explanatory and predictive models, and fact-based management to drive decisions and add
value‖ (Davenport and Kim, 2013, p3). According to the differences in the analytical
methods used as well as the objective they are being used for, analytics can be divided into
the following categories: descriptive, prescriptive and predictive. The descriptive analytics
can be referred to as reporting and it describes a certain phenomenon of interest. It
incorporates the actions of gathering, organizing, tabulating and depicting data, and even
though it is useful for decision makers in the context of an organization, it does not provide
details about why certain event occurred nor it is able to say what could happen in the future
(Davenport and Kim, 2013; Delen and Demirkan, 2013; Song et al., 2013). The prescriptive
analytics on the other hand, is related to making suggestion about a certain set of actions
and includes methods of experimental design and optimization (Davenport and Kim, 2013;
Sharda et al., 2013; Song et al., 2013). The experimental design demonstrates the reasons
why a phenomenon occurs by making experiments where independent variables are
manipulated, extraneous variables are controlled and therefore conclusions are being made
which result with actions that the decision maker should practice. Optimization as a
technique suggests balancing the level of a certain variable related to other variables, thus
identifying the ideal level of it – a recommendation for the de-cision maker (for example
identifying the ideal price of a product to be sold, the ideal level of supplies to be kept in
inventory or the right quantity of a particular order to be made) (Daven-port and Kim, 2013).
Page 32
Finally, predictive analytics is about determining the events which would materialize in the
future with a certain likelihood (Brydon and Gemino, 2008; Boyer et al., 2012; Davenport
and Kim, 2013; Delen and Demirkan, 2013; Schmueli and Koppius, 2010; Sharda et al.,
2013; Siegel, 2013). The predictive analytics ―go beyond merely describing the
characteristics of the data and the relationships among the variables (factors that can assume
a range of different val-ues); they use data from the past to predict the future‖ (Davenport
and Kim, 2013, p. 3). In order to make it more clear for the reader to comprehend what
predictive analytics refers to, what events are being predicted and what results are being
achieved.
Davenport and Harris (2007) place the predictive modeling and analytics within the domain
of BI&A based on the dimensions: degree of intelligence and the competitive advantage it
gives to the organizations that are using it Figure 2.3.
Figure 2.4 Competitive Advantage v/s Degree of Intelligence
Predictive analytics refers to ―building and assessment of a model aimed at making
empirical predictions‖ in the context of quantitative empirical modelling (Shmueli
Page 33
and Koppius, 2010, p. 555). That incorporates empirical predictive models (statistical
models like data mining algo-rithms for instance) which predict future scenarios and
evaluation methods assessing the pre-dictive power of a model. What predictive analytics
does is pinpoint relationships between the variables and after that, based on those
relationships it predicts the likelihood for a certain event to occur. Despite the predictive
purpose the relationships between data are used for, explicit cause-effect relationships are
not expected or assumed to be present in the data (Davenport and Kim, 2013).
Empirical modelling for explanation refers to statistical models that are used for
―testing causal hypotheses that specify how and why certain phenomena occur‖
(Shmueli and Kop-pius, 2010, p. 554). That also includes explanatory statistical models for
testing hypotheses (like regression models, common in IS research and social sciences in
general) and methods for evaluating the explanatory power of the model (various statistical
test for strengths of rela-tionships). Shmueli and Koppius (2010) point out to the existence
of large debate about the difference between explaining and predicting and their research
results with explaining the differences between these two terms in five different steps:
analysis goal, variables of interest, model building optimized function, model building
constraints and model evaluation in Table 2.2.
Page 34
Table 2.1 Explanatory Statistical Modelling and Predictive Analytics according to
Shmueli and Koppius (2010)
Page 35
Chapter -3 Problem Statement and Methodology
This chapter focuses on the existing systems present to handle big data and predictive
analytics. It also explains the motivation behind developing a recommendation system for
error log analytics. The methodology used for creating such is a system is also elaborated
in this chapter.
3.1 Problem Statement :
Need for Predictive Error Log Analytics Using Big Data
3.1.1 Existing systems in theory:
Agile systems are the new it thing, how to support the constant changing demands? You
need system logs to identify potential security issues and network failures. If the work is in
a highly standardized environment like financial analysis, legal advisory, or in government
offices or websites, Log data is the need of the hour for regular audits and compliance
reports. In the e-commerce business user logs gives useful insights into the data to help
provide a better user experience and conversion that helps in work being done.
There are generally two most common log types:

Event logs or Error Logs – They provide an extensive view to establish how your
system and the components associated with it are performing at any point in time in
normal as well as high pressure situations. Whether the servers are running fine, or if
there are any network failures and abnormalities in your network all these types of
errors are maintained in error logs.

User logs – These logs focus on building an intimate understanding of the online
user behaviors, such as what they explored on the website, which links were used the
most or which products were added in the favorite list etc. keeping a track of the
buyer’s profile in your logs for analytics and prediction purposes. Analyzing raw user
logs allows a more controlled approach, high accuracy, and transparency in
Page 36
introspecting user activities beyond statistics provided by standard web analytics services
like Google Analytics or Omniture.
With huge amounts of data extending to terabytes, even petabytes, it’s next to impossible
for the existing log analysis software to promptly and precisely apprehend patterns and point
towards trends and give predictions. In the absence of an efficient and programmed process
to give insights into this humongous data, organizations would face the wrath of dumping
valuable data in an unrefined ―data lake,‖ and eventually lose the profits that these data
insights are capable to provide and
lose the competitive advantages it can provide.
We developed a unique approach to search and give a data analytics approach for making
the best use of log data.
Existing System to Navigate and Analyze Logs with Big Data and Search
Figure 3.1 Existing Big Data Architecture for Log Analytics
Abundant big data applications robust in nature for log analytics have helped numerous
organizations avoid the loss of huge valued data and avoided getting the
Page 37
data being dumped into the ―data lake.‖ These applications are abided by Hadoop’s
processing power, machine learning algorithms, predictive analytics capabilities of R, and
advanced search capabilities.
A big data endowed log analytics platform:
Accumulates the data from different sources and stores raw unprocessed and
unstructured log files from multiple business systems (often hundreds of GB daily).
Loads the data through buffers for cleaning and processing.
Sends it into a log analytics stack for query parsing, search indexing, and trend
visualization
Enables developers to perform robust and prompt analysis of user trends, clustering,
clustering trends, market trends, improve error handling techniques.
3.2 Motivation
System logs especially error logs provide a peek into the state of a running system.
Instrumentation occasionally generates short messages that are collected in a systemspecific log. The content and format of the logs can vary widely from one system to another
and even among same components within a system. For example, an usb driver might
generate messages indicating that it had trouble communicating with the device, while a
web server might face problems in fulfilling the client request and loading the requested
page.
The content of the logs is diverse in nature, so are the uses. The log from a printer or USB
drive might be used for troubleshooting, while the Server log is more commonly used to
study traffic patterns to maximize the revenue from advertising. Undoubtedly, a single log
can be used for various purposes: information about the traffic along different network
paths, called flows, might help a user ameliorate network performance or observe a
malevolent intrusion; or call-detail records can
Page 38
help monitor the caller and receiver details in case of a crime investigation, and upon further
analysis can reveal call volume and drop rates within entire cities.
This paper provides an overview of some of the most common applications of log analysis,
describes some of the logs that might be analyzed and the methods of analyzing them, and
elucidates some of the lingering challenges. Log analysis is a rich field of research with
high impacts on the running ability of the system built. We intend to provide a clear
understanding of why log analysis is both vital and difficult.
1. Debugging
Many logs are intended to facilitate debugging. As Brian Kernighan wrote in Unix for
Beginners in 1979, "The most effective debugging tool is still careful thought, coupled with
judiciously placed print statements." Although programs in today’s time are orders more
significant and complex than compared to were given say orders that 30 years ago, many
people still use the old logging technique using printf to console or local disk and use some
combination of manual inspection and regular expressions to locate specific messages or
patterns.
The most simple and common use for a debug log is to grep for a specific message. If it is
believed that an application crashed due to abnormalities in the network behavior, then
person in charge should try to locate a "connection dropped" message in the server logs. In
most of the cases, it is problematic to figure out what kind of error to look for in the logs,
as there is no well-defined mapping between log messages and observed symptoms. For
example, when a service suddenly becomes slow, for the person operating it is improbable
to see an obvious error message saying, "ERROR: The service latency increased by 10%
because bug X, on line Y, was triggered." Instead, users often perform a search related to
severe keywords such as "error" or "failure." Such severity levels are often used in a
haphazard manner because generally a developer rarely has complete knowledge of how
the code will ultimately be used in what scenario.
Page 39
Moreover, red-herring messages such as no error detected may contaminate the result set
with non-consequential events. Consider the following message from the BlueGene/L
supercomputer:
YY-MM-DD-HH:MM:SS NULL RAS BGLMASTER FAILURE ciodb exited
normally with exit code 0
The severity of the word FAILURE is not helpful, as this message may be generated during
non-failure scenarios such as system maintenance.
When a developer codes the print statement of a log message, it is bounded to the context
of the program source code. The content of the message, however, often excludes this
context. Without knowledge of the code surrounding the print statement or what led the
program onto that execution path, some of the semantics of the message may be lost—that
is, in the absence of context, log messages can be difficult to understand.
2. Performance
Log analysis if done in a correct manner is able to enhance or debug the system
performance. Getting insights into a system's performance is commonly associated with
understanding of how the resources in that system are utilized. Some logs are the same as
used in the case of debugging, such as logging lock operations to debug a bottleneck. Other
logs are used in tracking the use of individual resources, producing a time series of the
resources facing time crunch. Resource-usage statistics often come in the form of
cumulative use per time period (e.g., b bits transmitted in the last minute). Bandwidth is
also be used as a criterion of data to characterize network or disk performance, page swaps
are used to represent memory effectiveness, or CPU utilization to characterize loadbalancing quality
Page 40
As seen in the case of debugging logs, performance logs must also be interpreted correctly.
Two types of contexts are especially useful in performance analysis: the environment of the
system in which the application is running and the workload of the system.
Performance problems are usually caused by lack of communication between components,
and to reveal that such interactions have taken place, you have to seize information from
heterogeneous logs generated by multiple sources. This seizing of information can be
challenging. In addition to heterogeneous log formats, components in distributed systems
may start creating discrepancies on the exact time, making the precise ordering of events
across multiple components next to possible to refurbish Also, an event that is dependent
to one component (e.g., a log flushing to disk) might cause serious problems for another
(e.g., because of the I/O resource contention). As the component causing the problem is
unlikely to log the event, it may be hard to capture this root cause. These are just a few of
the difficulties that emerge
.
3.3 Methodology
The complete task of developing a predictive model has been divided into 3 phases or
papers.
Paper 1: The aim of paper 1 was to justify the need of Hadoop and big data
illustrating the advantages of using Map reduce over normal methods of data handling.
It gave a combiner approach to error log analysis using big data. This approach to
handling error logs specifically has been recommended as it saves execution time as
compared to normal map reduce approach.
Paper 2: Paper 2 focused on the descriptive analysis of the log dataset of company
A at hand. It finds correlation between various parameters of the dataset to give a
Page 41
deeper understanding and lead the path towards statistical approach of linear regression to
interpret relationships amongst various parameters at hand.
Paper 3: Paper 3 gives a recommendation system based on past descriptions of logs
and how it was handled. Focusing on the concept of developing term document
matrices and finding cosine similarity between the description of the new log coming
in and the past data present with us.
Figure 3.2 Structure for creation of Recommendation System
Page 42
Chapter 4: Proposed Framework
This chapter describes the entire workflow required for building the recommendation
system breaking the entire task into 3 papers connected to each other so that the ultimate
aim of getting the desired system is achieved.
The entire work flow was divided into 3 parts consisting of 3 papers:
Figure 4.1 Flow of Work Done
4.1 A Combiner Approach to Effective Error Log Analysis Using Big Data
4.1.1 Role of Combiner in Map Reduce And Error Log Analysis
Role of Combiner in Map Reduce
A Combiner, also known as a semi-reducer, operates by taking the inputs from the Map
class and henceforth passing the output key-value pairs to the Reducer class. The main
purpose of a Combiner is to summarize the map output records with the same key. The
result (key-value collection) from the combiner will be sent over the network to the actual
Reducer task as input, thereby reducing the load at the Reducer. The Combiner class is used
in between the Map class and the Reduce class to minimize the volume of data transfer
between Map and Reduce. Usually, the output of the map task is large and the data
transferred to the reducer task is high.
Here is a brief summary on how MapReduce Combiner works:
• A combiner does not have a default interface and it must implement the Reducer
interface’s reduce() method.
Page 43
• A combiner operates on each and every map output key. It is a must for the combiner
to have the same output key-value types as the Reducer class.
• A combiner can produce summarized statistics from a large dataset because it
replaces the original Map output.
4.1.2 Purpose of log
Ubiquitous to the study of online activities is the possibility of collecting log file data. It is
plausible for the computer to trace every command typed by users—in some cases, every
stroke of the key. In cases where users interact only online, we can access a comprehensive
record of all of their history of interactions. The completeness of the record and ease of
collecting it are unrivalled. However, log file data is more often collected than analyzed.
The structure and type of log varies with different applications. Types of log files generally
maintained include:
1. Error logs: Keep the records of types of errors and time of occurrence. Helps in the
resolution of errors due to back tracing.
2. Web server logs: History of activities on the internet stored. New techniques such as
clickstream mining use web server log data
3.Console logs: Wellbeing of system applications assessed through system or console
logs.
Main focus of this paper is on Error Log Analytics. It manages the regeneration of data from
semi -structured to a uniform structured format, in order to provide base for analytics
.Business Intelligence (BI) functions such as Predictive Analytics are used to predict and
forecast the future status of the application on the basis of the current scenario. Proactive
measures can be taken rather than responsive measures in order to ensure efficiency in
maintenance of the applications and the devices.
There are 2 types of Log files :
1. Access Log
Page 44
2. Error Log.
This paper explores the Analytics of Error logs. Error Log records all the details such as
Timestamp, Severity, Application name, Error message ID, Error message details. Error
Log is a file that is created during data processing to hold data that is known to contain
errors and warnings. It is usually printed after completion of processing so that the errors
can be redressed. Error logs are always found in a heterogeneous format which looks
something like this.
Error logs contain the parameters such as:
-Timestamp (When the error was generated).
- Severity (Mentions if the message is a warning, error, emergency, notice or debug).
- Name of application that generated the error log.
- Error message ID.
- Error log message description.
INPUT :
Error log dataset
Approach for Map Reduce:
1. First the for the dataset a partitioner is created which divides the errors in 5
categories namely :
INFO, FATAL, DEBUG,TRACE and WARN.
2. Now the mapper and reducer functions are run on different amounts of data to
analyze the time spent by CPU on the task.
3. To reduce the time spent an additional COMBINER function is added before the
reducer thereby reducing the load on reducer function and decreasing the cpu time to
perform the same task.
Page 45
4.2 Effective Error Log Analysis Using Correlation
Log analysis is the process of reconditioning raw log data into useful information for
suggesting solutions to existing problems. The market for log analysis software is huge and
growing as more business insights are obtained from logs. Stakeholders in this industry need
accurate, quantitative data about the log analysis process to identify inefficiencies, streamline workflows, predict tasks, design great level analysis languages, and spot outstanding
challenges. For these purposes, it is imperative to understand log analysis in terms of
discrete tasks and data transformations that can be measured, evaluated, correlated, and
predicted, rather than qualitative portrayals and experience alone. One problem is that
logged system events are not an excellent representation of human log analysis activity.
Logging code is ideally not designed to capture human behavior at the most efficacious
level of granularity. Even if it were, recorded events may not reflect internal mental
activities.
The goal of this paper is to find correlation and use descriptive analysis on the log dataset
of company XYZ.
Log Dataset analyzed in this paper contains the following parameters:
1. Track : The project under which the error happened.
2. Incident ID : The ID of incident.
3. Priority : Priority of the error is labeled as High, Medium, Low.
4. Time of Incident Assigned to App Team
5. Major Application Affected : Major application of the company affected by the
error.
6. Status : Status of the ticket whether it is closed or open.
7. Primary Assignment: Namely domains like web ecommerce, document central,
SAP document etc.
8. Close Time
Page 46
9. Restoration Duration (h:mm) Calender days : Time taken to resolve the error in
calendar days.
10. BWA Restoration Duration (h:mm) : Time taken in hours
11. To be included in Restoration SLA
12. Restoration SLA Met
13. ELS Filter
14. Closed By
15. Basic Description
16. Description
17. Manual Calculation (D:H:MM:SS)
18. Manual Calculation (Hours)
19. KEDB Check : KEDB refers to Known Error Database.
20. KEDB Compliance (Y/N)
4.2.1 Terminology Used:
Known Error Database : KEDB
The Known Error Database is a storehouse of information that portrays all of the conditions
in your system application that might result in an incident for your customers and users.
As users incident issues, the various support engineers pursue the traditional steps stated in
the Incident Management process namely Logging, Categorization, Prioritization. Soon
after that they are on the hunt to find a correct and viable solution for the user. This is where
the KEDB steps in. The engineer should interact with the
Page 47
KEDB in a very identical manner of dealing with any Search engine or Knowledge database.
The engineer’s search using the ―Known Error‖ field and recover information to
view the Workaround field. The KEDB terminology construes of a known error and the
workaround field.
1. The Known Error
The Known Error is a characterization of the problem in the user’s words. In case of an
error, the users contact the service desk for help. While describing the problem they have a
limited view of the entire scope of the root cause. The user should use screenshots of error
messages, as well as the text of the message to aid searching the kind of error they have
encountered. They should also include accurate descriptions of the conditions that they have
experienced. The known error is basically an error that has been recorded along with its
solution if it is found for future references. These are the types of things we should be
describing in the Known Error field. A good case of a Known Error would be:
When accessing the Timesheet application using Internet Explorer 6 users experience an
error message when submitting the form.
The error message reads ―JavaScript exception at line 123‖
The Known Error should be written in terms reflecting the customer’s experience of the
Problem.
2. The Workaround
The Workaround is a series of chronological steps that the service desk personal could take
in order to either restore service to the user or provide temporary relief.
The Known Error is a search key. A Workaround is what the engineer is hoping to find – a
search result. Having a detailed Workaround, a set of technical actions the Service desk
should take to help the user, has multiple benefits – some more obvious than others.
Page 48
4.2.2 Benefits of Using a Known Error Database (KEDB)
1. Less Restoration time: In a scenario where the user has lost access to a service due
to an anomaly that is already known and has a place in the KEDB. The best possible
service that a user could hope for is an instant restoration of service or a temporary
resolution. Having a good known error database which makes the problem easy to find
also means that the workaround should be faster to locate. All of the time required to
properly analyses and understand the root cause of the user’s issue is removed by
allowing the service desk engineer a quick access to the workaround, thereby arriving
at a solution quickly with less effort.
2. Recurring Workaround: With a known error stored in the KEDB, recurring problems
whose solutions were recorded are solved in a manner such that each customer having
the same problem is given a solution with same veracity in terms of speed and accuracy.
KEDB helps avoid the case of one error different solutions, same types of error are
solved in a similar manner, and thereby it is kind of like providing a guideline for
helping similar errors.
3. Smart Work: In the absence of a KEDB engineers are often seen spending time and
energy trying to find a resolution for the recurring issues. This would be likely in
distributed teams working from different offices, but it is also a more common
occurrence in a single team. KEDB helps save time, energy, money and resources.
4. Evade skill divide – A team constitutes of engineers at different levels of skill. It is
an impossible scenario to employ a team that are all experts in every functional area,
so it is natural to have many junior members at a lower skill level. A system for
apprehending the workaround for complicated problems allows any engineer to quickly
resolve issues that are affecting users. Teams are often cross-functional. We might
foresee a scenario wherein there is a centralized application support function in a headoffice with users in remote offices supported by their home IT teams. A KEDB gives
all IT engineers a single platform to search for issues bothering the customers.
Page 49
5. Avoid conflicting or controversial workarounds: Establish certain parameters and
guidelines to control the workarounds that engineers suggest to users. There have been
many moments in the past methods that engineers suggest to customers are discussed
and asked how they fixed issues internally revealing the complex methods used. For
example: disabling the antivirus to avoid unexpected behavior, upgrading whole
software suites to fix a minor issue. All the managers can relate to this. Workarounds
can help eliminate dangerous workarounds.
6. Avoid Futile Ownership transfer of Incidents – A flaccid point in the Incident
Management process is the continuous transfer of ownership between teams. This is
the point where a customer issue goes to the bottom of someone else’s queue of work
and is left unhandled even if it was a high priority in some other person’s queue of
work. Often with not enough detailed context or background information, enabling the
service desk to resolve issues themselves prevents transfer of ownership for issues that
are already known.
7. Get acumen of the severity of the problem at hand : Well documented Known Errors
make it a lot convenient to link new incidents to existing previously documented
problems. Firstly this avoids a situation of duplicate logging of problems by different
engineers. Second it gives better insights about how severe the problem encountered is.
Consider two Problems in your system: A condition that affects a network router and
causes it to crash once every 5 months and a transactional database that is running faulty
and adding 4 seconds to timesheet entry .It is expected that the first problem would be
given a high priority and the second a lower one. It stands to reason that a network
outage on a core router would be more lethal to the system than a slowly running
timesheet system But which would cause more Incidents over time? You might be
associating 5 new Incidents per month against the timesheet problem whereas the
switch only causes issues irregularly. Being able to quickly link incidents to existing
documented problems allows you to judge the relative impact of each one.
Page 50
4.2.3 The KEDB implementation
In Technical terms when we talk about the KEDB we generally refer to the Incident
Management Database not a completely separate storehouse of data. Minimum one suitable
implementation of KEDB in that manner should be implemented.
There is a one-to-one relation between Known Error and Problem so it is logically correct
that the standard data representation of a problem with its number, assignment data, work
notes etc., should also hold the data that is required for the KEDB.
It is not incorrect to implement this in a different way that is storing the Problems and
Known Errors in separate locations, but it should preferably be kept all together to ease
analysis of both the known errors and problems.
4.2.4 Importance of R in Data Analytics
R is the only programming language that allows statisticians to perform the most
complicated and intricate analyses without getting into too much of details. With so many
benefits for data science, R has gradually mounted heights among professionals of big data.
According to a 2014 survey, R is one of the most powerful and popular programming
languages used by data scientists today. Features of R that makes it popular are:
1. The Fact That R Is an Open Source Programming Language
R is free for everyone to use because it is an open source programming language.
Programming codes of R can be used across all platforms like Linux, Windows, and Mac.
There are no limits with respect to subscription costs or license management, which makes
it easily available to data geeks. Also, you can have free access to the R programming
libraries. Nevertheless, there are some commercial libraries meant for enterprises dealing
with data in terabytes. Hadoop is a good example.
2. The Ultimate Statistical Analysis Kit
R is a programming language having all standard data analysis tools to access data in varied
formats, for several data manipulation operations – merges, transformations
Page 51
and aggregations. It includes tools for conventional and modern statistical models including
Regression, ANOVA, GLM and Tree, in its object oriented framework, which makes is
easier to extract as well as merge the needed information rather than copying it.
3. Benefits of Charting
R has some great tools to aid data visualization to create graphs, bar charts, multi panel
lattice charts, scatter plots and new custom designed graphics. Unparalleled charting and
graphics offered by R language is highly influenced by data visualization experts. Graphics
based on R programming can be seen in blogs like The New York Times, The Economist,
and Flowing Data.
4. R Language Offers Consistent Online Support
R language is the most sophisticated statistics software because of its quick and consistent
online support. The language has a loyal user base because statisticians, scientists and
engineers, even without proper computer programming knowledge, can easily use it.
5. The Most Powerful Ecosystem
R has the strongest ecosystem, a package with several functionalities built in for modern
statisticians. ―dplyr‖ and ―ggplot2‖ are some examples for data manipulation and plotting,
which relieves data scientists from graphic and charting capabilities to be included in
applications. R programming language can do almost everything, for business and
otherwise. It is used by leading social networks like Twitter and data scientists find it an
indispensible tool.
Error log analytics using R packages majorly dplyr,plyr,ggplot2 has been done which has
resulted in numerous graphs which tell us about the correlation and relationship between
type of errors and various columns as mentioned above in the log dataset description.
Correlation is any of a broad class of statistical relationships involving dependence, though
in common usage it most often refers to the extent to which two variables have a linear
relationship with each other.
Page 52
4.3 A predictive model for Error Log Analytics
Figure 4.2 Dataset of Logs of Comapany A
Phases of recommendation process
4.3.1 Information collection phase
This collects relevant information of users to generate a user profile or model for the
prediction tasks including user’s attribute, behaviors or content of the resources the user
accesses. A recommendation agent cannot function accurately until the user profile/model
has been well constructed. The system needs to know as much as possible from the user in
order to provide reasonable recommendation right from the onset. Recommender systems
rely on different types of input such as the most convenient high quality explicit feedback,
which includes explicit input by users regarding their interest in item or implicit feedback
by inferring user preferences indirectly through observing user behavior [31]. Hybrid
feedback can also be obtained through the combination of both explicit and implicit
feedback. In E- learning platform, a user profile is a collection of personal information
associated with a specific user. This information includes cognitive skills, intellectual
abilities,
Page 53
learning styles, interest, preferences and interaction with the system. The user profile is
normally used to retrieve the needed information to build up a model of the user. Thus, a
user profile describes a simple user model. The success of any recommendation system
depends largely on its ability to represent user’s current interests based on the previous data.
Definite models are imperative for obtaining useful and accurate recommendations from
any prediction techniques.
4.3.2 Explicit feedback
The system normally cajoles the user through the system interface to provide ratings for
items in order to construct and improve his model. The certainty in recommendation is
dependent entirely on the ratings provided by the user. The one shortcoming of this method
is, it requires user involvement at every stage and also, users are not always willing to supply
enough information. Ignoring the fact that explicit feedback requires more effort from user,
it is still viewed as providing accurate and reliable data, since it does not involve extracting
preferences from actions, and it focuses on bringing transparency into the recommendation
process that results in a slightly higher perceived recommendation quality and instills more
faith in the recommendations listed by the system.
4.3.3 Implicit feedback
The system automatically infers the user’s preferences by monitoring the varied actions of
users such as purchase history, navigation history, and time spent on some web pages, links
followed by the user, content of e-mail and button clicks among others. Implicit feedback
reduces the burden on users by inferring their user’s preferences from their behavior with
the system. The method though does not require effort from the user, but it is less accurate.
Also, it has also been argued that implicit preference data might in actuality be more
objective, as there is no bias arising from users responding in a socially desirable way [32]
and there are no self-image issues or any need for maintaining an image for others [33].
Page 54
4.3.4 Hybrid feedback
Hybrid system is a collection of strengths of both implicit and explicit feedback in order to
minimize their weaknesses and get a best performing system. This can System be conceived
by using an implicit data as a check on explicit rating or allowing user to give explicit
feedback only when he chooses to express interest in giving the feedback.
Figure 4.3 Recommendation Phases
4.3.5 Steps for building a recommendation system:
1. Understanding the data set based on the correlation and relevance of columns in
the data set.
2. Divide the data into 3 groups of based on the track : web, e-commerce and custom.
3. Create a corpus of the data present.
Page 55
4. Clean the corpus removing stop words, punctuations, numbers and special
characters.
5. Make the tdm(term document matrix) of the corpus of the dataset.
6. Now, make the tdm for any new error log coming into the system.
7. Find the cosine similarity of the new tdm with the tdm’s of the three groups to
determine which error group it belongs to.
8. According to the group allotted, the team needed to solve the error and time for the
error to be solved is determined.
Page 56
The following figure shows the algorithm for developing the proposed framework :
Figure 4.4 Algorithm for finding TDM
Page 57
Chapter 5: Result Analysis
This Chapter presents the experimental work carried out in this thesis. Log files of company
A are used as the data to analyze for patterns and correlations and propose a framework for
predictive analytics of error log in the dissertation work. To start we have done the analysis
of log file dataset of Company A. Section 5.1 of this chapter gives the details of the analysis
of dataset with map reduce approach and map reduce with combiner approach with the
results. After the analysis of data set using map reduce, we move on to find correlation
between different parameters of the log dataset to get a better descriptive analysis of the
data set. Section 5.2 focuses on correlation results. The section 5.3 describes all the
experimental work done for the proposed recommendation system.
5.1 Effective Combiner Approach to Error Log Analytcis
This paper explores the Analytics of Error logs. Error Log records all the details such as
Timestamp, Severity, Application name, Error message ID, Error message details. Error
Log is a file that is created during data processing to hold data that is known to contain
errors and warnings. It is usually printed after completion of processing so that the errors
can be redressed. Error logs are always found in a heterogeneous format which looks
something like this.
Error logs contain the parameters such as:
-Timestamp (When the error was generated).
- Severity (Mentions if the message is a warning, error, emergency, notice or debug).
- Name of application that generated the error log.
- Error message ID.
- Error log message description.
5.1.1 Input :
Error log dataset of company A
Approach for Map Reduce:
Page 58
1. First the for the dataset a partitioner is created which divides the errors in 5
categories namely :
INFO, FATAL, DEBUG, TRACE and WARN.
2. Now the mapper and reducer functions are run on different amounts of data to
analyze the time spent by CPU on the task.
3. To reduce the time spent an additional COMBINER function is added before the
reducer thereby reducing the load on reducer function and decreasing the CPU time to
perform the same task.
5.1.2 Output:
Time taken by reducer to process all the logs with combiner and without combiner.
Table 5.1 Output
SNO
Amount of Data Number
1
50 Mb
2
6GB
of Time taken for Time taken for
records Map
Reduce Map Reduce with
without Combiner
Combiner
(in ms)
(in ms)
3,26,93
7.9
6.1
6,00,000
15000
13492
The above output proves the point that enormous amount of data expanding to several
gigabytes can be easily analyzed using the partitioner and combiner approach as compared
to the regular map reduce approach. Hence, using the combiner approach is more efficient
in dealing with data sets with respect to time taken for their analysis. It is time effective and
minimizes the load on reducer and segregation of errors in 5 categories also contributes
towards the efficiency.
Page 59
5.2 Effective log analysis using Correlation :
Error log analytics using R packages majorly dplyr,plyr,ggplot2 has been done which
has resulted in numerous graphs which tell us about the correlation and relationship
between type of errors and various columns as mentioned above in the log dataset
description. Correlation is any of a broad class of statistical relationships involving
dependence, though in common usage it most often refers to the extent to which two
variables have a linear relationship with each other.
5.2.1 Descriptive Representation of Correlation between Parameters of Dataset
The first graph represents that ―Low‖ priority errors are mostly resolved using KEDB.
So, It becomes very important for a company to build a KEDB so as to enhance error solving
capability of the support team. Previously known type of errors can be easily solved using
a KEDB check. Errors with priority low have to be resolved in 7 days, errors with priority
medium 48 hours and errors with priority high have to be resolved in 7 hours calendar time.
Figure 5.1 Descriptive Representation of Correlation between Parameters of Dataset
5.2.2 The second graph shows the high correlation between number of logs opened
and searched in KEDB. The relation is direct because of the fact more logs equals
Page 60
more KEDB checks. This correlation leads to the decrease in time required to close the
ticket and resolve the error which further results in high customer satisfaction.
Figure 5.2 Correlation between number of logs opened and searched
5.2.3. The third graph displays the categories of logs searched in KEDB that is the errors
which have been previously seen and resolved. The main type is Alert and KEDB.
Alert constitutes of errors such as Server timeout, space issues etc. This gives the company
a heads up regarding what type of issues occur frequently and can be solved immediately
by keeping a record of it.
Page 61
Figure 5.3 Statistical report of number of logs by category
5.2.4
The fourth graph specifies the time series outliers in the period of November
wherein the outliers reach a peak on November 6. The omi integration refers to a project
wherein the ticket generation is not manual but automatic and can be generated as fast
as the errors coming in thereby reducing the load on the team and giving them time to
resolve the fast coming errors. The time series outliers help the company in observing
the server time outs and the time at which it occurs the most so this help in resolving
the issue at hand faster.
Figure 5.4 Count of KEDB checks by date of incident
Page 62
5.2.5
The graph represents the calendar days taken to restore from the error and time
taken by the team to resolve it during the two outliers as well.
Figure 5.5Average Restoration Duration
5.2.6
The pie chart shows that priority ―Low‖ has the maximum number of logs
available.
Figure 5.6 Number of Logs Available
Page 63
5.2.7
The figure shows that two team members have high lookup for KEDB in web
track and have managed to resolve most of the logs referring to the KEDB
which again emphasizes the need of KEDB in a company.
Figure 5.7 Need of KEDB
Page 64
5.2.8.The logs for primary assignment has outliers in the period before November 6
which were resolved efficiently.
Figure 5.8 Logs for Primary Assignment that were Resolved Efficiently
5.2.9
Average of Restoration Duration
The graph represents the average restoration duration from the time the error
occurred to the time it was resolved.
Figure 5.9 Average of Restoration Duration
Page 65
5.2.10 Count of KEDB Check
The graph shows the maximum number of KEDB checks for a particular
error.
Figure 5.10 Count of KEDB Check
5.2.11 Ticket Opened By
Figure 5.11 Count of Tickets Opened
Page 66
5.2.12 The graph shows the correlation between opened by and ―low‖ priority which
indicates that errors with priority low were opened and resolved the most by the team
as compared to the errors of medium and high priority.
Figure 5.12 Correlation between logs opened and priority
5.2.13 The next shows that KEDB and ElS were the most with opened by that means
that errors were searched for previously being in the KEDB and ELS that is early life
support was given to them for resolution.
Figure 5.13 Count of logs opened in KEDB and ELS
Page 67
5.2.14 Count of Logs by Date of Incident
Figure 5.14 Count of Logs by Date of Incident
5.2.15 The above graphs help us find the correlation between logs and various
parameters like the person solving them, priority of error occurring the most, the
calendar days required to solve them which in turn will allow us to build a predictive
modeling framework for error log analytics
Page 68
5.3 A Predictive model for Error Log Analytics
Figure 5.15 Dataset of Company A
Steps for building a recommendation system:
1. Understanding the data set based on the correlation and relevance of columns in
the data set.
2. Divide the data into 3 groups of based on the track : web, e-commerce and custom.
3. Create a corpus of the data present.
4. Clean the corpus removing stop words, punctuations, numbers and special
characters.
5. Make the tdm(term document matrix) of the corpus of the dataset.
6. Now, make the tdm for any new error log coming into the system.
7. Find the cosine similarity of the new tdm with the tdm’s of the three groups to
determine which error group it belongs to.
8. According to the group allotted, the team needed to solve the error and time for the
error to be solved is determined.
1. Group 1 TDM and length :
Page 69
Group 1 is custom and the number of records are: 8865. The below screenshot specifies the
TDM created for Group 1 and also the number of records present at that time in the tdm.
Figure 5.16 Screenshots of Custom
Page 70
1. Screenshot for the TDM of group 1:
Figure 5.17 Screenshots of TDM of Group 1
Page 71
2. Group 2 length and TDM : The below screenshot mentions the TDM of group 2 and
the number of records present in the TDM at that time. Group 2 is SAP and number of
records = 2445.
Figure 5.18 Screenshots of SAP
Page 72
3. Group 3 : Is Web
Length : 12645
Example of predicting the group for a new log coming in:
1. txt <- strsplit(―Error in sql server application service TDP backup resolve fast‖,
split=‖ ―)[[1]] data <- data.frame(text=txt, stringsAsFactors=FALSE)
2. TDM for this error is explained in point 1 as shown in Figure 5.19 :
Figure 5.19 Screenshots of Error
Page 73
3. Cosine similarity with group 1: The below screenshot shows the cosine similarity
between TDM of groups and the TDM of the new log error entry.
Figure 5.20 Cosine Similarity
4. Now analyzing the data we know similarity.
Result :
1. Similarity comparison of new log error with group1, group2 and group3
Table 5.2 Result
Similarity with Group 1 Similarity with Group 2 Similarity with Group 3
0.231
0.132
0.100
2. Average time for resolution of error
Group 1: Custom: 348.82 hours
Group2: SAP: 350.97 hours
Group 3: Web: 353.105 hours
Page 74
Chapter 6: Conclusion and Future work
6.1 Conclusion
The above system can be very beneficial to companies in an environment where 1tb of logs
are generated and accumulated every day in the system. Log Analytics gives hindsight to
how the products and applications created are handled and the efficiency of handling them
is very important. In professional terms these error logs are converted to tickets and any
ticket can land up anywhere to any person of a group who has no previous knowledge of
solving the particular type of ticket thereby increasing the ticket solving time and loosing
customer satisfaction.
The advantages of Recommendation system proposed are:
1. It creates a flow path for the log error to land in the correct group.
2. Grouping helps create a more sophisticated and efficient approach to error
handling.
3. Team specific error handling results in efficiency.
4. The customer can know exactly when the situation will be solved and can prepare
for backup services for the allotted time period to solve the error.
5. The approach is highly beneficial for huge amounts of data which cannot be
handled by regular approaches.
6.2 Future Work
The efficiency of the system can be improved using SVM (Support Vector Machines).
Support vector machines (SVMs) are a set of supervised learning methods used for
classification, regression and outlier’s detection.
The advantages of support vector machines are:
1. It is Highly Efficient in cases of high dimensional spaces.
2. Adequate in cases where the number of dimensions is greater than the number of
samples.
Page 75
3. It makes use of a subset of training points in the decision function (called support
vectors), so it is also efficient in case of memory.
4. Versatile: Versatility occurs due to different Kernel functions can be specified for
the decision function. Common kernels are provided, but it is also possible to specify
custom kernels.
The disadvantages of support vector machines include:
1. In case where the number of features is way greater than the number of samples, the
method is likely to give unclear results.
2. SVMs do not directly provide probability estimates, these are calculated using an
expensive five-fold cross-validation (see Scores and probabilities, below).
The model prepared by support vector machine as described above is dependent on a subset
of the training data, because of which the cost function for preparing the model does not
worry about training points that lie beyond the margin. Analogously, the model produced
by Support Vector Regression depends only on a subset of the training data, because the
cost function for building the model ignores any training data close to the model prediction.
There are three different implementations of Support Vector Regression: SVR, NuSVR and
LinearSVR. LinearSVR provides a faster implementation than SVR but only considers
linear kernels, while NuSVR implements a slightly different formulation than SVR and
LinearSVR.
Page 76
References:
1.
McAfee,A. and Brynjolfsson, E. ―Big data: the management revolution‖, Harvard
Business Review 90, 2012 , pp.60-68
2.
Goodhope, K. et al, ―Building LinkedIn's Real-time Activity Data Pipeline‖, Data
Engineering, Vol.35, No.2,2012 pp. 33-45.
3.
Bhardwaj, et al, ―Big data analysis: Issues and challenges‖, International
Conference on Electrical Electronics Signals Communication and Optimization
(EESCO), 2015.
4.
Souza,L. and Girish UR, ―Error Log Analytics using Big Data and MapReduce‖
IJCSIT,Vol.6 ,No.3, 2015, pp.2364-2367
5.
Bhandarkar, M, ―MapReduce programming with apache Hadoop‖, Parallel &
Distributed Processing(IPDPS), IEEE International Symposium, April 2010
6.
Narkhede,S. and Baraskar, T. ―HMR Log Analyzer: Web Application Logs over
Hadoop Map Reduce‖, International Journal of UbiComp(IJU), Vol.4, No.3, July 2013,
pp.41-47
7.
Katkar, G.S. and Kasliwal, A.D ―Use of Log Data for Predictive Analytics
through Data Mining‖ Current Trends in Technology and Science, Vol.3, No.3 AprilMay 2014, pp. 217-222
8.
Peng,W.et al, ―Mining Logs Files for Data-Driven System Management ―, ACM
SIGKDD Exploration Newsletter- Natural Language Processing and Text Mining, Vol.
7, Issue1, June 2005, pp. 44-51.
9.
Grace, L.K.J et al, ―Analysis of web logs and web user in web mining‖,
International Journal of Network Security & Its Applications IJNSA, Vol.3, No.1,
January 2011, pp. 99-110.
10. Bruckman, A. ―Chapter 58: Analysis of Log File Data to Understand User
Behavior and Learning in an Online Community‖, Georgia Institute of Technology, pp.
1449-1465.
11.
ALSPAUGH, S., et al, Better logging to improve interactive data analysis tools In
KDD Workshop on Interactive Data Exploration and Analytics ,2014.
Page 77
12.
BARRETT, R., ET AL. Field studies of computer system administrators: Analysis
of system management tools and practices, ACM Conference on Computer Supported
Cooperative Work (CSCW), 2004.
13.
BITINCKA, L., ET AL. Optimizing data analysis with a semi-structured time
series database, OSDI Workshop on Managing Systems via Log Analysis and Machine
Learning Techniques,(SLAML) ,2010.
14.
CHEN, Y., ET AL. Design implications for enterprise storage systems via multi-
dimension trace analysis, ACM Symposium on Operating Systems Principles (SOSP)
,2011.
15.
CHEN, Y., ET AL. Interactive analytical processing in big data systems: A cross-
industry study of map reduce workloads., International Conference on Very Large
Databases (VLDB) ,2012.
16.
CHIARINI, M. Provenance for system troubleshooting. In USENIX Conference
on System Administration ,LISA, 2011.
17.
COUCH, A. L, Standard deviations of the average system administrator. USENIX
Conference on System Administration (LISA), 2008.
18.
GOTZ, D., ET AL. Characterizing users’ visual analytic activity for insight
provenance, IEEE Information Visualization Conference (InfoVis) ,2009
19.
LOU, J.-G., ET AL. Mining dependency in distributed systems through
unstructured logs analysis., SIGOPS Operating System Review ,2010.
20.
LOU, J.-G., FU, Q., WANG, Y., AND LI, J. Mining dependency in distributed
systems through unstructured logs analysis. SIGOPS Operating System Review (2010).
21.
MAKANJU, A. A., ET AL. Clustering event logs using iterative partitioning
,ACM International Conference on Knowledge Discovery and Data Mining (KDD) (2009).
Page 78
22.
NORIAKI, K., ET AL. Semantic log analysis based on a user query behavior
model. In IEEE International Conference on Data Mining (ICDM) (2003).
23.
OLINER, A., ET AL. Advances and challenges in log analysis, ACM Queue,2011.
24.
OLINER, A., AND STEARLEY, J. What supercomputers say: A study of five
system logs. In IEEE/IFIP International Conference on Dependable Systems and
Networks (DSN), 2007.
25.
OLINER, A. J., ET AL. Using correlated surprise to infer shared influence. In
IEEE/IFIP International Conference on Dependable Systems and Networks
(DSN),2010.
26.
OLSTON, C., ET AL. Pig latin: A not-so-foreign language for data processing. In
ACM International Conference on Management of Data (SIGMOD),2008.
27.
OTHERS, S. K. Wrangler: Interactive visual specification of data transformation
scripts, ACM Conference on Human Factors in Computing Systems CHI, 2011.
personalization applications. Data Mining and Knowledge Discovery, 5(1/2):33-58,
2001a.
28.
Adomavicius, G. and A. Tuzhilin. Multidimensional recommender systems: a data
warehousing approach. In Proc. of the 2nd Intl. Workshop on Electronic Commerce
(WELCOM’01). Lecture Notes in Computer Science, vol. 2232, Springer, 2001.
29.
Adomavicius, G., R. Sankaranarayanan, S. Sen, and A. Tuzhilin. Incorporating
Contextual Information in Recommender Systems Using a Multidimensional
Approach. ACM Transactions on Information Systems, 23(1), January 2005.
30.
Aggarwal, C. C., J. L. Wolf, K-L. Wu, and P. S. Yu. Horting hatches an egg: A
new graph theoretic approach to collaborative filtering. In Proceedings of the Fifth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
August 1999.
Page 79
31.
Ansari, A., S. Essegaier, and R. Kohli. Internet recommendations systems. Journal
of Marketing Research, pages 363-375, August 2000.
32.
Armstrong, J. S. Principles of Forecasting – A Handbook for Researchers and
Practitioners, Kluwer Academic Publishers, 2001.
33.
Baeza-Yates, R., B. Ribeiro-Neto. Modern Information Retrieval. Addison-
Wesley, 1999.
34.
Balabanovic, M. and Y. Shoham. Fab:
Content-based, collaborative
recommendation.Communications of the ACM, 40(3):66-72, 1997.
35.
Basu, C., H. Hirsh, and W. Cohen. Recommendation as classification: Using social
and content-based information in recommendation. In Recommender Systems. Papers
from 1998 Workshop. Technical Report WS-98-08. AAAI Press, 1998.
36.
Belkin, N. and B. Croft. Information filtering and information retrieval.
Communications of the ACM, 35(12):29-37, 1992.
37.
Billsus, D. and M. Pazzani. Learning collaborative information filters. In
International Conference on Machine Learning, Morgan Kaufmann Publishers, 1998.
38.
Billsus, D. and M. Pazzani. A Personal News Agent That Talks, Learns and
Explains. In Proceedings of the Third Annual Conference on Autonomous Agents,
1999.
39.
Billsus, D. and M. Pazzani. User modeling for adaptive news access. User
Modeling and User-Adapted Interaction, 10(2-3):147-180, 2000.
40.
Billsus, D., C. A. Brunk, C. Evans, B. Gladish, and M. Pazzani. Adaptive
interfaces for ubiquitous web access. Communications of the ACM, 45(5):34-38, 2002.
41.
Breese, J. S., D. Heckerman, and C. Kadie. Empirical analysis of predictive
algorithms for collaborative filtering. In Proceedings of the Fourteenth Conference on
Uncertainty in Artificial Intelligence, Madison, WI, July 1998.
Page 80
42.
Buhmann, M. D. Approximation and interpolation with radial functions. In
Multivariate Approximation and Applications. Eds. N. Dyn, D. Leviatan, D. Levin, and
A. Pinkus.Cambridge University Press, 2001.
43.
Burke, R. Knowledge-based recommender systems. In A. Kent (ed.),
Encyclopedia of Library and Information Systems. Vol.69,No.32 . Marcel Dekker,
2000.
44.
Mandal D. Pratiksha, ―Study of Elastic Hadoop On Private Cloud.‖ International
Journal of Scientific and Research Publications, Vol.6, No.1, January 2016 321 ISSN
2250-3153.
45.
Maitrey S. & Jha C.K., Handling Big Data efficiently by using Map Reduce
Technique,2015 IEEE International Conference on Computational Intelligence &
Communication Technology.
46.
Elsayed A. et al, MapReduce: State of the art and research directions, IJCEE, vol6,
No.1, February 2014.
47. Parmar, Hiren, and Tushar Champaneria. "Comparative Study of Open Nebula,
Eucalyptus, Open Stack and Cloud Stack." International Journal of Advanced Research
in Computer Science and Software Engineering 4.2 (2014).
48.
Parmar, Hiren, and Tushar Champaneria. "Comparative Study of Open Nebula,
Eucalyptus, Open Stack and Cloud Stack." International Journal of Advanced Research
in Computer Science and Software Engineering 4.2 (2014).
49.
Manikandan SG, Ravi S. Big Data Analysis Using Apache Hadoop. InIT
Convergence and Security (ICITCS), 2014 International Conference on 2014 Oct 28
(pp. 1-4). IEEE.
50.
Gohil P, Garg D, Panchal B. A performance analysis of MapReduce applications
on big data in cloud based Hadoop. InInformation Communication and Embedded
Systems (ICICES), 2014 International Conference on 2014 Feb 27 (pp. 1-6). IEEE.
Page 81
51.
Nandimath, J. et al, "Big data analysis using Apache Hadoop." In
Information Reuse and Integration (IRI), 2013 IEEE 14th International
Conference on, pp. 700- 703. IEEE, 2013.
52.
Jacob, Jobby P., and Anirban Basu. "Performance Analysis of Hadoop
Map Reduce on Eucalyptus Private Cloud." International Journal of
Computer Applications 79.17 (2013).
53.
Iordache, Anca, et al. "Resilin: Elastic MapReduce over multiple
clouds." Cluster, Cloud and Grid Computing (CCGrid), 2013 13th
IEEE/ACM International Symposium on. IEEE, 2013
54.
Mittal, Ruchi, and Ruhi Bagga. "Performance Analysis of Multi-Node
Hadoop Clusters using Amazon EC2 Instances." International Journal of
Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus
Value (2013).
55.
Conejero, Javier, et al. "Scaling archived social media data analysis
using a hadoop cloud." IEEE 6th International Conference on Cloud
Computing (CLOUD). IEEE, 2013
56.
Daneshyar, Samira, and Majid Razmjoo. "Large-scale data processing
using Mapreduce in cloud computing Environment." International Journal
on Web Service Computing 3.4 (2012): 1.
57.
Dittrich J, Quiané-Ruiz JA. Efficient big data processing in Hadoop
MapReduce. Proceedings of the VLDB Endowment. 2012 Aug
1;5(12):2014-5.
58.
Lee KH, et al Parallel data processing with MapReduce: a survey. AcM
sIGMoD Record. 2012 Jan 11;40(4):11-20.
59.
Singh, S. & Singh, N., Big Data Analytics, 2012 International
Conference on Communication, Information & Computing Technology,
Oct 19-20, Mumbai, India
60. Tang B, Moca M, Chevalier S, He H, Fedak G. Towards mapreduce for
desktop grid computing. InP2P, Parallel, Grid, Cloud and Internet
Page | 82
Computing (3PGCIC), 2010 International Conference on 2010 Nov 4 (pp.
193-200). IEEE.
Page | 83
Download