Uploaded by Saurav Kumar

2WebDataMining

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/337590907
Research on Web Data Mining
Article · November 2019
CITATIONS
READS
0
1,961
4 authors, including:
Archana Shirke
Fr. C. Rodrigues Institute of Technology
18 PUBLICATIONS 108 CITATIONS
SEE PROFILE
All content following this page was uploaded by Archana Shirke on 28 November 2019.
The user has requested enhancement of the downloaded file.
Research on Web Data Mining
Mrs. Sunita S. Sane,
Mrs. Archana A. Shirke,
Veermata Jijabai Technological Institute,
Mumbai.
Veermata Jijabai Technological Institute,
Mumbai.
Email- sssane@vjti.org.in
Email- archanashirke25@gmail.com
ABSTRACT
Mining and sequential pattern Mining [15].
Web data mining is the mining of Web data. Web
Mining aims to discover useful information or
knowledge from Web Hyperlink Structure, Page
Content and Usage data. Although Web Mining uses
many data Mining techniques, it is not purely an
application of traditional Data Mining due to the
heterogeneity and semi-structure nature of the Web
data.
Classification: Stored data is used to locate data in
predetermined groups. For example, a restaurant chain
could mine customer purchase data to determine when
customers visit and what they typically order. This
information could be used to increase traffic by having
daily specials.
Keywords
Web Data Mining, Web Mining, Data Mining, Web
Content Mining. Web Usage Mining, Web Structure
Mining
Clustering: Data items are grouped according to
logical relationships or consumer preferences. For
example, data can be mined to identify market
segments or consumer affinities.
1. INTRODUCTION
Association rule Mining: Data can be mined to
identify associations. The beer-diaper example is an
example of associative Mining.
The heterogeneity and the lack of structure that
permeates much of the ever expanding information
sources on the World Wide Web, such as hypertext
documents, makes automated discovery, organization,
and management of Web-based information difficult.
Sequential pattern Mining: Data is mined to
anticipate behavior patterns and trends. For example,
an outdoor equipment retailer could predict the
likelihood of a backpack being purchased based on a
consumer's purchase of sleeping bags and hiking
shoes.
Data Mining consists of five major elements [8]:
•
Extract, transform, and load transaction data
onto the data warehouse system.
•
Store and manage the data in a
multidimensional database system.
•
Provide data access to business analysts and
information technology professionals.
•
Analyze the data by application software.
•
Present the data in a useful format, such as a
graph or table.
1.1 INTRODUCTION TO DATA
MINING
1.2 INTRODUCTION
MINING
The Web Mining research is a converging research
area from several research communities such as
database, Information Retrieval and Artificial
Intelligent research communities [1]. It has become
increasingly necessary for users to utilize automated
tools in order to find, extract, filter, and evaluate the
desired information and resources. These factors give
rise to the necessity of creating server-side and clientside intelligent systems that can effectively mine for
knowledge both across the Internet and in particular
Web localities[2].
Data Mining is also called knowledge discovery in
databases (KDD). It is commonly defined as the
process of discovering useful patterns or knowledge
from data sources e.g. databases, texts, images, the
Web etc [13]. The pattern must be valid, potentially
useful and understandable. Data Mining software is
one of a number of analytical tools for analyzing data.
It allows users to analyze data from many different
dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is
the process of finding correlations or patterns among
dozens of fields in large relational databases. There
are many Data Mining tasks. Some of the common
ones are supervised learning (or classification),
unsupervised learning (or clustering), association rule
TO
WEB
Huge amount of information available on the World
Wide Web leads to the Mining of Web. So Web
Mining can be defined as the use of Data Mining
techniques to automatically discover and extract
information from Web documents and services [2].
Web Mining is a cross area of Data Mining,
Information retrieval, Information Extraction and
Artificial Intelligent. The Web is huge, diverse and
dynamic and thus raises the scalability and multimedia
issues [1]. Information users could encounter
following problems when interacting with Web
finding relevant information, creating new knowledge
out of information available on the Web,
Personalization of information, learning about
consumers or individual users. Web Mining
techniques could be used to solve the information
overload problems above directly or indirectly.
However there are techniques from different research
areas such as database (DB), Information Retrieval
(IR), Natural Language Processing (NLP) and Web
Document Community could also be used [2].
Web Mining is decomposed into following different
subtask namely [2]:
1. Resource Finding : the task of retrieving
intended Web documents
2. Information selection & pre processing :
automatically selecting and pre processing
specific information from retrieved Web
resources
3. Generalization : automatically discovers
general patterns at individual Web sites as
well as across multiple sites
4. Analysis : validation and/or interpretation
of the mined patterns
Resource finding means that process of retrieving the
data that is either online or offline from the text
sources available on the Web such as electronic
newsletters, electronic newswire, newsgroup, the text
contents of HTML documents obtained by removing
HTML tags, and also the manual selection of Web
resources. Information selection and pre processing
step is any kind of transformation processes of the
original data retrieved in the IR process. These
transformation could be either a kind of preprocessing
that are mentioned above such as removing stop
words, stemming etc or a pre processing aimed at
obtaining the desired representation such as finding
phrases in the training corpus, transforming the
representation to relational or first order logic form
etc. Machine learning or Data Mining techniques are
typically used for generalization. Humans play an
important role in information and knowledge
discovery process on the Web since Web is an
interactive medium. Thus query triggered knowledge
discovery is as important as the more automatic data
triggered knowledge discovery.
Thus Web Mining refers to the overall process of
discovering potentially useful and previously
unknown information or knowledge from the Web
data. It is extension of the standard process of
knowledge discovery in databases (KDD) [2]. Web
Mining is often associated with IR or IE. However
Web Mining or information discovery on the Web is
not the same as IR or IE. IR is automatic retrieval of
all relevant documents while at the same time
retrieving as few of the non relevant as possible. IR
has a primary goal of indexing text and searching for
useful documents in a collection and nowadays
research in IR includes modeling, document
classification and categorization, user interfaces, data
visualization, filtering etc. The task that can be
considered to be an instance of Web Mining is Web
document classification or categorization which could
be used for indexing. Viewed in this respect, Web
Mining is part of the (Web) IR process [2].
IE has the goal of transforming a collection of
documents usually with the help of IR system, into
information that is more readily digested and
analyzed. IE aims to extract relevant facts from the
documents while IR aims to select relevant document.
IE is interested in structure or representation of a
document while IR views the text in a document just
as a bag of unordered words. Some IE systems use
Machine Learning or Data Mining techniques to learn
the extraction patterns or rules for Web documents
semi-automatically or automatically. Within this view,
Web Mining is part of the (Web) IE process [2].
The Web Mining process is similar to the data Mining
process. The difference is usually in the data
collection. In traditional Data Mining, the data is often
already collected and stored in a data warehouse. For
Web Mining, data collection can be a substantial task,
especially for Web Structure and Content Mining
which involves crawling a large number of target Web
pages. The classification of retrieval and mining tasks
for different types of data is given below [1].
Data
Any
Data
Textual
Data
WebRelated
Data
Purpose
Retrieving
known data
or
Data
Information
Web
documents
Retrieval Retrieval
Retrieval
efficiently
and
effectively
Finding new
patterns or
Data
Web
Text Mining
knowledge
Mining
Mining
previously
unknown
Figure 1: Classification of retrieval and mining
process
2. WEB MINING CATEGORIES
Web Mining tasks can be categorized into three types
[2]
1. Web Content Mining (WCM) - Web content
Mining refers to the discovery of useful information
from Web contents, including text, image, audio,
video, etc. Research in Web content Mining
encompasses resource discovery from the Web,
document categorization and clustering, and
information extraction from Web pages.
2. Web Structure Mining (WSM) - Web structure
Mining studies the Web’s hyperlink structure. It
usually involves analysis of the in-links and out-links
of a Web page, and it has been used for search engine
result ranking.
3. Web Usage Mining (WUM) - Web usage Mining
focuses on analyzing search logs or other activity logs
to find interesting patterns. One of the main
applications of Web usage Mining is to learn user
profiles.
2.1. Web Content Mining
Web Content Mining is related but different from Data
Mining and Text Mining. It is related to Data Mining
because many Data Mining techniques can be applied
in Web content Mining. It is related to Text Mining
because much of the Web contents are texts. However,
it is also quite different from Data Mining because
Web data are mainly semi-structured and/or
unstructured, while Data Mining deals primarily with
structured data. Web Content Mining is also different
from Text Mining because of the semi-structure nature
of the Web, while text Mining focuses on unstructured
texts. Web Content Mining thus requires creative
applications of Data Mining and/or Text Mining
techniques and also its own unique approaches. In the
past few years, there was a rapid expansion of
activities in the Web Content Mining area. This is not
surprising because of the phenomenal growth of the
Web contents and significant economic benefit of such
Mining. However, due to the heterogeneity and the
lack of structure of Web data, automated discovery of
targeted or unexpected knowledge information still
present many challenging research problems. In this
report, you can examine the following important Web
content Mining problems and discuss existing
techniques for solving these problems [1].
In recent years these factors have prompted
researchers to develop more intelligent tools for
information retrieval, such as intelligent Web agents,
as well as to extend database and Data Mining
techniques to provide a higher level of organization
2.2 Web Structure Mining
Web structure Mining is the process of using graph
theory to analyse the node and connection structure of
a Web site [1]. According to the type of Web
structural data, Web structure Mining can be divided
into two kinds. The first kind of Web structure Mining
is extracting patterns from hyperlinks in the Web. A
hyperlink is a structural component that connects the
Web page to a different location. The other kind of the
Web structure Mining is Mining the document
structure. It is using the tree-like structure to analyse
and describe the HTML (Hyper Text Markup
Language) or XML (eXtensible Markup Language)
tags within the Web page
2.3 Web Usage Mining
Web Usage Mining is the application that uses data
Mining to analyse and discover interesting patterns of
user’s usage data on the Web. The usage data records
the user’s behaviour when the user browses or makes
transactions on the Web site. It is an activity that
involves the automatic discovery of patterns from one
or more Web servers. Organizations often generate
and collect large volumes of data; most of this
information is usually generated automatically by Web
servers and collected in server log. Analyzing such
data can help these organizations to determine the
value of particular customers, cross marketing
strategies across products and the effectiveness of
promotional campaigns, etc.
Web Usage Mining is the type of Web Mining activity
that involves the automatic discovery of user access
patterns from one or more Web servers. As more
organizations rely on the Internet and the World Wide
Web to conduct business, the traditional strategies and
techniques for market analysis need to be revisited in
this context. Organizations often generate and collect
large volumes of data in their daily operations. Most
of this information is usually generated automatically
by Web servers and collected in server access logs.
Other sources of user information include referrer
logs, which contain information about the referring
pages for each page reference, and user registration or
survey data gathered via tools such as CGI scripts.
Analyzing such data can help these organizations to
determine the life time value of customers, cross
marketing strategies across products, and effectiveness
of promotional campaigns, among other things.
Analysis of server access logs and user registration
data can also provide valuable information on how to
structure a Web site in order to create a more effective
presence for the organization. Using intranet
technologies in organizations, such analysis can shed
light on more effective management of workgroup
communication and organizational infrastructure.
Finally, for organizations that sell advertising on the
World Wide Web, analyzing user access patterns helps
in targeting ads to specific groups of users.
WUM can be decomposed into the following subtasks:
2.2.1 Data Pre-processing for Mining:
It is necessary to perform a data preparation to convert
the raw data for further process. It has separate
subsections as follows.
•
Content
Preprocessing:
Content
preprocessing is the process of converting
text, image, scripts and other files into the
forms that can be used by the usage Mining.
•
Structure Preprocessing: The structure of a
Web site is formed by the hyperlinks
between page views. The structure
preprocessing can be treated similar as the
content preprocessing. However, each server
session may have to construct a different site
structure than others.
•
Usage Preprocessing: The inputs of the
preprocessing phase may include the Web
server logs, referral logs, registration files,
index server logs, and optionally usage
statistics from a previous analysis. The
outputs are the user session file, transaction
file, site topology, and page classifications.
2.2.2 Pattern Discovery
This is the key component of the Web Mining. Pattern
discovery covers the algorithms and techniques from
several research areas, such as Data Mining, Machine
Learning, Statistics, and Pattern Recognition. It has
separate subsections as follows.
•
Statistical Analysis: Statistical analysts may
perform different kinds of descriptive
statistical analyses based on different
variables when analyzing the session file.
By analyzing the statistical information
contained in the periodic Web system report,
the extracted report can be potentially useful
for improving the system performance,
enhancing the security of the system,
facilitation the site modification task, and
providing support for marketing decisions
•
Association Rules: In the Web domain, the
pages, which are most often referenced
together, can be put in one single server
session by applying the association rule
generation. Association rule mining
techniques can be used to discover
unordered correlation between items found
in a database of transactions.
•
Clustering: Clustering analysis is a
technique to group together users or data
items
(pages)
with
the
similar
characteristics.
Clustering
of
user
information or pages can facilitate the
development and execution of future
marketing strategies.
•
Classification:
Classification
is
the
technique to map a data item into one of
several predefined classes. The classification
can be done by using supervised inductive
learning algorithms such as decision tree
classifiers, naïve Bayesian classifiers, knearest neighbor classifier, Support Vector
Machines etc.
•
Sequential Pattern: This technique intends
to find the inter-session pattern, such that a
set of the items follows the presence of
another in a time-ordered set of sessions or
episodes. Sequential patterns also include
some other types of temporal analysis such
as trend analysis, change point detection, or
similarity analysis.
•
Dependency Modeling: The goal of this
technique is to establish a model that is able
to represent significant dependencies among
the various variables in the Web domain.
The modeling technique provides a
theoretical framework for analyzing the
behavior of users, and is potentially useful
for predicting future Web resource
consumption.
2.2.3 Pattern Analysis
Pattern Analysis is a final stage of the whole Web
usage Mining. The goal of this process is to eliminate
the irrelative rules or patterns and to extract the
interesting rules or patterns from the output of the
pattern discovery process. The output of Web Mining
algorithms is often not in the form suitable for direct
human consumption, and thus need to be transform to
a format can be assimilate easily. There are two most
common approaches for the patter analysis. One is to
use the knowledge query mechanism such as SQL,
while another is to construct multi-dimensional data
cube before perform OLAP operations. All these
methods assume the output of the previous phase has
been structured.
3. BASIC MODELS
In its full generality, a model must build machine
representations of world knowledge, and therefore
involve a NL grammar for text, hypertext, and semi
structured data which will useful for our learning
applications. We discuss some such models in this
section [8].
3.1 Models for structured text
1. Boolean Model: The simplest statistical model is
the Boolean model. It uses the
notion of exact
matching documents to the user query. Both the query
and the retrieval are based on the Boolean algebra.
2. Vector Space Model: A document in the vector
spacer model is represented as a weight vector, in
which each component weight is computed based on
some variation of TF or TF-IDF scheme. Document
are tokenized using simple syntactic rules (such as
white space delimiters in English) and tokens
stemmed to canonical form (e.g., 'reading' to 'read,' 'is,'
'was,' 'are' to 'be'). Each canonical token represents an
axis in a Euclidean space.
3. Statistical Language Model: This model is based
on probability and has foundations in statistical theory.
It first estimates a language model for each document
and then ranks documents by the likelihood of the
query given the language model.
4. Probabilistic Model: This model used for
document generation with the disclaimer that these
models have no bearing on grammar and semantic
coherence.
In spite of minor variations all these models regard
documents as multisets of terms, without paying
attention to ordering between terms. Therefore they
are collectively called bag-of-words models.
3.2 Models for semi structured data
Semi structured data is a point of convergence for the
Web and database communities: the former deals with
documents, the latter with data. The form of that data
is evolving from rigidly structured relational tables
with numbers and strings to enable the natural
representation of complex real-world objects like
books, papers, movies, jet engine components, and
chip designs without sending the application writer
into contortions.
Object Exchange Model (OEM): In OEM, data is in
the form of atomic or compound objects: atomic
objects may be integers or strings; compound objects
refer to other objects through labeled edges. HTML is
a special case of such 'intra-document' structure.
The above forms of irregular structures naturally
encourage Data Mining techniques from the domain of
'standard' structured warehouses to be applied,
adapted, and extended to discover useful patterns from
semi structured sources as well.
4. LINK ANALYSIS
In recent years, Web link structure has been
widely used to infer important information about Web
pages. Web structure mining has been largely
influenced by research in social network analysis and
citation analysis [8]. Citations (linkages) among Web
pages are usually indicators of high relevance or good
quality. We use the term in-links to indicate the
hyperlinks pointing to a page and the term out-links to
indicate the hyperlinks found in a page. Usually, the
larger the number of in-links, the more useful a page is
considered to be. The rationale is that a page
referenced by many people is likely to be more
important than a page that is seldom referenced. As in
citation analysis, an often cited article is presumed to
be better than one that is never cited. In addition, it is
reasonable to give a link from an authoritative source
(such as Yahoo!) a higher weight than a link from an
unimportant personal home page. By analyzing the
pages containing a URL, we can also obtain the
anchor text that describes it. Anchor text shows how
other Web page authors annotate a page and can be
useful in predicting the content of the target page.
Several algorithms have been developed to address
this issue.
5. THE SEMANTIC WEB
The Semantic Web is a term coined by Berner-Lee
[17] for the vision of making the information on the
Web machine-processable. The basic idea is to enrich
Web pages with machine-processable knowledge that
is represented in the form of ontologies [19].
Ontologies define certain types of objects and the
relations between them. As Ontologies are readily
accessible (like other Web documents), a computer
program can use them to draw inferences about the
information provided on Web pages. One of the
research challenges in that area is to annotate the
information that is currently available on the Web with
semantic tags. Typically, techniques from text
classification,
hyper-text
classification
and
information extraction are used for that purpose. A
landmark application in this area was the Web®KB
project at Carnegie-Mellon University (Craven,
DiPasquo, Freitag, McCallum, Mitchell, Nigam &
Slattery, 2000). Its goal was to assign Web pages or
parts of Web pages to entities in ontology. A simple
test ontology modeled knowledge about computer
science departments: there are entities like students
(graduate and undergraduate), faculty members
(professors, researchers, lecturers, post-docs, ...),
courses, projects, etc., and relations between these
entities, such as “courses are taught by one lecturer
and attended by several students” or “every graduate
student is advised by a professor”. Many applications
could be imagined for such ontology. For example, it
could enhance the capabilities of search engines by
enabling them to answer queries like “Who teaches
course X at university Y?” or “How many students are
in department Z?”, or serve as a backbone for Web
catalogues. A description of the first prototype system
can be found in. Semantic Web Mining emerged as
research field that focuses on the interactions of Web
mining and the Semantic Web.
6. WEB CRAWLING
Web crawlers are mainly used to create a copy of all
the visited pages for later processing by a search
engine that will index the downloaded pages to
provide fast searches[13]. Crawlers can also be used
for automating maintenance tasks on a Website, such
as checking links or validating HTML code. Also,
crawlers can be used to gather specific types of
information from Web pages, such as harvesting email addresses. A Web crawler is one type of bot, or
software agent. In general, it starts with a list of URLs
to visit, called the seeds. As the crawler visits these
URLs, it identifies all the hyperlinks in the page and
adds them to the list of URLs to visit, called the crawl
frontier. URLs from the frontier are recursively visited
according to a set of policies.
There are three important characteristics of the Web
that make crawling it very difficult:
• its large volume,
• its fast rate of change, and
• dynamic page generation,
which combine to produce a wide variety of possible
crawlable URLs.The large volume implies that the
crawler can only download a fraction of the Web
pages within a given time, so it needs to prioritize its
downloads. The high rate of change implies that by the
time the crawler is downloading the last pages from a
site, it is very likely that new pages have been added
to the site, or that pages have already been updated or
even deleted.
As Edwards et al. noted, "Given that the bandwidth for
conducting crawls is neither infinite nor free, it is
becoming essential to crawl the Web in not only a
scalable, but efficient way, if some reasonable
measure of quality or freshness is to be maintained.".
A crawler must carefully choose at each step which
pages to visit next.
The behavior of a Web crawler is the outcome of a
combination of policies:
• A selection policy that states which pages to
download.
• A re-visit policy that states when to check
for changes to the pages.
• A politeness policy that states how to avoid
overloading Websites.
• A parallelization policy that states how to
coordinate distributed Web crawlers.
The importance of a page for a crawler can also be
expressed as a function of the similarity of a page to a
given query. Web crawlers that attempt to download
pages that are similar to each other are called focused
crawler or topical crawlers. The main problem in
focused crawling is that in the context of a Web
crawler, we would like to be able to predict the
similarity of the text of a given page to the query
before actually downloading the page.
7. WEB DATA MINING
AGENT PARADIGM
AND
Web Mining is often viewed from or implemented
within an agent paradigm. Thus, Web Mining has a
close relationship with software agents or intelligent
agents. Indeed some of these agents perform data
Mining tasks to achieve their goals. According to
Green [4] there are three sub-categories of software
agents: User Interface Agents, Distributed Agents, and
Mobile Agents. User Interface agents that can be
classified into the Web Mining agent category are
information retrieval agents, information filtering
agent and personal assistant agents. Distributed agents
technology is concerned with problem solving by a
group of agents and relevant agents in this category
are distributed agents for knowledge discovery or Data
Mining. Delgado classifies the user interface agents by
the underlying information filtering technology into
content based filters, event based filters and hybrid
filters. In event based filtering, the system tracks and
follows the events that are inferred form the surfing
habits of people in the Web. Some examples of those
events are saving a URL into a bookmark folder,
mouse clicks and scrolls, link traverse behavior etc.
8. CONCLUSIONS
Web Data Mining is a new field and there are
researchers ventured in this field, especially Text
-Mining techniques. The key component of Web
Mining is the Mining process itself. A lot of work still
remains to be done in adapting known Mining
techniques as well as developing new ones.
9. REFERENCES
[1] Raymond Kosala, Hendrik Blockeel, “Web Mining
Research: A Survey”, SIGKDD Expirations, ACM
SIGKDD, July 2000.
[2] Wang Bin, Liu Zhijing “Web Mining Research”.
In Proceedings of the 5th IEEE International
View publication stats
Conference on Computational Intelligence and
Multimedia Applications (ICCIMA’03), 2003.
[3] R. Cooley, B. Mobasher, and J. Srivastava. “Web
Mining: Information and pattern discovery on the
World Wide Web”. In Proceedings of the 9th IEEE
International Conference on Tools with Artificial
Intelligence (ICTAI’97), 1997
[4] S. Green, L. Hurst, B. Nangle, P. Cunningham, F.
Somers, and R. Evans. “Software agents: A review ,
Technical Report TCD-CS-1997-06” , Technical
Report of Trinity College, University of Dublin, 1997.
[5] J. A. Delgado ” Agent-Based Information Filtering
and Recommender System On the Internet.” PhD
thesis, Dept of Intelligence Computer Science, Nagoya
Institute of Technology, March 2000.
[6] Web Site : http://ww.celi.it
[7] Soumen Chakrabarti, “Mining the Web :
Discovering Knowledge from Hypertext Data, Morgan
Kaufmann,2003
[8] Bing Liu, “Web Data Mining: Exploring
Hyperlinks, Contents and Usage Data”, Springer, 2007
[9]Hsincun Chen and Michael Chau, “Web Mining :
Machine Learning for Web Application” , Annual
Review of Information Science and Technology,
University of Arizona
[10] Chakrabarti, S. (2000). Data mining for
hypertext: A tutorial survey. SIGKDD
Explorations, 1(1), 1-11.
[11]Chakrabarti, S., Dom, B., & Indyk, P. (1998).
Enhanced hypertext categorization using hyperlink.
Proceedings of the 1998 ACM SIGMOD International
Conference on Management of Data, 307-318.
[12] Johannes Fürnkranz, “Web Mining” chapter, TU
Darmstadt, Knowledge Engineering Group.
[13] Web site : www.wikipedia.com\
[14] Margaret H. Dunham, “Data Mining: introductory
and advanced Topics”, Pearson Education, 2003.
[15] Jiawei Han and Michline Kamber, “Data Mining
Concepts and Techniques”, Elevier publication,
second edition, 2006.
[16] Tom Mitchell, “Machine Learning” McGraw-Hill
, 1997
[17] Berners-Lee, Hendler & Lassila, “Semantic
Web”, 2001.
[18] Search Engine : http://www.google.com.
[19] Dieter Fensel, “Ontology versioning on the
semantic
web”,
2001.
Download