References - B. Berardino

advertisement
Berardino: Web Mining
1
Bailey Berardino
February 19, 2012
ILS 655: Digital Libraries
DL Technology Exploration
Web Mining
“The web is perhaps the single largest data source in the world” (Lui, 2005, p. 2). The
World Wide Web is full of data, or raw information, available for use. Web mining looks to
extract that information and gain useful knowledge from it. There are many different definitions
of web mining. Cooley and Mobasher (1997) defined it as “the discovery and analysis of useful
information from the World Wide Web. (p.1)” Kosala and Blockeel state that web mining “refers
to the overall process of discovering potentially useful and previously unknown information or
knowledge from web data” (2000, p. 3).
So why mine data from the web? As stated before the Web has an unprecedented amount
of data in it. According to Lui, the size of the web leads to both opportunities and challenges.
Information on the Web is easily accessed, it is diverse and on almost anything. It comes in all
forms and types, and much of the Web is linked together. Also, much of the Web is redundant.
There are different levels to the Web (Surface and Deep) and it is dynamic, always changing.
Lastly, it is a “virtual society,” as much about interactions as anything else (Lui, 2005). Web
mining looks to extract the useful information, or data, and analyze it.
Web Mining and Data Mining
There is some confusion over the difference between web mining and data mining. Web
mining is different from both data mining and text mining. Data mining is also known as
Knowledge Discovery in Databases (KDD). It is discovering useful patterns of knowledge from
data sources (Lui, 2007). There are similarities because many data mining techniques are used in
web mining, and much of the web is text, so it is related to text mining as well. However, it does
Berardino: Web Mining
2
differ because web data is mainly semi or un-structured, while data mining is mostly structured.
According to Lui, “Web mining aims to discover useful information or knowledge from the Web
hyperlink structure, page content, and usage data. Although Web mining uses many data mining
techniques, as mentioned above it, is not purely an application of traditional data mining due to
the heterogeneity and semi-structured or unstructured nature of the Web data (2007, p. 6-7).
Both data and web mining usually consist of a three step process.
1.
Pre-processing: Cleaning up the raw data, make it easier to mine for knowledge
2.
Mining Process: Feeding data to data mining algorithms to produce pattern or
knowledge. The quality of the algorithm can be measured both by how effective it is and
how efficient it is.
3.
Post-processing/analysis: Identifying useful patterns (if any).
Although the three steps seen in both types of mining, they can be considerably different in
application.
There are three types of web mining: web usage mining, web structure mining and web
content mining. Web usage mining focuses on the discovery of user access patters from clientserver transactions. Web structure mining gains useful knowledge from the structure of websites
and hyperlinks. Lastly, web content mining extracts useful information from a web page’s
contents. The following chart from Kosala and Blockeel (2000, p.5) looks at the difference
between the three.
Berardino: Web Mining
3
QuickTime™ and a
decompressor
are needed to see this picture.
Web Content Mining
Web content mining can be defined as the process of extracting useful information from
the contents of web documents. According to Sirvastava (n.d.) “Content data corresponds to the
collection of facts a Web page was designed to convey to the users. It may consist of text,
images, audio, video, or structured records such as lists and tables” (slide 32). This field also
involves using technologies such as Information Retrieval (IR) and Natural Language Processing
(NLP). People who would be interested in content mining would be researchers, those involved
in ecommerce, government officials and many more.
Web content mining has many applications and implications. It can help categorize web
documents and find web pages across different servers that are similar. In the world of Ecommerce it could help companies better understand the needs of their customers. It could lead
to target marketing and more personalized service. The government uses web content mining to
track and classify possible criminal and terrorist threats. One of the larger implication of web
content mining is the invasion of privacy that can come with it, to be discussed more later.
Berardino: Web Mining
4
Web Structure Mining
QuickTime™ and a
decompressor
are needed to see this picture.
Web Structure mining is the process of discovering
structure information from the web. It can be performed at the document level (intra-page) or at
the hyperlink level (inter-page), which is called hyperlink analysis. “The structure of a typical
web graph consists of web pages a nodes, and hyperlinks edges connecting between two relevant
pages.” One of the key ideas is ranking a web page depending on the rank of the web pages
pointing at it (Sirvastava).
There are also many applications for web structure mining. It can be useful in gathering
information on the quality of a web page based on its authority and ranking. It helps in studying
interesting web structures, graphing patterns such as co-citation or social choice. Other
applications include web page classification, finding related pages and detection of duplicate
pages (Sirvastava).
Web Usage Mining
Sirvastava defines a web as a collection of inter-related fines on one or more web servers.
Web usage mining is “the discovery of meaningful patterns from data generated by client-server
QuickTime™ and a
decompressor
are needed to see this picture.
Berardino: Web Mining
5
transitions on one or more web locations” (slide 64). The data that is studied can come from a
number of sources. Servers can auto-generate data such as access log, agent logs, client-side
cookies. Other sources are user profiles or metadata such as page attributes, content attributes
and usage data.
Many of the implication for web usage mining are in the field of E-commerce. It allows a
company to determine the lifetime value of a client. It also helps in designing cross marketing
strategies and evaluating promotional campaigns. It will allow the company to target ads and
coupons at user groups based on their access patterns. It can also predict user behavior patterns
and present more dynamic information to users based on their interest and profiles. Other
implications include more effective and efficient web presence. It allows the creator to
determine best way to structure the site and pre-fetch files likely to be access. Intra-organization
applications (within an organization) include enhancing work group management and
communication. It can also evaluate the effectiveness of the intranet and identify possible
structural needs.
Techniques
There are a number of different techniques for web mining, but the most popular are:
1) Classification/Supervised Learning: This is most the frequent. Classes or categories are
pre-define. Documents are assigned to one or more existing categories.
2) Clustering/Unsupervised: No pre-defined categories, but algorithms put data into groups
or clusters.
3) Association rules: Finds sets of data items that occur frequently together.
4) Sequential patterns: Finds sets of data items that occur frequently together in some
sequences.
Berardino: Web Mining
6
Future
The applications and implications of web mining are just beginning. Much like the
internet itself, it is a very dynamic field. Some look to web mining as an apparatus for behavior
experiments, as well as a way to study the evolution of the web and its patterns. Mining of emails for information is also another way to target marking and for social networking.
Personalization of the web is big topic in the field. Mulvenna, Anand, and Buchner (2000) state
that “web mining provides the tools to analyze web log data in a user-centric manner (p. 124).
They discuss adaptive internet sites that auto improve the organization and presentation when
they learn from their visitor’s patterns. Web personalization can be defined as “an action that
adapts the information or services provided by a web site to the needs of a particular user or set
of users taking advantage or the knowledge gained form the user’s navigational behaviors and
individual interests, in combination with the content and structure of the website” (Eirinaki and
Vazirgiannis, 2003, p. 1-2). There are many ways that web mining can lead to value added
services for users.
Zaiane (2001) also looked at how web mining could improved web-based distance
learning. Much of the technology that is used to gather information for e-commerce can be
applied to the educational setting. At the time there had been little done to study learners
patterns, and it seems this is an area that needs further research. However, web mining
techniques could be applied to better evaluate the learning process, and perhaps personalize it as
well.
Web mining also leads to very serious privacy concern. Most people do not know the
extent of how much information is extracted from them and their internet usage. Web mining
can also lead to a number of threats including: identity theft, defamation, industrial espionage,
Berardino: Web Mining
7
ransom (electronic), vandalism and market manipulation. Some of the steps that need to be taken
to address these concerns included more awareness and education, more regulations and laws
and better technology for security and auditing (Sirvastava).
Web mining has the potential to be a great asset for both consumers and corporations. It
can also help with new fields of research and knowledge. It can help create new knowledge out
of old knowledge. But as with all new technology there are serious ethic considerations in the
usage of these techniques that should be further discussed and evaluated.
Berardino: Web Mining
8
References
Cooley, R., Mobasher, B. (1997). Web mining: Information and pattern discovery in the world
wide web. Retrieved from http://maya.cs.depaul.edu/classes/ect584/papers/cms-tai.pdf
Eirinaki, M. and Vazirgiannis, M. (2003). Web Mining for Web Personalization. ACM
Transactions on Internet Technology, 3(1), 1–27.
http://www.inf.ufsc.br/~lumar/datamining/p1-eirinaki.pdf
Kosala, R and Blockeel, H. (2000). Web mining research: A survey. Retrieved from
http://www.umiacs.umd.edu/~joseph/classes/enee752/Fall09/survey-2000.pdf
Lui, B. (2005). Web content mining [Tutorial]. The 14th International World Wide Web
Conference. Retrieved from http://www.cs.uic.edu/~liub/Web-Content-Mining-2.pdf
Lui, B. (2007) Web data mining: Exploring hyperlinks, contents, and usage data. New York:
Springer Berlin Heidelberg. Retrieved from http://www.mendeley.com/research/datamining-web/
Mulvenna, M., Anand, S., & Buchner, A. (2000). Personalization on the net using web mining.
Communications Of The ACM, 43(8), 122-125. http://0search.ebscohost.com.www.consuls.org/login.aspx?direct=true&db=lxh&AN=ISTA3501
938&site=ehost-live
Sirvastava, J. (n.d.). Web mining: Accomplishments and future directions [PowerPoint].
Retrieved from http://ignatius.atw.hu/mining.pdf
Zaiane, O. (2001). Web Usage Mining for a Better Web-Based Learning Environment. Retrieved
from
http://scholar.googleusercontent.com/scholar?q=cache:iJA2jZHK8sEJ:scholar.google.co
m/+web+mining&hl=en&as_sdt=0,7
Download