Lecture 11

advertisement
CSE5230/DMS/2002/11
Data Mining - CSE5230
Web Mining
CSE5230 - Data Mining, 2002
Lecture 11.1
Lecture Outline
 How
big is the web?
 What is “web data”?
 A taxonomy of web mining tasks
 Example: targeted advertising
 Example: personalization
 References
CSE5230 - Data Mining, 2002
Lecture 11.2
How big is the web?
 It
is not easy to determine the size of the web
In 1999, one estimate was that there were
approximately 350 million web pages, growing at about
1 million pages per day
In 2001, Google announced that they were indexing
around 3 billion web documents
matter which of these is more accurate – it’s
very big!
 We can view the web as the world’s biggest
database
 No
The word “database” is used loosely here, because the
web has no real formal structure or database schema
» This makes the application of data mining to the
web potentially very useful, but also difficult
CSE5230 - Data Mining, 2002
Lecture 11.3
What is “web data”?
 Web
data can be classified as follows [Dun2002]:
The actual content of web pages (text, images,
multimedia)
Intrapage structure – the HTML or XML mark-up
specifying the organization of the page content
Interpage structure – the links into and out of web
pages
Usage data describing how the users of a web site
access pages – navigation patterns
User profiles – these can include demographic data
obtained from a registration process, or perhaps IP
addresses. It can also include information found in
cookies
CSE5230 - Data Mining, 2002
Lecture 11.4
A taxonomy of web mining tasks (1)
Web Mining
Web Content
Mining
Web Page
Content Mining
 From
Web Structure
Mining
Search Result
Mining
Web Usage
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
[Dun2002], following [Zai1999].
CSE5230 - Data Mining, 2002
Lecture 11.5
A taxonomy of web mining tasks (2)

Web content mining
 Examines the contents of web pages (text, graphics)
 Examines the results of web searches
» Mining systems built on top of existing search engines
 Similar to traditional information retrieval (text categoriation,
text filtering, etc.)
» Often goes further than simple keyword search – e.g.
may cluster similar pages

Web structure mining
 Looks at page structure
» e.g. text in <H1> tags may be more important
 Links between pages
» e.g. pages with many incoming links may be more useful
CSE5230 - Data Mining, 2002
Lecture 11.6
A taxonomy of web mining tasks (3)
 Web
usage mining
Looks at log files of web access
General access tracking looks at history of pages
visited
Customised usage tracking may be focused on
particular kinds of usage, or particular users
Involves mining of sequential patterns
» Can use association rule discovery, or HMMs
» These patterns can be clustered to reveal users
with similar access behaviour
Can be used to
» improve web site design
» Customize presentation via collaborative filtering
CSE5230 - Data Mining, 2002
Lecture 11.7
Example: targeted advertising (1)

In marketing, targeting is any technique used to direct
marketing or advertising effort to the portion of the
population thought to be most valuable to the business,
e.g. those
 Likely to purchase
 Likely to spend a lot



The business wants to avoid spending money on sending
advertising to people who will not respond to it
In the web context, this can mean displaying an add for a
web site on a different web site
Can use web usage information to work out what kind of
people use a site: target demographics
 Sell advertising to companies wanting to target that
demographic
CSE5230 - Data Mining, 2002
Lecture 11.8
Example: targeted advertising (2)
 For
example, the Rugby Heaven web site
(http://rugbyheaven.smh.com.au/) is today
hosting advertising for:
MLC life insurance
Fintrack Financial Services
Business Review Weekly (BRW)
 They
appear to think that this site is likely to be
popular with older people who have money!
 The URL for the BRW ad. is:
http://campaigns.f2.com.au/event.ng/Type=click&FlightID=10928&AdID=24947&TargetID=2389&Se
gments=2,13,23,31,35,77,81,88,93,94,153,855,976,993,1145,1301,1989,2320,2389,2394,2396,247
7,2534,2576,2581,2689&Targets=535,2389,40,60,1834&Values=25,31,43,48,50,60,72,81,91,100,1
10,135,150,157,233,239,366,422,605,791,804,805,806,1203,1278,1403,1432,1476,1485,1499&Ra
wValues=&Redirect=http:%2F%2Fwww.brw.com.au%2Fsubscription%2Fsubscribe.asp
 It
is clear that some sophisticated targeting is
going on
CSE5230 - Data Mining, 2002
Lecture 11.9
Example: personalization (1)
 Personalization
spans the areas of web content
mining and web usage mining
 Personalization aims to modify document
contents or access patterns to better match the
preferences of a particular user
 Personalization can involve
Dynamically creating and serving web pages that are
unique to an individual user
Determining which pages to retrieve or link to on a userby-user basis
CSE5230 - Data Mining, 2002
Lecture 11.10
Example: personalization (2)
 Unlike
targeting, with personalization can be
done for the target web page (unlike a targeted
advertisement for another site)
Simple example: including the name of the user in the
page content
 Personalization
techniques include
Use of cookies
Use of user databases
Use of web usage patterns to identify similar users (for
use in collaborative filtering)
requires a user to log in – this part is not
data mining
 Often
CSE5230 - Data Mining, 2002
Lecture 11.11
Example: personalization (3)
 A classic
example of personalization is the
recommending to a user of
a product very similar to something they have bought
before (if the web site is selling something)
Content that is similar to something they have used
before
 Personalization
techniques can be based on
clustering, classification or even prediction
With classification, the desires of a user are determined
based on the class to which he/she is assigned.
Classes may be predetermined by experts.
With clustering, clusters of users with similar navigation
or purchasing behaviour are found, and the user’s
desires are determined on this basis
CSE5230 - Data Mining, 2002
Lecture 11.12
Example: personalization (4)
 Amazon.com
makes use of personalization, as
we will see in an on-line example
 They make use of both the user’s past behaviour
 They also use collaborative filtering – they
recommend products bought by users who have
similar profiles to the current user
Could use clustering, or information filtering techniques
CSE5230 - Data Mining, 2002
Lecture 11.13
References
 [Dun2002]
Margaret H. Dunham, Data Mining:
Introductory and Advanced Topics, Prentice Hall,
Upper Saddle River, NJ, USA, 2002, pp. 195-220.
 [Zai1999] Osmar R. Zaïane, Resource and
Knowledge Discovery from the Internet and
Multimedia Repositories, PhD Thesis, Simon Fraser
University, Canada, March 1999.
CSE5230 - Data Mining, 2002
Lecture 11.14
Download