Chapter 7 Web Content Mining Xxxxxx

advertisement
Chapter 7
Web Content Mining
Xxxxxx
Introduction
• Web-content mining techniques are used to
discover useful information from content on the
web
–textual
–audio
–video
–still images
–metadata
–hyperlinks
Introduction
• Some of the web content is generated
dynamically using queries to database
management systems
• Other web content may be hidden from
general users
Introduction
• Problems with the web data
– Distributed data
– Large volume
– Unstructured data
– Redundant data
– Quality of data
– Extreme percentage volatile data
– Varied data
Introduction
• Two approaches of web-content mining:
–agent-based
» software agents perform the content mining
–database oriented
» view the Web data as belonging to a database
Web Crawler
• A computer program that navigates the hypertext
structure of the web
– Crawlers are used to ease the formation of indexes used by
search engines
– The page(s) that the crawler begins with are called the seed
URLs.
• Every link from the first page is recorded and saved in a queue
• Builds an index visiting number of pages and then
replaces the current index
– Known as a periodic crawler because it is activated periodically
Web Crawler
• Another type is a Focused Crawler
– Generally recommended for use due to large size
of the Web
– Visits pages related to topics of interest
• If a page is not pertinent, the entire set of possible
pages below it is pruned
Multiple Layered Database
• Every layer of the database is more
generalized than the layer below it
• Unlike the lowest level, the upper levels are
structured and can be mined by an SQL-like
query language
Multiple Layered Database
• Provides an abstracted view of a fraction of
the web
• Virtual Web View (VWV), can be constructed
Search Engine
• Basic components to a search engine:
–The spider
gathers new or updated information on
Internet websites
–The index
used to store information about several
websites
–The search software
performs searching through the huge index in
an effort to generate an ordered list of useful
search results
Types of Queries
• Boolean Queries:
– Boolean logic queries connect words in the search using
operators such as AND or OR
• Natural Language Queries:
– In natural language queries the user frames as a question
or a statement
• Thesaurus Queries:
– In a thesaurus query the user selects the term from a
preceding set of terms predetermined by the retrieval
system
Types of Queries
• Fuzzy Queries:
– Fuzzy queries reflect no specificity
• Term Searches:
– The most common type of query on the Web is when a
user provides a few words or phrases for the search
• Probabilistic Queries:
– Probabilistic queries refer to the way in which the IR
system retrieves documents according to relevancy
The Robot Exclusion
• Why would the developers prefer to exclude
robots from parts of their websites?
• The robot exclusion protocol
– to indicate restricted parts of the Website to
robots that visit our site
– for giving spiders (“robots”) limited access to a
website
The Robot Exclusion
• Website administrators and content providers can
limit robot activity through two mechanisms:
– The Robots Exclusion Protocol is used by Website
administrators to specify which parts of the site
should not be visited by a robot, by providing a file
called robots.txt on their site.
– The Robots META Tag is a special html META tag
that can be used in any Web page to indicate
whether that page should be indexed, or parsed for
links.
Personalization of Web Content
• Used to modify the contents of a web page as
per the needs of a user
– Essentially, this involves building web pages
exclusively for each user
Types of Web Page Personalization
•
Collaborative filtering:
–
•
Manual techniques:
–
•
Achieves personalization by suggesting Web pages that
have earlier been given high ratings from similar users
Perform personalization via the use of rules that are
used to classify individuals based on profiles or
demographics
Content-based filtering:
–
Retrieves pages based on the similarity between them
and user profiles
Multimedia Information Retrieval
• Perspective of images and videos
• Content system for images is the Query by Image
Content (QBIC) system:
– A three-dimensional color feature vector, where distance measure
is simple Euclidean distance.
– k-dimensional color histograms, where the bins of the histogram
can be chosen by a partition-based clustering algorithm.
– A three-dimensional texture vector consisting of features that
measure scale, directionality, and contrast. Distance is computed
as a weighted Euclidean distance measure, where the default
weights are inverse variances of the individual features.
Multimedia Information Retrieval
• The query can be expressed directly in terms
of the feature representation itself
– For instance, Find images that are 40% blue in
color and contain a texture with specific
coarseness property
Multimedia Information Retrieval
• MIR System
www.hermitagemuseum.org/html_En/index.html
• A QBIC Layout Search Demo that illustrates a step
by step demonstration of the search described in
the text can be found at:
www.hermitagemuseum.org/fcgibin/db2www/qbicLayout.mac/qbic?selLang=English.
Multimedia Information Retrieval
• As multimedia become apparent as a more
extensively used data format, it is vital to deal with
the issues of:
–
–
–
–
–
metadata standards
classification
query matching
presentation
evaluation
• To guarantee the development and deployment of
efficient and effective multimedia information
retrieval systems
Download