whom

advertisement
Representation of Web Data in a
Web Warehouse
Ragini A.S.
&
Shipra Dutta
November 20th, 2001
Overview
•
•
•
•
•
•
•
•
•
•
Need for a web warehouse
WHOM- Data Model for WHOWEDA
Concept of Node & Link
Model for representing metadata,Structure & Content of
web documents & Hyperlinks
NMT & LMT
Modeling of structural & textual content of web documents
NDT & LDT
Advantages
Disadvantages
Conclusion & Future Word
Need of a Web Warehouse
• Rapid growth of WWW,which is a distributed
global information resource.
• Applications must be able to harness and analyze
web data.
• Germination of mobile users.
• Necessity to exploit historical web data.
• Traditional information retrieval techniques &
Search engines are not satisfactory
Data Model of WHOWEDA
WHOM(Ware House Object Model)
• Consists of two components
(1) set of web objects
(2) set of web operators
• Centered on the notion of web tables,which is a set
of web tuples
• Web tuple is a set of directed graphs each
consisting of set of nodes and links & satisfies a
web schema
• Set of operators like global web coupling,web
join,web select etc., are used to manipulate the
web data
Metadata associated with HTML &
XML documents (web documents)
 HTML or XML documents may have metadata as
• URL
• Format,size(in bytes),Date of last modification
• Information about author
 Hyperlink in web doc may have metadata as
• Source URL
• Target URL
• Type of Hyperlink(interior,local or global)
Node Meta Data Attribute
• Represented using a data type node
metadata-attribute
• Meta Attribute may be either atomic or complex
• Eg.for complex attribute, URL
URL can be decomposed as server, port,
protocol,path,filename
• Eg. for atomic attribute
size, as it can not be further decomposed
Set of node meta data attributes
Figure 1
Node Meta data Tree(NMT)
• Representation of instance of node metaattribute
• Internal vertices of tree are meta-attribute
names
• Leaf vertices of the tree are values of meta
data attribute
Example
Example of NMT
Consider
URL: http://www.ninds.nih.gov/patients/Disorder/Alexander/Alexander.htm
Last Modification Date: Thursday,15th July, 1999,04:50:53.
Size : 10761K
Attribute URL has following attribute/value pairs:
(Protocol,”http”),(Server,” www.ninds.nih.gov”),(Path,” patients/Disorder/Alexander”)
and (Filename,” Alexander.htm”)
Attribute Server has following sub attribute/value pairs:
(Name,”www.ninds.nih”),(Domain name,”gov”)
Node Meta Data Tree
Figure 2
Link Meta Data Attribute
• Represented using a data type link
metadata-attribute
• Each Attribute can be either atomic or complex
• Eg. Complex attribute,
Source URL or target URL can be decomposed
to server, port, protocol, path and file name.
Eg. Atomic attribute
Link type-local, global or interior
Set of link meta data attributes
Figure 3
Link metadata tree (LMT)
• Representation of instance of a link metadataattribute
• Corresponds to the link meta data attribute/value
pairs of hyperlinks.
• Internal vertices of tree are meta-attribute
names of hyperlinks
• Leaf vertices of the tree are values of meta data
attribute
Figure 4
Example of LMT
Figure 5
Issues for modeling Structure & Content
• Web data embedded within a HTML or XML
document should be written in compliance with
the HTML & XML specifications respectively
• Modeling tags & tag less data
• Modeling hierarchical structure
• Attribute/Value pairs associated with tags
• Order of text
• Location information of a portion of tag less data
Components of Node structural attribute
•
•
•
•
•
Name
Attribute_list
Content
Identifier
Location_attribute
Node Data Tree(NDT)
• Represents the structure & content of web
page
• Node structural objects which are instances
of node structural attributes satisfy some
dependency constraints ,which can be
collectively visualized as rooted,directed
tree which is an NDT.
Figure 6
Figure 6
Features Of NDT
•
•
•
•
Rooted, directed tree
Loss of structural information
No loss of content data
Exclusion of anchor tags
Components of NDT
•
•
•
•
•
name
Attribute_list
Identifier
Content
Location_attribute
Definition of Dependency Constraints
NDT for HTML Documents
Classification of HTML tags
• Non-noisy tags – HTML tags which are considered
in node data tree.
• Noisy tags – HTML tags which are ignored while
mapping a HTML document to NDT.Three types of
noisy tags that our model considers are
1. Tags used for formatting purpose
2. Tags used to represent a hyperlink
3. Tags with specification of executable content
Representation of non-noisy tags in NDT
Classified as three types
• Type1 tags
• Type2 tags
• Tags3 tags
Noisy and Non-noisy tag attributes
 Noisy attributes: Attributes which are ignored
while generating a NDT from HTML doc.
Three types of noisy attributes are
• Attributes used for formatting purpose
• Attributes used to represent behavior of web document
• Attributes which specify execution content
 Non-noisy attributes: Attributes considered
important in the context of modeling HTML
document & are represented in NDT
Representation Of Content and
Structure in XML Document.
• The XML Documents are mapped into a
Node Data Tree.
• Node structural objects which are instances
of node structural attributes satisfy some
dependency constraints ,which can be
collectively visualized as rooted,directed
tree which is an NDT.
Issues related to generating NDT
from XML Documents.
• The XML Documents don’t have have a fixed set
of tags and attributes like HTML.
• The tags and attributes are defined by the user.
• XML does not encounter the problem of elements
with no end tags and elements whose tags may be
omitted.
• Thus no need to address the issues related to type
2 and type 3 tags while generating the NDT’s from
XML documents.
Example of NDT generated from
XML Document.
Representing Structure and
Contents of hyperlinks.
• A hyperlink is an explicit relationship between two or
more data objects or portions of data objects.
• A hyperlink is defined by the data type Link type.
• A Link type consists of three components:
a set of meta
data attributes, a set of link structural attributes and a
reference identifier.
• Link structural attributes are used to express the structure
and content if hyperlinks and the reference identifier is
used to specify the location of hyperlinks in web
documents.
Issues for modeling hyperlinks.
• The <a> tag and the attributes href or name are
used to specify hyperlinks in HTML documents.
XML links are specified by the use of attribute
named xml:link.Possible values are simple and
extended, as well as locator group and document.
• When authors add a hyperlink to a document D,
they include the description of the document in
addition to the URL which are important.
• The location of Hyperlinks is important as we may
need to impose constraints in a query to follow
only those links which are located in a particular
portion of web page.
Link Structural attributes.
Similar to node structural attributes, it consists
of three components:
• Name, corresponding to start-tag of
HTML or XML link.
• Attribute_list, it is finite possibly empty
set of attributes associated with the tag.
• Content, between start and end tags.
Reference Identifier
It is a unique identifier that references an identifier in node
structural attribute. For example consider the web page in
Node Data tree for XML
Document.
Link data Tree
• It can be represented by a set of instances of link
structural attributes.
• HTML or XML documents are mapped into
instances of link structural attributes called link
structural objects,these objects and the
dependency constraints can be visualized as
rooted, directed tree called a link data tree.
• The internal vertices represent tagged elements
containing tag names and a list of attribute/value
pairs,the leaf vertices represent the label of the
link.
Link Data Tree for HTML
Documents
• The <a> tag marks a bock of HTML
document as a hypertext link.
• <a> can take several attributes like href or
name which specify the destination of
hypertext link or indicate that the marked
text can be the target of a hypertext link.
Example of LDT for HTML
Document
• Consider a code snippet:
<a href = http://www.rxlist.com/cgi/generic/index.html>all
RxList monographs(Nearly 300 of them)</a>
The Link data tree of the web
page
Illustration of LDT of hyperlinks
which contain image.
Link Data Tree for XML
Documents
• There are no fixed links tags to express links in
XML data so an element is an XmL link if either it
has xml:link attribute or the element and all of its
attributes and content adhere to syntactic
requirements.
• Two types of links are to be considered simple and
extended links.
• A simple link when mapped to LDT is always a
linear tree.
Example of Extended XML Link
Noisy Tags and Attributes
• XML tags are user defined and are not used
for formatting purpose. Thus, there are no
noisy tags to be ignored while generating
LDT’s.
• The attributes which specify link behavior
such as show and actuate are however
ignored.
Advantages
• It provides location independent
information to the mobile users.
• It is used in building web data repository
that supports historical web data.
• It is an effective and efficient information
retrieval technique.
Disadvantages
• Information retrieval is complex.
• It does not handle the executable contents of
the web documents.
• Attributes used to represent behavior of web
document are not considered in this model.
Conclusions
• This model logically separates the
hyperlinks from the web documents
• Aids in the representation of metadata,
contents and structure of HTML and XML
data as a tree-like structure.
Download