whom

Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20th, 2001 Overview • • • • • • • • • • Need for a web warehouse WHOM- Data Model for WHOWEDA Concept of Node & Link Model for representing metadata,Structure & Content of web documents & Hyperlinks NMT & LMT Modeling of structural & textual content of web documents NDT & LDT Advantages Disadvantages Conclusion & Future Word Need of a Web Warehouse • Rapid growth of WWW,which is a distributed global information resource. • Applications must be able to harness and analyze web data. • Germination of mobile users. • Necessity to exploit historical web data. • Traditional information retrieval techniques & Search engines are not satisfactory Data Model of WHOWEDA WHOM(Ware House Object Model) • Consists of two components (1) set of web objects (2) set of web operators • Centered on the notion of web tables,which is a set of web tuples • Web tuple is a set of directed graphs each consisting of set of nodes and links & satisfies a web schema • Set of operators like global web coupling,web join,web select etc., are used to manipulate the web data Metadata associated with HTML & XML documents (web documents)  HTML or XML documents may have metadata as • URL • Format,size(in bytes),Date of last modification • Information about author  Hyperlink in web doc may have metadata as • Source URL • Target URL • Type of Hyperlink(interior,local or global) Node Meta Data Attribute • Represented using a data type node metadata-attribute • Meta Attribute may be either atomic or complex • Eg.for complex attribute, URL URL can be decomposed as server, port, protocol,path,filename • Eg. for atomic attribute size, as it can not be further decomposed Set of node meta data attributes Figure 1 Node Meta data Tree(NMT) • Representation of instance of node metaattribute • Internal vertices of tree are meta-attribute names • Leaf vertices of the tree are values of meta data attribute Example Example of NMT Consider URL: http://www.ninds.nih.gov/patients/Disorder/Alexander/Alexander.htm Last Modification Date: Thursday,15th July, 1999,04:50:53. Size : 10761K Attribute URL has following attribute/value pairs: (Protocol,”http”),(Server,” www.ninds.nih.gov”),(Path,” patients/Disorder/Alexander”) and (Filename,” Alexander.htm”) Attribute Server has following sub attribute/value pairs: (Name,”www.ninds.nih”),(Domain name,”gov”) Node Meta Data Tree Figure 2 Link Meta Data Attribute • Represented using a data type link metadata-attribute • Each Attribute can be either atomic or complex • Eg. Complex attribute, Source URL or target URL can be decomposed to server, port, protocol, path and file name. Eg. Atomic attribute Link type-local, global or interior Set of link meta data attributes Figure 3 Link metadata tree (LMT) • Representation of instance of a link metadataattribute • Corresponds to the link meta data attribute/value pairs of hyperlinks. • Internal vertices of tree are meta-attribute names of hyperlinks • Leaf vertices of the tree are values of meta data attribute Figure 4 Example of LMT Figure 5 Issues for modeling Structure & Content • Web data embedded within a HTML or XML document should be written in compliance with the HTML & XML specifications respectively • Modeling tags & tag less data • Modeling hierarchical structure • Attribute/Value pairs associated with tags • Order of text • Location information of a portion of tag less data Components of Node structural attribute • • • • • Name Attribute_list Content Identifier Location_attribute Node Data Tree(NDT) • Represents the structure & content of web page • Node structural objects which are instances of node structural attributes satisfy some dependency constraints ,which can be collectively visualized as rooted,directed tree which is an NDT. Figure 6 Figure 6 Features Of NDT • • • • Rooted, directed tree Loss of structural information No loss of content data Exclusion of anchor tags Components of NDT • • • • • name Attribute_list Identifier Content Location_attribute Definition of Dependency Constraints NDT for HTML Documents Classification of HTML tags • Non-noisy tags – HTML tags which are considered in node data tree. • Noisy tags – HTML tags which are ignored while mapping a HTML document to NDT.Three types of noisy tags that our model considers are 1. Tags used for formatting purpose 2. Tags used to represent a hyperlink 3. Tags with specification of executable content Representation of non-noisy tags in NDT Classified as three types • Type1 tags • Type2 tags • Tags3 tags Noisy and Non-noisy tag attributes  Noisy attributes: Attributes which are ignored while generating a NDT from HTML doc. Three types of noisy attributes are • Attributes used for formatting purpose • Attributes used to represent behavior of web document • Attributes which specify execution content  Non-noisy attributes: Attributes considered important in the context of modeling HTML document & are represented in NDT Representation Of Content and Structure in XML Document. • The XML Documents are mapped into a Node Data Tree. • Node structural objects which are instances of node structural attributes satisfy some dependency constraints ,which can be collectively visualized as rooted,directed tree which is an NDT. Issues related to generating NDT from XML Documents. • The XML Documents don’t have have a fixed set of tags and attributes like HTML. • The tags and attributes are defined by the user. • XML does not encounter the problem of elements with no end tags and elements whose tags may be omitted. • Thus no need to address the issues related to type 2 and type 3 tags while generating the NDT’s from XML documents. Example of NDT generated from XML Document. Representing Structure and Contents of hyperlinks. • A hyperlink is an explicit relationship between two or more data objects or portions of data objects. • A hyperlink is defined by the data type Link type. • A Link type consists of three components: a set of meta data attributes, a set of link structural attributes and a reference identifier. • Link structural attributes are used to express the structure and content if hyperlinks and the reference identifier is used to specify the location of hyperlinks in web documents. Issues for modeling hyperlinks. • The <a> tag and the attributes href or name are used to specify hyperlinks in HTML documents. XML links are specified by the use of attribute named xml:link.Possible values are simple and extended, as well as locator group and document. • When authors add a hyperlink to a document D, they include the description of the document in addition to the URL which are important. • The location of Hyperlinks is important as we may need to impose constraints in a query to follow only those links which are located in a particular portion of web page. Link Structural attributes. Similar to node structural attributes, it consists of three components: • Name, corresponding to start-tag of HTML or XML link. • Attribute_list, it is finite possibly empty set of attributes associated with the tag. • Content, between start and end tags. Reference Identifier It is a unique identifier that references an identifier in node structural attribute. For example consider the web page in Node Data tree for XML Document. Link data Tree • It can be represented by a set of instances of link structural attributes. • HTML or XML documents are mapped into instances of link structural attributes called link structural objects,these objects and the dependency constraints can be visualized as rooted, directed tree called a link data tree. • The internal vertices represent tagged elements containing tag names and a list of attribute/value pairs,the leaf vertices represent the label of the link. Link Data Tree for HTML Documents • The <a> tag marks a bock of HTML document as a hypertext link. • <a> can take several attributes like href or name which specify the destination of hypertext link or indicate that the marked text can be the target of a hypertext link. Example of LDT for HTML Document • Consider a code snippet: <a href = http://www.rxlist.com/cgi/generic/index.html>all RxList monographs(Nearly 300 of them)</a> The Link data tree of the web page Illustration of LDT of hyperlinks which contain image. Link Data Tree for XML Documents • There are no fixed links tags to express links in XML data so an element is an XmL link if either it has xml:link attribute or the element and all of its attributes and content adhere to syntactic requirements. • Two types of links are to be considered simple and extended links. • A simple link when mapped to LDT is always a linear tree. Example of Extended XML Link Noisy Tags and Attributes • XML tags are user defined and are not used for formatting purpose. Thus, there are no noisy tags to be ignored while generating LDT’s. • The attributes which specify link behavior such as show and actuate are however ignored. Advantages • It provides location independent information to the mobile users. • It is used in building web data repository that supports historical web data. • It is an effective and efficient information retrieval technique. Disadvantages • Information retrieval is complex. • It does not handle the executable contents of the web documents. • Attributes used to represent behavior of web document are not considered in this model. Conclusions • This model logically separates the hyperlinks from the web documents • Aids in the representation of metadata, contents and structure of HTML and XML data as a tree-like structure.

whom

Related documents

Products

Support

whom

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib