Assuming Accurate Layout Information is Available: How do we Interpret the Content Flow in HTML Documents? Hassan Alam and Fuad Rahman Human Computer Interaction Group BCL Technologies Inc. Santa Clara, CA 95050 www.bcltechnologies.com fuad@bcltechnologies.com Overview of the Talk Content Flow in Web Pages Structural Flow vs. Logical Flow Language Independence Independence for Semantics Content Flow from Purely Geometrical Information Conclusion and Future Work Related Work Handcrafting Transcoding Adaptive Re-authoring Handcrafting involves typically crafting web pages by hand by a set of content experts for device specific output. Transcoding replaces HTML tags with suitable device specific tags, such as HDML, WML and others. The research on web page reauthoring can explicitly use natural language processing or use nonNLP techniques. The HTML Table based Structure How is the Table Structure Exploited? Most HTML source use table as the principal organizational method We assume that a geometric parser will give us exact positioning of each table and sub-table Content is in the Columns. Rows are only used to arrange content We assume that content flow is language independent, or is it? How is the Table Structure Exploited? Calculate xPreference list Calculate yPreference list Perform Proximity analysis: Know thy neighbors! Calculate Inclusion Criterion Quantify each table: Calculate area Calculate table hierarchy based on Inclusion criterion and proximity analysis Continued … How is the Table Structure Exploited? Calculate TOC Calculate Level of TOC Calculate Merging Criterion Same Inclusion Criterion Lowest first Sharing identical sides Not if a border exists The HTML Table based Structure Map of Table Layout What is the Advantage of this Analysis? Relative importance of content can be assessed, resulting in better re-authoring. It becomes possible to capture the contextual relationship among various components within the document, such as what is a side bar, what is an advertisement, what is a top bar etc. If needed, it is possible to use other natural language techniques to correlate tables by using semantics or other criteria. Current Work XML is being successfully used in many applications to mark up important information according to applicationspecific vocabularies . Two W3C Recommendations, XSLT (the Extensible Stylesheet Language Transformations) and XPath (the XML Path Language), meet that need. This is an exploratory paper offering a specific pathway to the future of web page re-authoring provided accurate layout information is available. It is probably better to use the XSLT language, which itself uses XPath, to specify how an implementation of an XSLT processor is to create a desired output from a given marked-up input. Future Work Exact location of each block, in rectangular coordinates, equivalent to rendition using a standard browser. Size of each block of content. Type of content, e.g. text, graphics etc. Weight of content, in terms of size and placement within a page. Continuity information, derived from physical association in terms of geometrical collocation. Classification of content into a set of pre-defined classes, e.g. main story, sidebars, links and so on. Linkage information from the XML representation, indicating the layers of information that can be hidden at a level of summary. This can represent the content in many levels, but more than two or three levels are unsuitable for easy navigation. Conclusions A specific pathway to the future of web page reauthoring provided accurate layout information is available. This in no way represents a state of the art discussion about the possible use of layout information. Rather, it focuses on one small part within an array of possibilities. It will be interesting to discuss other possibilities in this space during the DLIA workshop.