AIA 8803 Project Proposal Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech 1. Introduction The fast-growing World Wide Web contains a large amount of semi-structured HTML information about real-world objects. There are actually various kinds of real-world objects embedded in dynamic Web pages generated by online backend databases. Such dynamic content is called hidden Web or deep Web that refers to World Wide Web content that is not part of the surface Web indexed by search engines (Bergman, 2001). This provides a great opportunity for the database research community to extract and integrate the related deep Web information about the object together as an information unit. For example, some typical Web objects are products, people, conferences/papers, etc. Generally, the deep Web objects of the same type follow similar structure or schema. Accordingly, when these deep Web objects are extracted and integrated, a large warehouse can be constructed to perform further knowledge discovery tasks on the structured data. Furthermore, we believe the construction of such large-scale database based on deep Web mining can allow us to build advanced search applications that can help improve the next-generation Web search performance. However, by now there is still few work dedicated to exploring algorithms or methods for deep Web sources crawling and mining. This leads us to raise the idea of building a Web database to store the data records that can be crawled and extracted from the deep Web. With this constructed database, we can provide comparison search to the users. In our work, we try to implement a complete solution to build advanced search applications based on the extracted Web object-level data. We describe several essential components such as deep Web crawler, deep Web objects mining and database construction, and advanced search application based on the deep Web data. 2. Proposed Approach To well demonstrate the system design, we present several examples first to help illustrate what a hidden Web page comprises. As shown in Figure 1, we display two hidden Web pages that both contain a Barbie product object within them. From the figure, we can find that on each product, we can label a set of properties or attributes (e.g. product name, price, picture, description, etc.) to identify the object that corresponds to a real-world Barbie product. Generally, the objects of the same type follow similar structure or data schema, as shown in the two Barbie products in Figure 1. Since there is a large amount of such schematic data on the Web, a scalable system can be 1 AIA 8803 Project Proposal constructed to extract, store and apply these data. With the constructed database, we can develop advanced web search applications. For example, we can build a search engine combining different features or attributes of an identical Web object in different websites to respond to a query. Property Value Product Picture Price Shipping Desc. … … Property Product Price Shipping Picture … Value … Figure 1. Web page elements labeling into database tables. Advanced web search application Large-scale data warehouse Data fusion/Quality measurement Object record Object record ……. Object record Deep Web mining and data extraction Deep web crawler Web source 1 Web source 2 ……. Web source n Figure 2. System architecture. We believe these technologies will be beneficial to the development and growth of the next-generation web search engine. In Figure 2, we present a graph that describes the system architecture consisting of several essential components such as web crawler, web object extraction, attribute labeling and data warehouse construction as shown. 2 AIA 8803 Project Proposal (a) Attribute A1 A2 .,. An Value Backend Database Business Logic Frontend Interface (b) Figure 3. (a) three-layer architecture; (b) an example of presentation layer. 2.1. Hidden web crawler Before describing an algorithm to extract Web object data sources, we first introduce some background in Web crawler. A web crawler is a program or automated script which browses the World Wide Web in a methodical, automated manner (Kobayashi and Takeda, 2000). Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a website, such as checking links or validating HTML code. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit. The Web objects (e.g. products, people, community) are usually dynamically generated by backend database of online Web applications, which usually follow a three-layer architecture as shown in Figure 3 a. Such dynamic content is called hidden Web or deep Web (Bergman, 2001) that refers to World Wide Web content that is not part of the surface Web indexed by search engines. It is estimated that the deep Web is several magnitudes larger than the surface Web (Bergman, 2001). To discover content on the Web, search engines use web crawlers that follow hyperlinks. This technique is ideal for discovering resources on the surface Web but is often ineffective at finding hidden Web resources. For example, these crawlers do not attempt to find dynamic pages that are the result of database queries due to the infinite number of queries that are possible. It has been noted that this can be overcome by providing links to query results, but this could unintentionally inflate the popularity (e.g., PageRank) of a deep Web site. In the following, we propose a plausible seed-based web crawler for crawling hidden Web content. 3 AIA 8803 Project Proposal Iterative seed query for crawling. Different from the traditional web crawling solution where a hyperlink graph is traversed to crawl every hyperlinked web page, the web crawler for dynamic Web object is difficult in that the hidden Web content lacks a hyperlink graph for traverse. In this case, the crawler needs to primarily consider how to interact with a hidden database through the HTML presentation in order to extract its stored content as much as possible. We tentatively propose a seeding solution to target at this problem, as shown in Figure 3 b. 1. Initiate a set of seed queries, and submit each to an HTML frontend (e.g. we submit a keyword Car to Yahoo shopping as shown in Figure 3 b). 2. Extract new seeds from the returned results (e.g. Lincoln, Deluxe, Universal, TracRac, Truck and SUV show in Figure 3 b). 3. Submit the new seeds to the HTML end that can generate more results, and identify the new results that have never appeared before (this needs an advanced encoding algorithm that can allocate a unique identifier key to an existing search results item). 4. Go back to Steps 2 and 3 for repeated execution of crawling and extraction on the identified new search results as many as possible until some rule of iteration limitation or time limit. 2.2. Deep web mining After a large amount of Web object data sources are crawled, we set out to extract the Web objects from these source documents. An object can be extracted at two different levels as follows. 2.2.1. Web record identification and labeling. A Web object with a set of attributes is usually dynamically generated by backend databases, as shown in Figure 4. The data objects in the same page are related. They always share a common template and the elements at the same position of different records always have similar features and semantics. Based on the detection of such common pattern shared by objects in a Web page, there are some data mining studies involved in how to extend the existing information extraction techniques to automatically extract object information from Web pages. By using the vision-based page segmentation (VIPS) technology (Cai, et al, 2003), which makes use of page layout features such as font, color, and size to construct a vision-tree for a Web page, we can get a better representation of a page compared with the commonly used tag-tree. Based on the intrinsic schematic characteristics (as shown in Figure 4), we believe it has a high practical possibility to extract each attribute and its value to each object element. We can apply data mining and machine learning techniques that can be applied to effectively and automatically label an object in a Web page. Conditional Random Fields (CRFs) (Lafferty et al., 2001) is one of the most effective approaches that takes advantage of the sequence characteristics to do automatic labeling of properties for a Web object. As a conditional model, CRF can efficiently incorporate any useful feature for Web data extraction by incorporating long distance dependencies. 4 AIA 8803 Project Proposal Figure 4. An example for Web objects extraction and Attribute labeling. 2.2.2. Data warehouse construction Accordingly, when these Web objects are extracted and integrated, we can build a large database to store the object-level Web data for further data management. In our plan, the data warehouse technology can be applied to store large-scale Web objects data (Elmasri and Navathe, 2003). As demonstrated by Figure 1, multiple copies of information about an identical object usually exist across different sites, and such copies may be heterogeneously inconsistent that is caused by diverse Web site focuses and qualities. Each extracted instance of a Web object needs to be mapped to a real world object and stored into the Web data warehouse. To do so, we need techniques to integrate information about the same object and disambiguate different objects. The data fusion technology will be used to combine data from multiple sources and gather that information in order to achieve inferences, which will be more efficient than if they were achieved by means of a single source. Furthermore, we also believe the web page quality ranking metrics adopted in Web search like PageRank (Page, et al, 1998) or Hits (Kleinberg, 1998) provide important hints to evaluate the importance of an extracted object through its hosting web page. 3. Building Advanced Web Search Application In current search engines, the primary function is essentially relevance ranking at the document level like PageRank (Page, et al. 1998). Through the Web object extraction and data mining technologies, we can extract information from different Web sites and pages to build some structured databases of Web objects. With the constructed database, we are trying to develop advanced paradigms that can improve the state-of-the-arts Web search. For example, we can build an object-level search engine (Nie, et al. 2005) that can combine find-grained features or attributes of an identical Web object in different Web sites to respond to a user query. We can also construct a comparison web search engine that can compare attributes (e.g. price, performance, etc) of Web 5 AIA 8803 Project Proposal objects across different sites or sources. In the following, we introduce an advanced comparison Web search. Making comparison between objects is a common search activity people conduct on the Web (Sun, et al, 2006). Provided the rich attribute metadata extracted on each Web object, it is possible to build some vertical comparison search applications based on the characteristic of Web objects in different categories. There are many comparison Web search applications on the Internet. For example, Nextag and Shopzilla, Froogle are popular web sites that facilitate people’s comparison of price, performance or others on products among different products sold on the Web, or an identical product across different online shops or web sites. In these sites, most people use shopping search and comparison tools to research products that they end up buying from their local merchants. In such comparison search applications, the most challenging issue is involved in the precise extraction of Web object attributes from different sources or sites, which is plausible in our solution that use data warehouse to store the extracted Web object data. Although we plan to build commercial products search, it is very natural and interesting to be extended into other web objects, e.g. social network search, etc. There are many social network websites such as Facebook, MySpace, etc. By crawling the friend list of each person’s page on those sites, we can reach as many as members and access all public information in their profile. By analyzing the extracted characteristics and metadata, we can learn a lot of interesting knowledge and patterns of social networks, e.g. the friends cluster patterns in terms of spatial location, age, career, race, gender, hobby, etc. We can also investigate how people are related to each other, how many layers are between any two people, etc. In addition, we could perform friend search based on hobbies, majors, favorites, etc, as well as friends’ comparison based on those characteristics. 4. Conclusions There is lots of structured hidden Web information about real-world objects that are generated by online Web databases. However, such a large amount of valuable data that is seemingly unstructured has not been fully used to improve search performance. We described a practical and complete solution that includes web crawler, web object extraction and data warehouse construction. We also provided several advanced Web search applications based on the utilization of the constructed data warehouse. We believe this area will attract more and more attention from the research community in future. 5. References [1] Bergman M. K. (2001). The Deep Web: Surfacing Hidden Value. The Journal of Electronic Publishing 7 (1). 6 AIA 8803 Project Proposal [2] Cai, D., Yu, S., Wen, J.R. and Ma, W.Y. (2003). VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003. [3] Elmasri, R. and Navathe, S. (2003). Fundamentals of Database Systems/oracle 9i Programming, Addison-Wesley, July 2003. [4] Kleinberg, J. (1998) Authoritative sources in a hyperlinked environment. Proc. 9th Annual ACM-SIAM Symposium Discrete Algorithms, New York, 1998. [5] Lafferty, J., McCallum, A. and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. ICML 2001. [6] Liu, B., Grossman, R. and Zhai, Y. (2003). Mining data records in web pages. Proc. KDD 2003. [7] Nie, Z., Zhang, Y., Wen, J.R. and Ma, W.Y. (2005). Object-level ranking: bringing order to Web objects. Proc. WWW 2005. [8] Page, L., Brin, S., Motwani, R. and Winograd T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford University. [9] Sun, J., Wang, X., Shen, D., Zeng, H.J. and Chen Z. (2006) CWS: a comparative web search system. Proc. WWW 2006. 7