Deep Web Crawling and Mining for Building Advanced Search Application

AIA 8803 Project Proposal
Deep Web Crawling and Mining for Building
Advanced Search Application
Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu
{hua, houdan, yuliu, xinsun, yyu}
College of computing, Georgia Tech
1. Introduction
The fast-growing World Wide Web contains a large amount of semi-structured HTML information
about real-world objects. There are actually various kinds of real-world objects embedded in
dynamic Web pages generated by online backend databases. Such dynamic content is called
hidden Web or deep Web that refers to World Wide Web content that is not part of the surface Web
indexed by search engines (Bergman, 2001). This provides a great opportunity for the database
research community to extract and integrate the related deep Web information about the object
together as an information unit. For example, some typical Web objects are products, people,
conferences/papers, etc. Generally, the deep Web objects of the same type follow similar structure
or schema. Accordingly, when these deep Web objects are extracted and integrated, a large
warehouse can be constructed to perform further knowledge discovery tasks on the structured data.
Furthermore, we believe the construction of such large-scale database based on deep Web mining
can allow us to build advanced search applications that can help improve the next-generation Web
search performance.
However, by now there is still few work dedicated to exploring algorithms or methods for deep
Web sources crawling and mining. This leads us to raise the idea of building a Web database to
store the data records that can be crawled and extracted from the deep Web. With this constructed
database, we can provide comparison search to the users. In our work, we try to implement a
complete solution to build advanced search applications based on the extracted Web object-level
data. We describe several essential components such as deep Web crawler, deep Web objects
mining and database construction, and advanced search application based on the deep Web data.
2. Proposed Approach
To well demonstrate the system design, we present several examples first to help illustrate what a
hidden Web page comprises. As shown in Figure 1, we display two hidden Web pages that both
contain a Barbie product object within them. From the figure, we can find that on each product,
we can label a set of properties or attributes (e.g. product name, price, picture, description, etc.) to
identify the object that corresponds to a real-world Barbie product. Generally, the objects of the
same type follow similar structure or data schema, as shown in the two Barbie products in Figure
1. Since there is a large amount of such schematic data on the Web, a scalable system can be
AIA 8803 Project Proposal
constructed to extract, store and apply these data. With the constructed database, we can develop
advanced web search applications. For example, we can build a search engine combining different
features or attributes of an identical Web object in different websites to respond to a query.
Property Value
Figure 1. Web page elements labeling into database tables.
Advanced web search application
Large-scale data warehouse
Data fusion/Quality measurement
Object record
Object record
Object record
Deep Web mining and data extraction
Deep web crawler
Web source 1
Web source 2
Web source n
Figure 2. System architecture.
We believe these technologies will be beneficial to the development and growth of the
next-generation web search engine. In Figure 2, we present a graph that describes the system
architecture consisting of several essential components such as web crawler, web object extraction,
attribute labeling and data warehouse construction as shown.
AIA 8803 Project Proposal
Figure 3. (a) three-layer architecture; (b) an example of presentation layer.
2.1. Hidden web crawler
Before describing an algorithm to extract Web object data sources, we first introduce some
background in Web crawler. A web crawler is a program or automated script which browses the
World Wide Web in a methodical, automated manner (Kobayashi and Takeda, 2000). Many sites,
in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers
are mainly used to create a copy of all the visited pages for later processing by a search engine that
will index the downloaded pages to provide fast searches. Crawlers can also be used for
automating maintenance tasks on a website, such as checking links or validating HTML code. In
general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it
identifies all the hyperlinks in the page and adds them to the list of URLs to visit.
The Web objects (e.g. products, people, community) are usually dynamically generated by
backend database of online Web applications, which usually follow a three-layer architecture as
shown in Figure 3 a. Such dynamic content is called hidden Web or deep Web (Bergman, 2001)
that refers to World Wide Web content that is not part of the surface Web indexed by search
engines. It is estimated that the deep Web is several magnitudes larger than the surface Web
(Bergman, 2001). To discover content on the Web, search engines use web crawlers that follow
hyperlinks. This technique is ideal for discovering resources on the surface Web but is often
ineffective at finding hidden Web resources. For example, these crawlers do not attempt to find
dynamic pages that are the result of database queries due to the infinite number of queries that are
possible. It has been noted that this can be overcome by providing links to query results, but this
could unintentionally inflate the popularity (e.g., PageRank) of a deep Web site. In the following,
we propose a plausible seed-based web crawler for crawling hidden Web content.
AIA 8803 Project Proposal
Iterative seed query for crawling. Different from the traditional web crawling solution where a
hyperlink graph is traversed to crawl every hyperlinked web page, the web crawler for dynamic
Web object is difficult in that the hidden Web content lacks a hyperlink graph for traverse. In this
case, the crawler needs to primarily consider how to interact with a hidden database through the
HTML presentation in order to extract its stored content as much as possible. We tentatively
propose a seeding solution to target at this problem, as shown in Figure 3 b.
Initiate a set of seed queries, and submit each to an HTML frontend (e.g. we submit a
keyword Car to Yahoo shopping as shown in Figure 3 b).
Extract new seeds from the returned results (e.g. Lincoln, Deluxe, Universal, TracRac, Truck
and SUV show in Figure 3 b).
Submit the new seeds to the HTML end that can generate more results, and identify the new
results that have never appeared before (this needs an advanced encoding algorithm that can
allocate a unique identifier key to an existing search results item).
Go back to Steps 2 and 3 for repeated execution of crawling and extraction on the identified
new search results as many as possible until some rule of iteration limitation or time limit.
2.2. Deep web mining
After a large amount of Web object data sources are crawled, we set out to extract the Web objects
from these source documents. An object can be extracted at two different levels as follows.
2.2.1. Web record identification and labeling. A Web object with a set of attributes is usually
dynamically generated by backend databases, as shown in Figure 4. The data objects in the same
page are related. They always share a common template and the elements at the same position of
different records always have similar features and semantics. Based on the detection of such
common pattern shared by objects in a Web page, there are some data mining studies involved in
how to extend the existing information extraction techniques to automatically extract object
information from Web pages. By using the vision-based page segmentation (VIPS) technology
(Cai, et al, 2003), which makes use of page layout features such as font, color, and size to
construct a vision-tree for a Web page, we can get a better representation of a page compared with
the commonly used tag-tree. Based on the intrinsic schematic characteristics (as shown in Figure
4), we believe it has a high practical possibility to extract each attribute and its value to each
object element. We can apply data mining and machine learning techniques that can be applied to
effectively and automatically label an object in a Web page. Conditional Random Fields (CRFs)
(Lafferty et al., 2001) is one of the most effective approaches that takes advantage of the sequence
characteristics to do automatic labeling of properties for a Web object. As a conditional model,
CRF can efficiently incorporate any useful feature for Web data extraction by incorporating long
distance dependencies.
AIA 8803 Project Proposal
Figure 4. An example for Web objects extraction and Attribute labeling.
2.2.2. Data warehouse construction
Accordingly, when these Web objects are extracted and integrated, we can build a large database
to store the object-level Web data for further data management. In our plan, the data warehouse
technology can be applied to store large-scale Web objects data (Elmasri and Navathe, 2003). As
demonstrated by Figure 1, multiple copies of information about an identical object usually exist
across different sites, and such copies may be heterogeneously inconsistent that is caused by
diverse Web site focuses and qualities. Each extracted instance of a Web object needs to be
mapped to a real world object and stored into the Web data warehouse. To do so, we need
techniques to integrate information about the same object and disambiguate different objects. The
data fusion technology will be used to combine data from multiple sources and gather that
information in order to achieve inferences, which will be more efficient than if they were achieved
by means of a single source. Furthermore, we also believe the web page quality ranking metrics
adopted in Web search like PageRank (Page, et al, 1998) or Hits (Kleinberg, 1998) provide
important hints to evaluate the importance of an extracted object through its hosting web page.
3. Building Advanced Web Search Application
In current search engines, the primary function is essentially relevance ranking at the document
level like PageRank (Page, et al. 1998). Through the Web object extraction and data mining
technologies, we can extract information from different Web sites and pages to build some
structured databases of Web objects. With the constructed database, we are trying to develop
advanced paradigms that can improve the state-of-the-arts Web search. For example, we can build
an object-level search engine (Nie, et al. 2005) that can combine find-grained features or attributes
of an identical Web object in different Web sites to respond to a user query. We can also construct
a comparison web search engine that can compare attributes (e.g. price, performance, etc) of Web
AIA 8803 Project Proposal
objects across different sites or sources. In the following, we introduce an advanced comparison
Web search.
Making comparison between objects is a common search activity people conduct on the Web (Sun,
et al, 2006). Provided the rich attribute metadata extracted on each Web object, it is possible to
build some vertical comparison search applications based on the characteristic of Web objects in
different categories. There are many comparison Web search applications on the Internet. For
example, Nextag and Shopzilla, Froogle are popular web sites that facilitate people’s comparison
of price, performance or others on products among different products sold on the Web, or an
identical product across different online shops or web sites. In these sites, most people use
shopping search and comparison tools to research products that they end up buying from their
local merchants. In such comparison search applications, the most challenging issue is involved in
the precise extraction of Web object attributes from different sources or sites, which is plausible in
our solution that use data warehouse to store the extracted Web object data.
Although we plan to build commercial products search, it is very natural and interesting to be
extended into other web objects, e.g. social network search, etc. There are many social network
websites such as Facebook, MySpace, etc. By crawling the friend list of each person’s page on
those sites, we can reach as many as members and access all public information in their profile.
By analyzing the extracted characteristics and metadata, we can learn a lot of interesting
knowledge and patterns of social networks, e.g. the friends cluster patterns in terms of spatial
location, age, career, race, gender, hobby, etc. We can also investigate how people are related to
each other, how many layers are between any two people, etc. In addition, we could perform
friend search based on hobbies, majors, favorites, etc, as well as friends’ comparison based on
those characteristics.
4. Conclusions
There is lots of structured hidden Web information about real-world objects that are generated by
online Web databases. However, such a large amount of valuable data that is seemingly
unstructured has not been fully used to improve search performance. We described a practical and
complete solution that includes web crawler, web object extraction and data warehouse
construction. We also provided several advanced Web search applications based on the utilization
of the constructed data warehouse. We believe this area will attract more and more attention from
the research community in future.
5. References
[1] Bergman M. K. (2001). The Deep Web: Surfacing Hidden Value. The Journal of Electronic
Publishing 7 (1).
AIA 8803 Project Proposal
[2] Cai, D., Yu, S., Wen, J.R. and Ma, W.Y. (2003). VIPS: a vision-based page segmentation
algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.
[3] Elmasri, R. and Navathe, S. (2003). Fundamentals of Database Systems/oracle 9i
Programming, Addison-Wesley, July 2003.
[4] Kleinberg, J. (1998) Authoritative sources in a hyperlinked environment. Proc. 9th Annual
ACM-SIAM Symposium Discrete Algorithms, New York, 1998.
[5] Lafferty, J., McCallum, A. and Pereira, F. (2001). Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. Proc. ICML 2001.
[6] Liu, B., Grossman, R. and Zhai, Y. (2003). Mining data records in web pages. Proc. KDD
[7] Nie, Z., Zhang, Y., Wen, J.R. and Ma, W.Y. (2005). Object-level ranking: bringing order to
Web objects. Proc. WWW 2005.
[8] Page, L., Brin, S., Motwani, R. and Winograd T. (1998). The PageRank Citation Ranking:
Bringing Order to the Web. Technical Report, Stanford University.
[9] Sun, J., Wang, X., Shen, D., Zeng, H.J. and Chen Z. (2006) CWS: a comparative web search
system. Proc. WWW 2006.