automatic extraction of data from deep web page

advertisement
AUTOMATIC EXTRACTION OF DATA FROM DEEP
WEB PAGE
Nripendra Narayan Das
Ela Kumar
Associate Professor
RIET, Faridabad
nripendraddas@gmail.com,
Dean and Associate Professor
Gautam Buddha University, Greater Noida
ela_kumar@gbu.ac.in
ABSTRACT
There is large volume of information available to be
mined from the World Wide Web. The information on the
Web is contained in the form of structured and
unstructured objects, which is known as data records.
Such data records are important because essential
information are available in these host pages, e.g., lists of
products and there detail information. It is necessary to
extract such data records to provide relevance information
to user as per their requirements. Some of the approaches
used to solve this problem are manual approach,
supervised learning, and automatic techniques. The
manual method is not suitable for large number of pages.
It is a challenging work to retrieve appropriate and useful
information from Web pages. Currently, many web
retrieval systems called web wrappers, web crawler have
been designed . In this paper, some existing techniques are
examined, then our current work on web information
extraction is presented. It is the fact that most of the
search engines are not in a position to extract the data
from deep web. In this paper an algorithm has been
designed and practically implemented which will not only
extract the data in efficient manner but will also display as
a single page output. Experimental evaluation on a large
number of real input web url address collections indicates
that our algorithm correctly extracts data in most cases
Keywords : World wide web, wrapper, web crawler
1. INTRODUCTION
Before giving introduction of our paper we must know
what is Deep Web. The Deep Web (or Invisible web) is
the set of information resources on the World Wide Web
not reported by normal search engines.
The World Wide Web contains different types of relevant
information in the form of free text. The unpredictable
growth and popularity of the world-wide web and
availability of huge amount of information on the Internet
makes extraction of data from web pages difficult.
However, due to the heterogeneous and unstructured web
page design, it is very difficult to retrieve information
sources. Some of the Web mining applications, such as
comparison shopping robots, require expensive
maintenance to deal with different data formats. For
translation of input pages into structured data
automatically, many algorithms have been designed in the
area of information extraction (IE).
The information source can be classified into three main
types free text, structured text and semi-structured text.
Originally, the extraction system focuses on free text
extraction. Natural Language Processing (NLP)
techniques are developed to extract this type of
unrestricted, unregulated information, which employs the
syntactic and semantic characteristics of the language to
generate the extraction rules. The structured information
usually comes from databases, which provide rigid or well
defined formats of information, therefore, it is easy to
extract through some query language such as Structured
Query Language (SQL). The other type is the
semistructured information, which falls between free text
and structured information. A good example of semistructured information are Web pages.
According to the statistical results by Miniwatts Marking
Group [19], the growth of web users during this decade is
over 200% and there are more than 2 billion Internet users
from over 278 countries and world regions. At the same
time, public information and virtual places are increasing
accordingly, which almost covers any kind of information
needs. Thus this attracts much attention on how to extract
the useful information from the Web. Currently, the
targeted web documents can easily be obtained by
inputting some keywords with a web search engine. But
the drawback is that the system may not necessarily
provide relevant data rich pages and it is not easy for the
computer to automatically extract or fully understand the
information contained. The reason is due to the fact that
web pages are designed for human browsing, rather than
machine interpretation. Most of the pages are in Hypertext
Markup Language (HTML) format, which is a semistructured language, and the data are not given in a
particular format and change frequently [1].
There are several challenges in extracting information
from a semi-structured web page e.g.
(i) Lack of a schema.
(ii) Ill formatting.
(iii) High update frequency and semantic heterogeneity of
the information.
In order to overcome these challenges, our system design
transforms the page into a format called Extensible
Hypertext Mark-up Language (XHTML) [20]. Then, we
make use of the DOM tree hierarchy of a web page and
regular expressions are extracted out using the Extensible
Style sheet Language (XSL) [21, 22] technique, with a
human training process. The relevant information is
extracted and transformed into another structured
format— Extensible Mark-up Language (XML) [23].
The remainder of the paper is organized as follows: some
related works are illustrated in section 2, which involve a
brief overview of the current web information extraction
systems; then detail techniques in our approach are
addressed in section 3; experimental results are explained
in section 4; finally, the conclusion and future work are
mentioned in the last section.
In this paper, we will focus mainly on extracting text
information from web pages. After browsing more than
100 web pages, it was observed that almost all of the web
pages are displayed in particular format using a particular
language i.e. HTML. It was also observed that to display
the output page in better format, Table and list HTML
tags are used by most of the programmer.
The algorithm below describes the approach :
2 RELATED WORK
3.1 System design and architecture
In last 10-13 years, many extraction systems have been
developed. In the very beginning, a wrapper is constructed
to manually extract a particular format of information.
However, the wrappers were not very much successful , it
should be redesigned and reprogrammed accordingly to
different types of information. In addition, it is
complicated and knowledge intensive to construct the
extraction rules used in a wrapper for a specific domain.
Therefore only experts may have knowledge to do that.
Due to the extensive work in manually constructing a
wrapper, many wrapper generation techniques have been
developed. Those techniques could be classified into
several classes, including language development based,
HTML tree processing based, natural language processing
based, wrapper induction based, modelling based and
ontology based [2].
Researchers have developed several approaches for
mining data records from Web pages. David Embley,
Yuan Jiang, and Yiu-Kai Ng use a set of heuristics and a
manually –constructed domain ontology[3]. David
Buttler, Ling Liu, and Calton Pu extend this approach in
Omini (Object Mining and Extraction System,
http://disl.cc.gatech.edu/Omini), which uses additional
heuristics based on HTML tags but no domain
knowledge[4]. Chia-Hui Chang and Shao-Chen Lui
proposed IEPAD (Information Extraction Based on
Pattern Discovery), an automatic method that uses
sequence alignment to find patterns representing a set of
data records[5]. Because this method’s sequence matching
is limited, it also performs poorly. Kristina Lerman, Craig
Knoblock, and Steven Minton use clustering and grammar
induction of regular languages but report unsatisfactory
results [6]. Valter Crescenzi, Giansalvatore Mecca, and
Paolo Merialdo compare two pages to find patterns for
data extraction [7]. However, their method needs to start
with a set of similar pages. By working on individual
pages, our technique avoids this limitation. Other related
research is mainly in wrapper induction [8]. A wrapper is
a program that extracts data from a Web site and puts it in
a database. However, rather than mining data records,
wrapper induction only extracts certain pieces of
information on the basis of user labelled items.
3. PROPOSED ALGORITHM TO
EXTRACT
DATA
FROM WEB
PAGES:
To cater the challenge, in this paper we proposed an
algorithm to extract the structure of lists by using the
regularities both in the format of the pages and the data
contained in them.
• Extract all data from lists
– Compute the page pattern and identify the list on each
page
– Compute a set of features (separators and content) for
each data extract
• Identify columns and Classification of data
• Identify rows
Web Page
Transform the web page into
equivalent HTML code
Extract the relevant
information and transfer it
into temporary file
Display the extracted data of
web pages
Figure 1. System architecture
3.2 Algorithm of web data extraction
input:
k = key word(s)
output:
Extracted data from web pages
1. begin
2. while(not end of data records)
3. If(keyword found)
4. Open the web page internally
5. Parse the web page in equivalent HTML code
6. Search for data and extract from web page
7. Store the data in temporary file
8. Move to next record
9. Else
10. Move to next records
11. Display all the relevant extracted data
Figure 2: Pseudo code of the data extraction algorithm
When algorithm runs it starts by tokenizing Web pages,
that is, splitting the text of the Web pages into individual
words or tokens. Pages are analyze to find common
structure. Many types of Web sources, especially those
that return lists, generate the page from a template and fill
it with results of a database query. It is expected all data
are available in the same format and same type, car price
for instance(as shown in figure-1. In addition to content,
layout features, such as separators, may be useful hints for
arranging extracts by columns. However, we cannot rely
solely on separators — the table may have missing
columns, separators before the first row and after the last
one may be different from those separating rows within
the list, etc..
The final step of the analysis is to partition the list into
rows. Ideally, it should be easy to identify rows from the
class assignment, because a row corresponds to a tuple of
data, which is repeated for every row. An algorithm will
be used this task. Each list, or rather the sequence of
column labels for the extracts in the list, can be thought of
as a string generated by a regular language. The language
captures the repeated structure in the sequences that
corresponds to rows. We use this information to partition
the list into tuples. The end result of the application of the
suite of algorithms is a complete assignment of data in the
list to rows and columns.
Figure 2. Extraction of old car detail data from Car Wale
web site
3.3 Equivalent HTML Code of Figure-1
Due to constraints of page limit equivalent HTML code of
1st car only i.e. Mercedes-Benz S-Class has been shown
below.
<div>
<table id="premium-lisings" width="100%"
cellpadding="0" cellspacing="0" border="0"
class="parent_table">
<tr name="D498695" class="premium-d-yellow" >
<td>
<a name="carName" target="_blank" title="MercedesBenz S-Class [2003-2013] 320 CDI [2007-2011]"
href="/used/cars-in-newdelhi/mercedesbenzsclass20032013-D498695/">
<b>Mercedes-Benz S-Class [2003-2013] 320 CDI [20072011]</b></a>
in Ashok Vihar, New Delhi
<div class="rightfloat">
<div class='leftfloat font10 light-grey-text' style="margintop: 4px;margin-right: 5px;">Premium</div>
</div>
</td>
</tr>
<tr name="D498695" class="premium-d-yellow">
<td class="grey-bottom-border">
<table width="100%" cellpadding="0" cellspacing="0"
border="0" class="inner_table">
<tr class="row_strong">
<td rowspan="3" width="100" class="thumb no-border" >
<img dataoriginal='http://img2.aeplcdn.com/tc/04/D8/210038/2007Mercedes-Benz-S-Class-320-CDI-2007-2011-86213380x60.JPG' class='pointer lazy'
onclick="showPhotos('D498695');"
src='http://img.carwale.com/grey.gif' title='MercedesBenz S-Class [2003-2013] 320 CDI [2007-2011]'
alt='Loading..' />
<div>
<a class="text-grey2 pointer"
onclick="showPhotos('D498695');">7 Photos</a>
</div>
<div>
<a class="text-grey2 pointer" href="/used/cars-innewdelhi/mercedesbenz-sclass20032013D498695/#divVideos" target="_blank" > </a>
</div>
<div>
<img border="0" dataoriginal='http://img2.aeplcdn.com/AutoBiz/Certifications/
cargiant.png' src='http://img.carwale.com/grey.gif'
class='lazy' title='Certified Car' style='margin-top:10px;'
/>
</div>
</td>
<td width="125" name="price"><span class="marginleft5"><span class="cw-sprite rupee-medium"></span>
25,00,000</span></td>
<td width="125"><span name="kms">68,000</span>
Kms</td>
<td width="100" name="year">Oct-2007</td>
<td >Diesel</td>
</tr>
<tr>
<td >WHITE</td>
<td>Automatic</td>
<td name="seller">Dealer</td>
<td >Second Owner </td>
</tr>
<tr class="no-border">
<td align="left">
<a target="_blank" title="View complete details of the
car" href="/used/cars-in-newdelhi/mercedesbenzsclass20032013-D498695/">More details »</a>
</td>
<td align="center">
<a profileid="D498695" calculatedemi=""
onclick="javascript:_gaq.push(['_trackEvent',
'UsedCarSearch', 'Button', '']);" class="button-light-grey
contact-seller">Contact Seller</a>
</td>
<td colspan="2" align="right">
<div class="chkcomp pointer hide" profileid="D498695">
<span class="font11 rightfloat" style="paddingtop:2px;">Add to compare</span>
<div class="filter checked_grey rightfloat" ></div>
</div>
<span class="font11"><i>Last Updated: 26-Feb2014</i></span>
</td>
</tr>
</table>
<input type="hidden" name="model_name"
value="Mercedes-Benz S-Class [2003-2013]" />
</td>
</tr>
Observation :
It can be seen from above HTML code (shown in border)
that most of the web pages have been designed in a
particular format, where data are put inside the Table tag
format. e.g. <TABLE>, <TR>, <TD>.
frequently encountered data types, such as phone numbers
and zip codes, dash is part of data and not a separator.
Likewise, comma (,) is sometimes a separator (e.g.,
“House No - 123, Gali no - 2, South Extension”) and
sometimes part of data (e.g., in “1,000,000”), though in
our concept it has been chosen to treat it as a separator.
4 RESULTS
Above mentioned algorithm were applied to extract data
from some Web information sources containing a wide
variety of data types. Three or four pages were randomly
selected from each source. The extraction algorithm was
applied to each set of pages and manually checked
whether the data from the list was extracted correctly or
not. The table in Fig. 2 summarizes the results. In some of
the web pages data were not extracted as we want to
retrieve the data for a particular domain. In this paper
used card domain has been discussed and algorithm has
been applied to retrieve the data for used car only.
4.1 Prototype Search form for the
practical implementation
3.4 Finding the page template
During the extraction step, the text of each Web page is
split into individual words, or more accurately tokens, and
each token is assigned one or more syntactic types [10],
based on the characters appearing in it. Thus, a token can
be an HTML token, an alphanumeric, or a punctuation
token. If it’s an alphanumeric token, it may also belong to
one of two categories: numeric or alphabetic, which is
further divided into capitalized or lowercased types, and
so on. The syntactic token hierarchy is described in [10].
Many Web sources use templates to automatically
generate pages and fill them with results of a database
query.
The algorithm works in the following way:
(a) When query is submitted in search box, it
matches in the database where only web
addresses are stored of different domain with
key word.
(b) Internally open the web page and searches for
key word in the generated HTML code.
(c) If keyword matches with the keyword available
in database it opens the web page internally and
searches for result.
(d) Display the result in a particular format.
(e) The process is applied for all the web address
and finally result are displayed in single web
page.
Once section of the page is identified that contains the
relevant data, all data are extracted from it. If the HTML
table is carefully formatted, this step would amount to
extracting all visible text. However, in addition to HTML
tags, punctuation characters are often used to separate data
fields.
Many programmer use different format as separator.
Sometimes a dash (-) is a good separator, but for many
Figure 3 Search form
4.2 Screen shot result of extracted data
www.bookmyseats.in
www.cardekho.com
www.flipkart.com
www.justeat.in
www.carwale.com
www.snapdeal.com
ticket booking web
page)
No(on line Movie
ticket booking web
page)
Yes
No(on line purchasing
web page)
No(On line food order
in hotel)
Yes
No(on line purchasing
web page)
Table-1 Sample web pages for retrieval of data
5. CONCLUSIONS
In this paper, a novel and effective technique was
proposed to retrieve data records in a Web page. Our
algorithm is based on two important observations about
data records on the Web and a string matching algorithm.
It is automatic and thus requires no manual effort. In
addition, our algorithm is able to discover non-contiguous
data records, which cannot be handled by existing
techniques. Experimental results show that our new
method is extremely accurate. In our future work, it has
been planned to find data records that are not formed by
HTML table related tags.
Figure 4 Extracted Result
The web pages from different domain which were taken
for practical implementation has been shown below :
Name of web sites
Data retrieved
or not
www.gaadi.com
www.cartrade.com
www.quikr.com
www.indiabookstore.com
Yes
Yes
Yes
No (on line book
purchasing web page)
No(on line Movie
ticket booking web
page)
Yes
Yes
Yes
No(on line Movie
ticket booking web
page)
Yes
Yes
Yes
Yes
Yes
Yes
No(on line book
purchasing web page)
No(on line Movie
ticket booking web
page)
Yes
No(on line Movie
in.bookmyshow.com
www.zigwheels.com
www.cars.com
www.oncars.in
www.pvrcinemas.com
www.autotrader.com
www.olx.in
autos.yahoo.com
www.rightcar.com
www.gaadi.com
www.carazoo.com
www.amazon.in
www.ticketplease.com
www.autotrader.com
www.bookmyevent.com
6. REFERENCES
[1] Eikvil, L.: Information Extraction from World Wide
Web – A Survey, Technical Report 945, Norweigan
Computing Center, Oslo, Norway (July 1999)
[2] Laender, A.H.F., Ribeiro-Neto, B.A., da Silva,
A.S., Teixeira, J.S.: A brief survey of Web data
extraction tools. ACM Sigmod Record 31(2),
84–93 (2002).
[3]
D. Embley, Y. Jiang, and Y. Ng, “RecordBoundary Discovery in Web Documents,” Proc.
ACM Int’l Conf. Management of Data
(SIGMOD 99), ACM Press, 1999, pp. 467–478.
[4] D. Buttler, L. Liu, and C. Pu, “A Fully
Automated Extraction System for the World
Wide Web,” Proc. 21t Int’l Conf. Distributed
Computing Systems (ICDCS 01), IEEE CS Press,
2001, pp. 361–370.
[5]
C.-H. Chang and S.-L. Lui, “IEPAD:
Information Extraction Based on Pattern
Discovery,” Proc. 10th Int’l Conf. World Wide
Web (WWW 01), ACM Press, 2001, pp. 681–
688.
[6] K. Lerman, C. Knoblock, and S. Minton,
“Automatic Data Extraction from Lists and
Tables in Web Sources,” Proc. IJCAI 2001
Workshop Adaptive Text Extraction and Mining,
papers/
[20] http://www.w3.org/TR/xhtml1/ XHTML, W3C
Recommendation
[7] V. Crescenzi, G. Mecca, and P. Merialdo,
“RoadRunner: Towards Automatic Data
Extraction from Large Web Sites,” Proc. 27th
Int’l Conf. Very Large Data Bases (VLDB 01),
Morgan Kaufmann, 2001, pp. 109–118.
[21] http://www.zvon.org/xxl/XSLTutorial/Output/
contents.html XSLT Tutorial
2001; www.isi.edu/
lerman01-atem.pdf.
info-agents/
[8] C. Knoblock and A. Levy, eds., Proc. 1998
Workshop AI and Information Integration,
AAAI Press, 1998; www.isi.edu/ariadne/ aiii98wkshp/proceedings.html.
[9]
Baeza-Yates, R. “Algorithms for string
matching: A survey.” ACM SIGIR Forum, 23(34):34--58, 1989
[10] Kristina Lerman and Steven Minton. Learning
the common structure of data. In Proceedings
of the 15=7th National Conference on Arti£cial
Intelligence (AAAI- 2000) and of the 10th
Conference on Innovative Applications of
Arti£cial Intelligence (IAAI-98), Menlo Park,
July 26–30 2000. AAAI Press.
[11] M. A. Hearst. UIs for faceted navigation recent
advances and remaining open problems. In
Proceedings of HCIR, 2008.
[12] A. Jain and M. Pennacchiotti. Open entity extraction
from web search query logs. In Proceedings of
ICCL, 2010.
[13] H. S. Koppula, K. P. Leela, A. Agarwal, K. P.
Chitrapura, S. Garg, and A. Sasturkar. Learning url
patterns for webpage de-duplication. In Proceedings
of WSDM, 2010.
[14] J. Madhavan, S. R. Jeffery, S. Cohen, X. luna Dong,
D. Ko, C. Yu, and A. Halevy. Web-scale data
integration: You can only afford to pay as you go. In
Proceedings of CIDR, 2007.
[15] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A.
Rasmussen, and A. Halevy. Google’s deep web
crawl. In Proceedings of VLDB, 2008
[16] Yeye He, Dong Xin, Venkatesh Ganti, Sriram
Rajaraman, Nirav ShahCrawling Deep web entity
pages . In proceeding of WSDM’ 2013.
[17] Yiyao Lu, Hai He, Hongkun Zhao, Weiyi
Meng,Annotating Search Results from Web
Databases, IEEE transactions on knowledge and
data engineering, Vol. 25, No. 3, March 2013.
[18] Muhammad Faheem, Intelligent Crawling of Web
Applications for Web Archiving, published in
WWW 2012 companion ACM
[19] http://www.internetworldstats.com/Miniwatts
Marking Group
[22] http://www.w3.org/TR/xslt.html XSL
Transformations, W3C Recommendation
[23] http://www.w3.org/TR/xmlschema-0/ XML Schema
Primer, W3C Working Draft
[24] http://tidy.sourceforge.net/ HTML Tidy Library
Project
[25] http://www.saxonica.com/ Saxon Processor
Download