AUTOMATIC EXTRACTION OF DATA FROM DEEP WEB PAGE Nripendra Narayan Das Ela Kumar Associate Professor RIET, Faridabad nripendraddas@gmail.com, Dean and Associate Professor Gautam Buddha University, Greater Noida ela_kumar@gbu.ac.in ABSTRACT There is large volume of information available to be mined from the World Wide Web. The information on the Web is contained in the form of structured and unstructured objects, which is known as data records. Such data records are important because essential information are available in these host pages, e.g., lists of products and there detail information. It is necessary to extract such data records to provide relevance information to user as per their requirements. Some of the approaches used to solve this problem are manual approach, supervised learning, and automatic techniques. The manual method is not suitable for large number of pages. It is a challenging work to retrieve appropriate and useful information from Web pages. Currently, many web retrieval systems called web wrappers, web crawler have been designed . In this paper, some existing techniques are examined, then our current work on web information extraction is presented. It is the fact that most of the search engines are not in a position to extract the data from deep web. In this paper an algorithm has been designed and practically implemented which will not only extract the data in efficient manner but will also display as a single page output. Experimental evaluation on a large number of real input web url address collections indicates that our algorithm correctly extracts data in most cases Keywords : World wide web, wrapper, web crawler 1. INTRODUCTION Before giving introduction of our paper we must know what is Deep Web. The Deep Web (or Invisible web) is the set of information resources on the World Wide Web not reported by normal search engines. The World Wide Web contains different types of relevant information in the form of free text. The unpredictable growth and popularity of the world-wide web and availability of huge amount of information on the Internet makes extraction of data from web pages difficult. However, due to the heterogeneous and unstructured web page design, it is very difficult to retrieve information sources. Some of the Web mining applications, such as comparison shopping robots, require expensive maintenance to deal with different data formats. For translation of input pages into structured data automatically, many algorithms have been designed in the area of information extraction (IE). The information source can be classified into three main types free text, structured text and semi-structured text. Originally, the extraction system focuses on free text extraction. Natural Language Processing (NLP) techniques are developed to extract this type of unrestricted, unregulated information, which employs the syntactic and semantic characteristics of the language to generate the extraction rules. The structured information usually comes from databases, which provide rigid or well defined formats of information, therefore, it is easy to extract through some query language such as Structured Query Language (SQL). The other type is the semistructured information, which falls between free text and structured information. A good example of semistructured information are Web pages. According to the statistical results by Miniwatts Marking Group [19], the growth of web users during this decade is over 200% and there are more than 2 billion Internet users from over 278 countries and world regions. At the same time, public information and virtual places are increasing accordingly, which almost covers any kind of information needs. Thus this attracts much attention on how to extract the useful information from the Web. Currently, the targeted web documents can easily be obtained by inputting some keywords with a web search engine. But the drawback is that the system may not necessarily provide relevant data rich pages and it is not easy for the computer to automatically extract or fully understand the information contained. The reason is due to the fact that web pages are designed for human browsing, rather than machine interpretation. Most of the pages are in Hypertext Markup Language (HTML) format, which is a semistructured language, and the data are not given in a particular format and change frequently [1]. There are several challenges in extracting information from a semi-structured web page e.g. (i) Lack of a schema. (ii) Ill formatting. (iii) High update frequency and semantic heterogeneity of the information. In order to overcome these challenges, our system design transforms the page into a format called Extensible Hypertext Mark-up Language (XHTML) [20]. Then, we make use of the DOM tree hierarchy of a web page and regular expressions are extracted out using the Extensible Style sheet Language (XSL) [21, 22] technique, with a human training process. The relevant information is extracted and transformed into another structured format— Extensible Mark-up Language (XML) [23]. The remainder of the paper is organized as follows: some related works are illustrated in section 2, which involve a brief overview of the current web information extraction systems; then detail techniques in our approach are addressed in section 3; experimental results are explained in section 4; finally, the conclusion and future work are mentioned in the last section. In this paper, we will focus mainly on extracting text information from web pages. After browsing more than 100 web pages, it was observed that almost all of the web pages are displayed in particular format using a particular language i.e. HTML. It was also observed that to display the output page in better format, Table and list HTML tags are used by most of the programmer. The algorithm below describes the approach : 2 RELATED WORK 3.1 System design and architecture In last 10-13 years, many extraction systems have been developed. In the very beginning, a wrapper is constructed to manually extract a particular format of information. However, the wrappers were not very much successful , it should be redesigned and reprogrammed accordingly to different types of information. In addition, it is complicated and knowledge intensive to construct the extraction rules used in a wrapper for a specific domain. Therefore only experts may have knowledge to do that. Due to the extensive work in manually constructing a wrapper, many wrapper generation techniques have been developed. Those techniques could be classified into several classes, including language development based, HTML tree processing based, natural language processing based, wrapper induction based, modelling based and ontology based [2]. Researchers have developed several approaches for mining data records from Web pages. David Embley, Yuan Jiang, and Yiu-Kai Ng use a set of heuristics and a manually –constructed domain ontology[3]. David Buttler, Ling Liu, and Calton Pu extend this approach in Omini (Object Mining and Extraction System, http://disl.cc.gatech.edu/Omini), which uses additional heuristics based on HTML tags but no domain knowledge[4]. Chia-Hui Chang and Shao-Chen Lui proposed IEPAD (Information Extraction Based on Pattern Discovery), an automatic method that uses sequence alignment to find patterns representing a set of data records[5]. Because this method’s sequence matching is limited, it also performs poorly. Kristina Lerman, Craig Knoblock, and Steven Minton use clustering and grammar induction of regular languages but report unsatisfactory results [6]. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo compare two pages to find patterns for data extraction [7]. However, their method needs to start with a set of similar pages. By working on individual pages, our technique avoids this limitation. Other related research is mainly in wrapper induction [8]. A wrapper is a program that extracts data from a Web site and puts it in a database. However, rather than mining data records, wrapper induction only extracts certain pieces of information on the basis of user labelled items. 3. PROPOSED ALGORITHM TO EXTRACT DATA FROM WEB PAGES: To cater the challenge, in this paper we proposed an algorithm to extract the structure of lists by using the regularities both in the format of the pages and the data contained in them. • Extract all data from lists – Compute the page pattern and identify the list on each page – Compute a set of features (separators and content) for each data extract • Identify columns and Classification of data • Identify rows Web Page Transform the web page into equivalent HTML code Extract the relevant information and transfer it into temporary file Display the extracted data of web pages Figure 1. System architecture 3.2 Algorithm of web data extraction input: k = key word(s) output: Extracted data from web pages 1. begin 2. while(not end of data records) 3. If(keyword found) 4. Open the web page internally 5. Parse the web page in equivalent HTML code 6. Search for data and extract from web page 7. Store the data in temporary file 8. Move to next record 9. Else 10. Move to next records 11. Display all the relevant extracted data Figure 2: Pseudo code of the data extraction algorithm When algorithm runs it starts by tokenizing Web pages, that is, splitting the text of the Web pages into individual words or tokens. Pages are analyze to find common structure. Many types of Web sources, especially those that return lists, generate the page from a template and fill it with results of a database query. It is expected all data are available in the same format and same type, car price for instance(as shown in figure-1. In addition to content, layout features, such as separators, may be useful hints for arranging extracts by columns. However, we cannot rely solely on separators — the table may have missing columns, separators before the first row and after the last one may be different from those separating rows within the list, etc.. The final step of the analysis is to partition the list into rows. Ideally, it should be easy to identify rows from the class assignment, because a row corresponds to a tuple of data, which is repeated for every row. An algorithm will be used this task. Each list, or rather the sequence of column labels for the extracts in the list, can be thought of as a string generated by a regular language. The language captures the repeated structure in the sequences that corresponds to rows. We use this information to partition the list into tuples. The end result of the application of the suite of algorithms is a complete assignment of data in the list to rows and columns. Figure 2. Extraction of old car detail data from Car Wale web site 3.3 Equivalent HTML Code of Figure-1 Due to constraints of page limit equivalent HTML code of 1st car only i.e. Mercedes-Benz S-Class has been shown below. <div> <table id="premium-lisings" width="100%" cellpadding="0" cellspacing="0" border="0" class="parent_table"> <tr name="D498695" class="premium-d-yellow" > <td> <a name="carName" target="_blank" title="MercedesBenz S-Class [2003-2013] 320 CDI [2007-2011]" href="/used/cars-in-newdelhi/mercedesbenzsclass20032013-D498695/"> <b>Mercedes-Benz S-Class [2003-2013] 320 CDI [20072011]</b></a> in Ashok Vihar, New Delhi <div class="rightfloat"> <div class='leftfloat font10 light-grey-text' style="margintop: 4px;margin-right: 5px;">Premium</div> </div> </td> </tr> <tr name="D498695" class="premium-d-yellow"> <td class="grey-bottom-border"> <table width="100%" cellpadding="0" cellspacing="0" border="0" class="inner_table"> <tr class="row_strong"> <td rowspan="3" width="100" class="thumb no-border" > <img dataoriginal='http://img2.aeplcdn.com/tc/04/D8/210038/2007Mercedes-Benz-S-Class-320-CDI-2007-2011-86213380x60.JPG' class='pointer lazy' onclick="showPhotos('D498695');" src='http://img.carwale.com/grey.gif' title='MercedesBenz S-Class [2003-2013] 320 CDI [2007-2011]' alt='Loading..' /> <div> <a class="text-grey2 pointer" onclick="showPhotos('D498695');">7 Photos</a> </div> <div> <a class="text-grey2 pointer" href="/used/cars-innewdelhi/mercedesbenz-sclass20032013D498695/#divVideos" target="_blank" > </a> </div> <div> <img border="0" dataoriginal='http://img2.aeplcdn.com/AutoBiz/Certifications/ cargiant.png' src='http://img.carwale.com/grey.gif' class='lazy' title='Certified Car' style='margin-top:10px;' /> </div> </td> <td width="125" name="price"><span class="marginleft5"><span class="cw-sprite rupee-medium"></span> 25,00,000</span></td> <td width="125"><span name="kms">68,000</span> Kms</td> <td width="100" name="year">Oct-2007</td> <td >Diesel</td> </tr> <tr> <td >WHITE</td> <td>Automatic</td> <td name="seller">Dealer</td> <td >Second Owner </td> </tr> <tr class="no-border"> <td align="left"> <a target="_blank" title="View complete details of the car" href="/used/cars-in-newdelhi/mercedesbenzsclass20032013-D498695/">More details &raquo;</a> </td> <td align="center"> <a profileid="D498695" calculatedemi="" onclick="javascript:_gaq.push(['_trackEvent', 'UsedCarSearch', 'Button', '']);" class="button-light-grey contact-seller">Contact Seller</a> </td> <td colspan="2" align="right"> <div class="chkcomp pointer hide" profileid="D498695"> <span class="font11 rightfloat" style="paddingtop:2px;">Add to compare</span> <div class="filter checked_grey rightfloat" ></div> </div> <span class="font11"><i>Last Updated: 26-Feb2014</i></span> </td> </tr> </table> <input type="hidden" name="model_name" value="Mercedes-Benz S-Class [2003-2013]" /> </td> </tr> Observation : It can be seen from above HTML code (shown in border) that most of the web pages have been designed in a particular format, where data are put inside the Table tag format. e.g. <TABLE>, <TR>, <TD>. frequently encountered data types, such as phone numbers and zip codes, dash is part of data and not a separator. Likewise, comma (,) is sometimes a separator (e.g., “House No - 123, Gali no - 2, South Extension”) and sometimes part of data (e.g., in “1,000,000”), though in our concept it has been chosen to treat it as a separator. 4 RESULTS Above mentioned algorithm were applied to extract data from some Web information sources containing a wide variety of data types. Three or four pages were randomly selected from each source. The extraction algorithm was applied to each set of pages and manually checked whether the data from the list was extracted correctly or not. The table in Fig. 2 summarizes the results. In some of the web pages data were not extracted as we want to retrieve the data for a particular domain. In this paper used card domain has been discussed and algorithm has been applied to retrieve the data for used car only. 4.1 Prototype Search form for the practical implementation 3.4 Finding the page template During the extraction step, the text of each Web page is split into individual words, or more accurately tokens, and each token is assigned one or more syntactic types [10], based on the characters appearing in it. Thus, a token can be an HTML token, an alphanumeric, or a punctuation token. If it’s an alphanumeric token, it may also belong to one of two categories: numeric or alphabetic, which is further divided into capitalized or lowercased types, and so on. The syntactic token hierarchy is described in [10]. Many Web sources use templates to automatically generate pages and fill them with results of a database query. The algorithm works in the following way: (a) When query is submitted in search box, it matches in the database where only web addresses are stored of different domain with key word. (b) Internally open the web page and searches for key word in the generated HTML code. (c) If keyword matches with the keyword available in database it opens the web page internally and searches for result. (d) Display the result in a particular format. (e) The process is applied for all the web address and finally result are displayed in single web page. Once section of the page is identified that contains the relevant data, all data are extracted from it. If the HTML table is carefully formatted, this step would amount to extracting all visible text. However, in addition to HTML tags, punctuation characters are often used to separate data fields. Many programmer use different format as separator. Sometimes a dash (-) is a good separator, but for many Figure 3 Search form 4.2 Screen shot result of extracted data www.bookmyseats.in www.cardekho.com www.flipkart.com www.justeat.in www.carwale.com www.snapdeal.com ticket booking web page) No(on line Movie ticket booking web page) Yes No(on line purchasing web page) No(On line food order in hotel) Yes No(on line purchasing web page) Table-1 Sample web pages for retrieval of data 5. CONCLUSIONS In this paper, a novel and effective technique was proposed to retrieve data records in a Web page. Our algorithm is based on two important observations about data records on the Web and a string matching algorithm. It is automatic and thus requires no manual effort. In addition, our algorithm is able to discover non-contiguous data records, which cannot be handled by existing techniques. Experimental results show that our new method is extremely accurate. In our future work, it has been planned to find data records that are not formed by HTML table related tags. Figure 4 Extracted Result The web pages from different domain which were taken for practical implementation has been shown below : Name of web sites Data retrieved or not www.gaadi.com www.cartrade.com www.quikr.com www.indiabookstore.com Yes Yes Yes No (on line book purchasing web page) No(on line Movie ticket booking web page) Yes Yes Yes No(on line Movie ticket booking web page) Yes Yes Yes Yes Yes Yes No(on line book purchasing web page) No(on line Movie ticket booking web page) Yes No(on line Movie in.bookmyshow.com www.zigwheels.com www.cars.com www.oncars.in www.pvrcinemas.com www.autotrader.com www.olx.in autos.yahoo.com www.rightcar.com www.gaadi.com www.carazoo.com www.amazon.in www.ticketplease.com www.autotrader.com www.bookmyevent.com 6. REFERENCES [1] Eikvil, L.: Information Extraction from World Wide Web – A Survey, Technical Report 945, Norweigan Computing Center, Oslo, Norway (July 1999) [2] Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of Web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002). [3] D. Embley, Y. Jiang, and Y. Ng, “RecordBoundary Discovery in Web Documents,” Proc. ACM Int’l Conf. Management of Data (SIGMOD 99), ACM Press, 1999, pp. 467–478. [4] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Extraction System for the World Wide Web,” Proc. 21t Int’l Conf. Distributed Computing Systems (ICDCS 01), IEEE CS Press, 2001, pp. 361–370. [5] C.-H. Chang and S.-L. Lui, “IEPAD: Information Extraction Based on Pattern Discovery,” Proc. 10th Int’l Conf. World Wide Web (WWW 01), ACM Press, 2001, pp. 681– 688. [6] K. Lerman, C. Knoblock, and S. Minton, “Automatic Data Extraction from Lists and Tables in Web Sources,” Proc. IJCAI 2001 Workshop Adaptive Text Extraction and Mining, papers/ [20] http://www.w3.org/TR/xhtml1/ XHTML, W3C Recommendation [7] V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. 27th Int’l Conf. Very Large Data Bases (VLDB 01), Morgan Kaufmann, 2001, pp. 109–118. [21] http://www.zvon.org/xxl/XSLTutorial/Output/ contents.html XSLT Tutorial 2001; www.isi.edu/ lerman01-atem.pdf. info-agents/ [8] C. Knoblock and A. Levy, eds., Proc. 1998 Workshop AI and Information Integration, AAAI Press, 1998; www.isi.edu/ariadne/ aiii98wkshp/proceedings.html. [9] Baeza-Yates, R. “Algorithms for string matching: A survey.” ACM SIGIR Forum, 23(34):34--58, 1989 [10] Kristina Lerman and Steven Minton. Learning the common structure of data. In Proceedings of the 15=7th National Conference on Arti£cial Intelligence (AAAI- 2000) and of the 10th Conference on Innovative Applications of Arti£cial Intelligence (IAAI-98), Menlo Park, July 26–30 2000. AAAI Press. [11] M. A. Hearst. UIs for faceted navigation recent advances and remaining open problems. In Proceedings of HCIR, 2008. [12] A. Jain and M. Pennacchiotti. Open entity extraction from web search query logs. In Proceedings of ICCL, 2010. [13] H. S. Koppula, K. P. Leela, A. Agarwal, K. P. Chitrapura, S. Garg, and A. Sasturkar. Learning url patterns for webpage de-duplication. In Proceedings of WSDM, 2010. [14] J. Madhavan, S. R. Jeffery, S. Cohen, X. luna Dong, D. Ko, C. Yu, and A. Halevy. Web-scale data integration: You can only afford to pay as you go. In Proceedings of CIDR, 2007. [15] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google’s deep web crawl. In Proceedings of VLDB, 2008 [16] Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, Nirav ShahCrawling Deep web entity pages . In proceeding of WSDM’ 2013. [17] Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng,Annotating Search Results from Web Databases, IEEE transactions on knowledge and data engineering, Vol. 25, No. 3, March 2013. [18] Muhammad Faheem, Intelligent Crawling of Web Applications for Web Archiving, published in WWW 2012 companion ACM [19] http://www.internetworldstats.com/Miniwatts Marking Group [22] http://www.w3.org/TR/xslt.html XSL Transformations, W3C Recommendation [23] http://www.w3.org/TR/xmlschema-0/ XML Schema Primer, W3C Working Draft [24] http://tidy.sourceforge.net/ HTML Tidy Library Project [25] http://www.saxonica.com/ Saxon Processor