BULDING BIBLIOGRAPHY OF SCIENTIFIC PAPERS USING WEB CRAWLER Tin Huynh, Tien Do, Cuong Nguyen Univercity of Information Techonology – Vietnam National University, HCM City 6 Quarter, Linh Trung Ward, Thu Duc District, Ho Chi Minh City Email: tinhn@uit.edu.vn, tiendo.vn@gmail.com, cuongnp.uit.se@gmail.com Abstract In this paper we develop a system to build bibliography of scientific papers by combining the use of Web Crawler to index the article directly on the digital library and the existing database indexes. This method would ensure that data collected are accurate, updated and sufficient . Building a tool to indexes scientific papers is a very nesecessary and useful task for building searching documents to meet the needs of users. Our system will implement extract metadata of science paper on digital libraries ACM1, CiteSeer2, IEEEXplore3 in combined with the use of data available on DBLP4 to build bibliography of scientific papers. Keywords Digital Library, DBLP, Web Crawler. I. INTRODUCTION Nowadays, most of scientific articles are published on digital library and other shared forums. Among them, digital library is the one storing the huge numbers of articles, papers and they have been increasing every single year. When users need to find a scientific article, they could find on many digital libraries, as well as different available database indexes, because the updated data of the different organizations or the award exchange data between libraries of different institutions is limited. In addition, updating new data of existing database index is still limited. Thus, it is necessary to build a system so as to collect information about articles from different digital libraries in combined with using the available database index to build inventory index articles to make sure that all data are accurate, complete and updated. According to the survey [1] [7], there were some applications for collecting, storing index data of scientific papers as the system was introduced in [2][4][8][9][12]; however, how to ensure the data collected complete and updated is still a major challenge. In this paper, we develop a system to collect bibliography of scientific paper by using Web Crawler to extract URL to articles content driecly on digital libraries such as ACM, IEEEXplore, CiteSeer based on key words which are author’s name from DBLP data or subject in computer science 1 http://portal.acm.org/ 2 http://citeseerx.ist.psu.edu/ 3 http://ieeexplore.ieee.org/ 4 http://dblp.uni-trier.de/ (Wikipedia5). Afterward from these results, we would use the combined analysis together with set of rules defined to extract metadata of each article. From gathered information, system will combine the available data in DBLP – an bibliography contained 1.5 million articles (as of January 2011) to build up the bibliography of scientific articles. Thus, using Web Crawler in the digital library system would ensure that all articles have latest updated; at the same time, combination with the available data in DBLP will provide fully accurate data for the system. In section 2, we briefly survey current related resreach and system on data collection. In section 3 we will present our approach, that is the system architecture for metadata extraction from URL to artiscles contents and combined use of database indexes available . Assessment and some experimental results our system introduced in section 4. The final section will be the conclusion and some discussion on our system and future works. II. RELATED WORK According to the survey [1] [3] [7], there are several sources from which data can build a bibliography of articles including: building bibliography from TOCs (tables of contents of proceedings of conferences, journals) [8]; building bibliography of articles by extraction metadata from the article content (use postscript or PDF files of articles published online) [2][11][12]; collect information articles on the Web (on information sharing web sites or on author's Web)[4][9][10]. In [8], DBLP (Digital Bibliography & Library Project) provides bibliographic information on major computer science journals and proceedings. DBLP data collected through the analysis TOCs of proceedings of conferences, journals – TOCs files are collected by organizations sending to authors as well as our own collection. DBLP is open data, exported to other formats such as CDF, XML and MySQL; developers can download these files from the homepage of DBLP. Based on data of DBLP, there are some search tools such as those introduced in [5][6]. 5 http://en.wikipedia.org/wiki/Computer_science Number of articles do not exist in DBLP (%) Number of articles published before 2010 do not exist in DBLP Number of articles published at 2010 do not exist in DBLP III. OUT APROACH A. The system architecture. Digital Library ACM, IEEE, CiteSeer, (%) (%) URL Results URL Query Database 28,33 86,26 10,71 Data mining 43,67 77,45 14,51 Key Word Collection Module List URL Table 1: The experimental result for completeness of data DBLP To evaluate the update and accuracy of DBLP, we were searching articles on digital libraries with a topic keyword in the field of computer science, and then check the existence of received information in the DBLP data. In table 1, the average results of 100 first papers on the three digital libraries ACM, CiteSeer, IEEEXplore after searching for two keywords “Database” and “Data mining”. With these results, we found that data of DBLP do not guarantee complete and accurate data because it is difficult to collect all TOCs files from conferences, journals. Our system would use data from DBLP to make use of the available data; In addition, our systems offer a methodology for collecting additional data and DBLP by crawling to get bibliography of scientific papers directly from digital library. In [2][12], authors introduce ACI (Autonomous Citation Indexing), an index articles system for use in CiteSeer digital libraries. ACI use a search engine to collect, download the articles on the Internet with the PostScript or PDF file format, ACI use the algorithms and analysis to extract metadata on these files build up a database of the CiteSeer digital library. Currently, for digital libraries such as ACM, IEEEXplore, Spinger, and etc, downloading a paper is limited, so ACI can not collect metadata of these articles, whereas the bibliography of the article has been provided on digital libraries. On the other hand, algorithms to extract metadata from content of articles may be highly accurate and still is a topic being studied in the field of data mining. In [4][9][10], introduced some systems collect information articles on the Web . These systems use algorithms and rules to extract articles metadata on the Web, according to the author, the accuracy of the data collected are not high. Alternatively, this system has not made good use of bibliography of articles and papers which had been marked by digital libraries as well as existing. Metadata Extraction Module Metadata Metadata Import Module Metadata articles Key Word Check Duplicate DBLP BIBLOGRAPHY OF SCIENTIFIC PAPERS Fig 1: The System architecture In the Fig.1, we introduce the architecture of our system; input is the keyword chosen from a list of subjects in computer science or authors names taken from the DBLP. Basing entered keywords, the system will Crawl the URL to articles on digital libraries, the results from the digital library are list URL to articles relevant to keyword search, metadata extraction module uses the analysis, the combined use of predefined rules to identify and extraction metadata of articles from URL. Check duplicate module will check the existence of articles in DBLP, then the system can store the results obtained to the database. Import module has function connection and update data of DBLP. B. Articles crawl on digital libraries. System uses keywords to collect the path of articles (URL) which contains articles having corresponding content on the digital library. To get results from the digital library, system combines keyword with defined partten to create a URL query For example, for keyword “Database” in the digital library Citeseer. URL query: http://citeseerx.ist.psu.edu/search?q=database&feed=atom&s ort=rel Fig 2: The list of URL return from CiteSeer digital library with keyword Database. The system analyzes the HTML content returned from the digital library, to collect a list of links that contain the article in data of digital library. C. Metadata Extraction From the list of links to that the articles content, system will analyze the HTML content of each article by using the analysis HTMLParser6, SAXParser7 and pre-defined rules to extract information including a summary (abstract) and BibTeX file containing index information of the article. After get the BibTeX file, system would use BibTex parser in JabRef8 to get metadata of articles. D. Combined use of database indexes available DBLP For the data collected is complete, the system uses of available index data for DBLP also. When a paper is collected on, the system will use the title and year of publication to determine if the article has already been in the database yet. After the process of comparison, the system displays the statistical parameters for user reference. E. System Building biblography of scientific papers using Web Crawler We develop a system which builds bibllography of scientific papers using Web Crawler. Our system manages database by MySQL and it is built using Java that could run on any platform such as Windows, Linux. The system can display visual data to help users edit the information of the article or add or remove data directly and interact well with users. Fig 3. Detail of an article of the ACM Digital Library. Bibtex uses a style-independent text-based file format for lists of bibliography items, such as articles, books, and theses. In digital libraries, metadata of paper export to a BibTeX file, so metadata extraction from this will accurately. By using the rules defined before, the BibTeX file would get from HTML content of the article. The following are examples of rules to get Bibtex file from ACM Digital Library. ACMStartGetBibtex: (rule to get id of the article) http://portal.acm.org/exportformats.cfm?id= ACMEndGetBibtex : (rule to get file Bibtex) &expformat=bibtex ACMStartGetAbstract (rult to get abstract of article) http://portal.acm.org/tab_abstract.cfm?id= ACMEndGetAbstract: &usebody=tabbody To evaluate the data collection system collects, we are done by collecting the articles based on keyword topics in the field of computer science. Here are statistical results are averaged over 100 articles collected on digital libraries: Here's an example of a BibTeX file are get from the ACM Digital Library Digital library Fig 5: Screen System IV. EXPERIMENTAL EVALUATOIN The number of new article added (%) The number of new articles has published before 2010 (%) The number of new articles has published at 2010 (%) ACM 48 66,67 53.6 CiteSeer 33 72,41 65 IEEEXplore 54 12,96 74.5 Table 2. The Experimental result for data collection. Fig 4. BibTeX file structure get from the ACM Digital Library 6 Number of new articles added by the number of articles exist in the digital library but does not exist in the DBLP data. According the experimental result above, system has updated the information from the paper's new digital library, add missing data in the DBLP, help build up a database to index the article accurate, complete and updated. http://htmlparser.sourceforge.net/ 7http://download.oracle.com/javase/1.4.2/docs/api/javax/xml/parsers/SAXP arser.html 8 http://jabref.sourceforge.net/ V. CONCLUSION AND FUTURE WORK In this paper, we have introduced a system to build and update the data index scientific articles by the Web Crawler on digital libraries. Thus comparing to other studies [8][4][9], we have improved the accuracy and performance when crawling on digital libraries. Differentiating with solutions that are introduced in [2], our system utilizes data available from DBLP. In addition, the automatic update function using predefined time system ensures a new update of articles published in digital libraries. Next step in the near future, we will complete system with functions: Additional sources of data collection, including another digital libraries and from the author's personal page. Topics classification for scientific articles have been collected based on the metadata of the article. Building search engine of scientific papers based on data collected system REFERENCES [1] Badawia M. Albassuny. Automatic metadata generation applications: a survey study. International Journal of Metadata, Semantics and Ontologies . Volume 3, Number .pp 260 – 282. 4 / 2008. [2] C.L. Giles, K. Bollacker, S. Lawrence. CiteSeer: An Automatic Citation Indexing System.Digital Libraries 98. Third ACM Conf. Digital Libraries, ACM Press. pp. 89-98. New York. 1998. [3] Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, Khaled F. Shaalan. A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428, Oct. 2006. [4] G. Pant, K. Tsioutsiouliklis, J. Johnson, C.L. Giles. Panorama. Extending Digital Libraries with Topical Crawlers. Proc. ACM/IEEE Joint Conference on Digital Libraries (JCDL). Tuscon, AZ, USA. 2004. [5] Holger Bast, Ingmar Weber. The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration. CIDR 2007: 3rd Biennial Conference on Innovative Data Systems Research. pp. 88-95. Asilomar, CA, USA. 2007. [6] J. Diederich, W.-T. Balke. FacetedDBLP - Navigational Access for Digital Libraries. Bulletin of IEEE Technical Committee on Digital Libraries, Volume 4 Issue 1. ISSN 1937-7266. Spring 2008 [7] Jane Greenberg, Kristina Spurgin , Abe Crystal. Functionalities for automatic metadata generation applications: a survey of metadata experts’ opinions . Int. J. Metadata, Semantics and Ontologies, Vol. 1, No. 1. 2006. [8] Michael Ley. The DBLP Computer Science Bibliography: Evolution, Research Issues. Perspective. Lecture Notes in Computer Science. Volume 2476/2002, 481-486. 2002. [9] Rong Shi. Automatic metadata discovery from noncooperative digital libraries. In Proc. of IADIS international Conf. on e-Society. . Lisbon, Portugal. 2003. [10] Roth, D.L. The emergence of competitors to the Science Citation Index and the Web of Science. Current Science, Vol. 8. Pp. 1531 – 1536. 2005. [11] Nicola Zeni, Nadzeya Kiyavitskaya, Luisa Mich, John Mylopoulos, James Cordy. A Lightweight Approach to Semantic Annotation of Research Papers. Natural Language Processing and Information Systems. pp. 61-72. .Paris, France. 2007. [12] Steve Lawrence, C. Lee Giles, Kurt Bollacker. Digital Libraries and Autonomous Citation Indexing. Computer, vol. 32, no. 6, pp. 67-71. June 1999.