English_version1

advertisement
BULDING BIBLIOGRAPHY OF SCIENTIFIC
PAPERS USING WEB CRAWLER
Tin Huynh, Tien Do, Cuong Nguyen
Univercity of Information Techonology – Vietnam National University, HCM City
6 Quarter, Linh Trung Ward, Thu Duc District, Ho Chi Minh City
Email: tinhn@uit.edu.vn, tiendo.vn@gmail.com, cuongnp.uit.se@gmail.com
Abstract
In this paper we develop a system to build bibliography of
scientific papers by combining the use of Web Crawler to
index the article directly on the digital library and the
existing database indexes. This method would ensure that
data collected are accurate, updated and sufficient .
Building a tool to indexes scientific papers is a very
nesecessary and useful task for building searching
documents to meet the needs of users. Our system will
implement extract metadata of science paper on digital
libraries ACM1, CiteSeer2, IEEEXplore3 in combined with
the use of data available on DBLP4 to build bibliography
of scientific papers.
Keywords Digital Library, DBLP, Web Crawler.
I.
INTRODUCTION
Nowadays, most of scientific articles are published on digital
library and other shared forums. Among them, digital library
is the one storing the huge numbers of articles, papers and
they have been increasing every single year. When users need
to find a scientific article, they could find on many digital
libraries, as well as different available database indexes,
because the updated data of the different organizations or the
award exchange data between libraries of different institutions
is limited. In addition, updating new data of existing database
index is still limited. Thus, it is necessary to build a system so
as to collect information about articles from different digital
libraries in combined with using the available database index
to build inventory index articles to make sure that all data are
accurate, complete and updated.
According to the survey [1] [7], there were some applications
for collecting, storing index data of scientific papers as the
system was introduced in [2][4][8][9][12]; however, how to
ensure the data collected complete and updated is still a major
challenge.
In this paper, we develop a system to collect bibliography of
scientific paper by using Web Crawler to extract URL to
articles content driecly on digital libraries such as ACM,
IEEEXplore, CiteSeer based on key words which are author’s
name from DBLP data or subject in computer science
1 http://portal.acm.org/
2 http://citeseerx.ist.psu.edu/
3 http://ieeexplore.ieee.org/
4 http://dblp.uni-trier.de/
(Wikipedia5). Afterward from these results, we would use the
combined analysis together with set of rules defined to extract
metadata of each article. From gathered information, system
will combine the available data in DBLP – an bibliography
contained 1.5 million articles (as of January 2011) to build up
the bibliography of scientific articles. Thus, using Web
Crawler in the digital library system would ensure that all
articles have latest updated; at the same time, combination
with the available data in DBLP will provide fully accurate
data for the system.
In section 2, we briefly survey current related resreach and
system on data collection. In section 3 we will present our
approach, that is the system architecture for metadata
extraction from URL to artiscles contents and combined use of
database indexes available . Assessment and some
experimental results our system introduced in section 4. The
final section will be the conclusion and some discussion on
our system and future works.
II.
RELATED WORK
According to the survey [1] [3] [7], there are several sources
from which data can build a bibliography of articles including:
building bibliography from TOCs (tables of contents of
proceedings of conferences, journals) [8]; building
bibliography of articles by extraction metadata from the article
content (use postscript or PDF files of articles published
online) [2][11][12]; collect information articles on the Web
(on information sharing web sites or on author's
Web)[4][9][10].
In [8], DBLP (Digital Bibliography & Library Project)
provides bibliographic information on major computer science
journals and proceedings. DBLP data collected through the
analysis TOCs of proceedings of conferences, journals –
TOCs files are collected by organizations sending to authors
as well as our own collection. DBLP is open data, exported to
other formats such as CDF, XML and MySQL; developers can
download these files from the homepage of DBLP. Based on
data of DBLP, there are some search tools such as those
introduced in [5][6].
5
http://en.wikipedia.org/wiki/Computer_science
Number of
articles do
not exist in
DBLP
(%)
Number of
articles
published
before 2010
do not exist
in DBLP
Number of
articles
published at
2010 do not exist
in DBLP
III.
OUT APROACH
A. The system architecture.
Digital Library
ACM,
IEEE,
CiteSeer,
(%)
(%)
URL Results
URL Query
Database
28,33
86,26
10,71
Data
mining
43,67
77,45
14,51
Key Word
Collection
Module
List URL
Table 1: The experimental result for completeness of data DBLP
To evaluate the update and accuracy of DBLP, we were
searching articles on digital libraries with a topic keyword in
the field of computer science, and then check the existence of
received information in the DBLP data. In table 1, the average
results of 100 first papers on the three digital libraries ACM,
CiteSeer, IEEEXplore after searching for two keywords
“Database” and “Data mining”.
With these results, we found that data of DBLP do not
guarantee complete and accurate data because it is difficult to
collect all TOCs files from conferences, journals.
Our system would use data from DBLP to make use of the
available data; In addition, our systems offer a methodology
for collecting additional data and DBLP by crawling to get
bibliography of scientific papers directly from digital library.
In [2][12], authors introduce ACI (Autonomous Citation
Indexing), an index articles system for use in CiteSeer digital
libraries. ACI use a search engine to collect, download the
articles on the Internet with the PostScript or PDF file format,
ACI use the algorithms and analysis to extract metadata on
these files build up a database of the CiteSeer digital library.
Currently, for digital libraries such as ACM, IEEEXplore,
Spinger, and etc, downloading a paper is limited, so ACI can
not collect metadata of these articles, whereas the bibliography
of the article has been provided on digital libraries. On the
other hand, algorithms to extract metadata from content of
articles may be highly accurate and still is a topic being
studied in the field of data mining.
In [4][9][10], introduced some systems collect information
articles on the Web . These systems use algorithms and rules
to extract articles metadata on the Web, according to the
author, the accuracy of the data collected are not high.
Alternatively, this system has not made good use of
bibliography of articles and papers which had been marked by
digital libraries as well as existing.
Metadata
Extraction
Module
Metadata
Metadata
Import Module
Metadata articles
Key Word
Check
Duplicate
DBLP
BIBLOGRAPHY
OF SCIENTIFIC
PAPERS
Fig 1: The System architecture
In the Fig.1, we introduce the architecture of our system; input
is the keyword chosen from a list of subjects in computer
science or authors names taken from the DBLP. Basing
entered keywords, the system will Crawl the URL to articles
on digital libraries, the results from the digital library are list
URL to articles relevant to keyword search, metadata
extraction module uses the analysis, the combined use of predefined rules to identify and extraction metadata of articles
from URL. Check duplicate module will check the existence of
articles in DBLP, then the system can store the results
obtained to the database. Import module has function
connection and update data of DBLP.
B. Articles crawl on digital libraries.
System uses keywords to collect the path of articles (URL)
which contains articles having corresponding content on the
digital library.
To get results from the digital library, system combines
keyword with defined partten to create a URL query
For example, for keyword “Database” in the digital library
Citeseer.
URL query:
http://citeseerx.ist.psu.edu/search?q=database&feed=atom&s
ort=rel
Fig 2: The list of URL return from CiteSeer digital library with keyword
Database.
The system analyzes the HTML content returned from the
digital library, to collect a list of links that contain the article
in data of digital library.
C. Metadata Extraction
From the list of links to that the articles content, system will
analyze the HTML content of each article by using the
analysis HTMLParser6, SAXParser7 and pre-defined rules to
extract information including a summary (abstract) and
BibTeX file containing index information of the article.
After get the BibTeX file, system would use BibTex parser in
JabRef8 to get metadata of articles.
D. Combined use of database indexes available DBLP
For the data collected is complete, the system uses of available
index data for DBLP also. When a paper is collected on, the
system will use the title and year of publication to determine if
the article has already been in the database yet. After the
process of comparison, the system displays the statistical
parameters for user reference.
E. System Building biblography of scientific papers using
Web Crawler
We develop a system which builds bibllography of scientific
papers using Web Crawler. Our system manages database by
MySQL and it is built using Java that could run on any
platform such as Windows, Linux. The system can display
visual data to help users edit the information of the article or
add or remove data directly and interact well with users.
Fig 3. Detail of an article of the ACM Digital Library.
Bibtex uses a style-independent text-based file format for lists
of bibliography items, such as articles, books, and theses. In
digital libraries, metadata of paper export to a BibTeX file, so
metadata extraction from this will accurately. By using the
rules defined before, the BibTeX file would get from HTML
content of the article. The following are examples of rules to
get Bibtex file from ACM Digital Library.
ACMStartGetBibtex: (rule to get id of the article)
http://portal.acm.org/exportformats.cfm?id=
ACMEndGetBibtex : (rule to get file Bibtex)
&expformat=bibtex
ACMStartGetAbstract (rult to get abstract of article)
http://portal.acm.org/tab_abstract.cfm?id=
ACMEndGetAbstract: &usebody=tabbody
To evaluate the data collection system collects, we are done by
collecting the articles based on keyword topics in the field of
computer science. Here are statistical results are averaged over
100 articles collected on digital libraries:
Here's an example of a BibTeX file are get from the ACM
Digital Library
Digital
library
Fig 5: Screen System
IV.
EXPERIMENTAL EVALUATOIN
The number
of
new
article added
(%)
The number
of new articles
has published
before 2010
(%)
The number
of
new
articles
has
published at
2010
(%)
ACM
48
66,67
53.6
CiteSeer
33
72,41
65
IEEEXplore
54
12,96
74.5
Table 2. The Experimental result for data collection.
Fig 4. BibTeX file structure get from the ACM Digital Library
6
Number of new articles added by the number of articles exist
in the digital library but does not exist in the DBLP data.
According the experimental result above, system has updated
the information from the paper's new digital library, add
missing data in the DBLP, help build up a database to index
the article accurate, complete and updated.
http://htmlparser.sourceforge.net/
7http://download.oracle.com/javase/1.4.2/docs/api/javax/xml/parsers/SAXP
arser.html
8
http://jabref.sourceforge.net/
V.
CONCLUSION AND FUTURE WORK
In this paper, we have introduced a system to build and update
the data index scientific articles by the Web Crawler on digital
libraries. Thus comparing to other studies [8][4][9], we have
improved the accuracy and performance when crawling on
digital libraries. Differentiating with solutions that are
introduced in [2], our system utilizes data available from
DBLP.
In addition, the automatic update function using predefined
time system ensures a new update of articles published in
digital libraries.
Next step in the near future, we will complete system with
functions:
 Additional sources of data collection, including
another digital libraries and from the author's personal
page.
 Topics classification for scientific articles have been
collected based on the metadata of the article.
 Building search engine of scientific papers based on
data collected system
REFERENCES
[1] Badawia M. Albassuny. Automatic metadata generation applications: a
survey study. International Journal of Metadata, Semantics and Ontologies .
Volume 3, Number .pp 260 – 282. 4 / 2008.
[2] C.L. Giles, K. Bollacker, S. Lawrence. CiteSeer: An Automatic Citation
Indexing System.Digital Libraries 98. Third ACM Conf. Digital Libraries,
ACM Press. pp. 89-98. New York. 1998.
[3] Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, Khaled F.
Shaalan. A Survey of Web Information Extraction Systems. IEEE Transactions
on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428, Oct.
2006.
[4] G. Pant, K. Tsioutsiouliklis, J. Johnson, C.L. Giles. Panorama.
Extending Digital Libraries with Topical Crawlers. Proc. ACM/IEEE Joint
Conference on Digital Libraries (JCDL). Tuscon, AZ, USA. 2004.
[5] Holger Bast, Ingmar Weber. The CompleteSearch Engine: Interactive,
Efficient, and Towards IR&DB Integration. CIDR 2007: 3rd Biennial
Conference on Innovative Data Systems Research. pp. 88-95. Asilomar, CA,
USA. 2007.
[6] J. Diederich, W.-T. Balke. FacetedDBLP - Navigational Access for Digital
Libraries. Bulletin of IEEE Technical Committee on Digital Libraries,
Volume 4 Issue 1. ISSN 1937-7266. Spring 2008
[7] Jane Greenberg, Kristina Spurgin , Abe Crystal. Functionalities for
automatic metadata generation applications: a survey of metadata experts’
opinions . Int. J. Metadata, Semantics and Ontologies, Vol. 1, No. 1. 2006.
[8] Michael Ley. The DBLP Computer Science Bibliography: Evolution,
Research Issues. Perspective. Lecture Notes in Computer Science. Volume
2476/2002, 481-486. 2002.
[9] Rong Shi. Automatic metadata discovery from noncooperative digital
libraries. In Proc. of IADIS international Conf. on e-Society. . Lisbon,
Portugal. 2003.
[10] Roth, D.L. The emergence of competitors to the Science Citation Index
and the Web of Science. Current Science, Vol. 8. Pp. 1531 – 1536. 2005.
[11] Nicola Zeni, Nadzeya Kiyavitskaya, Luisa Mich, John Mylopoulos,
James Cordy. A Lightweight Approach to Semantic Annotation of Research
Papers. Natural Language Processing and Information Systems. pp. 61-72.
.Paris, France. 2007.
[12] Steve Lawrence, C. Lee Giles, Kurt Bollacker. Digital Libraries and
Autonomous Citation Indexing. Computer, vol. 32, no. 6, pp. 67-71. June
1999.
Download