Abstract - Bio-Information System Laboratory

advertisement
CiNet: GUI based Literature analysis tool using cited information.
BioNetwork (Bis732)
Professor: Dohoen Lee
20063399 Sejun Lee
Abstract
We have developed CiNet, a graphic user interface based tool that extracts trend of
the research related to literature which we are interested in, allowing for graphical
visualization, textual navigation, and topological analysis.
Introduction
The internet has basically changed the way we access information; it is now a routine
process to use search engines, such as Google, and to follow hyperlinks rather than
reach for a reference book (Schatz, 1997; Henzinger and Lawrence, 2004). Hyperlinking
of related content in the internet makes retrieval of information associative and
extremely effective. However, these progressive changes in the way we approach
information are not followed in the tools available for scientists. Tracing hyperlink is
also somewhat annoying task when we are searching cited paper. It needs more
convenient way to access the literature.
We need also whole picture of connected literature exploration in the internet. The
World Wide Web, however, is so achieve an information resource, because it grows
naturally in view of a subsequent retrieval of information (Barabasi and Albert,
1999;Huberman and Adamic, 1999). Scientific publications contain also references to
other publications; however, in the heterogeneous world of publishing houses and
policies, it is difficult to make this reference network available for navigation; not all
publications are electronically and freely available and a common standard for linking is
not in sight. We also have a problem to see whole picture of literature network.
To tackle this problem, we proposed CiNet which is literature navigator tool using
citation information and textmining approach (Fig 1)
Figure 1 System overview
Method
1. Construction CiNet with GUI Interface.
The system architecture of CiNet is shown figure 2. This system has two parts. The
first part is indicated blue rectangle in fig 2, gathering information part. The other part
is indicated red rectangle in fig 2, extracting information using textmining part.
Figure 2 System architecture
Gathering information from Google
CiNet collected literature information which is citing inputted literature by
entered literature name. Then, this program extracted cited literature list from
Google Scholar. It retrieved this information in real-time by provided literature
name.
Gathering abstracts from PubMed
We extracted abstract from PubMed using previously founding Google citation
information.
2. Extracting keyword using texmining approach
All abstracts, which found previous step, were clustered using published year
information. All abstracts in each cluster were then tokenized and then tagged with
Brill's rule-based part-of-speech tagger (Brill 1992, 1994). The tagging process
was done without training and the results of the tagging are used as-is.
We produce a frequency list for each cluster. Normally, this would be a word
frequency list. Then we calculated modified scoring method (Paul RAYSON and
Roger GARSIDE)
Abstarct 1
Abstract 2
Total
Freq of word
a
b
a+b
Freq of other words
c-a
d-b
c+d-a-b
TOTAL
C
d
c+d
Figure 3 Example when two abstracts are in the cluster
Note that the value ‘c’ corresponds to the number of words in abstract one, and
‘d’ corresponds to the number of words in abstract two(N values). The values ‘a’
and ‘b’ are called the observed values (O). We need to calculate the Expected
values (E) according to the following formula:
Ei 
N i  Oi
i
N
i
i
For example, figure 3, in our case N1 = c, and N2 = d. So, for this word, E1 =
c*(a+b)/(c+d) and E2 = d*(a+b)/(c+d), and I = log(cited number). The calculation
for the expected values takes account of the size of the two abstracts, so we do
not need to normalize the figures before applying the formula. We can then
calculate the log-likehood value according to this formula
LL  Ni *  (Oi * log(Oi/Ei) * I)
i
The word frequency list will be then sorted by the resulting LL values. This gives
the effect of placing the largest LL value at the top of the list representing the
word which has the most significant relative frequency difference among the
abstracts. Then we could find keyword in each cluster..
Result
Figure 4 CiNet
Figure 4 shows CiNet interface. CiNet has two parts to show information for user. The
left part represents literature name as tree structure. Linked literature means they are
connected from citation. The right part is the abstract of selected literature. This
directory structure is very familiar with user. Therefore, this program is also easy to
deal with. Furthermore, representing abstracts with literature name give more
convenience to user. This literature is also sorted from cited number, means how many
articles cited in. Therefore, we can easily find that upper one is more important than
below at the same level in the tree. Word frequency test is under developing now..
Conclusion
In this paper we present CiNet, a GUI tool that can be used to extract and visualize the
information
of
literature
from
publications
indexed
by
Google
and
PubMed.
Distinguishing features of CiNet include its ability to generate graph and core abstract
information using cited information based on a single query. Its version is still 0.4, but it
will be updated.
References
Barabasi, A.L. and Albert, R. (1999) Emergence of scaling in random networks. Science,
286, 509-512
Brill, E. (1992). A Simple Rule-based Part of Speech Tagger. In Proceedings of the
Third Conference on Applied-Natural Language Processing ANLP-1992, 152-155.
Brill, E. (1994). Some Advances in Transformation-Based Part of Speech Tagging. In
Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), 722727.
Henzinger, M. and Lawrence, S. (2004) Extracting knowledge from the World Wide Web.
Proc Natl Acad Sci USA, 101, 5186-5191
Huberman, B.A. and Adamic, L.A.(1999) Internet: Growth dynamics of the World Wide
Web. Nature, 401,131
Paul RAYSON, Roger GARSIDE, Comparing Corpora using Frequency Profiling
Schatz, B.R. (1997) Information retrieval in digital libraries: bringing search to the net.
Science, 275, 327-334
Download