CiNet: GUI based Literature analysis tool using cited information. BioNetwork (Bis732) Professor: Dohoen Lee 20063399 Sejun Lee Abstract We have developed CiNet, a graphic user interface based tool that extracts trend of the research related to literature which we are interested in, allowing for graphical visualization, textual navigation, and topological analysis. Introduction The internet has basically changed the way we access information; it is now a routine process to use search engines, such as Google, and to follow hyperlinks rather than reach for a reference book (Schatz, 1997; Henzinger and Lawrence, 2004). Hyperlinking of related content in the internet makes retrieval of information associative and extremely effective. However, these progressive changes in the way we approach information are not followed in the tools available for scientists. Tracing hyperlink is also somewhat annoying task when we are searching cited paper. It needs more convenient way to access the literature. We need also whole picture of connected literature exploration in the internet. The World Wide Web, however, is so achieve an information resource, because it grows naturally in view of a subsequent retrieval of information (Barabasi and Albert, 1999;Huberman and Adamic, 1999). Scientific publications contain also references to other publications; however, in the heterogeneous world of publishing houses and policies, it is difficult to make this reference network available for navigation; not all publications are electronically and freely available and a common standard for linking is not in sight. We also have a problem to see whole picture of literature network. To tackle this problem, we proposed CiNet which is literature navigator tool using citation information and textmining approach (Fig 1) Figure 1 System overview Method 1. Construction CiNet with GUI Interface. The system architecture of CiNet is shown figure 2. This system has two parts. The first part is indicated blue rectangle in fig 2, gathering information part. The other part is indicated red rectangle in fig 2, extracting information using textmining part. Figure 2 System architecture Gathering information from Google CiNet collected literature information which is citing inputted literature by entered literature name. Then, this program extracted cited literature list from Google Scholar. It retrieved this information in real-time by provided literature name. Gathering abstracts from PubMed We extracted abstract from PubMed using previously founding Google citation information. 2. Extracting keyword using texmining approach All abstracts, which found previous step, were clustered using published year information. All abstracts in each cluster were then tokenized and then tagged with Brill's rule-based part-of-speech tagger (Brill 1992, 1994). The tagging process was done without training and the results of the tagging are used as-is. We produce a frequency list for each cluster. Normally, this would be a word frequency list. Then we calculated modified scoring method (Paul RAYSON and Roger GARSIDE) Abstarct 1 Abstract 2 Total Freq of word a b a+b Freq of other words c-a d-b c+d-a-b TOTAL C d c+d Figure 3 Example when two abstracts are in the cluster Note that the value ‘c’ corresponds to the number of words in abstract one, and ‘d’ corresponds to the number of words in abstract two(N values). The values ‘a’ and ‘b’ are called the observed values (O). We need to calculate the Expected values (E) according to the following formula: Ei N i Oi i N i i For example, figure 3, in our case N1 = c, and N2 = d. So, for this word, E1 = c*(a+b)/(c+d) and E2 = d*(a+b)/(c+d), and I = log(cited number). The calculation for the expected values takes account of the size of the two abstracts, so we do not need to normalize the figures before applying the formula. We can then calculate the log-likehood value according to this formula LL Ni * (Oi * log(Oi/Ei) * I) i The word frequency list will be then sorted by the resulting LL values. This gives the effect of placing the largest LL value at the top of the list representing the word which has the most significant relative frequency difference among the abstracts. Then we could find keyword in each cluster.. Result Figure 4 CiNet Figure 4 shows CiNet interface. CiNet has two parts to show information for user. The left part represents literature name as tree structure. Linked literature means they are connected from citation. The right part is the abstract of selected literature. This directory structure is very familiar with user. Therefore, this program is also easy to deal with. Furthermore, representing abstracts with literature name give more convenience to user. This literature is also sorted from cited number, means how many articles cited in. Therefore, we can easily find that upper one is more important than below at the same level in the tree. Word frequency test is under developing now.. Conclusion In this paper we present CiNet, a GUI tool that can be used to extract and visualize the information of literature from publications indexed by Google and PubMed. Distinguishing features of CiNet include its ability to generate graph and core abstract information using cited information based on a single query. Its version is still 0.4, but it will be updated. References Barabasi, A.L. and Albert, R. (1999) Emergence of scaling in random networks. Science, 286, 509-512 Brill, E. (1992). A Simple Rule-based Part of Speech Tagger. In Proceedings of the Third Conference on Applied-Natural Language Processing ANLP-1992, 152-155. Brill, E. (1994). Some Advances in Transformation-Based Part of Speech Tagging. In Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), 722727. Henzinger, M. and Lawrence, S. (2004) Extracting knowledge from the World Wide Web. Proc Natl Acad Sci USA, 101, 5186-5191 Huberman, B.A. and Adamic, L.A.(1999) Internet: Growth dynamics of the World Wide Web. Nature, 401,131 Paul RAYSON, Roger GARSIDE, Comparing Corpora using Frequency Profiling Schatz, B.R. (1997) Information retrieval in digital libraries: bringing search to the net. Science, 275, 327-334