The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 2, May-June 2013 Text Mining Interpreting Knowledge Discovery from Biomed Articles K. Prabavathy* & Dr. P. Sumathi** *Research Scholar in Computer Science, Manonmanium Sundaranar University, Tirunelveli Town, Tamil Nadu, INDIA. E-Mail: praba_bud@yahoo.co.in **Doctoral Research Supervisor & Assistant Professor, Post Graduate & Research Department of Computer Science, Government Arts College, Coimbatore, Tamil Nadu, INDIA. E-Mail: sumathirajes@hotmail.com Abstract—The biggest challenge for text and data mining in biomedical informatics is to impact the discovery process, enabling scientists to generate novel hypothesis to address the most crucial questions for understanding the knowledge basis from biomed articles or documents. However, formulation of a flexible and general approach for integrating heterogeneous data and knowledge sources for discovery is elusive and highly dependent upon the specific underlying scientific question. This target has been taken as our work to interpret the knowledge discovery from Biomed articles. Our work has been framed with base of Keyword search, Information Retrieval and Information Extraction to bound the knowledge of articles in databases like Pub Med, Oxford journals etc., Thus, the true impact of text and data mining is only realized if it goes beyond the methods for extraction and indexing into enabling understanding of the entities like Protein-Protein Interaction, Gene and Human diseases relationship presented in articles, documents. It acts as an underpinnings process to form a network is modelled here. Keywords—Gene Disease Relationship, Gene Network, Information Retrieval, PPI, Text Mining Abbreviations—Information Extraction (IE), Information Indexing (II), Information Retrieval (IR), Natural Language Processing (NLP), Protein Protein Interaction (PPI), Text Classification (TC), Text Clustering (TCl) A I. INTRODUCTION computer, like a human, needs certain specialized knowledge in order to understand text. The scientific field that is dedicated to train computers with the right knowledge for the task is called Natural Language Processing (NLP). Biomedical text mining is the subfield that deals with text that comes from biology, medicine, and chemistry [Salton, 1989; Aronson, 1996]. The Challenges of Text mining in biomed terminology are dynamic nature of the domain inclusive of new terms (genes, proteins, chemical compounds, drugs) which frequently and constantly being created [Kuramochi & Karypis, 2004]. Also existing biomedical resources and ontologies need constant updating [Tin et al., 2005]. Some of these reason grounds to develop a new application to normalize and find structural information from open access database articles. In our work wellsummarized information and newly discovered evidence can be obtained. II. A REVIEW OF RELATED WORK There are a number of web-based text mining applications which can be used for discovering knowledge from articles. The pitfalls of existing are due to size of the widely used ISSN: 2321 – 2381 database which has a negative impact on the relevance of users’ query results and also simple free-text queries would return many false positives. Additionally, when reading a document of interest, users can query for related documents. Query expansion or reformulation is used to improve retrieval of documents relevant to a free-text query or related to a document of interest. Although applications are useful in exploring such information in the literature, not many of them provide real-time responses - the users often have to wait for several minutes before they receive the results. Some of the systems provide reasonably quick responses by limiting the number of documents to be analyzed to a very small number, but such limitation leads to a significant deterioration of the coverage [Simon M. Lin et al., 2004]. To complement existing applications, we develop an search mechanism to groundwork the knowledge and hidden structured data from abstract, articles, discussions etc. III. METHODS OF KNOWLEDGE DISCOVERY AND PATHWAYS Curators struggling to process scientific literature for discovery of facts and events crucial for gaining insights in biosciences motivated text mining to substructure the huge number of articles [Kors et al., 2005; Maier et al., 2005]. © 2013 | Published by The Standard International Journals (The SIJ) 33 The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 2, May-June 2013 Rapid growth of literature data poses challenges in efficient methods for extraction of information and effective ways of querying the information. Some of the crucial applications in Biological field are Named entity recognition of biological entities, Gene normalization, Protein-Protein interaction, Functional Analysis of genes, Extraction of gene-disease association etc., are taken as target of our work [Manning & Schutze, 1999; Mao & Chu, 2002]. Five main tasks and supporting tasks are arranged, and their results show advances in the state of the art in finegrained biomedical domain. The key technologies and tasks grounding the structured information used in our work are as Information Retrieval (IR) Information Extraction (IE) Information Indexing (II) Text Classification (TC) Text Clustering (TCl) 3.1. Information Retrieval It is a process of recovery of documents from a collection of documents, open access database etc., which persuade a given information demand. Information demand is posed in form of a user flexibility query. 3.2. Information Extraction (IE) IE refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources. It focuses on the collection, organization and application of information to answer questions. The challenges faced are of accuracy, running time, dynamically changing sources, Data Integration and Extraction Errors. Information extraction demonstrates that extraction methods successfully generalize in various aspects. 3.3. Information Indexing Efficient Indexing is required to reduce vocabulary of terms and query formulation. Indexed Document Collection includes of Tokenization, Stemming, and Stop word removal methods. IR TC IE BioText Mining TCl Classify II Figure 1 – Technologies of BioText Mining ISSN: 2321 – 2381 3.4. Text Classification Common problem in information science is assignment of an electronic document to one or more categories, based on its contents. Supervised document classification are provided and the correct classification model is learnt based on naive Bayes classifier, latent semantic indexing support vector machines, artificial neural network, kNN, decision trees, Concept Mining techniques. 3.5. Text Clustering Find which documents have many words in common, and place the documents with the most words in common into the same groups [Strehl et al., 2000]. Similarity of documents instead of similarity of sequences, expression profiles or structures. Cluster documents into topics according to user query keywords. A clustering program tries to find the groups in the data. Text Clustering programs often choose first the documents that seem representative of the middle of each of the clusters. Then it compares all the documents to these initial representatives. Similarity is based on how many words the documents have in common, and how strongly they are weighted. The topical terms of the clusters are chosen from words that represent the centre of the cluster. The best clustering is one in which the average difference of the documents to their cluster centres smallest [Varelas et al., 2005]. IV. FUNCTIONALITY OF SYSTEM Each of the approaches has its own strengths and weaknesses, especially with regard to the sensitivity and specificity of the method. A simple and finer idea to extract vein information like Protein Protein Information (PPI), Gene Disease relationship and sub structuring the Gene Network are done here using technologies of biotext mining [Hoffmann & Valencia, 2004; Sehgal & Srinivasan, 2006; Liu et al., 2006]. We use non overlapping training and background sets, and test sets are processed using a leave-one-out validation procedure. It acts as an integrating tool based environment to mine the information from a given biomedical literature and a database to store the mined information. As a first phase, the biomedical articles and documents are retrieved from the open access database like oxford journal, PubMed etc. Since there are 22 million documents effective Information Retrieval methods are used. Here Rule based induction methods are used to retrieve the articles from BioMed literature with the user requirements or keyword given. In the second phase of Information extraction various query expansion or reformulation strategies have been proposed in the biomedical field. A user’s free-text query defining the need for some information can be enriched with common synonyms or morphological variants from existing or automatically with keyword analysis. The analysed documents are indexed in the third phase of Information indexing. The indexed articles and documents are trained with set of documents representing a topic of interest from © 2013 | Published by The Standard International Journals (The SIJ) 34 The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 2, May-June 2013 generated thesauruses by means of well known classification algorithm. In final phase of Text clustering, first comparing every pair of documents, and finding the pair of documents which are most similar to each other are clustered. Identified biological entities and longer entities in articles from the clusters are linked and marked to entries in biological database called temp warehouse. The complexity exists in synonyms/acronyms, ambiguity, typographical variants, symbols/id of entities are overwhelmed by various dictionary based techniques [Schuemie et al., 2007]. Ambiguity occurs between protein names and their protein family names and Genes. Diversity of features of words, similarity with existing entries in database, presence of trigger words are considered in accuracy of information extraction. Information Retrieval Keyword/ Query Information Extraction Identification Information Indexing Analysis Filtering Articles from Db Text Classification Text Clustering Temp Warehouse Grouping Extracted Output Figure 2 – Functionality of a System V. DISCUSSION AND CONCLUSION The stipulation of an adapted system to gather relevant and brittle information from text, abstract and articles are obtained as a goal of our work. The retrieval of documents related to a single document can be significantly improved by extracted output. The extracted output preserves the general design and goals of the previous event, but adds a new focus on variability to address a limitation of existing. It is intended for biologists and biologists interested in adding text mining tools to their bioinformatics toolbox. It serves as a unique forum to discuss novel approaches to text and data mining methods that respond to specific scientific questions, enabling predictions that integrate a variety of data sources and can potentially impact scientific discovery. [6] [7] [8] [9] [10] [11] REFERENCES [1] [2] [3] [4] [5] G. Salton (1989), “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, Addison-Wesley, Reading, MA. AR Aronson (1996), “The Elect of Textual Variation on Concept based Information Retrieval”, Proceedings of AMIA Annu Fall Symp, Pp. 373. C. Manning & H. Schutze (1999), “Foundation of Statistical Natural Language Processing”, The MIT Press, Cambridge MA. A. Strehl, J. Ghosh, & R.J. Mooney (2000), “Impact of Similarity Measures on Webpage Clustering”, AAAI Workshop on AI for Web Search, Pp. 58–64. W. Mao & W.W. Chu (2002), “Free Text Medical Document Retrieval via Phrased-based Vector Space Model”, Proceedings of AMIA’02, San Antonio, TX. ISSN: 2321 – 2381 [12] [13] [14] [15] M. Kuramochi & G. Karypis (2004), “An Efficient Algorithm for Discovering Frequent Subgraphs”, IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 9. Simon M. Lin, Patrick McConnell, Kimberly F. Johnson & Jennifer Shoemaker (2004), “MedlineR: An Open Source Library in R for Medline Literature Data Mining”, Bioinformatics, Vol. 20, Pp. 3659–3661. R. Hoffmann & A. Valencia (2004), “A Gene Network for Navigating the Literature”, Nat Genet, Vol. 36, Pp. 664 N. Tin, JF. Kelso, AR. Powell, H. Pan, VB Bajic & WA Hide (2005), “Integration of Text- and Data Mining using Ontologies Successfully Selects Disease Gene Candidates”, Nucleic Acids Res, Vol. 33, No. 5, Pp. 1544–1552. J. Kors, M. Schuemie, B. Schijvenaars, M. Weeber & B. Mons (2005), “Combination of Genetic Databases for Improving Identification of Genes and Proteins in Text”, Biolink Conference. H. Maier, S. Döhr, K. Grote, S. O'Keeffe, T. Werner, M. Hrabé de Angelis & R. Schneider (2005), “LitMiner and WikiGene: Identifying Problem-Related Key Players of Gene Regulation using Publication Abstracts”, Nucleic Acids Res., Vol. 33, Pp. W779–W782. G. Varelas, E. Voutsakis, Euripides G. M. Petrakis, Evangelos E. Milios & P. Raftopoulou (2005), “Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web”, WIDM '05, New York: ACM Press, Pp. 10–16. AK. Sehgal & P. Srinivasan (2006), “Retrieval with Gene Queries”, BMC Bioinformatics, 7:220. H. Liu, ZZ Hu, J. Zhang & C. Wu (2006), “BioThesaurus: A Web-based Thesaurus of Protein and Gene Names”, Bioinformatics, Vol. 22, Pp. 103–105. MJ. Schuemie, B. Mons, M. Weeber & JA. Kors (2007), “Evaluation of Techniques for Increasing Recall in a Dictionary Approach to Gene and Protein Name Identification”, Journal of Biomedical Informatics, Vol. 40, No. 3, Pp. 316–324. © 2013 | Published by The Standard International Journals (The SIJ) 35 The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 2, May-June 2013 K. Prabavathy, M.Sc, M.Phil.., Doctoral Research scholar in department of Computer Science, Manonmanium Sundaranar University, Tirunelveli, Tamil Nadu, India. She completed M.Phil in the area of Data Mining and received MCA degree through Bharathiar University, Coimbatore and M.Sc degree through Madurai Kamaraj University, Madurai. She has published number of papers in reputed journals and conferences. She has about five years experience of teaching and research experience. Her area of interest includes Data Mining, Bioinformatics and Computer Networks. ISSN: 2321 – 2381 Dr. P. Sumathi is working as an Assistant Professor, PG & Research Department of Computer Science, Government Arts College, Coimbatore, Tamilnadu, India. She received her Ph.D., in the area of Grid Computing in Bharathiar University. She has done her M.Phil in the area of Software Engineering in Mother Teresa Women’s University and received MCA degree at Kongu Engineering College, Perundurai. She has published a number of papers in reputed journals and conferences. She has about Sixteen years of teaching and research experience. Her research interests include Data Mining, Grid Computing and Software Engineering. © 2013 | Published by The Standard International Journals (The SIJ) 36