International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 3, Issue 5, May 2014 ISSN 2319 - 4847 COMPUTING THE CONFIDENCE OF REDUNDANT EXTRACTED INFORMATION USING PTQL B.Bharathi Department of CSE JNTUA, JNTUA CE, Ananthapuram, India Abstract Information extraction is a method to extract particular kind of information in sequence from large volume of information. In the incremental information extraction the intermediate output of the processed data are stored in the parse tree data base with the help of a parse tree. Whenever there is a need to extract the information, extraction is performed on previously processed data as well as modified data generated by the component that was amended. Since the updates in the parse tree database are not incremental there is redundancy in which it will go through the information that was same in all the documents which will take more amount of time. So, incremental parse tree database is chosen in which it uses hashing and in this the state of the document will be identified by the signatures generated by it. So by using the incremental parse tree data base there is reduction in redundancy compared to the parse tree base which was used earlier. Keywords- Dependency parser, information extraction, information retrieval, query language. 1. INTRODUCTION It is predictable that each year additional than 6,00,000 articles are available in the biomedical literature, with secure to 20 million publication access being stored in the Medline database. Extracting in sequence from such a large corpus of documents is very complicated. So it is important to achieve the extraction of information by automaticity. Information Extraction (IE) is the progression of extracting structured information from the formless information. Incremental information extraction framework uses database organization system as an essential part. Database management system provides the dynamic extraction needs over the file-based storage space systems. Text processing components of named entity recognition and parsers deploy for the entire text corpus. The transitional output of each text processing constituent is stored in the relational database proposed as Parse Tree Database (PTDB). Database query which is used to retrieve the information from the PTDB is in the form of Parse Tree Query Language (PTQL). If the retrieval goal is modified or a module is efficient than the corresponding module then only the responsible module is spread for the entire text corpus and the processed data is occupied into the PTDB. Then retrieval of information is performed on the data that is involved rather than entire text. Unlike the file-based pipeline loom, incremental in sequence extraction framework loom stores the intermediate processed information of each component; this avoids the requirement of reprocessing the complete text corpus. Avoiding such reprocessing of information is most significant for information extractions because it reduces the extraction time extremely. 2. RELATED WORK Information extraction is one of the most important research areas over several past years. The purpose of this is that improving precision of the extraction systems. This segment, describes how this incremental in sequence extraction framework differs from conventional IE systems, rule-based IE systems. 2.1 Traditional IE Approaches Admired file-based frameworks of UIMA [1] and GATE [2] IE frameworks have the capability to integrate various NLP mechanisms for IE. These frameworks do not accumulate the intermediate dispensation data. QXtract [5],Snowball [6] systems utilize the RDBMS to store and query the take out results. Cimple [7], SystemT [8] systems use the combined operations in RDBMS. This extracts the results that are stored in various database tables. All these frameworks do not store any intermediate processed data, so if a component is improved or extraction goal is changed then all components have to be reprocessed from the initial. This consumes additional time and also has the high computational cost. In organize to reduce this high computational cost the majority common approach called manuscript filtering is used. Thus the clean documents are called as capable documents which are used in [5], [9], and [10]. Extraction is performed on those capable documents and then the applicable documents are retrieved from this. This the riddle approach fully based on the verdict that is selected separately based on the lexical clues. These clues enclose been provided by the parse tree query language. Also this riddle process utilizes the effectiveness of the IR engines. 2.2 Rule-Based IE Approaches Volume 3, Issue 5, May 2014 Page 236 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 3, Issue 5, May 2014 ISSN 2319 - 4847 Rule-based IE approaches worn in [11], [12], [13], and [14]. Avatar System [14] uses the AQL query language; this has the ability of performing IE task by matching it with standard expressions. But this query language does not sustain the traversal on parser tree. DIAL [12], TLM [13], identify It Now [15] systems are fully based on the relationship removal. They use their possess query languages. But all these uncertainty languages only supports querying of data from the superficial parsing they do not provide the ability of performing extraction using the rich grammatical structures. Cimple [7], SystemT [8], Xlog [16] used the declarative languages. Joins process is performed on the RDBMS and then system are applied for the addition of the dissimilar extracted results. However, these rules are not able of querying parse trees. MEDIE [11] stores the parse tree in the database and it tolerates the query language for the removal over this parse trees. The XML-like query languages such as XPath [3], XQuery [4] are pedestal on one kind of addiction grammar called head driven phrase structure (HPSG). Link types cannot be uttered in this query language as in PTQL. 3. OVERVIEW OF PROPOSED SYSTEM The proposed model "Computing the confidence of redundant extracted information using PTQL" consists of incremental parse tree data base (IPTDB) where the processing time is reduced to a larger extent as with the parse tree base which contains redundant information whenever a new goal is added if the document contains same information which was already present in the parse tree data base. Because in the parse tree database whenever a new goal is added extraction is applied on the previously processed data as well as updated data generated by the component. So incremental parse tree database is chosen. 3.1 Incremental Parse Tree Database and Index In the incremental parse tree data base whenever any document is given first the text data is converted into sentences and then into words and for the words parts of speech is assigned. A parse tree is constructed for a document. For this process Lucene is used. The detailed description of lucene is given in [18]. After the text was analyzed then extraction takes place by issuing queries to the database. In the parse tree database [17] if a document is added and if the document contains the same information which was present in the database then also it is going to update the information because in the parse tree database whenever a new goal is added it is going to update everything. But in the incremental parse tree database it is going to update only some information that was related directly to the given goal. In the IPTDB, in a document for a sentence parse tree is constructed and for that a hash value is calculated for the whole tree. Whenever any new document is added hash values are compared. If the hash values are similar then it assumes that both the documents are similar and it is not going to compute the value and also assumes that both are similar words. So no need to scan them. If the hash values are not similar then it is going to check the sub nodes. Since if the value differs then it will be in any one of the sub nodes in the tree. So all the sub nodes values are calculated and checks where the difference occurs. It is going to update only that value. So due to incremental parsing incremental indexing takes place. 3.1.1 Incremental Indexing In the incremental parse tree database for indexing the documents it uses hashing. Incremental indexing takes place in which hash values are stored for a document. In the incremental indexing avoids the documents which are already indexed as the state of the document is identified by the signatures generated as it uses hashing. 3.1.2 Hashing Hashing is a method to transform string of words into usually shorter fixed length value. It is used to retrieve the items by using the hashed key. Since there is one hashed key the searching time will be less compared to the earlier process. To search for any name it first considers the hash value and compares for a match using that value. Hashed value can be used to store a reference to the original data and retrieves at the time when it is useful. The algorithm for computing the hash values in the nodes can be represented as 3.1.3 Algorithm Step 1: Initialize a document. Step 2: Generation of a parse tree for a sentence in a document. Step 3: Generation of hash values for the parse tree. Step 4: Initialize index with generated hash values. Step 5: Scan the differ node if the hash value of the sub node differs. Step 6 : Continue the process until all the sub nodes are calculated. The architecture of incremental parse tree database is shown in Figure.3.1 Volume 3, Issue 5, May 2014 Page 237 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 3, Issue 5, May 2014 ISSN 2319 - 4847 Figure 3.1 System Architecture of PTQL framework 4. Performance Analysis The performance of IPTDB (Incremental Parse Tree Database) with PTDB (Parse Tree Data Base). In regard to experiments consider the corpus of documents of different sizes, and applied updates to the document source by adding new documents and changes to the existing documents. The performance advantage of ITPDB over PTDB observed under the situation of indexing Tree Database over 20 iterations. The time taken for repetitive indexing process is considerably minimized in ITDB that compared to PTDB. The performance advantage is explored in Fig.4.1 Figure 4.1 IPTDB performance over PTDB under update index time 5. Conclusion Existing mining frameworks do not provide the ability to manage the intermediate processed information. This leads to the preventable reprocessing of the entire text assortment when the extraction goal is personalized or improved, which can be computationally exclusive and time consuming one. To diminish this reprocessing time, the midway processed data is stored in the database as in original framework. The database is in the appearance of parse tree. To extract in sequence from this parse tree the mining goal written by the user in natural language text is transformed into PTQL and then extraction is execute on text corpus. This incremental extraction loom saves much more time compared to perform mining by first processing each sentence one-at-a time with linguistic parsers and then additional components. REFERENCES [1] D. Ferrucci and A. Lally, “UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment,” Natural Language Eng., vol. 10, nos. 3/4, pp. 327- 348, 2004. [2] H. Cunningham, D. Maynard, K. Bontcheva, and V.Tablan, “GATE: A Framework and Graphical Development Environment for Robust NLP Tools andApplications,” Proc. 40th Ann. Meeting of the ACL,2002. [3] J.Clark and S.DeRose,”XML Path Language (xpath)”,http://www.w3.org/TR?xpath,Nov.1999. [4] “XQuery 1.0: An XML Query Language,”http://www.w3.org/XML/Query, June 2001. [5] E. Agichtein and L. Gravano, “Querying Text Databases for Efficient Information Extraction,” Proc. Int‟l Conf.Data Eng. (ICDE), pp. 113-124, 2003. Volume 3, Issue 5, May 2014 Page 238 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 3, Issue 5, May 2014 ISSN 2319 - 4847 [6] E. Agichtein and L. Gravano, “Snowball: Extracting Relations from Large Plain-Text Collections,” Proc.Fifth ACM Conf. Digital Libraries, pp. 85-94, 2000. [7] A. Doan, J.F. Naughton, R. Ramakrishnan, A. Baid, X.Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C.Gokhale, J. Huang, W. Shen, and B.-Q. Vuong,“Information Extraction Challenges in Managing Unstructured Data,” ACM SIGMOD Record, vol. 37,no. 4, pp. 14-20, 2008. [8] R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S.Vaithyanathan, and H. Zhu, “System T: A System for Declarative Information Extraction,” ACM SIGMOD Record, vol. 37, no. 4, pp. 7-13, 2009 [9] P.G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano,“Towards a Query Optimizer for Text-Centric Tasks,”ACM Trans. Database Systems, vol. 32, no. 4, p. 21,2007. [10] A. Jain, A. Doan, and L. Gravano, “Optimizing SQLQueries over Text Databases,” Proc. IEEE 24th Int‟lConf. Data Eng. (ICDE ‟08), pp. 636-645, 2008. [11] Y. Miyao, T. Ohta, K. Masuda, Y. Tsuruoka, K.Yoshida, T. Ninomiya, and J. Tsujii, “Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases,” Proc. 21st Int‟l Conf.Computational Linguistics and the 44th Ann. Meeting of the Assoc. for Computational Linguistics (ACL ‟06),pp. 1017-1024, 2006. [12] R. Feldman, Y. Regev, E. Hurvitz, and M. Finkelstein-Landau, “Mining the Biomedical Literature UsingSemantic Analysis and Natural Language Processing International Journal of Engineering Research & Technology (IJERT)Vol. 2 Issue 2, February- 2013 ISSN: 2278-0181Techniques,” Information Technology in DrugDiscovery Today, vol. 1, no. 2, pp. 69-80, 2003. [13] F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, andS. Vaithyanathan, “An Algebraic Approach to Rule-Based Information Extraction,” Proc IEEE 24th Int‟lConf. Data Eng. (ICDE ‟08), 2008. [14] B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, “Privacy-Preserving Data Publishing: A Survey of Recent Developments,”ACM Computing Surveys,vol. 42, no. 4, pp. 14:1-14:53, June 2010. [15] M. Cafarella, D. Downey, S. Soderland, and O. Etzioni,“Knowitnow: Fast, Scalable Information Extraction from the Web,” Proc. Conf. Human Language Technology and Empirical Methods in Natural Language Processing (HLT ‟05), pp. 563-570, 2005. [16] W.Shen, A.Doan, J.F.Naughton and R.Raghu,”Declarative Information Extraction Using Datalog with Embedded Extraction Predicates,” Proc 33rd Int‟l Conf.Very Large Data Bases (VLDB ‟07), pp. 1033-1044,2007. [17] Luis Tari, Phan Huy Tu, Jorg Hakenberg, Yi Chen, Tran Cao Son, Graciela Gonzalez and Chitta Baral, “ Incremental Information Extraction Using Relational Databases”, IEEE transaction on knowledge and data engineering, vol 24, January 2012. [18] http://lucene.apache.org AUTHOR Boru Bharathi received B.tech and M.Tech degree in computer science and engineering from JNTUA college of engineering, Ananthapuram in 2011 and 2013 respectively. Research interests include Natural Language processing and Information Retrieval. Volume 3, Issue 5, May 2014 Page 239