International Journal of Application or Innovation in Engineering & Management... Web Site: www.ijaiem.org Email: Volume 3, Issue 5, May 2014

advertisement
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 3, Issue 5, May 2014
ISSN 2319 - 4847
COMPUTING THE CONFIDENCE OF
REDUNDANT EXTRACTED INFORMATION
USING PTQL
B.Bharathi
Department of CSE JNTUA, JNTUA CE, Ananthapuram, India
Abstract
Information extraction is a method to extract particular kind of information in sequence from large volume of information. In the
incremental information extraction the intermediate output of the processed data are stored in the parse tree data base with the
help of a parse tree. Whenever there is a need to extract the information, extraction is performed on previously processed data as
well as modified data generated by the component that was amended. Since the updates in the parse tree database are not
incremental there is redundancy in which it will go through the information that was same in all the documents which will take
more amount of time. So, incremental parse tree database is chosen in which it uses hashing and in this the state of the document
will be identified by the signatures generated by it. So by using the incremental parse tree data base there is reduction in
redundancy compared to the parse tree base which was used earlier.
Keywords- Dependency parser, information extraction, information retrieval, query language.
1. INTRODUCTION
It is predictable that each year additional than 6,00,000 articles are available in the biomedical literature, with secure to
20 million publication access being stored in the Medline database. Extracting in sequence from such a large corpus of
documents is very complicated. So it is important to achieve the extraction of information by automaticity. Information
Extraction (IE) is the progression of extracting structured information from the formless information.
Incremental information extraction framework uses database organization system as an essential part. Database
management system provides the dynamic extraction needs over the file-based storage space systems. Text processing
components of named entity recognition and parsers deploy for the entire text corpus. The transitional output of each text
processing constituent is stored in the relational database proposed as Parse Tree Database (PTDB). Database query
which is used to retrieve the information from the PTDB is in the form of Parse Tree Query Language (PTQL).
If the retrieval goal is modified or a module is efficient than the corresponding module then only the responsible module
is spread for the entire text corpus and the processed data is occupied into the PTDB. Then retrieval of information is
performed on the data that is involved rather than entire text. Unlike the file-based pipeline loom, incremental in
sequence extraction framework loom stores the intermediate processed information of each component; this avoids the
requirement of reprocessing the complete text corpus. Avoiding such reprocessing of information is most significant for
information extractions because it reduces the extraction time extremely.
2. RELATED WORK
Information extraction is one of the most important research areas over several past years. The purpose of this is that
improving precision of the extraction systems. This segment, describes how this incremental in sequence extraction
framework differs from conventional IE systems, rule-based IE systems.
2.1 Traditional IE Approaches
Admired file-based frameworks of UIMA [1] and GATE [2] IE frameworks have the capability to integrate various NLP
mechanisms for IE. These frameworks do not accumulate the intermediate dispensation data. QXtract [5],Snowball [6]
systems utilize the RDBMS to store and query the take out results. Cimple [7], SystemT [8] systems use the combined
operations in RDBMS. This extracts the results that are stored in various database tables. All these frameworks do not
store any intermediate processed data, so if a component is improved or extraction goal is changed then all components
have to be reprocessed from the initial. This consumes additional time and also has the high computational cost. In
organize to reduce this high computational cost the majority common approach called manuscript filtering is used. Thus
the clean documents are called as capable documents which are used in [5], [9], and [10]. Extraction is performed on
those capable documents and then the applicable documents are retrieved from this. This the riddle approach fully based
on the verdict that is selected separately based on the lexical clues. These clues enclose been provided by the parse tree
query language. Also this riddle process utilizes the effectiveness of the IR engines.
2.2 Rule-Based IE Approaches
Volume 3, Issue 5, May 2014
Page 236
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 3, Issue 5, May 2014
ISSN 2319 - 4847
Rule-based IE approaches worn in [11], [12], [13], and [14]. Avatar System [14] uses the AQL query language; this has
the ability of performing IE task by matching it with standard expressions. But this query language does not sustain the
traversal on parser tree. DIAL [12], TLM [13], identify It Now [15] systems are fully based on the relationship removal.
They use their possess query languages. But all these uncertainty languages only supports querying of data from the
superficial parsing they do not provide the ability of performing extraction using the rich grammatical structures. Cimple
[7], SystemT [8], Xlog [16] used the declarative languages. Joins process is performed on the RDBMS and then system
are applied for the addition of the dissimilar extracted results. However, these rules are not able of querying parse trees.
MEDIE [11] stores the parse tree in the database and it tolerates the query language for the removal over this parse trees.
The XML-like query languages such as XPath [3], XQuery [4] are pedestal on one kind of addiction grammar called head
driven phrase structure (HPSG). Link types cannot be uttered in this query language as in PTQL.
3. OVERVIEW OF PROPOSED SYSTEM
The proposed model "Computing the confidence of redundant extracted information using PTQL" consists of incremental
parse tree data base (IPTDB) where the processing time is reduced to a larger extent as with the parse tree base which
contains redundant information whenever a new goal is added if the document contains same information which was
already present in the parse tree data base. Because in the parse tree database whenever a new goal is added extraction is
applied on the previously processed data as well as updated data generated by the component. So incremental parse tree
database is chosen.
3.1 Incremental Parse Tree Database and Index
In the incremental parse tree data base whenever any document is given first the text data is converted into sentences and
then into words and for the words parts of speech is assigned. A parse tree is constructed for a document. For this process
Lucene is used. The detailed description of lucene is given in [18].
After the text was analyzed then extraction takes place by issuing queries to the database. In the parse tree database [17] if
a document is added and if the document contains the same information which was present in the database then also it is
going to update the information because in the parse tree database whenever a new goal is added it is going to update
everything. But in the incremental parse tree database it is going to update only some information that was related
directly to the given goal.
In the IPTDB, in a document for a sentence parse tree is constructed and for that a hash value is calculated for the whole
tree. Whenever any new document is added hash values are compared. If the hash values are similar then it assumes that
both the documents are similar and it is not going to compute the value and also assumes that both are similar words. So
no need to scan them.
If the hash values are not similar then it is going to check the sub nodes. Since if the value differs then it will be in any
one of the sub nodes in the tree. So all the sub nodes values are calculated and checks where the difference occurs. It is
going to update only that value.
So due to incremental parsing incremental indexing takes place.
3.1.1 Incremental Indexing
In the incremental parse tree database for indexing the documents it uses hashing. Incremental indexing takes place in
which hash values are stored for a document. In the incremental indexing avoids the documents which are already
indexed as the state of the document is identified by the signatures generated as it uses hashing.
3.1.2 Hashing
Hashing is a method to transform string of words into usually shorter fixed length value. It is used to retrieve the items by
using the hashed key. Since there is one hashed key the searching time will be less compared to the earlier process. To
search for any name it first considers the hash value and compares for a match using that value. Hashed value can be used
to store a reference to the original data and retrieves at the time when it is useful. The algorithm for computing the hash
values in the nodes can be represented as
3.1.3 Algorithm
Step 1: Initialize a document.
Step 2: Generation of a parse tree for a sentence in a document.
Step 3: Generation of hash values for the parse tree.
Step 4: Initialize index with generated hash values.
Step 5: Scan the differ node if the hash value of the sub node differs.
Step 6 : Continue the process until all the sub nodes are calculated.
The architecture of incremental parse tree database is shown in Figure.3.1
Volume 3, Issue 5, May 2014
Page 237
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 3, Issue 5, May 2014
ISSN 2319 - 4847
Figure 3.1 System Architecture of PTQL framework
4. Performance Analysis
The performance of IPTDB (Incremental Parse Tree Database) with PTDB (Parse Tree Data Base). In regard to
experiments consider the corpus of documents of different sizes, and applied updates to the document source by adding
new documents and changes to the existing documents. The performance advantage of ITPDB over PTDB observed
under the situation of indexing Tree Database over 20 iterations. The time taken for repetitive indexing process is
considerably minimized in ITDB that compared to PTDB. The performance advantage is explored in Fig.4.1
Figure 4.1 IPTDB performance over PTDB under update index time
5. Conclusion
Existing mining frameworks do not provide the ability to manage the intermediate processed information. This leads to
the preventable reprocessing of the entire text assortment when the extraction goal is personalized or improved, which
can be computationally exclusive and time consuming one. To diminish this reprocessing time, the midway processed
data is stored in the database as in original framework. The database is in the appearance of parse tree. To extract in
sequence from this parse tree the mining goal written by the user in natural language text is transformed into PTQL and
then extraction is execute on text corpus. This incremental extraction loom saves much more time compared to perform
mining by first processing each sentence one-at-a time with linguistic parsers and then additional components.
REFERENCES
[1] D. Ferrucci and A. Lally, “UIMA: An Architectural Approach to Unstructured Information Processing in the
Corporate Research Environment,” Natural Language Eng., vol. 10, nos. 3/4, pp. 327- 348, 2004.
[2] H. Cunningham, D. Maynard, K. Bontcheva, and V.Tablan, “GATE: A Framework and Graphical Development
Environment for Robust NLP Tools andApplications,” Proc. 40th Ann. Meeting of the ACL,2002.
[3] J.Clark and S.DeRose,”XML Path Language (xpath)”,http://www.w3.org/TR?xpath,Nov.1999.
[4] “XQuery 1.0: An XML Query Language,”http://www.w3.org/XML/Query, June 2001.
[5] E. Agichtein and L. Gravano, “Querying Text Databases for Efficient Information Extraction,” Proc. Int‟l Conf.Data
Eng. (ICDE), pp. 113-124, 2003.
Volume 3, Issue 5, May 2014
Page 238
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 3, Issue 5, May 2014
ISSN 2319 - 4847
[6] E. Agichtein and L. Gravano, “Snowball: Extracting Relations from Large Plain-Text Collections,” Proc.Fifth ACM
Conf. Digital Libraries, pp. 85-94, 2000.
[7] A. Doan, J.F. Naughton, R. Ramakrishnan, A. Baid, X.Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao,
C.Gokhale, J. Huang, W. Shen, and B.-Q. Vuong,“Information Extraction Challenges in Managing Unstructured
Data,” ACM SIGMOD Record, vol. 37,no. 4, pp. 14-20, 2008.
[8] R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S.Vaithyanathan, and H. Zhu, “System T: A System for Declarative
Information Extraction,” ACM SIGMOD Record, vol. 37, no. 4, pp. 7-13, 2009
[9] P.G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano,“Towards a Query Optimizer for Text-Centric Tasks,”ACM
Trans. Database Systems, vol. 32, no. 4, p. 21,2007.
[10] A. Jain, A. Doan, and L. Gravano, “Optimizing SQLQueries over Text Databases,” Proc. IEEE 24th Int‟lConf.
Data Eng. (ICDE ‟08), pp. 636-645, 2008.
[11] Y. Miyao, T. Ohta, K. Masuda, Y. Tsuruoka, K.Yoshida, T. Ninomiya, and J. Tsujii, “Semantic Retrieval for the
Accurate Identification of Relational Concepts in Massive Textbases,” Proc. 21st Int‟l Conf.Computational
Linguistics and the 44th Ann. Meeting of the Assoc. for Computational Linguistics (ACL ‟06),pp. 1017-1024, 2006.
[12] R. Feldman, Y. Regev, E. Hurvitz, and M. Finkelstein-Landau, “Mining the Biomedical Literature UsingSemantic
Analysis and Natural Language Processing International Journal of Engineering Research & Technology
(IJERT)Vol. 2 Issue 2, February- 2013 ISSN: 2278-0181Techniques,” Information Technology in DrugDiscovery
Today, vol. 1, no. 2, pp. 69-80, 2003.
[13] F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, andS. Vaithyanathan, “An Algebraic Approach to Rule-Based
Information Extraction,” Proc IEEE 24th Int‟lConf. Data Eng. (ICDE ‟08), 2008.
[14] B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, “Privacy-Preserving Data Publishing: A Survey of Recent
Developments,”ACM Computing Surveys,vol. 42, no. 4, pp. 14:1-14:53, June 2010.
[15] M. Cafarella, D. Downey, S. Soderland, and O. Etzioni,“Knowitnow: Fast, Scalable Information Extraction from the
Web,” Proc. Conf. Human Language Technology and Empirical Methods in Natural Language Processing (HLT
‟05), pp. 563-570, 2005.
[16] W.Shen, A.Doan, J.F.Naughton and R.Raghu,”Declarative Information Extraction Using Datalog with Embedded
Extraction Predicates,” Proc 33rd Int‟l Conf.Very Large Data Bases (VLDB ‟07), pp. 1033-1044,2007.
[17] Luis Tari, Phan Huy Tu, Jorg Hakenberg, Yi Chen, Tran Cao Son, Graciela Gonzalez and Chitta Baral, “
Incremental Information Extraction Using Relational Databases”, IEEE transaction on knowledge and data
engineering, vol 24, January 2012.
[18] http://lucene.apache.org
AUTHOR
Boru Bharathi received B.tech and M.Tech degree in computer science and engineering from JNTUA
college of engineering, Ananthapuram in 2011 and 2013 respectively. Research interests include Natural
Language processing and Information Retrieval.
Volume 3, Issue 5, May 2014
Page 239
Download