Applying Key Phrase Extraction to aid Invalidity Search Manisha Verma, Vasudeva Varma SIEL, LTRC, IIIT Hyderabad Outline Introduction Related Work Motivation and Contribution Approaches Experiments and Results Future Work Questions ??? INTRODUCTION Invalidity Search The task is to uncover patents or other published prior art that may render a granted patent invalid Find prior art that the patent examiner overlooked so that a patent can be declared invalid. Input and Process INPUT It’s a patent application PROCESS Use existing search engines to find similar work. MANUALLY create queries, go through several documents – articles, granted patents etc and find similar documents. Related Work Related Work Two ways of approaching the problem 1. 2. Create a query from a patent and try different retrieval models Use different models to create a query from a patent then use an existing retrieval model. Our work employs the second approach. Approach 1 Use claim text or abstract to create a query from the patent. Following have been used to improve Recall and Precision Re-ranking using several features Cluster based Pseudo Relevance Feedback Scoring based on subtopics etc. Approach 2 Select words/phrases from different sections in a patent Select words using tf-idf from a patent. Find out which section results in best queries Assign weight to each word to mark its importance. Common weighing methods explored are tf,and tf-idf Identify the optimal length of the query i.e. number of words to keep in a query generated from a patent. Empirically determine the value. Motivation and Contribution Motivation and Contribution Explore and evaluate different ways to select phrases to make queries for patents. Though several key phrase extraction approaches have been proposed in the literature, they have not been used to create queries for invalidity search task. Evaluate and analyze the performance of queries created by using state-of-the-art unsupervised and supervised key phrase extraction techniques. Approaches Key Phrase Extraction Techniques Unsupervised TextRank (R. Mihalcea et al.) SingleRank (X. Wan et al.) Tf-Idf Tf Supervised RankPhrase (X. Jiang et al.) KEA (I. H.Witten et al.) Unsupervised Approaches TextRank Present text as graph using cooccurrence statistics Run iterative algorithm to find dominant nodes (words) in graph.. SingleRank Same approach as TextRank While in TextRank phrases containing the top-ranked words are selected, in SingleRank, we do not filter out any low scoring words. Supervised Approaches KEA Use features to represent key phrases. Use a classifier to train on manually annotated data. RankPhrase Treat key phrase extraction as ranking problem Same features from KEA have been used Training Supervised Approaches ??? • To annotate patents with key phrases, take some applications with relevance judgments. For every phrase in the document – – – – • • Fire it as a query. Calculate MAP and Recall of that phrase (using the relevance judgments) Select phrases with high Map and Recall Prune phrases based on tf-idf scores Use these phrases for the document. Use some sample documents annotated using this approach to train the supervised approach. Experiments And Results Our DATA 1.3 million patents (NTCIR) 1000 patent applications For each application, a list of patents which claim same invention is provided. Unsupervised vs Supervised Performance on different sections Results The experiments indicate that key phrase extraction techniques indeed improve invalidity search results. Queries created by using unsupervised and supervised approaches perform better than those formed by tf or tfidf. In supervised approaches, queries created by using phrases extracted by KEA show 29% and 37% improvement in MAP over TextRank and tf-idf respectively. Future Work Weigh queries generated by using both the approaches Try the approaches on different patent collections Explore combination of the two approaches for query construction References X. Xue and W. B. Croft. Automatic query generation for patent search. In CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 2037–2040, NY, USA, 2009. ACM. R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proc. of EMNLP, 2004. X. Xue and W. B. Croft. Transforming patents into prior-art queries. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 808–809, NY, USA, 2009. ACM. X. Jiang,Y. Hu, and H. Li. A ranking approach to key phrase extraction. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 756–757, NY, USA, 2009. ACM. Questions ???