Developing a Comprehensive Patent-Related Information Retrieval Tool Abstract There is an explosive growth of regulatory and related information now available online. This paper reviews the current state of practice in accessing patent-related documents in the form of patents, government regulations, court cases, scientific publications, etc. This paper proposes an ontology-based framework to retrieve documents from the multiple heterogeneous databases. A use case, erythropoietin, is developed to test the framework. A corpus of 135 patents and 30 court cases closely related to the use case is built. Methods are discussed to improve the results obtained through the use of bio-ontologies for document retrieval in the two databases. The challenges faced with respect to the integration of these documents and future plans are briefly discussed. 1. Introduction The role of science and technology has been in the regulatory state in our system of government. There are administrative agencies that deal with various science and technology issues such as the Food and Drug Administration (FDA), the Environmental Protection Agency (EPA), the U.S. Patent and Trademark Office (USPTO), the Nuclear Regulatory Commission (NRC), the Federal Communications Commission (FCC), and the like. The agencies promulgate regulations that appear in the relevant chapters of the Code of Federal Regulations (CFR) and they interpret these regulations and the applicable United States Code. In addition, the courts (often federal courts) interpret the relevant U.S. statutes and federal regulations (CFR). Moreover, there is often a need to consult additional literature in the form of technical/scientific publications. Let us consider a few examples. If a company wanted to study the market for acid reflux drugs, they may choose to go to the FDA web site; they may look for court cases involving these drugs, and they may also study some relevant technical publications. Similarly, a start-up company looking to work on therapeutics in the breast cancer space may choose to study patents in this field to determine whether some patents were litigated, and the applicable scientific and technological literature. In each situation, we have a common problem. There is relevant information that must be accessed and which is available in different information domains. In addition, even within one domain, the information may not be easily accessible and searchable. Broadly speaking, we have information on a particular topic in (a) an administrative agency; (b) the court system; (c) the relevant laws and regulations; and (d) other literature such as scientific publications. In order to develop a basic understanding of a topic, a user would desire the ability to search, collate and analyze information across all these information domains. The past years have seen a tremendous advancement and expansion in research and developments in various fields of science and technology. New fields continue to emerge and have opened up more opportunities. In 2009 alone, 485,312 patent applications were filed at the USPTO. This implies a tremendous increase in the information available and hence the necessity for knowledge bases which can efficiently manage such explosive growth of information. From a technology firm’s perspective, there are many concerns to address for which a thorough analysis of existing works in specific domain and related fields must be done. These concerns could be about protecting their invention, being updated about their competitors, being careful not to infringe anyone else’s patents, checking to see if the validity of those patents has been challenged in the USPTO (“Patent Office”) or the courts and the like. The documents to be studied range from patent documents, scientific publications, court cases to other government administrative agency and court documents. Furthermore, these documents are distributed across many different organizations and databases. For example, there are around 40 different patent issuing authorities across the world and tens of thousands of scientific journals. These different domains and databases are very often not compatible with one another and hence a manual scan through each possible database is not time and labor feasible. Existing frameworks assist in searching for documents in each individual domain, especially for patent documents and scientific publications. However frameworks for accessing other legal documents such as court cases and government regulations are rather lacking. Very little efforts have been attempted to deal with the integration of such diverse, heterogeneous documents. The framework proposed in this paper attempts to provide an integrated approach, with a single or multiple compatible interfaces, for retrieving related documents from across many of these incompatible domains and databases. The approach intends to exploit information available in each set of documents that helps improve automated relevancy estimate across incompatible but yet related domains. In this work, we choose to develop a use case in the biotechnology arena involving the hormone, erythropoietin, which regulates red blood cell production. We develop a set of results for this particular application as an exemplar to illustrate our work and to test out the efficacy of our proposed approach. Specifically we test one part of the framework and for two sets of documents; court cases and patents. This paper is organized as follows: Section 2 presents some common challenges associated with the various types of documents and existing work to address them. Section 3 introduces the use case and the framework is discussed in Section 4. Section 5 reports selected preliminary results and observations of the use case study. Section 6 briefly discusses continuing efforts and future works. 2. Background and Relevant Work This section introduces the presently available tools and resources for working with the different document types related to patent regulations. The challenges that we face with accessing each set of documents are briefly discussed. 2.1 Patents Currently there are over 7 million issued U.S Patents. In 2009, 485,312 patent applications were filed with the USPTO [12]. In addition, there are over 40 different patent issuing authorities across the world including European Patent Office, Japanese and German Patent offices. Espacenet is a service offered by the European Patent Office (EPO) which offers search by keywords, inventor names, European classification etc. [14]. Thompson Innovation offers patent analysis and translation services (Delphion) as well as publication search such as Web of Science [13]. Other sources include Dialog LLC which is an online information retrieval system, Google Scholar, USPTO’s website WIPO[18] etc. Research is very active also in using semantic web technologies to represent patent structures via ontologies and facilitate content based document retrieval. Some natural language processing (NLP) techniques have also been employed to extract important information such as claim text, chemical compounds, etc. from patent documents [6]. 2.2 Court Cases and Litigation Documents IP litigation is an important part of the patent filing and regulation compliance framework. Information regarding whether a patent has been previously litigated is valuable. There are 94 District Courts and one Court of Appeals (CAFC). PACER (Public Access to Court Electronic Records) is one electronic system to access databases for US Court cases [15]. Manually scanning each of these 95 databases is not a feasible option. Currently, PACER requires one to know party/assignee name or the case number; in other words, it does not allow keyword based search and it hosts image documents which cannot be searched automatically for text. Furthermore, keyword based search may not be the most effective due to lack of context. 2.3 Patent File Wrappers and Scientific Publications File Wrappers contain information about scope of protection, application/patent data, prosecution history and other examination information. This can all be key information which can help in relevancy estimates in document retrieval. Available resources include Patent Application Information Retrieval PAIR (public and private) to download patent file wrappers. File wrappers available for download from PAIR are in image form and require some extra processing in order to be able to extract text from them. Scientific publications cover a very broad set of topics which are spanned across many different databases. Available tools which facilitate such a search are PubMed, MedLine, and Google Scholar etc. [16]. For an estimate, PubMed contains published articles from over 300 research journals. In addition, there exist conference proceedings and workshop presentations that can further hold valuable information for the organization/patent examiner etc. While there is plenty of work going on with respect to document retrieval in each of these independent domains, there is little work to bring together many of these databases into a common platform to help solving the problem of retrieving highly related sets of documents from various domains. 3. Use Case: Erythropoietin/EPO The preliminary framework proposed will be demonstrated through the use case “erythropoietin”. The use case will not only help in evaluating the effectiveness of the framework, but also demonstrate how each document specific or domain specific challenge is handled. Erythropoietin is a hormone which controls erythropoiesis, which is the production of red blood cells in bone marrow. Interest in this hormone grew rapidly since its discovery and its functional importance. External preparation of erythropoietin has made possible the treatment of diseases such as anemia (the inability to produce sufficient red blood cells in the body). Synthetic production of this hormone has led to several commercially available drugs. Amgen Inc., an international biotechnology company, produced the first commercially available drug – Epogen. Amgen holds 5 patents for the production of erythropoietin – US5547933, US5618698, US5621080, US5756349 and US5955422, which have since been cited as well as challenged by others. Other pharmaceutical companies which have shown interest include Hoechst Marion Roussel, Transkaryotic Technologies and Reliance Life Sciences. Several efforts are being made in order to standardize the use and representation of biomedical terminology. A domain specific ontology is more than a simple thesaurus of words; it formalizes the idea of classes and domains. Domain ontologies provide more information about how one concept is related to another. Ontologies can be used as a backbone for content and information retrieval systems. GoPubMed uses Gene Ontology and MEdical SubHeadings (MESH) to access the PubMed database while others make use of NLP techniques to search the PubMed database [17]. Ontologies have also been used previously for patent retrieval [1-2, 7-8]. Ontologies address or build upon a certain aspect of a domain. Each ontology considers and relates concepts from a different viewpoint. Consider an example - Gene Ontology (GO) organizes their ontologies based on three principles – the concept as a cellular component, as a part of a biological process, and its molecular function. The ontology shown in Fig. 1 represents ‘erythropoietin receptor binding’ as a molecular function. This use case contains plenty of documentations and forms a rich corpus with documents from a variety of domains. Following forward and backward citations of the five core patents, we have outlined 135 closely related patents to complete the use case. The five core patents have been challenged several times in US courts. Approximately 20 court cases have taken place since late 1980’s till today. These sets of patents also reference a set of over 3000 publications. Collectively, these documents put together make a good corpus to work with and test the framework. 3.1 Ontologies The field of biomedicine and biotechnology is rapidly growing with huge amount of research being carried out daily. The use of terminology is very extensive and hard to keep up with. There is hence a requirement for a controlled vocabulary to help correlate various existing and new terms. One solution for this is provided by ontologies. An ontology is a formal representation of key concepts or properties in a domain and the various relations connecting these individual entities. This kind of formal representation is very widely used such as in biomedical informatics, artificial intelligence, etc. Fig 1: Gene Ontology [19] The ontology describes the process erythropoietin receptor binding as a type of cytokine receptor binding which is in turn a type of receptor binding and so on. The tree directly gives us a set of key phrases – “cytokine receptor binding”, “binding” etc. which can be matched with the text contained in documents. In addition, the class properties contain definitions and relations for concepts. Each set of documents are written in very different writing and formatting styles, intended for a different set of audience. Publications make high use of technical jargons, whereas court cases are long and very descriptive emphasizing on clarity of the statements being made. Patent documents make good use of technical jargons but also employ writing styles as required by legal documents. Using XML not only allows maintaining these documents in a uniform format, it would also facilitate easy transformation into an ontology based language such as OWL similar to the work done by Giereth et.al. [4-5] and Wannar et. al. [3]. Building an ontology for this application is a way to formalize the framework. It will have knowledge encoded about various aspects spoken about in this paper. A single ontology may result in a very narrow term expansion. BioPortal is a web based application for accessing and sharing bio-ontologies by the National Center for Biomedical Ontologies [19]. It provides a common searchable and interactive interface to over 130 bio ontologies. Other sources for ontologies and taxonomies include GeneCards, MedTerms etc. In this paper, we search for all ontologies made available through BioPortal that contain the keyword erythropoietin/EPO either as a concept or as a relation to another concept. We make use all the ontologies to not only obtain a larger term base but a broader scope and hence a broader coverage of documents. The results of using these ontologies are shown in the results section. We choose to index and search the documents using Apache Lucene. Apache Lucene is an open source information retrieval library which offers full-text indexing and search APIs. It makes use of the vector space model to represent documents. In the vector space model, every document is represented as an ndimensional vector where each dimension is a unique word occurring in text, and the magnitude along each dimension is the frequency of occurrence of the word in a document. This software is very widely used in web search utilities. The model also filters out very frequent words such as – a, the, of etc. known as stop words as they generally do not contain any information about the document. Words occurring in the text are trimmed down to their roots. This process is called stemming and makes the model more efficient. Apache Lucene allows indexing documents in the form of ‘fields’, which makes it possible to constrain the search to only one or more sub sections of the document. It uses tf-idf to rank the documents based on the query. Tf-idf stands is a scoring factor that takes into account the frequency of occurrence of a word in a document (term frequency) and the frequency of occurrence across all the documents (inverse document frequency). 3.2 Document Organization and Indexing Each type of document is written and available in a different format from the other. As a first step, these documents are converted into a common format. This makes it easy to manage and perform operations on the documents. The format chosen is XML, and an automated script is written to convert the downloaded HTML files into XML files based on the fields/features of that document. For example a patent document is re-arranged into XML as shown in Fig 2. 4. Proposed Framework There are many aspects of multi-domain document search which can be exploited for effective document retrieval. The framework proposed in this paper attempts to achieve this in various stages which are explained below. Fig. 3 gives the top level picture of the framework. The corpus is a combined document set of various types such as patents, court cases, publications etc. The framework interacts with the corpus based on the user query and returns the results to the user. 4.1 Keyword Expansion Fig 2: Sample patent XML file The field of biotechnology is rapidly expanding. While the number of documents required to be managed is increasing immensely, there is an imminent lack of standard terminology. This poses an issue to applications that deal with biomedical text such as search engines. A bag of words model may not recognize a particular relevant document unless the document contains the concept exactly as in the model. One solution to this is the use of bioontologies. Bio-ontologies address this issue by defining concepts and relations. Given a keyword, it is possible to find related concepts by parsing the tree and the class properties. This can give a significant boost to the number of hits. inventor names, the hearing date, etc., which will be used in the next stages to re-score the documents and improve relevancy measures. Feature extraction is made easier due to the re-formatting of the documents into an XML format as explained in Section 3.2. The features that are extracted from patents and court cases are listed in Table 1. The various scoring methods are discussed in Section 5 as the results are presented. Table 1: Features extracted from patents and court cases Document Patents Court Cases Fig 3: Proposed Framework The goal of this stage is to make use of the ontologies, based on the user query to maximize the number of document hits. There are several challenges associated with this stage. Picking a relevant ontology is very important as an irrelevant ontology could result in a pool of irrelevant keywords, affecting the document retrieval at the very beginning. Also, close attention needs to be paid to the diverse usage of terms that are used in each type of document. For example, a court case may make use of less technical terms than what a scientific publication might. A patent normally restricts itself to the topic under concern; the references it cites may vary widely and will need the ontology to cover sufficient breadth of terms. We must hence clearly define which ontology will apply to each domain. Selecting irrelevant ontologies could thus lead to producing inaccurate results. 4.2 Independently Search Documents Having expanded the keywords, we now use the keywords to retrieve documents independently from the different databases. The documents are indexed based on fields such as title, abstract, claims, etc. Individual field indexing allows for different combinations of fields in our keyword search. The goal of this stage is to try and maximize recall from each set of documents. In this step, we also outline features which are both general and obvious to a type of documents. These features are fields such as the Features Inventor, Assignee, Location, Application Information, US Classification, Examiner, Dates, References, Claims, Examiner, Top-Occurring keywords Plaintiff(s), Defendant(s), Date, Judge, District, Location, Claim References, Patent Number References 4.3 Cross-Referencing Features Certain features/fields of a document quite commonly appear in other documents. For example, a court case/litigation very clearly states the claim, which is supposedly being infringed, of the patent under litigation. It also states the parties involved and possibly the inventors. All of this information can be used to identify documents which are closely related. If two documents have much of these overlapped, they could be highly relevant to each other. In this stage, features extracted from the previous step are cross-referenced across the different databases to capture the inter dependency between them. Fig. 4 shows an extract from US patent 4,677,195 and a court case between Amgen Inc. and Chugai Pharmaceuticals Ltd. (927 F.2d 1200). The court case clearly has references to the corresponding patent’s claims. Both the documents are maintained in different formats, but contain information which establishes relevancy between them. This step focuses on cross-referencing all such features from the documents. There are many features that can be extracted from patents. The features could be patent metadata or a specific claim. The goal is to find and use a set of generalized features, occurrences of which can be found in other types of documents. The output of this stage can be feedback to the previous one iteratively until a good set of documents is retrieved. For example, if a relevant court case cites or mentions a patent which was not retrieved in the previous stage, this new information can be used to retrieve the document. “erythropoietin”. This result shows the inadequacy of keyword retrieval and implies further improvements are necessary, such as the use of query expansion and optimization techniques. Table 2: Subset of results without use of ontology Patent Number Score 5955422 6204247 6245740 6270989 6280977 6340742 6420339 6420340 6524818 0.109 0.000 0.018 0.000 0.027 0.113 0.000 0.000 0.009 5.1.1 Boosted Results Fig 4: Reference to a claim in a court case 4.4 User Feedback The search and relevancy methods discussed so far rely on features such as document metadata and cross-referencing of documents. Every user has a specific need and searches for documents in a different context. The framework will allow the user to control or fine tune the results and could lead to significant performance improvements. 5. Results and Observations This section summarizes and discusses the results obtained on the patent and court cases. The results are discussed for the first part of the framework – expanding keywords and independent searching of the domains through the use case. 5.1 Patents An initial search was performed on the 135 patents using the keyword ‘erythropoietin’. The search was performed on the entire text to capture the occurrence of this keyword anywhere in the document. Selected results are tabulated as shown in Table 2. Although these documents were manually identified as closely related, as it turns out, not all 135 documents were retrieved. The recall was 69%; 90 documents of the total 135 are retrieved using just the keyword A search on BioPortal returned around 11 ontologies that showed occurrence of the keyword ‘erythropoietin’ and ‘EPO’ either in the properties of a concept or as a concept. We make use of these ontologies for the tests. Common relations such as is_a, component_of, subClass, parent, child etc. are followed and the ontology is traversed to collect siblings, children and parents of a particular concept. Relations such as measured_by, RO (relations ontology) etc. are not followed since they generally do not lead to solid concepts or keywords that can be used in searches. Following these relations in all the ontologies returned by the search on BioPortal, a term base of 43 terms and phrases (“erythropoietin receptor binding”, “epoetin alfa”, recombinant erythropoietin” etc.) were collected and used to search the databases. For each term, the search was performed with respect to the following combinations – Title only, Abstract only, Claims only, Description only, Title and Abstract, Abstract and Claims, Title and Abstract and Claims, Claims and Description, Abstract and Description, and the full-text of the document. Lucene normalizes the score of a particular field in a document by the length of that field. This implicitly gives shorter fields a higher weight. For example, a document which shows occurrence of a particular term in its title is given a higher weight than another document which shows occurrence of the same term somewhere in its description. Additionally, these weights can be manually boosted if needed. Many options are available to handle multi-word or multi-phrase queries. In our work, if the term has more than one word, then occurrence of any of these terms is accepted and accordingly ranked. Table 3 shows the boost in the ranks of the core patents after the term expansion using ontologies. Prior to the term expansion, the core patents were ranked very low and many relevant patents were not retrieved. are children, parents and grandparents are given lesser weight as the distance increases. Table 3: Results on patents after query expansion Rank Patent Number Score 11 7528104 17.59312 12 7078376 17.48418 13 7625564 14.68403 14 5955422 14.58489 15 4667016 14.02017 26 7128913 11.23234 27 5621080 11.06053 28 5618698 11.05477 29 6548653 10.98611 30 7012130 10.92918 31 5547933 10.82067 For the results shown in Table 3, the scoring function used is to simply aggregate the scores for each individual term, and each individual field for which it had a non-zero score. The individual terms had equal weights. Searching for the 43 different terms resulted in a recall of 100%, i.e., all the 135 documents were retrieved from the database. The 5 core patents have a relatively high rank and show up in the top 35 results and are shown in bold in the table. A concern that arises from this result is the effect of using these terms on a larger and more general database. A term as general as protein appears in almost 200,000 documents when search for in USPTO. To ensure that the set of 135 documents maintains a high rank even in a larger and more general database, each term is weighted based on its relation with the original query. Siblings and synonyms may get high weight when compared to grandparents or other far off relatives. Four heuristic weighting functions were used for the expanded terms. The weighting functions are all step functions which assign a fixed weight to each of the four relations to the query term – synonym, child, parent and grandparent. Fig. 5 shows these functions. All the functions follow the same heuristic that synonyms have a full-weight of 1 while terms which Figure 5: Weighting functions used The weighting schemes are tested on a corpus of 1156 US patents. These 1156 US patents are a collection of the top 50-100 scoring documents for each of the 43 terms extracted from the ontologies. Fig. 6 compares the recall values as a result of the four different weighting functions used. The fourth weighting function gives the least weight to very general concept terms and gives the best results for this test set and is hence used for collecting further results. Larger differences in these functions could be observed with a larger set of relevant documents. Figure 6: Comparison of recall values obtained from the four weighting functions The effectiveness of the weighting schemes can be measured with respect to the relative ranks for the 5 core patents. In the unweighted scheme, these are ranked in the 50-115 range. As we move higher (or lower) in an ontology, away from the query term, the concepts become more general resulting in a very general hit ratio when compared to the more specific concepts. With the use of weighting function-4, the relative ranks go up to the 10-55 range for the same patents. This is a significant improvement and shows that even in a large corpus; the most relevant patents to the user’s query are amongst the top hits. Table 4 shows the results from the fourth weighting function. Table 4: Improvement in relative ranking of the core patents Un-weighted Patent Rank Score Number 51 5659012 21.022 12 Weighted Patent Score Number 5955422 19.598 Rank 52 5955422 20.885 13 6489293 19.415 53 5792850 20.848 42 5547933 14.927 54 5441868 20.731 43 7128913 14.927 100 7128913 17.501 44 7078376 14.912 101 5621080 17.398 45 4835260 14.861 102 5618698 17.389 46 5621080 14.793 103 6126917 17.377 47 6521245 14.701 104 7217689 17.163 48 4677195 14.686 105 5547933 17.134 49 5756349 14.673 106 6376218 17.130 50 6340742 14.664 112 5756349 16.915 51 5618698 14.585 5.2 Court Cases Court cases form an important source of information for patent examiners and researchers whether to verify the validity of an invention, learn about previously published/protected work, or to challenge the claims of an existing patent. Table 5 shows the result of using the 43 search terms to search the database. The database is a collection of the litigations involving the 5 core patents. The database consists 30 court cases dated from 1989 – 2009. A full recall is observed. Court cases follow writing guidelines that are different from technical bio publications and papers. The documents have less technical jargons and hence the use of terms is generally limited to the key concept or claim that is being challenged such as ‘erythropoietin’ or a statement related to it. In addition, the above documents were collected around the use case and each one of them contains the keyword “erythropoietin”. Hence selecting immediately related terms from ontologies such as synonyms is sufficient to retrieve the related court cases. Table 5: Sample Court Cases retrieved Court Case Amgen Inc. v. F. Hoffmann-La Roche Ltd - 2008 Amgen Inc. v. F. Hoffmann-La Roche Ltd - 2009 Amgen Inc. v. Hoechst Marion Roussel Inc. - 2009 Amgen Inc. v. Hoechst Marion Roussel Inc. - 2008 Amgen Inc. v. Hoechst Marion Roussel Inc. - 2001 Score 0.875 0.725 0.497 0.497 0.423 6. Conclusions and Future Work In this paper, we demonstrate the use of ontologies for term expansion for document retrieval in two databases containing 30 court cases and 135 patents. We demonstrate the results through a use case, erythropoietin. We address the lack of existing tools and methods for retrieving relevant documents across different domains. We propose a framework that will serve as the backbone for such form of document retrieval. The scope and results discussed in this paper are limited to the expanding the terms based on ontologies and using them to improve retrieval in the two databases. Various weighting methods and approaches are discussed and shown to be effective. The documents of immediate interest get a relatively high score amongst a larger corpus. The results are lenient towards type-1 errors in order to give the user more choices. Publications and PTO file wrappers also are important part of the corpus. They contain information which can be used to improve relevancy measures such as authors, citations, date of publication, journal or proceeding names, number of citations etc. Our next research plan is to investigate the effectiveness of this framework for these sets of documents. The features will be cross-referenced and results will be integrated to return a collective set of relevant documents. A single user query could have many possible interpretations. For example, one user may be interested in the preparation of a drug, while another might be interested in the commercial use of it. To account for this variation in user requests, the framework will also include an option for user relevancy feedback. This will ensure the results adhere to the user’s requirement. Our ultimate goal is to develop an ontology based on these initial tests. The ontology will account for every aspect of the framework, right from the expansion of keywords to the cross-referencing of various features and will serve as the backbone for the framework. 7. Acknowledgements This research is partially supported by NSF Grant Number 0811975 awarded to the University of Illinois at Urbana-Champaign and NSF Grant Number 0811460 to Stanford University. Any opinions and findings are those of the authors, and do not necessarily reflect the views of the National Science Foundation. 8. References [1] Ghoula, N., Khelif, K., and Dieng-Kuntz, R., “Supporting Patent Mining by using Ontology-based Semantic Annotations”, Proceedings of the IEEE/WIC/ACM international Conference on Web intelligence, Washington, DC, pp. 435-438, November 2007. [8] Codina J., Pianta E., Vrochidis S. and Papadopoulos S., “Integration of semantic, metadata and image search engines with a text search engine for patent retrieval”, Proceedings of the Workshop on Semantic Search at the 5th European Semantic Web Conference, pp. 14-28, June 2008. [9] Kitamura Y., Kashiwase M., Masayoshi F. and Mizoguchi R., “Deployment of an ontological framework of function design knowledge”, Advanced Engineering Informatics, Vol. 18(2), pp. 115–127, 2004. [10] McCarty, L. T., “Deep semantic interpretations of legal texts”, Proceedings of the 11th international Conference on Artificial intelligence and Law, Stanford, California, pp. 217-224, June 2007. [11] Tong, R. M., Reid, C. A., Crowe, G. J., and Douglas, P. R., “Conceptual legal document retrieval using the RUBRIC system”, Proceedings of the 1st international Conference on Artificial intelligence and Law, Boston, pp. 28-34, 1987. [2] Soo VW, Lin SY, Yang SY, Lin SN, Cheng SL, “A cooperative multi-agent platform for invention based on patent document analysis and ontology”, Expert Systems with Applications, Volume 31(4), pp. 766-775, November 2006. [12] USPTO. Accessed on 6/10/2010. http://www.uspto.gov/ [3] Wanner L., Baeza-Yates R., Brugmann S., Codina J., Diallo B., Escorsa E., Giereth M., Kompatsiaris Y., Papadopoulos S., Pianta E., Piella G., Puhlmann I., Rao G., Rotard M., Schoester P., Serafini L. and Zervaki V., “Towards content-oriented patent document processing”, World Patent Information, Volume 30(1), pp. 21-23, March 2008. [14] espacenet. Accessed on 6/10/2010. http://www.espacenet.com/ [4] Giereth, M., Koch, S., Kompatsiaris, Y., Papadopoulos, S., Pianta, E., Serafini, L., and Wanner, L., “A Modular Framework for Ontology-based Representation of Patent Information”, Proceeding of the 2007 Conference on Legal Knowledge and information Systems: JURIX 2007, Vol. 165, 49-58. [5] Giereth M, Brügmann S, Stäbler A, Rotard M, Ertl T., “Application of semantic technologies for representing patent metadata”, In proceedings of the first international workshop on applications of semantic technologies, 2006 [6] S. Sheremetyeva, Natural language analysis of patent claims. In: Proceedings of the ACL Workshop on Patent Corpus Processing, Sapporo, pp. 66–7, 2003. [7] Yang SY, Lin SY, Lin SN, Cheng SL and Soo VW, “An Ontology-based Multi-agent Platform for Patent Knowledge Management”, International Journal of Electronic Business Management, Vol. 3(3), pp. 181192, 2005 [13] Thomson Innovation. Accessed on 6/10/2010. http://www.thomsoninnovation.com/ [15] PACER. Accessed on 6/10/2010. http://www.pacer.gov/ [16] PubMed. Accessed on 6/10/2010. http://www.ncbi.nlm.nih.gov/pubmed/ [17] GoPubMed. Accessed on 6/10/2010. http://www.ncbi.nlm.nih.gov/pubmed/ [18] WIPO. Accessed on 6/10/2010. http://www.wipo.int/portal/index.html.en [19] BioPortal. Accessed on 6/10/2010. http://bioportal.bioontology.org