siddharth_paper_v4 - Engineering Informatics Group

advertisement
Developing a Comprehensive Patent-Related Information Retrieval Tool
Abstract
There is an explosive growth of regulatory and related
information now available online. This paper reviews the
current state of practice in accessing patent-related
documents in the form of patents, government regulations,
court cases, scientific publications, etc. This paper
proposes an ontology-based framework to retrieve
documents from the multiple heterogeneous databases. A
use case, erythropoietin, is developed to test the framework.
A corpus of 135 patents and 30 court cases closely related
to the use case is built. Methods are discussed to improve
the results obtained through the use of bio-ontologies for
document retrieval in the two databases. The challenges
faced with respect to the integration of these documents
and future plans are briefly discussed.
1. Introduction
The role of science and technology has been in the
regulatory state in our system of government. There
are administrative agencies that deal with various
science and technology issues such as the Food and
Drug Administration (FDA), the Environmental
Protection Agency (EPA), the U.S. Patent and
Trademark Office (USPTO), the Nuclear Regulatory
Commission (NRC), the Federal Communications
Commission (FCC), and the like. The agencies
promulgate regulations that appear in the relevant
chapters of the Code of Federal Regulations (CFR)
and they interpret these regulations and the applicable
United States Code. In addition, the courts (often
federal courts) interpret the relevant U.S. statutes and
federal regulations (CFR). Moreover, there is often a
need to consult additional literature in the form of
technical/scientific publications.
Let us consider a few examples. If a company wanted
to study the market for acid reflux drugs, they may
choose to go to the FDA web site; they may look for
court cases involving these drugs, and they may also
study some relevant technical publications.
Similarly, a start-up company looking to work on
therapeutics in the breast cancer space may choose to
study patents in this field to determine whether some
patents were litigated, and the applicable scientific
and technological literature.
In each situation, we have a common problem. There
is relevant information that must be accessed and
which is available in different information domains.
In addition, even within one domain, the information
may not be easily accessible and searchable. Broadly
speaking, we have information on a particular topic
in (a) an administrative agency; (b) the court system;
(c) the relevant laws and regulations; and (d) other
literature such as scientific publications. In order to
develop a basic understanding of a topic, a user
would desire the ability to search, collate and analyze
information across all these information domains.
The past years have seen a tremendous advancement
and expansion in research and developments in
various fields of science and technology. New fields
continue to emerge and have opened up more
opportunities. In 2009 alone, 485,312 patent
applications were filed at the USPTO. This implies a
tremendous increase in the information available and
hence the necessity for knowledge bases which can
efficiently manage such explosive growth of
information.
From a technology firm’s perspective, there are many
concerns to address for which a thorough analysis of
existing works in specific domain and related fields
must be done. These concerns could be about
protecting their invention, being updated about their
competitors, being careful not to infringe anyone
else’s patents, checking to see if the validity of those
patents has been challenged in the USPTO (“Patent
Office”) or the courts and the like.
The documents to be studied range from patent
documents, scientific publications, court cases to
other government administrative agency and court
documents. Furthermore, these documents are
distributed across many different organizations and
databases. For example, there are around 40 different
patent issuing authorities across the world and tens of
thousands of scientific journals. These different
domains and databases are very often not compatible
with one another and hence a manual scan through
each possible database is not time and labor feasible.
Existing frameworks assist in searching for
documents in each individual domain, especially for
patent documents and scientific publications.
However frameworks for accessing other legal
documents such as court cases and government
regulations are rather lacking. Very little efforts have
been attempted to deal with the integration of such
diverse, heterogeneous documents.
The framework proposed in this paper attempts to
provide an integrated approach, with a single or
multiple compatible interfaces, for retrieving related
documents from across many of these incompatible
domains and databases. The approach intends to
exploit information available in each set of
documents that helps improve automated relevancy
estimate across incompatible but yet related domains.
In this work, we choose to develop a use case in the
biotechnology arena involving the hormone,
erythropoietin, which regulates red blood cell
production. We develop a set of results for this
particular application as an exemplar to illustrate our
work and to test out the efficacy of our proposed
approach. Specifically we test one part of the
framework and for two sets of documents; court cases
and patents.
This paper is organized as follows: Section 2 presents
some common challenges associated with the various
types of documents and existing work to address
them. Section 3 introduces the use case and the
framework is discussed in Section 4. Section 5
reports selected preliminary results and observations
of the use case study. Section 6 briefly discusses
continuing efforts and future works.
2. Background and Relevant Work
This section introduces the presently available tools
and resources for working with the different
document types related to patent regulations. The
challenges that we face with accessing each set of
documents are briefly discussed.
2.1 Patents
Currently there are over 7 million issued U.S Patents.
In 2009, 485,312 patent applications were filed with
the USPTO [12]. In addition, there are over 40
different patent issuing authorities across the world
including European Patent Office, Japanese and
German Patent offices. Espacenet is a service offered
by the European Patent Office (EPO) which offers
search by keywords, inventor names, European
classification etc. [14]. Thompson Innovation offers
patent analysis and translation services (Delphion) as
well as publication search such as Web of Science
[13]. Other sources include Dialog LLC which is an
online information retrieval system, Google Scholar,
USPTO’s website WIPO[18] etc. Research is very
active also in using semantic web technologies to
represent patent structures via ontologies and
facilitate content based document retrieval. Some
natural language processing (NLP) techniques have
also been employed to extract important information
such as claim text, chemical compounds, etc. from
patent documents [6].
2.2 Court Cases and Litigation Documents
IP litigation is an important part of the patent filing
and regulation compliance framework. Information
regarding whether a patent has been previously
litigated is valuable. There are 94 District Courts and
one Court of Appeals (CAFC). PACER (Public
Access to Court Electronic Records) is one electronic
system to access databases for US Court cases [15].
Manually scanning each of these 95 databases is not a
feasible option. Currently, PACER requires one to
know party/assignee name or the case number; in
other words, it does not allow keyword based search
and it hosts image documents which cannot be
searched automatically for text. Furthermore,
keyword based search may not be the most effective
due to lack of context.
2.3 Patent File Wrappers and Scientific
Publications
File Wrappers contain information about scope of
protection, application/patent data, prosecution
history and other examination information. This can
all be key information which can help in relevancy
estimates in document retrieval. Available resources
include Patent Application Information Retrieval
PAIR (public and private) to download patent file
wrappers. File wrappers available for download from
PAIR are in image form and require some extra
processing in order to be able to extract text from
them.
Scientific publications cover a very broad set of
topics which are spanned across many different
databases. Available tools which facilitate such a
search are PubMed, MedLine, and Google Scholar
etc. [16]. For an estimate, PubMed contains published
articles from over 300 research journals. In addition,
there exist conference proceedings and workshop
presentations that can further hold valuable
information for the organization/patent examiner etc.
While there is plenty of work going on with respect
to document retrieval in each of these independent
domains, there is little work to bring together many
of these databases into a common platform to help
solving the problem of retrieving highly related sets
of documents from various domains.
3. Use Case: Erythropoietin/EPO
The preliminary framework proposed will be
demonstrated through the use case “erythropoietin”.
The use case will not only help in evaluating the
effectiveness of the framework, but also demonstrate
how each document specific or domain specific
challenge is handled.
Erythropoietin is a hormone which controls
erythropoiesis, which is the production of red blood
cells in bone marrow. Interest in this hormone grew
rapidly since its discovery and its functional
importance. External preparation of erythropoietin
has made possible the treatment of diseases such as
anemia (the inability to produce sufficient red blood
cells in the body). Synthetic production of this
hormone has led to several commercially available
drugs. Amgen Inc., an international biotechnology
company, produced the first commercially available
drug – Epogen. Amgen holds 5 patents for the
production of erythropoietin – US5547933,
US5618698,
US5621080,
US5756349
and
US5955422, which have since been cited as well as
challenged by others. Other pharmaceutical
companies which have shown interest include
Hoechst Marion Roussel, Transkaryotic Technologies
and Reliance Life Sciences.
Several efforts are being made in order to standardize
the use and representation of biomedical terminology.
A domain specific ontology is more than a simple
thesaurus of words; it formalizes the idea of classes
and domains. Domain ontologies provide more
information about how one concept is related to
another. Ontologies can be used as a backbone for
content and information retrieval systems.
GoPubMed uses Gene Ontology and MEdical SubHeadings (MESH) to access the PubMed database
while others make use of NLP techniques to search
the PubMed database [17]. Ontologies have also been
used previously for patent retrieval [1-2, 7-8].
Ontologies address or build upon a certain aspect of a
domain. Each ontology considers and relates
concepts from a different viewpoint. Consider an
example - Gene Ontology (GO) organizes their
ontologies based on three principles – the concept as
a cellular component, as a part of a biological
process, and its molecular function. The ontology
shown in Fig. 1 represents ‘erythropoietin receptor
binding’ as a molecular function.
This use case contains plenty of documentations and
forms a rich corpus with documents from a variety of
domains. Following forward and backward citations
of the five core patents, we have outlined 135 closely
related patents to complete the use case. The five core
patents have been challenged several times in US
courts. Approximately 20 court cases have taken
place since late 1980’s till today. These sets of
patents also reference a set of over 3000 publications.
Collectively, these documents put together make a
good corpus to work with and test the framework.
3.1 Ontologies
The field of biomedicine and biotechnology is rapidly
growing with huge amount of research being carried
out daily. The use of terminology is very extensive
and hard to keep up with. There is hence a
requirement for a controlled vocabulary to help
correlate various existing and new terms. One
solution for this is provided by ontologies. An
ontology is a formal representation of key concepts or
properties in a domain and the various relations
connecting these individual entities. This kind of
formal representation is very widely used such as in
biomedical informatics, artificial intelligence, etc.
Fig 1: Gene Ontology [19]
The ontology describes the process erythropoietin
receptor binding as a type of cytokine receptor
binding which is in turn a type of receptor binding
and so on. The tree directly gives us a set of key
phrases – “cytokine receptor binding”, “binding” etc.
which can be matched with the text contained in
documents. In addition, the class properties contain
definitions and relations for concepts.
Each set of documents are written in very different
writing and formatting styles, intended for a different
set of audience. Publications make high use of
technical jargons, whereas court cases are long and
very descriptive emphasizing on clarity of the
statements being made. Patent documents make good
use of technical jargons but also employ writing
styles as required by legal documents.
Using XML not only allows maintaining these
documents in a uniform format, it would also
facilitate easy transformation into an ontology based
language such as OWL similar to the work done by
Giereth et.al. [4-5] and Wannar et. al. [3]. Building an
ontology for this application is a way to formalize the
framework. It will have knowledge encoded about
various aspects spoken about in this paper.
A single ontology may result in a very narrow term
expansion. BioPortal is a web based application for
accessing and sharing bio-ontologies by the National
Center for Biomedical Ontologies [19]. It provides a
common searchable and interactive interface to over
130 bio ontologies. Other sources for ontologies and
taxonomies include GeneCards, MedTerms etc. In
this paper, we search for all ontologies made
available through BioPortal that contain the keyword
erythropoietin/EPO either as a concept or as a
relation to another concept. We make use all the
ontologies to not only obtain a larger term base but a
broader scope and hence a broader coverage of
documents. The results of using these ontologies are
shown in the results section.
We choose to index and search the documents using
Apache Lucene. Apache Lucene is an open source
information retrieval library which offers full-text
indexing and search APIs. It makes use of the vector
space model to represent documents. In the vector
space model, every document is represented as an ndimensional vector where each dimension is a unique
word occurring in text, and the magnitude along each
dimension is the frequency of occurrence of the word
in a document. This software is very widely used in
web search utilities. The model also filters out very
frequent words such as – a, the, of etc. known as stop
words as they generally do not contain any
information about the document. Words occurring in
the text are trimmed down to their roots. This process
is called stemming and makes the model more
efficient. Apache Lucene allows indexing documents
in the form of ‘fields’, which makes it possible to
constrain the search to only one or more sub sections
of the document. It uses tf-idf to rank the documents
based on the query. Tf-idf stands is a scoring factor
that takes into account the frequency of occurrence of
a word in a document (term frequency) and the
frequency of occurrence across all the documents
(inverse document frequency).
3.2 Document Organization and Indexing
Each type of document is written and available in a
different format from the other. As a first step, these
documents are converted into a common format. This
makes it easy to manage and perform operations on
the documents. The format chosen is XML, and an
automated script is written to convert the downloaded
HTML files into XML files based on the
fields/features of that document. For example a patent
document is re-arranged into XML as shown in Fig 2.
4. Proposed Framework
There are many aspects of multi-domain document
search which can be exploited for effective document
retrieval. The framework proposed in this paper
attempts to achieve this in various stages which are
explained below. Fig. 3 gives the top level picture of
the framework. The corpus is a combined document
set of various types such as patents, court cases,
publications etc. The framework interacts with the
corpus based on the user query and returns the results
to the user.
4.1 Keyword Expansion
Fig 2: Sample patent XML file
The field of biotechnology is rapidly expanding.
While the number of documents required to be
managed is increasing immensely, there is an
imminent lack of standard terminology. This poses an
issue to applications that deal with biomedical text
such as search engines. A bag of words model may
not recognize a particular relevant document unless
the document contains the concept exactly as in the
model. One solution to this is the use of bioontologies. Bio-ontologies address this issue by
defining concepts and relations. Given a keyword, it
is possible to find related concepts by parsing the tree
and the class properties. This can give a significant
boost to the number of hits.
inventor names, the hearing date, etc., which will be
used in the next stages to re-score the documents and
improve relevancy measures. Feature extraction is
made easier due to the re-formatting of the
documents into an XML format as explained in
Section 3.2. The features that are extracted from
patents and court cases are listed in Table 1. The
various scoring methods are discussed in Section 5 as
the results are presented.
Table 1: Features extracted from patents and
court cases
Document
Patents
Court Cases
Fig 3: Proposed Framework
The goal of this stage is to make use of the
ontologies, based on the user query to maximize the
number of document hits. There are several
challenges associated with this stage. Picking a
relevant ontology is very important as an irrelevant
ontology could result in a pool of irrelevant
keywords, affecting the document retrieval at the
very beginning. Also, close attention needs to be paid
to the diverse usage of terms that are used in each
type of document. For example, a court case may
make use of less technical terms than what a
scientific publication might. A patent normally
restricts itself to the topic under concern; the
references it cites may vary widely and will need the
ontology to cover sufficient breadth of terms. We
must hence clearly define which ontology will apply
to each domain. Selecting irrelevant ontologies could
thus lead to producing inaccurate results.
4.2 Independently Search Documents
Having expanded the keywords, we now use the
keywords to retrieve documents independently from
the different databases. The documents are indexed
based on fields such as title, abstract, claims, etc.
Individual field indexing allows for different
combinations of fields in our keyword search. The
goal of this stage is to try and maximize recall from
each set of documents. In this step, we also outline
features which are both general and obvious to a type
of documents. These features are fields such as the
Features
Inventor, Assignee,
Location, Application
Information, US
Classification, Examiner,
Dates, References, Claims,
Examiner, Top-Occurring
keywords
Plaintiff(s), Defendant(s),
Date, Judge, District,
Location, Claim
References, Patent Number
References
4.3 Cross-Referencing Features
Certain features/fields of a document quite commonly
appear in other documents. For example, a court
case/litigation very clearly states the claim, which is
supposedly being infringed, of the patent under
litigation. It also states the parties involved and
possibly the inventors. All of this information can be
used to identify documents which are closely related.
If two documents have much of these overlapped,
they could be highly relevant to each other. In this
stage, features extracted from the previous step are
cross-referenced across the different databases to
capture the inter dependency between them.
Fig. 4 shows an extract from US patent 4,677,195 and
a court case between Amgen Inc. and Chugai
Pharmaceuticals Ltd. (927 F.2d 1200). The court case
clearly has references to the corresponding patent’s
claims. Both the documents are maintained in
different formats, but contain information which
establishes relevancy between them. This step
focuses on cross-referencing all such features from
the documents. There are many features that can be
extracted from patents. The features could be patent
metadata or a specific claim. The goal is to find and
use a set of generalized features, occurrences of
which can be found in other types of documents. The
output of this stage can be feedback to the previous
one iteratively until a good set of documents is
retrieved. For example, if a relevant court case cites
or mentions a patent which was not retrieved in the
previous stage, this new information can be used to
retrieve the document.
“erythropoietin”. This result shows the inadequacy of
keyword retrieval and implies further improvements
are necessary, such as the use of query expansion and
optimization techniques.
Table 2: Subset of results without use of
ontology
Patent Number
Score
5955422
6204247
6245740
6270989
6280977
6340742
6420339
6420340
6524818
0.109
0.000
0.018
0.000
0.027
0.113
0.000
0.000
0.009
5.1.1 Boosted Results
Fig 4: Reference to a claim in a court case
4.4 User Feedback
The search and relevancy methods discussed so far
rely on features such as document metadata and
cross-referencing of documents. Every user has a
specific need and searches for documents in a
different context. The framework will allow the user
to control or fine tune the results and could lead to
significant performance improvements.
5. Results and Observations
This section summarizes and discusses the results
obtained on the patent and court cases. The results are
discussed for the first part of the framework –
expanding keywords and independent searching of
the domains through the use case.
5.1 Patents
An initial search was performed on the 135 patents
using the keyword ‘erythropoietin’. The search was
performed on the entire text to capture the occurrence
of this keyword anywhere in the document. Selected
results are tabulated as shown in Table 2. Although
these documents were manually identified as closely
related, as it turns out, not all 135 documents were
retrieved. The recall was 69%; 90 documents of the
total 135 are retrieved using just the keyword
A search on BioPortal returned around 11 ontologies
that showed occurrence of the keyword
‘erythropoietin’ and ‘EPO’ either in the properties of
a concept or as a concept. We make use of these
ontologies for the tests. Common relations such as
is_a, component_of, subClass, parent, child etc. are
followed and the ontology is traversed to collect
siblings, children and parents of a particular concept.
Relations such as measured_by, RO (relations
ontology) etc. are not followed since they generally
do not lead to solid concepts or keywords that can be
used in searches. Following these relations in all the
ontologies returned by the search on BioPortal, a
term base of 43 terms and phrases (“erythropoietin
receptor binding”, “epoetin alfa”, recombinant
erythropoietin” etc.) were collected and used to
search the databases.
For each term, the search was performed with respect
to the following combinations – Title only, Abstract
only, Claims only, Description only, Title and
Abstract, Abstract and Claims, Title and Abstract and
Claims, Claims and Description, Abstract and
Description, and the full-text of the document.
Lucene normalizes the score of a particular field in a
document by the length of that field. This implicitly
gives shorter fields a higher weight. For example, a
document which shows occurrence of a particular
term in its title is given a higher weight than another
document which shows occurrence of the same term
somewhere in its description. Additionally, these
weights can be manually boosted if needed.
Many options are available to handle multi-word or
multi-phrase queries. In our work, if the term has
more than one word, then occurrence of any of these
terms is accepted and accordingly ranked. Table 3
shows the boost in the ranks of the core patents after
the term expansion using ontologies. Prior to the term
expansion, the core patents were ranked very low and
many relevant patents were not retrieved.
are children, parents and grandparents are given
lesser weight as the distance increases.
Table 3: Results on patents after query
expansion
Rank
Patent
Number
Score
11
7528104
17.59312
12
7078376
17.48418
13
7625564
14.68403
14
5955422
14.58489
15
4667016
14.02017
26
7128913
11.23234
27
5621080
11.06053
28
5618698
11.05477
29
6548653
10.98611
30
7012130
10.92918
31
5547933
10.82067
For the results shown in Table 3, the scoring function
used is to simply aggregate the scores for each
individual term, and each individual field for which it
had a non-zero score. The individual terms had equal
weights. Searching for the 43 different terms resulted
in a recall of 100%, i.e., all the 135 documents were
retrieved from the database. The 5 core patents have a
relatively high rank and show up in the top 35 results
and are shown in bold in the table.
A concern that arises from this result is the effect of
using these terms on a larger and more general
database. A term as general as protein appears in
almost 200,000 documents when search for in
USPTO. To ensure that the set of 135 documents
maintains a high rank even in a larger and more
general database, each term is weighted based on its
relation with the original query. Siblings and
synonyms may get high weight when compared to
grandparents or other far off relatives.
Four heuristic weighting functions were used for the
expanded terms. The weighting functions are all step
functions which assign a fixed weight to each of the
four relations to the query term – synonym, child,
parent and grandparent. Fig. 5 shows these functions.
All the functions follow the same heuristic that
synonyms have a full-weight of 1 while terms which
Figure 5: Weighting functions used
The weighting schemes are tested on a corpus of
1156 US patents. These 1156 US patents are a
collection of the top 50-100 scoring documents for
each of the 43 terms extracted from the ontologies.
Fig. 6 compares the recall values as a result of the
four different weighting functions used. The fourth
weighting function gives the least weight to very
general concept terms and gives the best results for
this test set and is hence used for collecting further
results. Larger differences in these functions could be
observed with a larger set of relevant documents.
Figure 6: Comparison of recall values
obtained from the four weighting functions
The effectiveness of the weighting schemes can be
measured with respect to the relative ranks for the 5
core patents. In the unweighted scheme, these are
ranked in the 50-115 range. As we move higher (or
lower) in an ontology, away from the query term, the
concepts become more general resulting in a very
general hit ratio when compared to the more specific
concepts. With the use of weighting function-4, the
relative ranks go up to the 10-55 range for the same
patents. This is a significant improvement and shows
that even in a large corpus; the most relevant patents
to the user’s query are amongst the top hits. Table 4
shows the results from the fourth weighting function.
Table 4: Improvement in relative ranking of
the core patents
Un-weighted
Patent
Rank
Score
Number
51
5659012
21.022
12
Weighted
Patent
Score
Number
5955422
19.598
Rank
52
5955422
20.885
13
6489293
19.415
53
5792850
20.848
42
5547933
14.927
54
5441868
20.731
43
7128913
14.927
100
7128913
17.501
44
7078376
14.912
101
5621080
17.398
45
4835260
14.861
102
5618698
17.389
46
5621080
14.793
103
6126917
17.377
47
6521245
14.701
104
7217689
17.163
48
4677195
14.686
105
5547933
17.134
49
5756349
14.673
106
6376218
17.130
50
6340742
14.664
112
5756349
16.915
51
5618698
14.585
5.2 Court Cases
Court cases form an important source of information
for patent examiners and researchers whether to
verify the validity of an invention, learn about
previously published/protected work, or to challenge
the claims of an existing patent. Table 5 shows the
result of using the 43 search terms to search the
database. The database is a collection of the
litigations involving the 5 core patents. The database
consists 30 court cases dated from 1989 – 2009. A
full recall is observed.
Court cases follow writing guidelines that are
different from technical bio publications and papers.
The documents have less technical jargons and hence
the use of terms is generally limited to the key
concept or claim that is being challenged such as
‘erythropoietin’ or a statement related to it. In
addition, the above documents were collected around
the use case and each one of them contains the
keyword
“erythropoietin”.
Hence
selecting
immediately related terms from ontologies such as
synonyms is sufficient to retrieve the related court
cases.
Table 5: Sample Court Cases retrieved
Court Case
Amgen Inc. v. F. Hoffmann-La Roche Ltd - 2008
Amgen Inc. v. F. Hoffmann-La Roche Ltd - 2009
Amgen Inc. v. Hoechst Marion Roussel Inc. - 2009
Amgen Inc. v. Hoechst Marion Roussel Inc. - 2008
Amgen Inc. v. Hoechst Marion Roussel Inc. - 2001
Score
0.875
0.725
0.497
0.497
0.423
6. Conclusions and Future Work
In this paper, we demonstrate the use of ontologies
for term expansion for document retrieval in two
databases containing 30 court cases and 135 patents.
We demonstrate the results through a use case,
erythropoietin. We address the lack of existing tools
and methods for retrieving relevant documents across
different domains. We propose a framework that will
serve as the backbone for such form of document
retrieval.
The scope and results discussed in this paper are
limited to the expanding the terms based on
ontologies and using them to improve retrieval in the
two databases. Various weighting methods and
approaches are discussed and shown to be effective.
The documents of immediate interest get a relatively
high score amongst a larger corpus. The results are
lenient towards type-1 errors in order to give the user
more choices.
Publications and PTO file wrappers also are
important part of the corpus. They contain
information which can be used to improve relevancy
measures such as authors, citations, date of
publication, journal or proceeding names, number of
citations etc. Our next research plan is to investigate
the effectiveness of this framework for these sets of
documents. The features will be cross-referenced and
results will be integrated to return a collective set of
relevant documents.
A single user query could have many possible
interpretations. For example, one user may be
interested in the preparation of a drug, while another
might be interested in the commercial use of it. To
account for this variation in user requests, the
framework will also include an option for user
relevancy feedback. This will ensure the results
adhere to the user’s requirement.
Our ultimate goal is to develop an ontology based on
these initial tests. The ontology will account for every
aspect of the framework, right from the expansion of
keywords to the cross-referencing of various features
and will serve as the backbone for the framework.
7. Acknowledgements
This research is partially supported by NSF Grant
Number 0811975 awarded to the University of
Illinois at Urbana-Champaign and NSF Grant
Number 0811460 to Stanford University. Any
opinions and findings are those of the authors, and do
not necessarily reflect the views of the National
Science Foundation.
8. References
[1] Ghoula, N., Khelif, K., and Dieng-Kuntz, R.,
“Supporting Patent Mining by using Ontology-based
Semantic
Annotations”,
Proceedings
of
the
IEEE/WIC/ACM international Conference on Web
intelligence, Washington, DC, pp. 435-438, November
2007.
[8] Codina J., Pianta E., Vrochidis S. and Papadopoulos S.,
“Integration of semantic, metadata and image search
engines with a text search engine for patent retrieval”,
Proceedings of the Workshop on Semantic Search at
the 5th European Semantic Web Conference, pp. 14-28,
June 2008.
[9] Kitamura Y., Kashiwase M., Masayoshi F. and
Mizoguchi R., “Deployment of an ontological
framework of function design knowledge”, Advanced
Engineering Informatics, Vol. 18(2), pp. 115–127,
2004.
[10] McCarty, L. T., “Deep semantic interpretations of
legal texts”, Proceedings of the 11th international
Conference
on
Artificial
intelligence
and
Law, Stanford, California, pp. 217-224, June 2007.
[11] Tong, R. M., Reid, C. A., Crowe, G. J., and Douglas,
P. R., “Conceptual legal document retrieval using the
RUBRIC system”, Proceedings of the 1st international
Conference on Artificial intelligence and Law, Boston,
pp. 28-34, 1987.
[2] Soo VW, Lin SY, Yang SY, Lin SN, Cheng SL, “A
cooperative multi-agent platform for invention based on
patent document analysis and ontology”, Expert
Systems with Applications, Volume 31(4), pp. 766-775,
November 2006.
[12] USPTO. Accessed on 6/10/2010.
http://www.uspto.gov/
[3] Wanner L., Baeza-Yates R., Brugmann S., Codina J.,
Diallo B., Escorsa E., Giereth M., Kompatsiaris Y.,
Papadopoulos S., Pianta E., Piella G., Puhlmann I., Rao
G., Rotard M., Schoester P., Serafini L. and Zervaki V.,
“Towards
content-oriented
patent
document
processing”, World Patent Information, Volume 30(1),
pp. 21-23, March 2008.
[14] espacenet. Accessed on 6/10/2010.
http://www.espacenet.com/
[4] Giereth, M., Koch, S., Kompatsiaris, Y., Papadopoulos,
S., Pianta, E., Serafini, L., and Wanner, L., “A Modular
Framework for Ontology-based Representation of
Patent Information”, Proceeding of the 2007
Conference on Legal Knowledge and information
Systems: JURIX 2007, Vol. 165, 49-58.
[5] Giereth M, Brügmann S, Stäbler A, Rotard M, Ertl T.,
“Application of semantic technologies for representing
patent metadata”, In proceedings of the first
international workshop on applications of semantic
technologies, 2006
[6] S. Sheremetyeva, Natural language analysis of patent
claims. In: Proceedings of the ACL Workshop on
Patent Corpus Processing, Sapporo, pp. 66–7, 2003.
[7] Yang SY, Lin SY, Lin SN, Cheng SL and Soo VW,
“An Ontology-based Multi-agent Platform for Patent
Knowledge Management”, International Journal of
Electronic Business Management, Vol. 3(3), pp. 181192, 2005
[13] Thomson Innovation. Accessed on 6/10/2010.
http://www.thomsoninnovation.com/
[15] PACER. Accessed on 6/10/2010.
http://www.pacer.gov/
[16] PubMed. Accessed on 6/10/2010.
http://www.ncbi.nlm.nih.gov/pubmed/
[17] GoPubMed. Accessed on 6/10/2010.
http://www.ncbi.nlm.nih.gov/pubmed/
[18] WIPO. Accessed on 6/10/2010.
http://www.wipo.int/portal/index.html.en
[19] BioPortal. Accessed on 6/10/2010.
http://bioportal.bioontology.org
Download