John Tait
Chief Scientific Officer
IRF
1
• Mihai Lupu of the IRF
• Jian-Han Zhu of University College London
• Jimmy Huang of York University of Canada
• Giovanna Roda of colleagues in the CLEF IP team for Matrixware and the IRF
• Royal Society for Chemistry for generously making their Scientific Journal Colelctions available to us
IRF Member Services 2
The content of this talk was planned on the basis we could discuss the detail of TREC
CHEM: over the weekend it became clear
NIST policy is that the results should not be made public until the TREC conference in the
US in November
• Therefore much detail had to be removed
IRF Member Services 3
• Introduction to the IRF
• TREC CHEM
– 2009
– Plans for the future
• Summary and Conclusions
IRF Member Services 4
5
A international not-for-profit institution, founded in 2006, based in Vienna, to promote and facilitate research in large scale information retrieval
To bridge the gap between the needs of the industry and the academic know-how.
To maintain a facility that enables large scale information retrieval and in depth processing of data for research
To bring the latest information retrieval technology to the community of patent professionals and other professional searchers.
A platform initiated by Matrixware which: improves the transfer of knowledge between professionals in Intellectual
Property and Information Retrieval and promotes collaboration between experts on the development of new research methodologies for international patent data
.
High Recall: a single missed document can invalidate a patent
Session based: single searchers may involve days of cycles of results review and query reformulation
Defendable: Process and results may need to be defended in court
CLEF-IP
Track
The goal of the CLEF-IP track is to investigate multilingual IR techniques in the Intellectual
Property domain.
• Target data >1Mio EPO granted patents documents in three languages: English, German, French
• Tasks prior art search, invalidity search
• Test collection constructed using the available EPO prior art reports
• Scientific Members
– Access to data to resources
– Project links to industry
• Industrial Members
– Consultancy and research in IR and IP search
– Training and support: systems evaluation semantic computing
– Links to academia
IRF Member Services 12
John Tait
Chief Scientific Officer
IRF
13
• Organised by the US Federal Institute of
Standards and Technology
• Has run annually since1991
– Originally focused on ad hoc text retrieval with long queries
– Regularly extended
• Video
• Web
• Genomics
• Legal
IRF Member Services 14
• IRF approached NIST about using our patent data and computing facilities as a means to promote scientific co-operation
• Jian-Han Zhu then of UK Open University about Chemistry approached NIST about
Chemistry
• NIST were interested in domain specific retrieval to follow up Genomic track etc. and helped us get going
IRF Member Services 15
• 1.2 mil. patent files
(IRF)
• 59k scientific articles
(RSC)
IRF Member Services 16
• Technical Survey
– Search for all potentially relevant documents, in both collections.
– 18 manually defined and evaluated topics
• Prior Art
– Search for patents that may invalidate a given patent
– 1000 automatically created and evaluated topics (1000 patent files)
17
• 15 institutions registered to get the data
– 6 submitted 31 runs for the TS task:
• University of Applied Science Geneva, Information
Retrieval Laboratory of Dalian University of
Technology, Fraunhofer SCAI, Milwaukee School of
Engineering, Purdue University, York University
– 8 submitted 59 runs for the PA topics:
• University of Applied Science Geneva, Carnegie
Mellon University , Information Retrieval Laboratory of Dalian University of Technology, University of
Iowa, Fraunhofer SCA, Milwaukee School of
Engineering, Purdue University, York University
18
• Basic vector space model
– Different sections, weights on each section
– bm25
• Additional filtering/weighting based on IPC codes
• Linguistic processing
– Emphasis on Noun Phrases
• Concept based search
• Query expansion
– Using Oscar3, MeSH
19
• Technology Survey tasks
– 8 chemistry grad students
– 5 experts
– Each topic evaluated by 2 students and 1 expert
• Prior Art tasks
– Automatically evaluated based on citations within patents and family members
20
• Manual evaluations have some conflicting results
– Not more than other manually evaluated topics
• Using entity recognition and synonyms proves successful
– Some groups manually extended the queries
• “simple methods” seem to also perform well
(e.g. Lucene-based, bm25)
– E.g. for Inferred Average Precision they reach
97% of highest score
• Disclaimer: results analysis is still ongoing
21
• Subject to discussion at TREC in November
– Increased numbers of patents
– Include images
– Task extensions/refinements
• Searching for numerical ranges (independent of unit)
• Searching for specific roles of specific chemical components
• The use of Markush structures
IRF Member Services 22
• The IRF is promoting collaboration between information retrieval and intelelctual property professionals through promoting evaluations and joint technology development projects
• TREC CHEM has provided an objective and independent means as assessing the effectiveness of technologies on two sorts of retrieval tasks
IRF Member Services 23
IRF Newsletter Chemistry Issue: http://www.ir-facility.org/the_irf/newsletter
www.ir-facility.org
www.matrixware.com
www.matrixware.net