Automated Gene Summary - BeeSpace

advertisement
Automated Gene Summary:
Let the Computer Summarize
the Knowledge
Xu Ling
Department of Computer Science
University of Illinois at Urbana-Champaign
The Reality of Scientific Literature
18,000,000
16,000,000
14,000,000
12,000,000
Hard to keep up
manual curation!
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
0
1965
1970
1975
1980
1985
1990
1995
2000
2005
2010
Automated Gene Summarization
Gene summary
.Gene
. . product
......
.Expression
........
.Sequence
........
.Interactions
........
.Mutations
........
General
Functions
Goal

To retrieve and summarize all the
knowledge about a particular gene from
the literature

Compressing knowledge: enables biologists to
quickly understand the target gene.

Automated curation: explicitly covers multiple
aspects of a gene, such as the sequence
information, mutant phenotypes etc.
Our Solution

Semi-structured summary on multiple aspects







Gene products
Expression pattern
Sequence information
Phenotypic information
Genetical/physical interactions
…
2-stage summarization


Retrieve relevant articles by gene name search
Extract most informative and relevant sentences for
each aspects.
Text Summary of Gene Abl
System Overview: 2-stage
Gene name
recognition
Sentence
Categorization
Gene Name Recognition

v1: Dictionary-based string match


v2: Machine learning methods of gene
name recognition


High recall, low precision
High precision, low recall
v3: v2 + dictionary based synonym
expansion

Improved in both recall and precision
Categorization of Retrieved
Sentences

Collect “example sentences” from FlyBase



v1: applying vector space model to construct
aspect “profile”.
v2: applying probabilistic models to factor out
context-specific language.
v3: v2 + biologist labeled training
examples. Real sentence!
Many thanks for the help by Susan Brown’s “Beetle group” !
Example. 1
Example. 2
Gene Summary in BeeSpace v4

To add
General Entity Summarization

General and applicable to summarize
other entities: pathways, protein family, …

General settings:
1. Space:
A set of documents to be summarized.
2. Aspects: A set of aspects to define the
structure of the summary.
3. Examples: Training sentences for each aspect.
Further Generalization …

Limitations of the categorization problem with
training examples




Predefined aspects, may not fit the need of a particular
user
Only works for a predefined domain and topics
Training examples for each aspect are often unavailable
More Realistic New Setup


Allow a user to flexibly describe each facet with
keywords (1-2): let the user determine what they want
Generate the summary in a semi-supervised way: no
need of training examples
Example (1): Consumer vs. Editor
Facets
Generated Overview (10k
customer rev.)
Editor's Review (1)
Body Styles,
Exterior
Design
Like the minor exterior styling changes from
2005 to 2006. Tried the Camry XLE first, nice
ride, but lacked a few features i wanted, like
dual zone A/C, and didn't like the wood trim.
... Available trim levels
include ... The VP provides air
conditioning, power windows ...
Powertrains
…
…
Safety
…
…
Interior
Design
The interior is beautiful - I got all of the
features and the navigation is extremely easy
to use. Accord's interior is top notch, nice
design, clear gauges, comfy seats, lots of
storage space
…The seating arrangements
are top-notch, and the interior
design and materials quality
continue the high-caliber
standards ... The car's
backseat is among the
roomiest in the segment...
Driving
Impressions
…
…
Honda accord
2006
Example (2): Different Aspects
17

What if the users want an overview with different
facets?
Facets User
Input
Generated Overview
Design
design, style
Like the minor exterior styling changes from 2005 to
2006. Accord's interior is top notch, nice design, clear
gauges, comfy seats, lots of storage space
Engine
engine, fuel
…
Finance
finance,
price
When I bought it I was amazed at the trim level for
the price. It is extremely fun to drive, fit and finish is
fantastic, the oversteer could easily be corrected, at
the price, it has no peer and is 10k less then a
comparable BMW
Safety
safety
…
Driving
comfort, fun
…
Conclusion

The generated summaries are



directly useful to biologists,
and also serve as entry points to enable
them to quickly navigate relevant literatures,
via the BeeSpace analysis environment
available at
www.beespace.uiuc.edu
Start from Here …

The reverse of automated entity
summarization: automated entity retrieval
Profiling of entities using entity summary
Eg., what genes are associated with … ?


Build a powerful knowledge base …
Enriched entities under certain context
Eg., what are the significantly enriched genes in …?
 Entities involved in certain biomedical relations
Eg., what genes are interacting with gene X ?

BeeSpace v5 !
Acknowledgement
Chengxiang Zhai
Bruce Schatz
Xin He
Jing Jiang
Qiaozhu Mei
Moushumi Sarma
Gene Robinson
Vector Space Model (VSM)

Construct a corresponding term vector Vc using
the training sentences for the aspect


Construct a sentence term vector Vs for each
sentence


The weight of a term ti in the aspect term vector for aspect j:
wij=TFijIDFi, where TFij= term frequency, IDFi= 1 + log(N/ni) is
the inverse document frequency (N=total number of
documents, ni=number of documents containing term ti).
with the same IDF and TF=number of times a term occurs in
the sentence
Aspect relevance score S=cos(Vc, Vs).
Download