Automated Gene Summary: Let the Computer Summarize the Knowledge Xu Ling Department of Computer Science University of Illinois at Urbana-Champaign The Reality of Scientific Literature 18,000,000 16,000,000 14,000,000 12,000,000 Hard to keep up manual curation! 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 0 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Automated Gene Summarization Gene summary .Gene . . product ...... .Expression ........ .Sequence ........ .Interactions ........ .Mutations ........ General Functions Goal To retrieve and summarize all the knowledge about a particular gene from the literature Compressing knowledge: enables biologists to quickly understand the target gene. Automated curation: explicitly covers multiple aspects of a gene, such as the sequence information, mutant phenotypes etc. Our Solution Semi-structured summary on multiple aspects Gene products Expression pattern Sequence information Phenotypic information Genetical/physical interactions … 2-stage summarization Retrieve relevant articles by gene name search Extract most informative and relevant sentences for each aspects. Text Summary of Gene Abl System Overview: 2-stage Gene name recognition Sentence Categorization Gene Name Recognition v1: Dictionary-based string match v2: Machine learning methods of gene name recognition High recall, low precision High precision, low recall v3: v2 + dictionary based synonym expansion Improved in both recall and precision Categorization of Retrieved Sentences Collect “example sentences” from FlyBase v1: applying vector space model to construct aspect “profile”. v2: applying probabilistic models to factor out context-specific language. v3: v2 + biologist labeled training examples. Real sentence! Many thanks for the help by Susan Brown’s “Beetle group” ! Example. 1 Example. 2 Gene Summary in BeeSpace v4 To add General Entity Summarization General and applicable to summarize other entities: pathways, protein family, … General settings: 1. Space: A set of documents to be summarized. 2. Aspects: A set of aspects to define the structure of the summary. 3. Examples: Training sentences for each aspect. Further Generalization … Limitations of the categorization problem with training examples Predefined aspects, may not fit the need of a particular user Only works for a predefined domain and topics Training examples for each aspect are often unavailable More Realistic New Setup Allow a user to flexibly describe each facet with keywords (1-2): let the user determine what they want Generate the summary in a semi-supervised way: no need of training examples Example (1): Consumer vs. Editor Facets Generated Overview (10k customer rev.) Editor's Review (1) Body Styles, Exterior Design Like the minor exterior styling changes from 2005 to 2006. Tried the Camry XLE first, nice ride, but lacked a few features i wanted, like dual zone A/C, and didn't like the wood trim. ... Available trim levels include ... The VP provides air conditioning, power windows ... Powertrains … … Safety … … Interior Design The interior is beautiful - I got all of the features and the navigation is extremely easy to use. Accord's interior is top notch, nice design, clear gauges, comfy seats, lots of storage space …The seating arrangements are top-notch, and the interior design and materials quality continue the high-caliber standards ... The car's backseat is among the roomiest in the segment... Driving Impressions … … Honda accord 2006 Example (2): Different Aspects 17 What if the users want an overview with different facets? Facets User Input Generated Overview Design design, style Like the minor exterior styling changes from 2005 to 2006. Accord's interior is top notch, nice design, clear gauges, comfy seats, lots of storage space Engine engine, fuel … Finance finance, price When I bought it I was amazed at the trim level for the price. It is extremely fun to drive, fit and finish is fantastic, the oversteer could easily be corrected, at the price, it has no peer and is 10k less then a comparable BMW Safety safety … Driving comfort, fun … Conclusion The generated summaries are directly useful to biologists, and also serve as entry points to enable them to quickly navigate relevant literatures, via the BeeSpace analysis environment available at www.beespace.uiuc.edu Start from Here … The reverse of automated entity summarization: automated entity retrieval Profiling of entities using entity summary Eg., what genes are associated with … ? Build a powerful knowledge base … Enriched entities under certain context Eg., what are the significantly enriched genes in …? Entities involved in certain biomedical relations Eg., what genes are interacting with gene X ? BeeSpace v5 ! Acknowledgement Chengxiang Zhai Bruce Schatz Xin He Jing Jiang Qiaozhu Mei Moushumi Sarma Gene Robinson Vector Space Model (VSM) Construct a corresponding term vector Vc using the training sentences for the aspect Construct a sentence term vector Vs for each sentence The weight of a term ti in the aspect term vector for aspect j: wij=TFijIDFi, where TFij= term frequency, IDFi= 1 + log(N/ni) is the inverse document frequency (N=total number of documents, ni=number of documents containing term ti). with the same IDF and TF=number of times a term occurs in the sentence Aspect relevance score S=cos(Vc, Vs).