PPT - Illinois

advertisement
Automated Information Extraction and Analysis for
Information Synthesis
Catherine Blake, M.S., MCompSc
Information and Computer Science
University of California, Irvine
Motivation
Tammy Tengs, Sc.D
Health Priorities Research Group
School of Social Ecology
University of California, Irvine
Breast Cancer Studies that Report Smoking
• Information overload
Smoking reported
1 study
Primary topic
Secondary topic
– more than 12 million references already in MEDLINE
– thousands more each day
– well-articulated queries retrieve many relevant articles
• Most information from an article is not used
• Breast cancer risk factors are not well understood
Primary topic
Method
Secondary topic
• Traditional meta-analysis considers only the 29 studies where smoking is a primary topic
• Information synthesis considers secondary information from the remaining 58 studies
• Including secondary information can reduce certain publication biases
– Other than age and gender current risk factors explain
only half of the breast cancer occurrence
Implementation and Use
• User study
– 2 groups of scientists in medicine and public health
– observations and interviews during a systematic review
– retrospective analysis after a meta-analysis
• Meta-analysis training
• Literature review
• Framework to support scientists as they retrieve, extract
and analyze information from articles
• Supports observed user behaviors and work practices
Information Synthesis Framework
Research
Question
Contextual
Questions
Selection
Domain
Corpus
What are women with breast cancer exposed to?
Breast cancer
studies
Critical components
Facts
Analysis
Extraction
Concepts
• Identify information from text
• Verify the information extracted
• Synthesis using meta-analysis
External
Data
Verification
Collaboration
For each study
• number of patients
• age of patients
• risk-factor exposure
•…
3 Are these rates
significantly
different?
2 What are women in a similar population exposed to?
External
database of risk
factors
Funded by
California Breast Cancer Research Program
Research Goals
(1) Understand how scientists in medicine
and public health currently use biomedical
literature to answer research questions
(2) Design and implement technology to
support the observed information
behaviors and work practices
(3) Use the technology to quantify the risk of
smoking and breast cancer
Conclusion
Using Information Synthesis to Quantify Breast Cancer Risk Factors
1
Information Synthesis
MEDLINE
Wanda Pratt, Ph.D.
Information School and
Division of Biomedical & Health Informatics
University of Washington
Codebook
• age, gender
• % responses
• location
•…
Pilot implementation
• Automated Extraction
– heuristic approach
– training set precision and recall=(0.84, 0.86)
– test set precision and recall= (0.68, 0.71)
• Automated Analysis
– Random-effects meta-analysis module
implemented in java
• During a systematic review scientists iterate
between retrieval, extraction and analysis
• Most information required is located only
within the full-text of an article (not the
abstract or title)
• Information Synthesis requires collaboration
• Pilot study demonstrated
– Automated extraction looks promising
– Automated analysis successful
Future Work
• Further evaluation of extraction algorithms
• Implementation of verification component
• Final smoking and breast cancer analysis
Related Work
Blake, C. and Pratt,W. (In Press) Collaborative Information Synthesis American Society for Information
Science and Technology (ASIST), 2002 Philadelphia, PA.
C.Blake (2002) Information Synthesis: A Process used by Scientists in Medicine and Public Health to Overcome
Information Overload, Fourth International Conference on Conceptions of Library and Information Science:
Emerging Frameworks and Methods (CoLIS 4), Doctoral Forum, Seattle, WA.
Tengs, T. and Osgood, N.D. (2001). The link between smoking and Impotence : Two Decades of Evidence.
Preventive Medicine, 32(6), 447-452.
Download