Automated Information Extraction and Analysis for Information Synthesis Catherine Blake, M.S., MCompSc Information and Computer Science University of California, Irvine Motivation Tammy Tengs, Sc.D Health Priorities Research Group School of Social Ecology University of California, Irvine Breast Cancer Studies that Report Smoking • Information overload Smoking reported 1 study Primary topic Secondary topic – more than 12 million references already in MEDLINE – thousands more each day – well-articulated queries retrieve many relevant articles • Most information from an article is not used • Breast cancer risk factors are not well understood Primary topic Method Secondary topic • Traditional meta-analysis considers only the 29 studies where smoking is a primary topic • Information synthesis considers secondary information from the remaining 58 studies • Including secondary information can reduce certain publication biases – Other than age and gender current risk factors explain only half of the breast cancer occurrence Implementation and Use • User study – 2 groups of scientists in medicine and public health – observations and interviews during a systematic review – retrospective analysis after a meta-analysis • Meta-analysis training • Literature review • Framework to support scientists as they retrieve, extract and analyze information from articles • Supports observed user behaviors and work practices Information Synthesis Framework Research Question Contextual Questions Selection Domain Corpus What are women with breast cancer exposed to? Breast cancer studies Critical components Facts Analysis Extraction Concepts • Identify information from text • Verify the information extracted • Synthesis using meta-analysis External Data Verification Collaboration For each study • number of patients • age of patients • risk-factor exposure •… 3 Are these rates significantly different? 2 What are women in a similar population exposed to? External database of risk factors Funded by California Breast Cancer Research Program Research Goals (1) Understand how scientists in medicine and public health currently use biomedical literature to answer research questions (2) Design and implement technology to support the observed information behaviors and work practices (3) Use the technology to quantify the risk of smoking and breast cancer Conclusion Using Information Synthesis to Quantify Breast Cancer Risk Factors 1 Information Synthesis MEDLINE Wanda Pratt, Ph.D. Information School and Division of Biomedical & Health Informatics University of Washington Codebook • age, gender • % responses • location •… Pilot implementation • Automated Extraction – heuristic approach – training set precision and recall=(0.84, 0.86) – test set precision and recall= (0.68, 0.71) • Automated Analysis – Random-effects meta-analysis module implemented in java • During a systematic review scientists iterate between retrieval, extraction and analysis • Most information required is located only within the full-text of an article (not the abstract or title) • Information Synthesis requires collaboration • Pilot study demonstrated – Automated extraction looks promising – Automated analysis successful Future Work • Further evaluation of extraction algorithms • Implementation of verification component • Final smoking and breast cancer analysis Related Work Blake, C. and Pratt,W. (In Press) Collaborative Information Synthesis American Society for Information Science and Technology (ASIST), 2002 Philadelphia, PA. C.Blake (2002) Information Synthesis: A Process used by Scientists in Medicine and Public Health to Overcome Information Overload, Fourth International Conference on Conceptions of Library and Information Science: Emerging Frameworks and Methods (CoLIS 4), Doctoral Forum, Seattle, WA. Tengs, T. and Osgood, N.D. (2001). The link between smoking and Impotence : Two Decades of Evidence. Preventive Medicine, 32(6), 447-452.