Information Retrieval and its Application in Biomedicine Sept 4 Introduction Hong Yu1,2, PhD Susan McRoy1, PhD 1Department of Computer Science 2Department of Health Sciences University of Wisconsin-Milwaukee What is Information Retrieval? The field concerned with the acquisition, organization, and searching of knowledge-based information. (Hersh, 2003) Speed Up Communication Information World Wide Web Company Documentations Drug Descriptions Medical Records Books Everything that is text, image, video, and sound, and that can be transformed digitally Information in Biomedicine Literature (over 17 million publications) WWW Electronic medical records Genomics data – DNA sequences, etc. Knowledge representation – Gene Ontology Company databases – Micromedex drug database IR in Biomedicine Index Medicus (Billings 1879) MEDLARS (NLM 1966) SAPHIRE (Hersh 1990) PubMed (NLM 1996) Arrowsmith (Smalheiser 1998) BioText (Hearst 2003) BioMedQA (Yu 2006) Electronic and Open Publishing Internet and Web have a profound impact on the publishing of knowledge-based information Most of literature can be electronically available Open-access – The Bethesda Statement on Open Access Publishing (http://www.earlham.edu/~peters/fos/bethesda.htm) (April 11, 2003) – The Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities (http://www.zim.mpg.de/openaccessberlin/berlindeclaration.html). (2003) – PubMedCentra (NLM 2004) Quality of Information A lack of quality control – Anyone can publish online – A wealthy of studies concluded that Web has a poor quality for healthcare information Readability – Hard to read Information Needs and Seeking Unrecognized needs – Clinicians unaware of information needs or knowledge deficit Recognized needs – Clinicians aware of needs but may or may not pursue them Pursued needs – Information seeking occurs but may or may not be successful Satisfied needs – Information seeking successful Evidence-Based Medicine What You Will Learn IR algorithms – Indexing – Query and Retrieval – Evaluation – Text Classification – XML retrieval – Web retrieval What You Will Learn (Cont.) Open-Source IR tools – What open-source IR tools are available Indexing/retrieval Part-of-speech and syntactic parsing Semantic parsing Discourse relations Machine-learning classifiers How to use the tools? What You Will Learn (Cont.) State of the art IR systems – Baruch 1965 [BLIMP http://blimp.cs.queensu.ca/index.html] – SAPHIRE (Hersh 1990) Retrieval – MedLEE (Friedman 1994) Extraction – PubMed (NLM 1997) – ARROSMITH Systems (Smalheiser 1998) Hidden Relation Discovery Tool – GENIES (Friedman 2001) Extraction BioNLP Systems BioText (Hearst 2003 http://biotext.berkeley.edu/ ) – Retrieval+Categorization GeneWays (Rzhetsky 2004 http://geneways.genomecenter.columbia.edu/ ) – Extraction+Visualization TextPresso (Muller 2004 http://www.textpresso.org/ ) – Retrieval+Extraction iHOP (Hoffman and Valencia 2005 net.org/UniPub/iHOP/) http://www.ihop- – Retrieval BioMedQA (Yu 2006 http://monkey.ims.uwm.edu/MedQA) – Question Answering Advanced NLP applications Beyond text: Image and Video Image classification – Finding concepts in captions and annotations – Machine learning on textual & visual features – Determining salient features in text and image separately and merging the results Extracting text from image – Understanding and correcting OCR (handwriting, equations) – Finding text in images Finding document text related to illustrations Video retrieval Beyond Extraction: Experimental Tools Resources Annotated collections (GENIA, Medstract, Yapex …) Ontologies, tools, knowledge bases … Publications, Conferences, Evaluations … Centres and web portals What We Provide Textbook – Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2007 http://www-csli.stanford.edu/~schuetze/informationretrieval-book.html Office hour: – Tuesdays, 3-4 pm EMS 710 and by appointment – Hong Yu, 414-229-3344 – Susan McRoy, 414-229-6695 What We Expect Undergraduate: – 30% Homework, 35% Midterm exam, 35% Final exam or project Graduate: – 20% Midterm exam, 40% Homework, 40% Project: The project may be done individually or in a team of 2-3 people. The final project will include a software system, a 2-3 page written project report, and an oral presentation. The report should describe the problem, the approach, and evaluation and should cite related work where appropriate.