Information Extraction From Medical Records by Alexander Barsky Current Methodology: Broad assessment of patient contained in beginning of chart with references to more specific areas. Specific divisions follow broad assessment. Records are listed in chronological order of activity. Chart Example: . Problem: A patient's medical chart is very detailed and very complex in nature. Any attempt to quickly locate specific information will be met with frustration. Example: . Solution: Create a system that properly extracts wanted information based on a predefined set of parameters. Example: "Hormonal imbalance during puberty". Retrieve all references to hormonal imbalances but only between two specific time periods in medical chart. Tool At our disposal: JAPE : Java Annotation Patterns Engine. Use : pattern matching and semantic extraction GATE : General Architecture for Text Engineering. Use: Information Extraction, document annotation, and XML output. C# : Visual C# Winforms. Use: Medium for conversion between XML and .csv file formats. Solution Methodology: 1. Create corpus of documents in GATE. 2. Introduce rules for information extraction. 3. Annotate documents in corpus. 4. Output annotated documents in XML. 5. Strip file of unnecessary elements and convert to .csv. ANNIE A-Nearly-New-Information-Extraction-System -Tokeniser - splits sentence into simple tokens -Gazetter - identify entity names contained in lists -Sentence Splitter - splits text into sentences based on lists. -Parts of Speech Tagger - identifies text as different POS. -Coreference Matcher- identifies relationships between previously defined entities. Success in Information Extraction is based on integrating most if not all ANNIE components - JAPE : Key to Extraction - JAPE Example - XML Output: - Problem: Too much unorganized information. Solution : XLST to the rescue!!! XLST - Extensible Stylesheet Language Transformations - Add specific rules to seperate needed from unnecessary information. XLST Example -Find all the nodes within the <Lookup>. Add string between the tags. CSV File Type Comma Seperated Value - Used to present information in a tabular system. Useful for analyzing large amount of data in an easy to understand format. Most common program to use it is Excel. . Potential Problem: Regardless of how well all the ANNIE tools are utilized and how well the JAPE rules are defined, proper recall precentage won't ever be exact. Solution: Machine Learning Machine learning is our best chance to increase precision of output results. Training a computer to recognize commonally used reporting phraseology will organize extraction better with more precise, concise outputs. Lucky for us, GATE include plugins to program machine learning.