1 DIALECT TOPIC MODELING FOR IMPROVED CONSUMER MEDICAL SEARCH Steven P. Crain, BS Shuang-Hong Yang, MS Hongyuan Zha, PhD Georgia Institute of Technology Yu Jiao, PhD Oak Ridge National Lab 2 This work was funded in part by Oak Ridge National Lab, Microsoft and Hewlett-Packard. Topic Models Dialect Topic Models Evaluation OUTLINE Introduction Conclusion 3 CONTEXT: CONSUMER MEDICAL SEARCH Why do my eyes sometimes beat uncontrollably? Double-Blind, Randomized, Comparative Study of Meditoxin® Versus Botox® in the Treatment of Essential Blepharospasm Yun, Kim and Lee (2009) 4 CONTEXT: CONSUMER MEDICAL SEARCH Why do my eyes sometimes beat uncontrollably? Tic Disorders and Twitches Many people at some point experience spasm-like movements of particular muscles. WebMD® (2010) 5 RELATED WORK HIQuA: Expand query with related technical terms. Zeng, et al. (2002) MedicoPort: Translation based on WebMD®. Zeng, et al. (2006) MedSearch: Extract usable query from long textual description. Can & Baykal (2007) These all fail when the query contains informal words not found in consumer-targeted literature. 6 TOPICAL SEARCH Informal Query Technical Document 7 Topic Models Dialect Topic Models Evaluation OUTLINE Introduction Conclusion 8 TOPIC MODEL: LATENT DIRICHLET ALLOCATION (LDA) Topic Words 1 Office visit doctor see pain any right bad 2 Exercise you anything muscle exercise 3 Anxiety people depression life anxiety 4 Appearance look skin acne teeth breath 5 Gynecology period sex worry blood infect 9 LDA WORKS WELL: MANY DOCUMENTS FOR EVERY TECHNICALITY Informal Documents Technical Documents 10 LDA WORKS POORLY: GAP BETWEEN INFORMAL AND TECHNICAL DOCUMENTS Informal Documents Technical Documents 11 Topic Models Dialect Topic Models Evaluation OUTLINE Introduction Conclusion 12 TOPIC STRUCTURE LDA Words Topic 1 twitch beat tic Topic 2 blepharospasm Topic 3 Dialect Topic Models (diaTM) Topic 1 Topic 2 Informal Formal Technical twitch beat twitch tic blepharospasm 13 DIATM GENERATIVE MODEL Dialect Mixture P(dialect|doc) Topic Mixture P(topic|doc) DiaTM additions to LDA 14 DIATM GENERATIVE MODEL Dialect Features Topic DiaTM additions to LDA 15 DIATM GENERATIVE MODEL Topic Models DiaTM additions to LDA 16 EXAMPLE DiaTM improves topic models based on example documents. I'm 17 years old and I am so sick of acne. It’s not white heads. I get pimples that are very painful underneath the skin. I tried all kinds of products but haven’t found one that treats under the skin so if anyone has any advice please help!!!! 17 EXAMPLE Identify the mixture of topics and dialects in the document. Dialect: Informal Appearance Office visit Exercise Anxiety Gynecology I'm 17 years old and I am so sick of acne. It’s not white heads. I get pimples that are very painful underneath the skin. I tried all kinds of products but haven’t found one that treats under the skin so if anyone has any advice please help!!!! 18 EXAMPLE New words are added to the “informal” version of the “appearance” topic. Dialect: Informal Appearance Office visit Exercise Anxiety Gynecology I'm 17 years old and I am so sick of acne. It’s not white heads. I get pimples that are very painful underneath the skin. I tried all kinds of products but haven’t found one that treats under the skin so if anyone has any advice please help!!!! 19 Topic Models Dialect Topic Models Evaluation OUTLINE Introduction Conclusion 20 DATA SOURCES Per Collection: Training: 1000 documents Validation: 2500 documents Testing: 2500 documents Yahoo! Answers MeSH® CDC WebMD® Labeled Data: Consumer questions: Question/Document pairs: Judgments: 200 4982 5854 PubMed Central 21 UNSUPERVISED EVALUATION: PERPLEXITY The uncertainty that remains after accounting for the model. diaTM LDA DiaTM is twice as good as LDA at explaining word selection. 22 UNSUPERVISED EVALUATION: KL-DIVERGENCE Kullback Leibler divergence measures the difference between probability distributions. probability We compare the distribution of topics in more technical collections to the distribution in Yahoo! Base Low KL High KL topic DiaTM has a little better topic sharing across the technicality spectrum than LDA. 23 SUPERVISED EVALUATION: INFORMATION RETRIEVAL PERFORMANCE Normalized Discounted Cumulative Gain (nDCG) 24 Goal: Overcome language gap to enable consumers to find technical documents on a topic. DiaTM uses topic models with dialect-specific versions to bridge the language gap. Substantial improvement over state-of-the-art topical retrieval. Further improvements expected from: Better handling of consumer vocabulary Improved dialect cues Document quality consideration CONCLUSION 25 THANK YOU! Contact: s.crain@gatech.edu Data: https://research.cc.gatech.edu/dmirlab/node/2 26 SOURCES OF LANGUAGE GAPS Language is Communication Gaps come from Encoding Target Audience 27 GAPS FROM ENCODING Language is Efficient Common ideas can be expressed succinctly. Uncommon ideas must be composed of simpler ideas. Consumers and Professionals have different encoding needs. 28 GAPS FROM TARGET AUDIENCE Clarity Believed to be understandable Expedient Will accomplish goal Image Want to sound informal or technical The greater the effort, the larger the gap! 29 TOPICAL SEARCH Advantages Well established techniques Avoid exhaustive consumer to technical translation Better recall Disadvantages Imprecise Different consumer and technical concepts Existing techniques assume limited language gap 30 DIATM DIALECT FEATURES When adding words to a topic, DiaTM must know which dialect version to add to. Word-level features determine the dialect of each word. We used Word frequency statistics in different collections predominated by different dialects. Fraction of occurrences in consumer question vs. answer. Other suggestions Word length and complexity Presence in dialect-specific lexicon 31 MECHANICAL TURK INTER-RATER RELIABILITY 300 7 6 5 200 4 3 150 2 100 1 0 50 0 0 1 2 3 4 5 Maximum Rating Judgment Query-Doc Pairs 250 Pairs High -1 Low -2 Mean 32 SUPERVISED EVALUATION Precision 33