1

advertisement
1
DIALECT TOPIC MODELING FOR
IMPROVED CONSUMER MEDICAL
SEARCH
Steven P. Crain, BS
Shuang-Hong Yang, MS
Hongyuan Zha, PhD
Georgia Institute of Technology
Yu Jiao, PhD
Oak Ridge National Lab
2
This work was funded in part by Oak
Ridge National Lab, Microsoft and
Hewlett-Packard.
 Topic
Models
 Dialect
Topic Models
 Evaluation
OUTLINE
 Introduction
 Conclusion
3
CONTEXT:
CONSUMER MEDICAL SEARCH
Why do my eyes
sometimes beat
uncontrollably?
Double-Blind,
Randomized,
Comparative Study of
Meditoxin® Versus
Botox® in the
Treatment of
Essential
Blepharospasm
Yun, Kim and Lee
(2009)
4
CONTEXT:
CONSUMER MEDICAL SEARCH
Why do my eyes
sometimes beat
uncontrollably?
Tic Disorders and
Twitches
Many people at some
point experience
spasm-like
movements of
particular muscles.
WebMD® (2010)
5
RELATED WORK



HIQuA: Expand query with related technical
terms. Zeng, et al. (2002)
MedicoPort: Translation based on WebMD®.
Zeng, et al. (2006)
MedSearch: Extract usable query from long
textual description. Can & Baykal (2007)
These all fail when the query contains informal
words not found in consumer-targeted literature.
6
TOPICAL SEARCH
Informal
Query
Technical
Document
7
 Topic
Models
 Dialect
Topic Models
 Evaluation
OUTLINE
 Introduction
 Conclusion
8
TOPIC MODEL:
LATENT DIRICHLET ALLOCATION (LDA)
Topic
Words
1 Office visit
doctor see pain any right bad
2 Exercise
you anything muscle exercise
3 Anxiety
people depression life anxiety
4 Appearance
look skin acne teeth breath
5 Gynecology
period sex worry blood infect
9
LDA WORKS WELL: MANY DOCUMENTS
FOR EVERY TECHNICALITY
Informal Documents
Technical Documents
10
LDA WORKS POORLY: GAP BETWEEN
INFORMAL AND TECHNICAL DOCUMENTS
Informal Documents
Technical Documents
11
 Topic
Models
 Dialect
Topic Models
 Evaluation
OUTLINE
 Introduction
 Conclusion
12
TOPIC STRUCTURE
LDA
Words
Topic 1
twitch beat tic
Topic 2
blepharospasm
Topic 3
Dialect Topic Models (diaTM)
Topic 1
Topic 2
Informal
Formal
Technical
twitch beat
twitch tic
blepharospasm
13
DIATM GENERATIVE MODEL
Dialect
Mixture
P(dialect|doc)
Topic
Mixture
P(topic|doc)
DiaTM additions to LDA
14
DIATM GENERATIVE MODEL
Dialect
Features
Topic
DiaTM additions to LDA
15
DIATM GENERATIVE MODEL
Topic
Models
DiaTM additions to LDA
16
EXAMPLE
DiaTM improves topic models based on example
documents.
I'm 17 years old and I am so sick of acne. It’s not
white heads. I get pimples that are very painful
underneath the skin.
I tried all kinds of products
but haven’t found one that treats under the skin so if
anyone has any advice please help!!!!
17
EXAMPLE
Identify the mixture of topics and
dialects in the document.
Dialect: Informal
Appearance
Office visit Exercise
Anxiety
Gynecology
I'm 17 years old and I am so sick of acne. It’s not
white heads. I get pimples that are very painful
underneath the skin.
I tried all kinds of products
but haven’t found one that treats under the skin so if
anyone has any advice please help!!!!
18
EXAMPLE
New words are added to the “informal”
version of the “appearance” topic.
Dialect: Informal
Appearance
Office visit Exercise
Anxiety
Gynecology
I'm 17 years old and I am so sick of acne. It’s not
white heads. I get pimples that are very painful
underneath the skin.
I tried all kinds of products
but haven’t found one that treats under the skin so if
anyone has any advice please help!!!!
19
 Topic
Models
 Dialect
Topic Models
 Evaluation
OUTLINE
 Introduction
 Conclusion
20
DATA SOURCES
Per Collection:
Training:
1000 documents
Validation: 2500 documents
Testing:
2500 documents
Yahoo!
Answers
MeSH®
CDC
WebMD®
Labeled Data:
Consumer questions:
Question/Document pairs:
Judgments:
200
4982
5854
PubMed
Central
21
UNSUPERVISED EVALUATION:
PERPLEXITY
The uncertainty that remains after accounting
for the model.
diaTM
LDA
DiaTM is twice as good as LDA at
explaining word selection.
22
UNSUPERVISED EVALUATION:
KL-DIVERGENCE
Kullback Leibler divergence
measures the difference between
probability distributions.
probability
We compare the distribution of
topics in more technical collections
to the distribution in Yahoo!
Base
Low KL
High KL
topic
DiaTM has a little
better topic sharing
across the technicality
spectrum than LDA.
23
SUPERVISED EVALUATION:
INFORMATION RETRIEVAL PERFORMANCE
Normalized Discounted Cumulative Gain
(nDCG)
24



Goal: Overcome language gap
to enable consumers to find
technical documents on a topic.
DiaTM uses topic models with
dialect-specific versions to
bridge the language gap.
Substantial improvement over
state-of-the-art topical
retrieval.
Further improvements
expected from:
Better handling of consumer
vocabulary
 Improved dialect cues
 Document quality consideration

CONCLUSION

25
THANK YOU!
Contact: s.crain@gatech.edu
Data:
https://research.cc.gatech.edu/dmirlab/node/2
26
SOURCES OF LANGUAGE GAPS
Language
is
Communication
Gaps
come from
 Encoding
 Target
Audience
27
GAPS FROM ENCODING
 Language


is Efficient
Common ideas can be expressed succinctly.
Uncommon ideas must be composed of simpler
ideas.
 Consumers
and Professionals have different
encoding needs.
28
GAPS FROM TARGET AUDIENCE
 Clarity

Believed to be understandable
 Expedient

Will accomplish goal
 Image

Want to sound informal or technical
The greater the effort, the larger the gap!
29
TOPICAL SEARCH
Advantages
Well established
techniques
 Avoid exhaustive
consumer to technical
translation
 Better recall

Disadvantages
Imprecise
 Different consumer
and technical concepts
 Existing techniques
assume limited
language gap

30
DIATM DIALECT FEATURES
When adding words to a topic, DiaTM must know
which dialect version to add to.
 Word-level features determine the dialect of each
word.


We used
Word frequency statistics in different collections
predominated by different dialects.
 Fraction of occurrences in consumer question vs.
answer.


Other suggestions


Word length and complexity
Presence in dialect-specific lexicon
31
MECHANICAL TURK
INTER-RATER RELIABILITY
300
7
6
5
200
4
3
150
2
100
1
0
50
0
0
1
2
3
4
5
Maximum Rating
Judgment
Query-Doc Pairs
250
Pairs
High
-1
Low
-2
Mean
32
SUPERVISED EVALUATION
Precision
33
Download