Defence Presentation

advertisement
Resolving Power of Search Keys
in MedEval, a Swedish Medical Test
Collection with User Groups:
Doctors and Patients
PhD thesis by Karin Friberg Heppin, Göteborgs Universitet
Opponent:
Prof Dr Stefan Schulz
Freiburg University (Germany)
‘Resolving Power’ of Search Keys
in MedEval, a Swedish Medical Test
Collection with User Groups:
Doctors and Patients
PhD thesis by Karin Friberg Heppin, Göteborgs Universitet
Opponent:
Prof Dr Stefan Schulz
Freiburg University (Germany)
Background
• More than fifty years of research in Information Retrieval (IR)
• Importance of IR as a key technology for dealing with large
amounts of information in the era of the Internet
• Most IR research is done in English content and standard texts
(mostly newswire)
• Specific issues in
• Other languages: Swedish
• Sublanguages / language registers: Medicine
Example
• Single-word compounds are common in Swedish texts (10%)
• Compare two search expressions:
(1) “narkotikapolitik”
(2) “fotboll”
• What happens if used in a trivial IR setting?
• Does a search for (1) retrieve enough relevant documents?
• If single-word compounds are decomposed and the component
used as search keys?
• Does a search for “fot” AND “boll” still yield relevant documents?
• Or does it only add noise?
Focus of thesis
• Resource production
• a medical test document collection in Swedish
• IR research questions
• What are the characteristics of good search keys in general ?
• Can professional language characteristics be used for optimizing
target-group specific searches?
• Are compounds good search keys or is it better to use their
constituents as search keys?
Organization of thesis
IR background: exhaustive review of the state of the art:
models, evaluation, linguistics, medical IR
Test environment: tools, resources, creation of MedEval test collection
Pilot studies: investigation of the behavior of terms and groups of terms;
analysis of the patients and doctor documents
Literature / Appendix
259
pp.
Main hypothesis
The resolving power of search keys is dependent on their
frequency in the document collection
This should guide the decision whether to use decomposition of
single-word compounds
How is the hypothesis being validated?
• Creation of a medical test collection
• Running IR experiments
• pilot study
• manual inspection and error analysis
Creation of a Test Collection
• Subset of MedLex with medical texts from different sources
totalling 42000 documents
• Two indexes
• original tokens (e.g. “saltkoncentration”)
• tokens and constituents of compounds (e.g.
“saltkoncentration”, “salt”, “koncentration”
• Document processing: tokenization, lemmatization, non-
lexical decomposition
Tools used
• Indri search engine:
• inference network approach
• produces ranked output
• complex query syntax (several proximity parameters, individual
weighting, Boolean AND)
• TREC eval: evaluation toolkit
• Query performance analyzer
Example Query Performance Analyzer
Topic collection and relevance
assessment
• 62 topics were acquired by medical students
• relevance assessment done by pooling (suboptimal but only
feasible strategy with given resources):
• interactive searching and judging
(four runs * pool depth 100)
• four grades of relevance judgements
• judgements of target readers: patients vs. physicians vs. both
• adjusted relevance scores
Six different scenarios
Creating baseline queries
• Division of the terms of a query into facets (conceptual
aspects) in order to assess the impact of query components
• Using words of the topics + Swedish MeSH synonyms
• Example for facet:
TREATMENT = #syn(behandla behandling strategi
behandlingsstrategi behandlingsmetod behandlingsalternativ
tillvägagångssätt genomföra)
• Parameter for assessment: Normalized discounted
cumulative gain (nDCG)
Analysis of the contribution of
• words
• word fragments
• facets
to the query performance (resolving power)
measured in nDCG (normalized discounted
cumulated gain)
Test of the suitability of single terms
nDCG
Measuring resolving power by
removing facets
nDCG
retrieves noise
Recall vs. noise
Quality of search keys (dependent on
topic)
ineffective keys
effective keys
Conclusions
• Ineffective search keys are more likely to be found among
terms with very high and very low frequency (statistical
significant), but the effect is not very strong effect and there
are important exceptions
• Low frequency compounds can benefit from decomposition
• Only split compounds if the constituents have greater
resolving power than the compound
• If the compound has a head – modifier structure only use
the head, as the modifier is supposed to have a low
resolving power
• No clear message in which sense professional language
characteristics can be used for optimizing target-group
specific searches
Questions to the candidate
Question 1
It is interesting that early IR work you cited referred not to the
parameter pair precision/recall, but specificity/sensitivity.
Whereas sensitivity =
recall = rel docs found / rel docs
specificity =
nrel docs found / nrel docs
precision
rel docs found / found docs
=
Is there a reason why the mainstream IR research abandoned
specificity? Is precision really an unproblematic parameter?
Do you know recent work that reintroduced specificity in IR?
Question 2
The F-measure allows to assign different weights to precision
and recall. This is important when different user scenarios are
to be studied. There are scenarios in which recall is more
important and the user accept noise because no relevant
document must be missed. Would it be possible to express
these user scenarios using nDCG?
Question 3
You used normalized discounted cumulative gain (nDCG) for
measuring the resolving power of search expressions
Why did you chose this parameter (and not the widely used Fmeasure)?
Question 4
You describe a Swedish stemmer that produces a 15 percent
increase in precision. Stemmers are supposed to increase
recall. How can they increase precision?
Question 5
I am surprised that a lexicon-free compound splitter, guided by
indicative consonant sequences, yielded relatively good results.
Do these results also hold for neoclassical compounds, which
are typical for medicine (e.g. “cerebrovaskulär“)?
Could the performance have been improved if you had
combined it at least with a basic lexicon of medical terms?
Question 6
Could you explain a bit more the difference between a
synonym set and a facet? If you include hyponyms what does
this mean with compounds? Does this mean that a facet for
cancer would look like this:
CANCER = (cancer bröstcancer lungcancer pankreascancer
kolorektalcancer tjocktarmscancer magsäckscancer …)
Question 7
It seems that the conclusions re the two language registers
studies are less obvious compared to the analysis of the
effectiveness of decomposition of compounds.
What had you originally expected from the distinction between
registers and what could still be done to achieve the expected
result?
Question 8
The conclusions of your “pilot” studies are based on a careful
dissection of the elements of topic descriptions in order to pick
out characteristic features the terms to assess their usefulness.
Nevertheless what you found out is still hypothetic.
How could an empirical study be devised that provides stronger
evidence of the validity of your conclusions?
Question 9
On page 34 you address reliability: why does reliability matter
and how can you measure it? Has reliability be assessed in the
construction of your test collection?
Question 10
Your analysis of multiword units seems a bit off-topic. Could
you explain the rationale for this investigation and why this
matters for the discussion of language registers?
Question 11
Question 12
Your conclusion that single-word compounds should be treated
in a differentiated way (according to semantic and statistic
criteria) seems not so original because this is already obvious
when observing the behaviour of Web search engines.
Which related work do you know that addressed this issues,
which are their conclusions and what is the specific contribution
of your own research?
Question 13
The test collection you have built is certainly a valuable
resource for further research. Which kind of research can you
imagine or would you like to be seen using your test collection?
Download