Improving Search through Corpus Profiling Bonnie Webber School of Informatics

advertisement
Improving Search through
Corpus Profiling
Bonnie Webber
School of Informatics
University of Edinburgh
Scotland
1
Original motivation

PhD research (Michael Kaisser) on using (lexical
resources (FrameNet, PropBank, VerbNet) to improve
performance in QA

Developed two methods [Kaisser & Webber, 2007]

Evaluation on Web and AQUAINT corpus produced
significantly different results.

Other research where same methods on same input
produce significantly different results on different corpora.
FrameNet
Example of annotated FrameNet data:
(Screenshots from framenet.icsi.berkeley.edu)
Two QA methods
Method 1



Use resources to generate templates in which answer
might be found.
Project templates onto quoted strings used directly as
search queries.
Method 2



Use resources to generate dependency structures in
which answer might occur.
Search on lexical co-occurance.
Filter results by comparing structure of candidate
sentences with the structure of the annotated
resource sentences.
Method 1
Example:
“Who purchased YouTube?”
Method 1
Extract simplified dependency structure from
question using MiniPar:
head:
head\subj:
head\obj:
purchase.v
“Who”
“YouTube”
Method 1
Get annotated sentences from FrameNet for
purchase.v:
The company
FE:Buyer
had
purchased
several PDMS terminals ...
lexical unit
FE: Goods
Method 1
The company
FE:Buyer
had
purchased
several PDMS terminals ...
lexical unit
FE: Goods
Use MiniPar to associate annotated abstract
frame structure with dependency structure:
Buyer[Subject, NP] VERB Goods[Object, NP]
Method 1
Buyer[Subject, NP] VERB Goods[Object, NP]
head=purchase.V, Subject=“Who”, Object=“YouTube”
Buyer[ANSWER] purchase.V Goods[“YouTube”]
Method 1
Generate potential answer templates:
ANSWER[NP] purchased YouTube
 ANSWER[NP] (has|have) purchased YouTube
 ANSWER[NP] had purchased YouTube
 YouTube (has|have) been purchased by
ANSWER[NP]
 ...

Method 1
Use patterns to generate quoted strings as
search queries:
"YouTube has been purchased by"
 Extract sentences from snippets.
 Parse sentences.
 If structures match, extract answer:
“YouTube has been purchased by Google for
$1.65 billion.”

Method 1 (extended)
The company
had
FE:Buyer
...
purchased
several PDMS terminals
LU
FE:Goods
the landowner
sold
the land
to developers
FE:Seller
LU
FE:Goods
FE:Buyer
...
...
Create additional paraphrases using all verbs in original
frame & verbs identified through inter-frame relations:
 ANSWER[NP] bought YouTube
 YouTube was sold to ANSWER[NP]
Method 1 (Web-based evaluation)

Accuracy results on 264 (of 500) TREC 2002
questions whose head verb is not "be":
FN base all verbs in frame inter-frame relations
0.181
0.204
0.215
FN base
PropBank
VerbNet
combined
0.181
0.227
0.223
0.261
Method 1 (further extension)
FN often gives ‘interesting’ examples rather than common
ones. So
 assume (as default) that verbs display common patterns:




Intransitive: [ARG0] VERB
Transitive: [ARG0] VERB [ARG1]
Ditransitive: [ARG0] VERB [ARG1] [ARG2]
And if one of these patterns is observed in Q that isn’t
among those found in FN, just add it.
combined
0.261
combined+
0.284
Method 1

Method 1 and its extensions all lead to clear
improvements in QA over the web,

But they may be losing answers by finding only exact
string matches.

“YouTube was recently purchased by Google for $1.65
billion.”
Method 2 addresses this.
Method 2
Associates each annotated sentence in FN and PB with a
set of dependency paths from the head to each of the
frame elements.
“The Soviet Union[ARG0] has purchased roughly eight
million tons of grain[ARG1] this month[TMP]”.




head: “purchase”, path = /i
ARG0: paths = {./s, ./subj,}
ARG1: paths = {./obj}
TMP: paths = {./mod}
Method 2
Question analysis: Same as Method 1.
 Search based on key words from question:

purchased YouTube (no quotes)
Sentences are extracted from the returned
snippets, e.g.:
“Their aim is to compete with YouTube, which
Google recently purchased for more than $1
billion.”
 Dependency parse produced for each extract.

Method 2
Eight tests comparing dependency paths:
1a Do the candidate and example sentences share
the same head verb?
1b Do the candidate and example sentences share
the same path to the head?
2a In the candidate sentence, do we find one or
more of the example’s paths to the answer role?
2b In the candidate sentence, do we find all of the
example’s paths to the answer role?
Method 2
3a Can some of the paths for the other roles be
found in the candidate sentence?
3b Can all of the paths for the other roles be found
in the candidate sentence?
4a Do the surface strings of the other roles partially
match those of the question?
4b Do the surface strings of the other roles
completely match those of the question?
Method 2
Each sentence that passes steps 1a and 2a is
assigned a weight of 1. (Otherwise 0.)
 For each of the remaining tests that succeeds,
that weight is multiplied by 2.

Method 2
Annotated frame sentence (from PropBank):
“The Soviet Union[ARG0] has purchased roughly
eight million tons of grain[ARG1] this
month[TMP]”.
Candidate sentence retrieved from the Web:
“Their aim is to compete with YouTube, which
Google recently purchased for more than $1
billion.”
N.B. Object rel clause - string match would fail.
Method 2
Candidate sentence:
 head: “purchase, ”path = /i/pred/i/mod/pcomp-n/rel/i
 phrase: “Google”, paths = {./s, ./subj,}
 phrase: “which”, paths = {./obj}
 phrase: “YouTube”, paths = {\i\rel}
 phrase: “for more than $1 billion”, paths = {./mod}
PropBank example sentence:
 head: “purchase”, path = /i
 ARG0: “The Soviet Union”, paths = {./s, ./subj,}
 ARG1: “roughly eight million tons of grain ”, paths = {./obj}
 TMP: “this month”, paths = {./mod}
Method 2
Candidate sentence:
 head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i
 phrase: “Google”, paths = {./s, ./subj}
 phrase: “which”, paths = {./obj}
 phrase: “YouTube”, paths = {\i\rel}
 phrase: “for more than $1 billion”, paths = {./mod}
PropBank example sentence:
 head: “purchase”, path = /i
 ARG0: “The Soviet Union”, paths = {./s, ./subj}
 ARG1: “roughly eight million tons of grain”, paths = {./obj}
 TMP: “this month”, paths = {./mod}
Method 2
Candidate sentence:
 head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i
 phrase: “Google”, paths = {./s, ./subj}
 phrase: “which”, paths = {./obj}
 phrase: “YouTube”, paths = {\i\rel}
 phrase: “for more than $1 billion”, paths = {./mod}
The results of the tests are:
1a
OK
2a
OK
3a
OK
4a
-
1b
-
2b
OK
3b
OK
4b
-
This sentence returns the answer “Google”, to which a score of 8 is
assigned.
Method 2
Candidate sentence:
 head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i
 phrase: “Google”, paths = {./s, ./subj}
 phrase: “which”, paths = {./obj}
 phrase: “YouTube”, paths = {../..}
 phrase: “for more than $1 billion”, paths = {./mod}
We get a (partially correct) role assignment:
 ARG0: “Google ”, paths = {./s, ./subj}
 ARG1: “which”, paths = {./obj}
 TMP: “for more than $1 billion”, paths = {./mod}
Method 2
Evaluation results for method 2:
FrameNet
0.030
PropBank
0.159
PropBank outperforms FrameNet because:
 More lexical entries in PropBank
 More example sentences per entry in PropBank
 FrameNet does not annotate peripheral adjuncts
Evaluation
21% improvement on the 264 non-’be’ TREC 2002
questions, when used on the web.
Method 1 – FrameNet
0.181
Method 1 – PropBank
0.227
Method 1 – VerbNet
0.223
Method 1 – all resources
0.261
Method 2 – PropBank
0.159
All methods – PropBank
0.306
All methods – all resources
0.367
Problem
Similar levels of improvement were not found when
applied directly to the AQUAINT corpus, using the exact
same methods.
Method 1 - FrameNet
0.181
0.027
Method 2 - PropBank
0.159
0.023
Not an isolated case
Across 9 different IR models, [Iwayama et al, 2003] found
similar differences when posing the same queries to
 a corpus of Japanese patent applications (full text)
 a corpus of Japanese newspaper articles
tf
.0227
.1054
idf
.1577
.2443
log(tf)
.1255
.2266
log(tf).idf
.2132
.2853
BM25
.2503
.3346
But they don’t speculate on the reason for these results.
What makes for such differences?

In Kaisser’s case, the form in which information
appears in the corpus may match neither the
question nor any form derivable from it via
FrameNet, PropBank or VerbNet.
What year was Alaska purchased?
On March 30, 1867, U.S. Secretary of State
William H. Seward reached agreement with
Russia to purchase the territory of Alaska for
$7.2 million, a deal roundly ridiculed as Seward's
Folly. (APW20000329.0213)
 But by 1867, when Secretary of State William H.
Seward negotiated the purchase of Alaska from
the Russians, sweetheart deals like that weren't
available anymore.' (NYT19980915.0275)

Hypothesis
Profiling a corpus and adapting search to its
characteristics can improve performance in IR
and QA.
 Neither new nor surprising: “Genre, like a range
of other non-topical features of documents, has
been under-exploited in IR algorithms to date,
despite the fact that we know that searchers rely
heavily on such features when evaluating and
selecting documents” [Freund et al, 2006].
 Also cf. [Argamon et al, 1998; Finn & Kushmerik,
2006; Karlgren 2004; Kessler et al, 1997]

What basis for profiling?
Documents can be characterised in terms of
 genre
 register
 domain
 These in turn implicate
 lexical choice
 syntactic choice
 choice of referring expression
 structural choices at the document level
 formatting choices

Definitions

Genre, register, domain are not completely
independent concepts.
Definitions

Genre, register, domain are not completely
independent concepts.

Genre: A distinctive type of communicative
action, characterized by a socially recognized
communicative purpose and common aspects of
form [Orlikowski & Yates, 1994].
Definitions

Genre, register, domain are not completely
independent concepts.

Genre: A distinctive type of communicative
action, characterized by a socially recognized
communicative purpose and common aspects of
form [Orlikowski & Yates, 1994].

Register: Generalized stylistic choices due to
situational features such as audience and
discourse environment [Morato et a., 2003]
Definitions

Genre: A distinctive type of communicative
action, characterized by a socially recognized
communicative purpose and common aspects of
form [Orlikowski & Yates, 1994].

Register: Generalized stylistic choices due to
situational features such as audience and
discourse environment [Morato et a., 2003]

Domain: The knowledge and assumptions held
by members of a (professional) community.
Assumptions

In IR, it seems worth characterizing documents
directly as to genre (and possibly register).

Doing so automatically requires characterising
inter alia significant linguistic features.

For QA, further benefits will come from profiling
the lexical, syntactic, referential, structural and
formatting consequences of genre, register and
domain, and exploiting these features directly.
Direct use of genre

[Freund et al, 2006], [Yeung et al, 2007]

Analysed behavior of software engineering
consultants looking for documents they need in
order to provide technical services to customers
using the company’s software product.

A range of genres identified through both user
interviews and analysis of the websites and
repositories they used
Direct use of genre










Manuals
Presentations
Product documents
Technotes, tips
Tutorials and labs
White papers
Best practices
Design patterns
Discussions/forums
….
Direct use of genre
Requires manually labelling each document with
its genre or recognizing its genre automatically.
 The latter requires characterising genres in
terms of automatically recognizable features.

Best practice: Description of a proven methodology
or technique for achieving a desired result, often based
on practical experience.
 Form: primarily text, many formats, variable length
 Style: imperatives, “best practice”
 Subject matter: new technologies, design, coding
Direct use of genre (X-Site)
Prototype workplace search tool for software
engineers currently in use [Yeung et al, 2007].
 Provides access to ~8GB of content crawled
from the Internet, intranet and Lotus Notes data.
 Exploits
 Task profiles
 Task-genre associations: known +/_/relationships between task and genre pairs
 Automatic genre classifier

Using genre, register, domain in QA

Answers to Qs can be found anywhere, not just
in documents on the specific topic.
Q: When Did the Titanic Sink?
Twelve months have passed since 193 people died aboard
the Herald of Free Enterprise. But time has not eased the
pain of Evelyn Pinnells, who lost two daughters when the
ferry capsized off Belgium. They were among the victims
when the Herald of Free Enterprise capsized off the Belgian
port of Zeebrugge on March 6, 1987. It was the worst
peacetime disaster involving a British ship since the Titanic
sank in 1912.
Using genre, register, domain in QA
For this reason, IR for QA differs from general IR,
using (instead) passage retrieval, quoted strings,
etc.
 For the same reason, one may not want to
prefilter documents by genre, register or domain
labels (as seems useful for IR).
 Rather, it may be beneficial to exploit features of
and patterns in the linguistic features that realize
genre, register and domain.
 What are those features?

Lexical features

Register strongly affects word choice:
 MedLinePlus: “runny nose”
 PubMed: “rhinitis”, “nasopharyngeal symptoms”
 Clinical notes: “greenish sputum”


UMLS: Informal “greenish” doesn’t appear
[Bodenreider & Pakhomov, 2004]
Domain also affects word choice:
 “smoltification” occurs ~600 times in a corpus of
1000 papers on salmon, while none in AQUAINT
[Gabbay & Sutcliffe, 2004].
Lexical features

Register strongly effects type/token ratios:
 Only 850 core words (+ inflections) in Basic
English, so type/token ratio is very small.
 Federalist papers: ~0.36
King James Version: And God said, Let the waters bring forth
abundantly the moving creature that hath life, and fowl that may fly
above the earth in the open firmament of heaven.
Bible in Basic English: And God said, Let the waters be full of
living things, and let birds be in flight over the earth under the arch
of heaven.
Lexical features

IR4QA using either keywords or quoted strings
for passage retrieval could benefit from
responding to both types of lexical divergence
between question and corpus.
Syntactic Features: Voice

Active



The Grid provides an ideal platform for new ontology
tools and data bases, …
Users log-in using a password which is encrypted
using a public key and private key mechanism.
Passive


Ontologies are recognized as having a key role in
data integration on the computational Grid.
We store ontology files in hierarchical collections,
based on user unique identifiers, ontology
identifiers, and ontology version numbers.
Syntactic Features

Passive voice is used significantly more often in
the physical sciences than in the social sciences
[Bonzi, 1990].
Active (%)
Passive (%)
Air pollution
65.5%
34.5%
Infectious disease
66.3%
33.7%
Ed administration
77.9%
22.1%
Social Psychology 76.6%
23.4%
Syntactic Features
Passives also used significantly often in surgical
reports [Bross et al, 1972] and repair reports.
 For agentive verbs, missing agent is surgeon (or
surgical team) or repair person:
 “… the skin was prepared and draped ..
Incision was made .. Axillary fat was dissected
and bleeding controlled …”
 But not for non-agentive verbs.

Syntactic Features

DURING NORMAL START CYCLE OF 1A GAS TURBINE,
APPROX 90 SEC AFTER CLUTCH ENGAGEMENT, LOW
LUBE OIL AND FAIL TO ENGAGE ALARM WERE RECEIVED
ON THE ACC. (ALL CONDITIONS WERE NORMAL INITIALLY).
SAC WAS REMOVED AND METAL CHUNKS FOUND IN OIL
PAN. LUBE OIL PUMP WAS REMOVED AND WAS FOUND TO
BE SEIZED. DRIVEN GEAR WAS SHEARED ON PUMP
SHAFT.
CasRep: Maintenance & repair report
Syntactic Features: Clause type
Main clause
 Relative clause



In this case, the user (j.bard) has created a private
version of the CARO ontology which is shared with
stuart.aitken …
Participial clause

Users log-in using a password which is encrypted
using a public key and private key mechanism.
Infinitive clause
…

Syntactic Features
Relative clauses used significantly more often in the
social than the physical sciences [Bonzi, 1990].
 Participial clauses used significantly more often in
the physical than the social sciences [Bonzi, 1990].

Rel Clauses (%)
Participial Clauses (%)
Air pollution
20.5%
29.6%
Infectious disease
25.2%
28.1%
Ed administration
29.2%
18.0%
Social Psychology 33.0%
18.6%
Why might this matter?
Not all the arguments to verbs in different types
of clauses are explicit.
 The same techniques cannot be used to recover
them.
 For relative clauses, syntax (attachment)
suffices.
 For participial (and main) clauses, more
general context must be assessed.
 The missing argument could be what answers
the question.

Structural features

Document structure variation across genres is
greater than variation within genres.
 “inverted pyramid” structure of a news article
 IMRD structure of a scientific article
 step structure of instructions
 ingredients list + step structure of recipes
 SOAP structure of clinical records (and
systems structure within Objective section)
Why might this matter?

Can suggest where to look for information.
 In news articles, information that defines terms
is more likely to be found near the beginning
than the end [Joho and Sanderson 2000].
 In scientific articles, position isn’t a good
indicator for definitional material [Gabbay &
Sutcliffe 2004].
Why might this matter?

If information isn’t in its intended section, one
might conclude that it’s false, unnecessary,
irrelevant, etc., depending on the function of the
section.
 Chocolate chips absent from an ingredients
list in recipe.
 No mention of “irregular heart beat” in report
on CV system in Objective section of SOAP
note.
Conclusion
The effectiveness of a given technique for IR,
QA, IE can vary significantly, depending on the
corpus it’s applied to.
 For search among docs (IR), genre and register
are clear factors in user relevance decisions.
 For search within docs (QA, IE), they appear
significant as well.
 In particular, search that is sensitive to genre and
register may yield better performance than
search that isn’t.

References

O Bodenreider, S Pakhomov (2003). Exploring adjectival modification in biomedical
discourse across two genres. ACL Workshop on Biomedical NLP, Sapporo, Japan.

S Argamon, M Koppel, G. Avneri (1998). Routing documents according to style. First
International Workshop on Innovative Information Systems, Boston, MA.

L Freund, C Clarke, E Toms (2006).Towards genre classification for IR in the
workplace. First Symposium on Information Interaction in Context (IIiX).

I Gabbay, R Sutcliffe (2004). A qualitative comparison of scientific and journalistic texts
from the perspective of extracting definitions. ACL Workshop on QA in Restricted
Domains. Sydney.

M Iwayama, A Fujii, N Kando, Y Marukawa (2003). An empirical study on retrieval
models for different document genres: Patents and newspaper articles. SIGIR’03,
251-258.

H Joho, M Sanderson (2000). Retrieving descriptive phrases from large amounts of
free text. Proc. 9th Intl Confenrence on Information and Knowledge Management
(CIKM), pp 180-186.
References

J Karlgren (1999). Stylistic experiments in information retrieval. In T. Strzalkowski
(Ed.), Natural language information retrieval. Dordrecht, The Netherlands: Kluwer.

M Kaisser, B Webber (2007). Question Answering based on semantic roles,
ACL/EACL Workshop on Deep Linguistic Processing, Prague CZ.

Kessler, B., Nunberg, G., & Schutze, H. (1997). Automatic detection of text genre.
Proc. 35th Annual Meeting, Association for Computational Linguistics (pp. 32-38).

P Yeung, L Freund, C Clarke (2007). X-Site: A Workplace Search Tool for Software
Engineers. SIGIR’07 demo, Amsterdam.
Thank you!
61
Download