pptx - clarin-nl

advertisement
Example queries for
Federated search
Jan Odijk
CLARIN Federated Search Workshop
Copenhagen, 24 Apr 2013
1
Overview
•
•
•
•
•
•
•
•
Linguistic Problem
Federated Search
Structural Differences
Search in Lexicons
Search In Corpora
Corpora+Lexicon search
Iterative Corpus Search
Search in micro-comparative databases?
2
Linguistic Problem
• Inflection of attributively used adjectives
• Influence of number, gender, case, definiteness
(strong/weak inflection), other factors?
• Exceptions to the main rules such as
–
–
–
–
het bijvoeglijk(?e) naamwoord, lit. the adjectival noun, ‘the adjective’
het medisch(*e) onderzoek `the medical research’
de medisch(*e) onderzoeker `the medical researcher’
Een competent(e) linguïst
• Where –e suffix is predicted as the only option by the main rules
• The exceptions are not (all) arbitrary, there are further
subregularities: I want to find out what these are.
• There are similar phenomena in many languages (Germanic,
Romance, e.g. Brasilian Portuguese Menuzzi 1994, ..)
3
Linguistic Problem
• [Odijk 1992] J. Odijk. Uninflected Adjectives in Dutch.
In R. Bok-Bennema & R. van Hout, Linguistics in the
Netherlands1992, pp.197-208. Amsterdam:
Benjamins
• Odijk, J. (2012). De structuur van Phrasal Names.
Nederlandse Taalkunde, 17(2), 292-298.
• Odijk, J. (2013), Comparative Linguistic Research in
the CLARIN Infrastructure, presentation to be held at
the Patterns of Macro- and Micro-Diversity in the
Languages of Europe and the Middle East.
Computational Issues in Studying Language Diversity:
Storage, Analysis and Inference , Groningen, July
2013.
4
Broaden empirical Base
• I want to search
1. For relevant examples in large annotated text
corpora
2. For related examples in large text corpora selected
on the basis of the results of (1)
3. For relevant examples is microcomparative
databases (e.g. MIMORE)
4. For properties of relevant words in dictionaries
5. For synonyms/hyperonyms/hyponyms in semantic
lexical databases (e.g. CORNETTO)
5
Federated Search
•
•
•
•
•
•
•
A set of resources R has been selected by the researcher on the basis of
metadata. The resources of R can be in different locations
A Federated Search Engine (FSE) enables search in the resources in R
For each resource r in R there is a local search engine LSE_r
A query can be formulated in an agreed-upon query language
(SRU/CQL),e.g. via a Federates Search web application: q_fs
Q_fs uses ISOCAT DCs
Q_fs is sent to the LSE_r for each r in R, and translated there into the
query language needed for LSE_r and into the DCs used in r
Each LSE_r yields a result set for q_fs, in which it translates DCs used into
ISOCAT_DCs , and sends to FSE, which combines/aggregates them and
prepares them for presentation to /saving by the user, possibly via the
Federated Search web application
6
Federated Search
• For many resource formats used in NL there
is NOT yet a systematic mapping of their DCs
to ISOCAT DCs, e.g. TEI, CGN-format, Folia,
EAF, GTB, WordNet, CELEX, Praat,…
• A special project should be started up for
this
– Nationally for national formats (CGN, Folia, …)
– Internationally for generic formats (TEI, CELEX,
Wordnet, …)
7
Structural Differences
• There are (sometimes trivial) structural
differences between resources.
• Description of an occurrence of Dutch ‘is’:
CGN
VU-DNC/FoLiA
<pw ref="fn000248.20.4"
w="is"
pos="WW(pv,tgw,ev)"
lem="zijn"
....
pq="man"
/>
<w xml:id="BAObi1.s.5.w.18">
<t>is</t>
<lemma class="zijn"/>
<pos class="WW(pv,tgw,ev)">
...
</pos>
<t class="ocroutput">is</t>
</w>
8
Structural Differences
• Do these two descriptions contain the same or
overlapping information?
• ISOCAT alone will not help because there are
differences in structure
• Will the LSE’s deal with such structural
differences?
• Or is something more general needed for this
(and is this possible?)
9
Structural Differences
• CGN pw element means the same as Folia w element:
• CGN w attribute of pw means the same as Folia t element
in w
• CGN pos attribute of pw means the same as Folia class
attribute of element pos (part of speech property name)
• CGN lem attribute of pw means the same as FoLiA class
attribute of element lemma (lemma property name)
• Values inside the CGN pos attribute of pw are identical to
and mean the same as values inside Folia class attribute of
element pos in element w (values from the CGN-tagset)
10
Search in Lexicons
• WFT-GTB: Give me entries with PoS=noun of
which the headword ends in “tsje”
• GTB, CELEX, CGN-lexicon: Give me entries
with PoS=noun and with the headword
ending in “tsje”, together with the source
(=GTB, CELEX, of CGN-lexicon) in which it was
found.
11
Search in Lexicons
• Search in all resources where the language=nld
• For each resource with language=nld
– For each word in [ ‘zeer’, ‘heel’, ‘erg’] with PoS=adj
• For each sense of the word
– For each synonym of the sense
» For each lemma of the synonym
• Return word, Pos, sense, synonym, lemma, ‘synonym’ , resource.name
•
• And analogously with ‘synonym’ replaced by ‘immediate
hyperonym’
• And analogously with ‘synomym’ replaced by ‘hyponym’ (incl
hyponyms of hyponyms)
•
12
Question
• Question: Will federated search somehow smartly `know’ (e.g.
from the metadata) that it has to search in lexicons only, actually
only in lexicons that contain synonym information? Or will it waist
time and effort by searching in all text corpora and in lexicons that
do not have synonym information? Or is a smart choice of
resources to search in left to the user?
•
• Similarly:
• Search in CGN: Give me all utterances that contain the word ‘zeer’
with PoS=ADJ spoken by a speaker with age<=7.
•
• (there are no speakers with age<=7 in CGN; will federated search
smartly be able to see this from the metadata or will it waist time
searching?)
13
Search in Corpora
• Search in CGN-corpus, VU-DNC, SONAR:
• Give me utterances that contain a subsequence of the
form:
– A wordtoken with PoS='definite_determiner',
immediately followed by
– A wordtoken with PoS=adjective with vorm=zonder-e,
immediately followed by
– A wordtoken with Pos=noun
• examples are 'het bijvoeglijk naamwoord', 'de gulden
snede', 'het ingewikkelder probleem')
• lternative: just return the subsequence
14
Corpus+Lexicon search
• The same as in the preceding example but now
• the adjective should not end in two syllables that
both contain a schwa (represented by a regular
expression over the phonetic transcription) in its
phonetic_transcription as found in the CGNlexicon
• This excludes an example such as: 'het
ingewikkelder probleem'
15
Corpus+Lexicon search
• a value for an additional attribute with as
possible values eFormExists,
eFormDoesNotExist,
eFormExistenceUnknown.
• The value specifies whether it is true for the
word with pos=adjective that a form with
property vorm=met-e exists or, or not, or
whether it is unknown whether such a form
exists.
16
Corpus+Lexicon search
• let wv be the value of the attribute word of the wordtoken with
properties Pos=adjective, vorm=zonder-e). Look up the entry/ies
for wv for which PoS=adjective in the CGN-lexicon and/or CELEXlexicon lexicon, and determine its lemma (=wl)
– if not found: result =eFormExistenceUnknown
– if found
• look up in CGN/Celex an entry with PoS=adjective-code and lemma=wl
and vorm=met-e
– if found: result=EFormExists (e.g. (het) bijvoeglijk (naamwoord))
– if not found: result= eFormDoesNotExist (e.g. ('de) gulden (snede)'
• This can be done in one very complicated query, or the queries
might be put in a series where the results of the First query are
filtered by the second query, etc.
17
Iterative Corpus Search
• Each result in of the previous query is (or contains) a
sequence Det ADJ NOUN
• For each result found in the previous query,
– Give me utterances that contain a subsequence of the form:
– A wordtoken with PoS='definite determiner', immediately
followed by
– A wordtoken with PoS=adjective, with lemma=ADJ.lemma and
with vorm=met-e, immediately followed by
– A wordtoken with Pos=noun with number=NOUN.number
•
• alternative: just return the subsequences
18
Search in MIMORE
• using the MIMORE search engine (MIMORE web app)
• Give me utterances that contain a subsequence of the
form:
– A wordtoken with PoS='definite_determiner',
immediately followed by
– A wordtoken with PoS=adjective with vorm=zonder-e,
immediately followed by
– A wordtoken with Pos=noun
• alternative: just return the subsequences
19
More Examples
• Odijk, J. (2011), "User Scenario Search", internal
CLARIN-NL document, April 13, 2011. [docx]
• Odijk, J. (2011), "Linguistic Research in the
CLARIN Infrastructure", presentation for the
KNAW eHumanities Workshop, NIAS, Wassenaar,
Mar 29, 2011 [ppt]. Abstract contained in
eHumanities Brainstorm Booklet
• Odijk, J.E.J.M. (2012, October 23). Linguistic
Research and the CLARIN Infrastructure. Utrecht,
Digital Humanities Lecture. [ppt]
20
Thanks for your attention!
21
DO NOT ENTER HERE
22
Download