Odijk Linguistics and CLARIN 110329 110328

advertisement
Linguistic Research
And
CLARIN
Jan Odijk
NIAS 29 Mar 2011
1
Overview
• Introduction
• Basic Facts & Research Questions
• Do the Research
– Consult Grammars
– Select from relevant data from multiple sources
– Apply tools to enrich data
– Analyze the data
• Conclusions
2
Introduction
• Suppose you’re a linguistic researcher in
1980 (no internet, no computers,…)
– And libraries would not exist….
• I am a linguistic researcher in 2011
– But no infrastructure for data and tools exists!
– though there are many data and tools
• CLARIN has as its main goal to remedy this
3
Basic Facts
• Heel, erg, and zeer are synonyms (‘very’)
• Zeer, erg can modify verbs, adjectival
predicates and prepositional predicates
• Heel can only modify adjectival predicates
– A: Hij is daar zeer/erg/heel blij mee
– P: Hij is daar zeer/erg/*heel in zijn nopjes mee
– V: Dat verbaast ons zeer/erg/*heel.
4
Basic Facts
• English very is like heel in these respects;
– P: *He is very in love
– A: He is very amorous
– V: It surprised us very *(much))
5
Basic Facts
• Difference:
– not due to semantics
– Purely syntactic
– As far as we know: does not follow from a general
rule
– So it must be ‘learned’ by a child acquiring Dutch
as first language
6
Research Question (1)
• How does a child acquiring Dutch as a first
language get to ‘know’ that zeer and erg can
modify verbs, prepositional and adjectival
predicates?
7
Hypotheses (1)
• Hypothesis 1a
– Once a word is encountered for the first time, a
critical phase (‘training phase’) starts in which the
word properties will be determined based on input;
after this phase the word properties are fixed.
– A sufficient number of actual examples occurring
in this period sets the word properties (positive
evidence)
8
Hypotheses (1)
• Hypothesis 2a
– Once a word is encountered for the first time, its
grammatical properties are initially set by
Semantic Bootstrapping: D (semcat) -> syncat
– A sufficient number of actual examples occurring
in this period will add to the word properties
(positive evidence)
– Sufficient amount of input that is contradictory to
the semantically bootstrapped properties overrules
them
9
Research Question (2)
• How can a child acquiring Dutch as first
language get to ‘know’ that heel cannot
modify prepositional predicates and verbs?
– Children are never taught that it is not possible;
– They are also never or seldom corrected for
language errors, and if they are, they seem to
ignore it (Negative evidence plays no role)
10
Hypotheses (2)
• Hypothesis 1b
– Absence of relevant constructions in the training
phase of a word leads to absence of the property
(indirect negative evidence)
• Hypothesis 2b
– Absence of relevant constructions in the training
phase of a word does not lead to absence of the
property for semantically bootstrapped properties
11
Related Questions
•
•
•
•
Do children ever make errors against this?
Is a ‘training phase’ for word properties real?
How ‘long’ is this training phase?
What is a ‘sufficiently large’ number of actual
examples
• Does semantic bootstrapping play a role, and
if so which one
• Are these words acquired in different language
acquisition stages?
12
Related Questions
• Can this be related to the different modification
potential?
• Is there a relation with the fact that zeer
appears to be rather formal, while heel and
erg are not?
13
Related Questions
• adverb-adjective agreement (substandard):
– heel/hele dikke boeken ‘very thick books’
– erg/erge dikke boeken
– Zeer/*zere dikke boeken
– Is this somehow related?
• What about other, closely related, words?
14
Consult Grammars
• Currently
– Consult paper and electronic grammars
• ANS and e-ANS e.g. section 15-3-1-1
• In the near Future
– Consult Taalportaal with (I hope/expect)
• All examples formally marked as such
• All examples parsed/tagged, using ISOCAT DCs and
searchable
• Links to (possibly complex queries) to illustrate with real
data from treebanks and other annotated data
15
Find Data
• Which data and tools (LRs) exist that might
contribute to answering these questions?
• Currently:
– you have to search for them in multiple places
– Many relevant data are not publicly visible (you
will encounter them by personal contacts only)
– Or you have to create them yourself
16
Find Data
• There is no place/site where you query:
– Give me a list of all LRs for the Dutch language
– What is the size of all Dutch text corpora (in
#tokens)
– Give me a list of all Dutch data that contain
children 2-7 years old as speaker
– Give me a list of all Dutch data containing any of
the words heel, zeer, erg
• Not even in most individual data centres
(TST-Centrale, ELRA, LDC, ..)
17
Find Data
• CLARIN
– Provides a flexible framework incl. tools for
making descriptions of LRs (‘metadata’)
• CMDI
– Supports (assistance, execution, payment) the
creation of metadata for LRs
– Supports making these metadata (and the actual
data) visible and accessible via CLARIN portals
18
Find Data
• CLARIN
– Provides facilities for semantic interoperability
• ISOCAT, Relation Registry (coming soon)
– browsing, searching and querying facilities for the
metadata
• Initial prototype: Virtual Language Observatory
– Will enable you to collect the data that are relevant
to you in a virtual collection
– This will save the researcher a lot of time
– It will enlarge the empirical basis for the research 19
Closely Related Words
• Find words that are closely related
– Adverbs that function as an intensifier (‘booster’)
– Are synonymous or co-hyponyms
• In order to determine their properties and
potential further generalizations
– Using e.g.
• Dutch EuroWordnet (currently via ELRA M0016)
• Or Cornetto (via the Dutch HLT-Agency)
– In CLARIN you will be able to access these
resources without even knowing where they are
20
Closely Related Words
Result should be something like:
• abnormaal afschuwelijk akelig bijster bijzonder
bovenmatig buitengemeen buitensporig danig
donders eminent enorm exceptioneel extra
extraordinair extreem fabelachtig fenomenaal
geweldig gigantisch intens kolossaal merkwaardig
mirakels onbeschrijfelijk ongelofelijk ongehoord
ongekend ongemeen onmenselijk onmetelijk
ontzettend onwijs speciaal uitermate uiterst
uitzonderlijk verdraaid verduiveld verrekte
verschrikkelijk vet zeldzaam …..
21
Basic Facts: Correct?
• Check the basic facts
• Check against occurrences in corpora
– Problem: each of the 3 words is ambiguous!
• Erg (4x)= noun(de) ‘erg’; noun(het)’evil’, adj+adv
‘unpleasant’, adv ’very’
• Zeer (3x)= noun ‘pain’; adj ‘painful’; adv ‘very’
• Heel (3x) = adj ‘whole’; verbform ‘heal’; adv ‘very’
– PoS-tagged corpus will help somewhat
• But most corpora do not distinguish adj from adv by
category! (searching for PoS bigrams will help slightly)
– A fully-parsed corpus would be ideal
22
Basic Facts: Correct?
– LASSY Small: 1M manually verified parsed corpus
– Very Simple Interface to LASSY Small (1yr)
– Plus: Simple Interface to LASSY Small (1m)
– As (web) applications (Web service soon?)
– Queries: erg::mod:; zeer::mod: ; heel::mod:
– Extract from Statistics:
Modifiee
erg
zeer
heel
ADJ
143
268
263
WW
35
49
9
BW
1
1
7
– Query: heel::mod:ww
23
Basic Facts: Correct?
• Analysis
– 8 examples are forms that are ambiguous between adjectival and verbal participle,
•
All are examples of adjectival participles but LASSY analyzes all participles as verbal
– In 1 example heel modifies the adj open from the expression open staan voor, but
wrongly analyzed as modifying the verb staan
• CLARIN will offer facilities to make annotations to such corpora
• Same queries could be done
– for the other related words
– on LASSY Large Corpus (2.4 billion words, automatically parsed)
– In the CGN corpus (but it uses a different interface)
• But this will require facilities for ‘batch jobs’ or more complicated queries
(maybe via web services)
24
Acquisition Corpora:
Search
• E.g. data in the CHILDES system (part of
TalkBank
– 7 corpora for Dutch
– But with their own data formats (CHAT) and tools
(CLAN)
• However, also mirrored at MPI and accessible
via (ANNEX/)TROVA (again another interface)
25
Acquisition Corpora:
Search
• Give records for utterances containing erg with
– Corpus
– File:
– Line:
– Part Role:
– Child Gdr:
– Age:
– UTT
(e.g. Van Kampen Corpus)
(e.g. laura74.cha)
(e.g. 139)
(e.g. Child)
(e.g. female)
(e.g. 5;6.12)
(e.g. “ja , die s erg moeilijk .”)
• Maybe also some preceding/following context
• Map attribute names and values to ISOCAT 26
Acquisition Corpora:
Search
•
•
•
•
•
•
•
Corpus: Van Kampen
File: sarah21.cha
Line: 630
Speaker: Child
Child Gender: Female
Age: 2;7.16
UTT: “prinses e(r)g groot !”
27
Acquisition Corpora:
Search
• For each child, give list of pairs session + age
of the child
• For child and each session, give #occurrences
of zeer, heel, erg
• etc, etc.
• Such queries (Some example attempts )
– Mixed metadata/content search
– Over multiple resources
– Specific output formats
• are not so easy with the current interfaces!!
28
Acquisition corpora:
Search
• Heel is found 153 times in Van Kampen
corpus
• Erg is found 77 times in Van Kampen corpus
– But many are an irrelevant use of erg
• PoS-tagging the corpus might be useful
– Search for POS-bigrams (e.g. erg/adj */adj)
– Add lemma’s
• Or even full parsing, at least of the adult
speech
29
Acquisition corpora: Parse
• CLARIN-NL
– Web services are being developed
• For PoS-tagging text
• For full parsing of text
• (and many more)
– To be usable by humanities researchers
– in a user-friendly way in work flow systems
• Usefulness depends on
– Size of the data (effort to select manually)
– Quality of the web services
30
Store the found data
• The found and newly created data
– should be stored in a supported format
– With automatically generated metadata
– With automatically generated provenance data
– Using data categories mapped to or from ISOCAT
– For which PIDs are provided
– Stored on a server of a CLARIN-centre
– So that they
• can become proper resources on their own
• Are visible, accessible and interpretable as part of
enriched publications
31
Search in CGN / SONAR
• To assess level of formality
– Give absolute and relative frequencies of
heel/hele/erg/erge/zeer as adj by text genre, and
speaker/participants education level
– In CGN (spoken corpus)
– In SONAR (written corpus)
– Idem but for the word + the following Pos-tag
– Idem but in the fully parsed part of CGN and in
LASSY + the PoS tag of the modifiee head
32
Interpret the data
• Interpret the data in function of the hypotheses
being investigated
• Apply analytical / statistical tools to the data
– CLARIN should support formats of frequently used
statistical packages such as SPSS, R, etc.
• The research will surely lead to new
questions, so to new queries
• Reach conclusions and publish an open
access enriched publication
33
Broaden the scope
• Do the same for worden/raken (‘become’/
‘get’)
• NP, PP and AP can be predicate
complements
• Worden and raken take predicate
complements
• They are (almost) synonymous
• worden: takes only NP or AP
• raken: takes only AP or PP
34
Broaden the scope
• AP: Zij werd / raakte zwanger
• PP: Zij *werd / raakte in verwachting
• Zij werd / *raakte burgemeester
And
repeat the process
35
Conclusions
• There is no adequate infrastructure for LRs
• There are bits and pieces, but
– Finding LRs is not easy
– LRs have their own formats, data categories, user
/ search interfaces
– Limited formal and no semantic interoperability
– Search in combined LRs very difficult if not
impossible
• full research potential is not exploited
• CLARIN(-NL) attempts to remedy this
36
CLARIN-NL
Thanks for your attention!
http://www.clarin.nl/
37
No Entry!
38
Basic Facts: Correct?
•
•
•
•
•
•
•
•
•
De omgang met de buren gebeurt op een heel ontspannen manier en de vrouw van de dominee heeft zelfs al Wolderse vlaai leren bakken . (parse)
–
heel:ADJ:mod:WW:ontspannen
De verschijnselen zijn heel verschillend . (parse)
–
heel:ADJ:mod:WW:verschillend
,, Op het voorterrein ging het nog heel overtuigend . (parse)
–
heel:ADJ:mod:WW:overtuigend
Ze hebben heel gericht en planmatig volkscafés bezocht om daar hun gif te spuien . (parse)
–
heel:ADJ:mod:WW:gericht
Ze is zelfs met een ' meester ' getrouwd : Marc Dassesse _ mevrouw Spiritus-Dassesse zet heel geëmanicipeerd haar meisjesnaam voorop _ is nu
een gerenomeerd fiscaal adviseur en hoogleraar aan de ULB . (parse)
–
heel:ADJ:mod:WW:geëmanicipeerd
Gelukkig krijg ik nog heel geregeld te horen : ' Gerard jongen , dat doe je gewoon foùt ' . "/sentence"?xml version="1.0" encoding="UTF-8 (parse)
–
heel:ADJ:mod:WW:geregeld
Dat is een heel verrassend resultaat en het stemt tot optimisme . (parse)
–
heel:ADJ:mod:WW:verrassend
De biermarkt is heel versnipperd en wordt overspoeld door nieuwe productlanceringen . (parse)
–
heel:ADJ:mod:WW:versnipperd
Toch staan we hier heel open voor voorstellen . (parse)
–
heel:ADJ:mod:WW:staan
39
Metadata search CGN+CHILDES
Dutch && 2<age<7
Regexp content search
heel|zeer|erg|erge|hele
Resultset export to file
CGN
regexp ^heel$|^erg$|…
CGN
regexp op WORDS tier + POS
Download