PaNoLa-Joensuu

advertisement
PaNoLa:
Parsing Nordic Languages
Eckhard Bick
http://beta.visl.sdu.dk
PaNoLa Goals
●
●
●
1. Integrate existing and stimulate new Constraint
Grammar-research in Nordic countries
2. Internet based Grammar Teaching, applying the
VISL model to different Nordic languages
3. Morphologically and syntactically annotated
corpus data
Participants
●
●
●
●
University of Southern Denmark (Eckhard Bick, Anette Wulff)
Danish CG as well as CGs for 6 other languages
Oslo University (Janne Bondi Johannessen, Kristin Hagen)
Bokmål and Nynorsk CGs
Helsinki University (Fred Karlsson):Finnish and Swedish CGs
Göteborg University (Torbjörn Lager)
µTBL-system (corpus trained automatic CG)
●
Tartu University (Heli Uibo, Kaili Müürisep): Estonian CG
●
Tromsø University (Trond Trosterud): Sami CG
●
The Greenlandic Language Secretariat Oqaasileriffik (Per Langgård)
●
Iceland University of Education (Jóhanna Karlsdottir)
●
University of the Faroe Islands (Zakaris Hansen)
Project framework
●
Funding: Nordic Council of Ministries
●
Funded project period:
PaNoLa: January 2002 – December 2003: da, no, sv, fi
PaNoLa-addon: 2004: is, fo, smi, kl
PaNoLa-plus: 2005 (- 2006): is, fo, smi, kl
planned: PaNoLa-neighbour: 2005/6 (- 2007): lit, lav, ru
●
Historical basis and ongoing cooperation
da, no, sv, fi
PaNoLa
is, fo, smi, kl
PaNoLa
addon
PaNoLa-plus
lit, lav, ru
PaNoLa-neighbour
Project framework
●
Network aspect: 4 workshops in Denmark, Norway,
Iceland and Sweden
Odense, 19.-21. May 2002
Ustaoset, 25.-27. October 2002
Reykjavik, 1.-2. June 2003
Göteborg, 24.-25. October 2003
Odense, 23.-26. October 2004
Fefor, 11.-13. Marts 2005
(Tallin, 1.-3. April 2005)
planned: Thorshavn, 16.-19. September 2005
●
●
Administration, Web-server, Data-integration:
VISL/ISK, University of Southern Denmark
Satellite projects: e.g. Arboretum, GREI, Arborest
Constraint Grammar
●
●
●
●
●
Rule and lexicon based robust parsing (Karlsson
et. al. 1995), methodological paradigm
Shared conceptual and notational conventions,
allowing productive research transfer
Language dependent differences: Lexicon, rules
(Inter-scandinavian comparative payoff?)
Compiler and rule type differences
Focus differences: tagging? Parsing? Semantics?
Teaching? Corpus annotation? QA?, NER?, ...
Rule formalism and architecture
OsloSwe Fin Bergen
CG CG tagger
cg1-compiler
DanGram, Sami
Est
other VISL languages
CG
visl-cg- ☻
cgxcg2compiler
compiler compiler
Sets as targets “cg2-like” plus
substitute operator
Barrierfor correcting
conditions
hybrid input
Lingsoft-compatible
Needs more rules
than cg2
PoS
sv
Syntax
Case
roles
fi
est
smi
no
da
Swedish or
language-indep.
trained CG
µ-TBL
Automatic
learning,
local context,
rule ordering
The Lexical Base
Samic Est Swe
CG CG CG
TWOL
Fin
CG
Oslo-Bergen DanGram
tagger
Core lexicon +
morphological analyser
Valency potential (especially for verbs)
Semantic sets
NER
Full
semantic
prototype
lexicon
µ-TBL
Corpus
dependent
Theoretical Framework (Syntax)
Traditional CG: Flat dependency
Word based form and function tags
Cg2tree (MC)
Dependency
☻(visl-psg)
filter (SH)
PSGRedwood
Grammar
Treebank format
☻ Editing tools
☻
Visl2penn
(EB)
PENN format
☻
Korpus90/2000
Oslo-Bergen Corpus
Arboretum
☻
Visl2tiger
(LN, EB, ..)
TIGER format
Search interfaces
Danish
Norwegian
Treebank data compatibility
CG
CG
CG-dep VISL
VISLdep
TIGER
TIGER-dep
MALT-dep
DTAGdep
cg2dep
depspli
cator
depspli
cator
cg2visl |
visl2tiger.pl
cg2visl |
visl2tiger.pl
| tiger2dep.pl
cg2dep |
visldep2malt
depspli
cator
cg2visl
(visl-psg +
grammar)
CGdep
VISL
visldep2malt
tree
2cg
visl2tiger.pl
visl2tiger.pl |
tiger2dep.pl
visl2tiger.pl |
tiger2dep.pl
| tigerdep2malt
VISLdep
TIGER
tiger2dep.pl
TIGER
-dep
tigerdep2malt,
(NTN tools)
MALT
(NTN tools)
DTAG
(NTN tools)
(NTN
tools)
Accessibility
☻ Strong focus on making tools and corpora freely
●
accessible on the internet
☻ Provide notational and complexity filters to
●
bridge differences between different research and
teaching traditions
☻ VISL's open source philosophy for reconciling
●
academic and commercial use:
Free compilers and corpora, but allowing for the
protection (i.e. commercializability) of grammars,
lexica and end-user applications
Related applicative CG-projects
●
●
CG spell/grammar checking (No, Da)
Lingsoft / Microsoft
Named Entity Recognition (Da, No)
Nomen Nescio (Nordic Network) 2001-2003
Treebanks (Da Arboretum, Norwegian plans)
Nordic Treebank Network 2003-2004
●
●
●
Question Answering systems (Da)
Aminova Dialogue Systems
Teaching
(e.g. VISL-GYM, VISL-HHX, GREI)
PaNoLa's other leg: CALL
Integrating and strengthening Nordic languages
in the VISL grammar teaching system
●
A unified system of grammatical categories and structural
analysis for 22 languages (Dienhart 2000 and Bick 2001)
●
Color codes and symbolic notation
●
Systematic focus on form & function
●
Preexisting server and programming infrastructure
●
School and university teaching contacts at all levels
●
Internet based games and exercises
●
Graded complexity filters
notational harmonization vs. linguistic differences:
The greenlandic example
KAL22a)Suumuna naasut qorsuttaat kiilorpassuakkaarlugu nunamut uumassuseqanngitsumut
siaruartilertaraa apullu aanniariaraangat siaruaatipallatsittarlugu? (Hvad var det der
gjorde, at kilo efter kilo af det grønne plantestof kunne vælte frem fra den livløse jord,
lige så snart det blev varmt nok i vejret og de sidste rester af sne var væk?)
QUE:par
CJT:cl
=S:pron Suumuna #'Hvilken/Hvad'
=fA:icl
==Od:g
===D:n
naasut #'planternes'
===H:n
qorsuttaat #'deres det grønne'
==P:v-pcp1
kiilorpassuakkaarlugu
#gørende det i kilovis
=A:g
==H:n
nunamut #'jorden'
==D:n
uumassuseqanngitsumut
#'på den livløse'
=P:v
siaruartilertaraa
#får det til at brede sig
CJT:cl=fA:cl==S:n
apullu #og sneen
CO:conj _lu
-CJT:cl
=-fA:cl
==P:v
aanniariaraangat
#så ofte den begynder at smelte
=P:v
siaruaatipallatsittarlugu
#får det til at vælte frem
?
==H:n
nunamut #på jorden
===R:n('nuna')
nuna===D:in('mut',fleksiver) -mut
==D:n
uumassuseqanngitsumut
===R:v('uuma')
uuma===D:in('ssusiq') -ssuse===D:iv('qar')
-qa===D:iv('ngngit') -nngit===D:in('Tuq')
-su===D:in('mut',fleksiver) -mut
==P:v
aanniariaraangat
===R:v('aak')
aan===D:iv('niar') -nia===D:iv('riar') -riar===D:iv('gaangat',fleksiver)
-aangat
=P:v
siaruaatipallatsittarlugu
==R:v('siaruar') siarua==D:iv('ute')
-ati==D:iv('pallak') -pallat==D:iv('tit')
-sit==D:iv('Tar')
-tar==D:iv('lugu',fleksiver) -lugu
Greenlandic word-internal tree structures
Teaching corpora
Pedagogically structured
● XML-markup for teaching topic and didactical progression
● Finnish and Swedish modelled on Danish and Norwegian examples
files (comparative possibilities)
● compatibility with and importability for research treebanks (e.g. Sofie)
●
Danish
Bokmål
Nynorsk
Icelandic
Faroese
Sami
Swedish
Finnish
Estonian
Greenlandic
Sentences
Words
1121+
766
766
212
178
155+
106
102
100+
100?
12029
5629
5888
1394
1609
603
1153
545
596
?
Words pr.
sentence
10,1
7,3
7,7
6,6
9,0
3,9
10,9
5,3
6,0
?
Interactive teaching trees
Grammar games: Labyrinth
Grammar Games: Word Fall
Integrating the CG and CALL legs
●
Nordic CG expertise is used to provide live analyses as
input for the teaching modules, if necessary by CGIcommunication between university servers, e.g. Oslo-SDU
●
Descriptional harmonization issues (e.g. Word class)
●
Determine matching complexity (e.g. subclause analysis?)
CG leg evaluation
●
●
●
CG-grammars improve incrementally, so evaluation is less definite
than for probabilistic systems, and can change over time.
Results depend on tag granularity and test genre
Some numbers:
-- DanGram: F-Score 98.65 for PoS, 94.9 for function (Bick 2003)
-- DanGram NER: 5% typing errors, 2% chunking errors
-- Bokmål CG: 97.2% lexical F-score (Hagen & Johannessen 2003)
-- Nynorsk CG: 96.2% lexical F-score
-- SWECG 1.0: recall 99.7% at a precision of 95% (pre-PaNoLa)
-- µ-TBL CG for Swedish: 98.1% lexical accuracy when allowing
for 1.04 tags pr. Word (Lager 1999)
Teaching leg evaluation
●
●
●
●
●
GREI evaluation: improvement of grammatical skills
after using VISL tools (104 children 7th and 8th grade)
Same level tests before & after using VISL/GREI, test &
control groups
Subjective results: All users thought VISL was more fun
(games more than trees), and that their grammatical skills
had improved
Objective results: Test group performed 14.5% better
than control group (7th grade), resp. 7% (8th grade) and
12% at the secondary level.
Differences were positive for both PoS and sentence
analysis, but more marked for the latter
Teaching corpora differences
across PaNoLa languages
●
●
●
●
●
Preposition frequency: 11% (Bokmål), 11.4% (Danish),
13.4% (Nynorsk), 0.5% (Finnish)
PoS: “klappe i”, “tage på”, “skrive noget om”
are tagged as ADV in Danish, as PRP in Norwegian samples
Danish infinitive markers ('at') tagged as CONJ in Norwegian
Subclass solutions: e.g. Da/Fi distinction between adjunct and
argument adverbials, not made by No/Se (fA/As/Ao vs. A)
Tradition interference: Swedish analysis had zero constituents,
because it was annotated according to the English VISL model
Outlook
●
Continued development of Nordic Constraint
Grammars and CG applications
●
Ongoing CALL service for schools
●
Presence of the CG paradigm in other Nordic networks
●
“Post-PaNoLa”: VISL adaptations for other minor
Nordic languages (Faeroese, Icelandic, Samic, Estonian
...)
Download