Information Extraction - School of Electrical Engineering and

advertisement
Resources for
multilingual processing
Georgiana Puşcaşu
University of Wolverhampton, UK
Outline


Motivation and goals
NLP Methods, Resources and Applications












Text Segmentation
Part of Speech Tagging
Stemming
Lemmatization
Syntactic Parsing
Named Entity Recognition
Term Extraction and Terminology Data Management Tools
Text Summarization
Language Identification
Statistical Language Modeling Toolkits
Corpora
Conclusions
2
Motivation and goals
Motivation

Most NLP research and resources deal with English

The Web is multilingual and ideally for all languages the
current NLP state-of-the-art should be attained
Goals

To present already available textual methods that can
support multilingual NLP

To offer an inventory of existent tools and resources that
can be exploited in order to avoid reinventing the wheel
3
Text Segmentation

Electronic text is essentially just a sequence of characters

Before any real processing, text needs to be segmented

Text segmentation involves


Low-level text segmentation (performed at the initial stages of text
processing)

Tokenization

Sentence splitting
High-level text segmentation

Intra-sentential: segmentation of linguistic groups such as Named
Entities, Noun Phrases, splitting sentences into clauses

Inter-Sentential: grouping sentences and paragraphs into discourse
topics
4
Tokenization

Tokenization is the process of segmenting text into
linguistic units such as words, punctuation, numbers,
alphanumerics, etc.

It is normally the first step in the majority of text
processing applications

Tokenization in languages that are:
 segmented: is considered a relatively easy and
uninteresting part of text processing (words delimited
by blank spaces and punctuation)
 non-segmented: is more challenging (no explicit
boundaries between words)
5
Tokenization in
segmented languages

Segmented languages: all modern languages that use a
Latin-, Cyrillic- or Greek-based writing system

Traditionally, tokenization rules are written using regular
expressions

Problems:

Abbreviations: solved by lists of abbreviations (pre-compiled or
automatically extracted from a corpus), guessing rules

Hyphenated words: “One word or two?”

Numerical and special expressions (Email addresses, URLs,
telephone numbers, etc.) are handled by specialized tokenizers
(preprocessors)

Apostrophe: (they’re => they + ‘re; don’t => do + n’t) solved by
language-specific rules
6
Tokenization in
non-segmented languages

Non-segmented languages: Oriental languages

Problems:


tokens are written directly adjacent to each other

almost all characters can be one-character word by
themselves but can also form multi-character words
Solutions:

Pre-existing lexico-grammatical knowledge

Machine learning employed to extract segmentation
regularities from pre-segmented data

Statistical methods: character n-grams
7
Tokenizers (1)
ALEMBIC
Author(s): M. Vilain, J. Aberdeen, D. Day, J. Burger, The MITRE Corporation
Purpose: Alembic is a multi-lingual text processing system. Among other tools, it
incorporates tokenizers for: English, Spanish, Japanese, Chinese, French, Thai.
Access: Free by contacting day@mitre.org
ELLOGON
Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,
Greece
Purpose: Ellogon is a multi-lingual, cross-platform, general-purpose language
engineering environment. One of the provided components that can be adapted to
various languages can perform tokenization. Supported languages: Unicode.
Access: Free at http://www.ellogon.org/
GATE (General Architecture for Text Engineering)
Author(s): NLP Group, University of Sheffield, UK
Access: Free but requires registration at http://gate.ac.uk/
HEART Of GOLD
Author(s): Ulrich Schäfer, DFKI Language Technology Lab, Germany
Purpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian,
Greek, German, French, English, Chinese.
8
Access: Free at http://heartofgold.dfki.de/
Tokenizers (2)
LT TTT
Author(s): Language Technology Group, University of Edinburgh, UK
Purpose: LT TTT is a text tokenization system and toolset which enables users to
produce a swift and individually-tailored tokenisation of text.
Access: Free at http://www.ltg.ed.ac.uk/software/ttt/
MXTERMINATOR
Author(s): Adwait Ratnaparkhi
Platforms: Platform independent
Access: Free at http://www.cis.upenn.edu/~adwait/statnlp.html
QTOKEN
Author(s): Oliver Mason, Birmingham University, UK
Platforms: Platform independent
Access: Free at http://www.english.bham.ac.uk/staff/omason/software/qtoken.html
SProUT
Author(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, Germany
Purpose: SProUT provides tokenization for Unicode, Spanish, Japanese, German,
French, English, Chinese.
Access: Not free. More information at http://sprout.dfki.de/
9
Tokenizers (3)
THE QUIPU GROK LIBRARY
Author(s): Gann Bierner and Jason Baldridge, University of Edinburgh, UK
Access: Free at https://sourceforge.net/project/showfiles.php?group_id=4083
TWOL
Author(s): Lingsoft
Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish.
Access: Not free. More information at http://www.lingsoft.fi/
10
Sentence splitting

Sentence splitting is the task of segmenting text into
sentences

In the majority of cases it is a simple task:
. ? ! usually signal a sentence boundary

However, in cases when a period denotes a decimal
point or is a part of an abbreviation, it does not
always signal a sentence break.

The simplest algorithm is known as ‘period-spacecapital letter’ (not very good performance). Can be
improved with lists of abbreviations, a lexicon of
frequent sentence initial words and/or machine
learning techniques
11
Part of Speech (POS) Tagging

POS Tagging is the process of assigning a
part-of-speech or lexical class marker to each
word in a corpus (Jurafsky and Martin)
WORDS
The
couple
spent
the
honeymoon
on
a
yacht
TAGS
N
V
P
DET
12
POS Tagger Prerequisites

Lexicon of words

For each word in the lexicon information about all its
possible tags according to a chosen tagset

Different methods for choosing the correct tag for a
word:

Rule-based methods

Statistical methods

Transformation Based Learning (TBL) methods
13
POS Tagger Prerequisites:
Lexicon of words

Classes of words


Closed classes: a fixed set

Prepositions: in, by, at, of, …

Pronouns: I, you, he, her, them, …

Particles: on, off, …

Determiners: the, a, an, …

Conjunctions: or, and, but, …

Auxiliary verbs: can, may, should, …

Numerals: one, two, three, …
Open classes: new ones can be created all the time, therefore it is not
possible that all words from these classes appear in the lexicon

Nouns

Verbs

Adjectives

Adverbs
14
POS Tagger Prerequisites
Tagsets

To do POS tagging, need to choose a standard set
of tags to work with

A tagset is normally sophisticated and linguistically
well grounded

Could pick very coarse tagets

N, V, Adj, Adv.

More commonly used set is finer grained, the
“UPenn TreeBank tagset”, 48 tags

Even more fine-grained tagsets exist
15
POS Tagger Prerequisites
Tagset example – UPenn tagset
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
CC
CD
DT
EX
FW
IN
JJ
JJR
JJS
LS
MD
NN
NNS
NNP
NNPS
PDT
POS
PRP
PP
RB
RBR
RBS
RP
SYM
Coordinating conjunction
Cardinal number
Determiner
Existential there
Foreign word
Preposition/subord. conjunction
Adjective
Adjective, comparative
Adjective, superlative
List item marker
Modal
Noun, singular or mass
Noun, plural
Proper noun, singular
Proper noun, plural
Predeterminer
Possessive ending
Personal pronoun
Possessive pronoun
Adverb
Adverb, comparative
Adverb, superlative
Particle
Symbol (mathematical or scientific)
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
TO
UH
VB
VBD
VBG
VBN
VBP
VBZ
WDT
WP
WP
WRB
#
$
.
,
:
(
)
"
`
"
'
"
to
Interjection
Verb, base form
Verb, past tense
Verb, gerund/present participle
Verb, past participle
Verb, non-3rd ps. sing. present
Verb, 3rd ps. sing. present
wh-determiner
wh-pronoun
Possessive wh-pronoun
wh-adverb
Pound sign
Dollar sign
Sentence-final punctuation
Comma
Colon, semi-colon
Left bracket character
Right bracket character
Straight double quote
Left open single quote
Left open double quote
Right close single quote
Right close double quote
16
POS Tagging
Rule based methods

Start with a dictionary

Assign all possible tags to words from the
dictionary

Write rules by hand to selectively remove
tags

Leaving the correct tag for each word
17
POS Tagging
Statistical methods (1)
The Most Frequent Tag Algorithm
 Training




Take a tagged corpus
Create a dictionary containing every word in the corpus
together with all its possible tags
Count the number of times each tag occurs for a word and
compute the probability P(tag|word); then save all
probabilities
Tagging

Given a new sentence, for each word, pick the most
frequent tag for that word from the corpus
18
POS Tagging
Statistical methods (2)
Bigram HMM Tagger
 Training
 Create a dictionary containing every word in the corpus together
with all its possible tags
 Compute the probability of each tag generating a certain word,
compute the probability each tag is preceded by a specific tag
(Bigram HMM Tagger => probability is dependent only on the
previous tag)
 Tagging
 Given a new sentence, for each word, pick the most likely tag for
that word using the parameters obtained after training
 HMM Taggers choose the tag sequence that maximizes this
formula: P(word|tag) * P(tag|previous tag)
19
Bigram HMM Tagging:
Example
People/NNS are/VBZ expected/VBN to/TO queue/VB at/IN the/DT
registry/NNS
The/DT police/NN is/VBZ to/TO blame/VB for/IN the/DT queue/NN

to/TO
the/DT

tk = argmaxk P(tk|tk-1)*P(wi|tk)


i = number of word in sequence, k = number among possible tags for the word “queue”
How do we compute P(tk|tk-1)?


queue/???
queue/???
count(tk-1tk)/count(tk-1)
How do we compute P(wi|tk)?

count(wi tk)/count(tk)

max[P(VB|TO)*P(queue|VB) , P(NN|TO)*P(queue|NN)]

Corpus:


P(NN|TO) = 0.021
P(VB|TO) = 0.34
*
*
P(queue|NN) = 0.00041
P(queue|VB) = 0.00003
=> 0.000007
=> 0.00001
20
POS Tagging
Transformation Based Tagging (1)


Combination of
methodologies
rule-based
and
stochastic
tagging

Like rule-based because rule templates are used to learn
transformations

Like stochastic approach because machine learning is
used — with tagged corpus as input
Input:

tagged corpus

lexicon (with all possible tags for each word)
21
POS Tagging
Transformation Based Tagging (2)

Basic Idea:



Set the most probable tag for each word as a start value
Change tags according to rules of type “if word-1 is a determiner
and word is a verb then change the tag to noun” in a specific
order
Training is done on tagged corpus:
1. Write a set of rule templates
2. Among the set of rules, find one with highest score
3. Continue from 2 until lowest score threshold is passed
4. Keep the ordered set of rules

Rules make errors that are corrected by later rules
22
Transformation Based Tagging
Example

Tagger labels every word with its most-likely tag


For example: race has the following probabilities in the
Brown corpus:
 P(NN|race) = 0.98
 P(VB|race)= 0.02
Transformation rules make changes to tags

“Change NN to VB when previous tag is TO”
… is/VBZ expected/VBN to/TO race/NN tomorrow/NN
becomes
… is/VBZ expected/VBN to/TO race/VB tomorrow/NN
23
POS Taggers (1)
ACOPOST
Author(s): Jochen Hagenstroem, Kilian Foth, Ingo Schröder, Parantu Shah
Purpose: ACOPOST is a collection of POS taggers. It implements and extends wellknown machine learning techniques and provides a uniform environment for testing.
Platforms: All POSIX (Linux/BSD/UNIX-like OSes)
Access: Free at http://sourceforge.net/projects/acopost/
BRILL’S TAGGER
Author(s): Eric Brill
Purpose: Transformation Based Learning POS Tagger
Access: Free at http://www.cs.jhu.edu/~brill
fnTBL
Author(s): Radu Florian and Grace Ngai, John Hopkins University, USA
Purpose: fnTBL is a customizable, portable and free source machine-learning toolkit
primarily oriented towards Natural Language-related tasks (POS tagging, base NP
chunking, text chunking, end-of-sentence detection). It is currently trained for English
and Swedish.
Platforms: Linux, Solaris, Windows
Access: Free at http://nlp.cs.jhu.edu/~rflorian/fntbl/
24
POS Taggers (2)
LINGSOFT
Author(s): LINGSOFT, Finland
Purpose: Among the services offered by Lingsoft one can find POS taggers for Danish,
English, German, Norwegian, Swedish.
Access: Not free. Demos at http://www.lingsoft.fi/demos.html
LT POS (LT TTT)
Author(s): Language Technology Group, University of Edinburgh, UK
Purpose: The LT POS part of speech tagger uses a Hidden Markov Model
disambiguation strategy. It is currently trained only for English.
Access: Free but requires registration at http://www.ltg.ed.ac.uk/software/pos/index.html
MACHINESE PHRASE TAGGER
Author(s): Connexor
Purpose: Machinese Phrase Tagger is a set of program components that perform basic
linguistic analysis tasks at very high speed and provide relevant information about
words and concepts to volume-intensive applications. Available for: English, French,
Spanish, German, Dutch, Italian, Finnish.
Access: Not free. Free access to online demo at http://www.connexor.com/demo/tagger/
25
POS Taggers (3)
MXPOST
Author(s): Adwait Ratnaparkhi
Purpose: MXPOST is a maximum entropy POS tagger. The downloadable version
includes a Wall St. Journal tagging model for English, but can also be trained for
different languages.
Platforms: Platform independent
Access: Free at http://www.cis.upenn.edu/~adwait/statnlp.html
MEMORY BASED TAGGER
Author(s): ILK - Tilburg University, CNTS - University of Antwerp
Purpose: Memory-based tagging is based on the idea that words occurring in similar
contexts will have the same POS tag. The idea is implemented using the memorybased learning software package TiMBL.
Access: Usable by email or on the Web at http://ilk.uvt.nl/software.html#mbt
µ-TBL
Author(s): Torbjörn Lager
Purpose: The µ-TBL system is a powerful environment in which to experiment with
transformation-based learning.
Platforms: Windows
Access: Free at http://www.ling.gu.se/~lager/mutbl.html
26
POS Taggers (4)
QTAG
Author(s): Oliver Mason, Birmingham University, UK
Purpose: QTag is a probabilistic parts-of-speech tagger. Resource files for English and
German can be downloaded together with the tool.
Platforms: Platform independent
Access: Free at http://www.english.bham.ac.uk/staff/omason/software/qtag.html
STANFORD POS TAGGER
Author(s): Kristina Toutanova, Stanford University, USA
Purpose: The Stanford POS tagger is a log-linear tagger written in Java. The
downloadable package includes components for command-line invocation and a Java
API both for training and for running a trained tagger.
Platforms: Platform independent
Access: Free at http://nlp.stanford.edu/software/tagger.shtml
SVM TOOL
Author(s): TALP Research Center, University of Catalunya, Spain
Purpose: The SVMTool is a simple and effective part-of-speech tagger based on Support
Vector Machines. The SVMLight software implementation of Vapnik's Support Vector
Machine by Thosten Joachims has been used to train the models for Catalan, English
and Spanish.
Access:
Free.
SVMTool
at
http://www.lsi.upc.es/~nlp/SVMTool/
and
SVMLight at http://svmlight.joachims.org/
27
POS Taggers (5)
TnT
Author(s): Thorsten Brants, Saarland University, Germany
Purpose: TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-ofspeech tagger that is trainable on different languages and virtually any tagset. The
tagger is an implementation of the Viterbi algorithm for second order Markov models.
TnT comes with two language models, one for German, and one for English.
Platforms: Platform independent.
Access: Free but requires registration at http://www.coli.uni-saarland.de/~thorsten/tnt/
TREETAGGER
Author(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart,
Germany
Purpose: The TreeTagger has been successfully used to tag German, English, French,
Italian, Spanish, Greek and old French texts and is easily adaptable to other
languages if a lexicon and a manually tagged training corpus are available.
Access: Free at
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
28
POS Taggers (6)
Xerox XRCE MLTT Part Of Speech Taggers
Author(s): Xerox Research Centre Europe
Purpose: Xerox has developed morphological analysers and part-of-speech
disambiguators for various languages including Dutch, English, French, German,
Italian, Portuguese, Spanish. More recent developments include Czech, Hungarian,
Polish and Russian.
Access: Not free. Demos at
http://www.xrce.xerox.com/competencies/content-analysis/fsnlp/tagger.en.html
YAMCHA
Author(s): Taku Kudo
Purpose: YamCha is a generic, customizable, and open source text chunker oriented
toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP
chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs),
first introduced by Vapnik in 1995. YamCha is exactly the same system which
performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking
task.
Platforms: Linux, Windows
Access: Free at http://www2.chasen.org/~taku/software/yamcha/
29
Stemming

Stemmers are used in IR to reduce as many related
words and word forms as possible to a common
canonical form – not necessarily the base form –
which can then be used in the retrieval process.

Frequently, the performance of an IR system will be
improved if term groups such as: CONNECT,
CONNECTED,
CONNECTING,
CONNECTION,
CONNECTIONS are conflated into a single term (by
removal of the various suffixes -ED, -ING, -ION, -IONS
to leave the single term CONNECT). The suffix
stripping process will reduce the total number of terms
in the IR system, and hence reduce the size and
complexity of the data in the system, which is always
advantageous.
30
The Porter Stemmer



A conflation stemmer
developed
by
Martin
Porter at the University of
Cambridge in 1980
Idea: the English suffixes
(approximately 1200) are
mostly made up of a
combination of smaller
and simpler suffixes
Can be adapted to other
languages (needs a list of
suffixes
and
context
sensitive rules)
31
Stemmers (1)
ELLOGON
Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,
Greece
Access: Free at http://www.ellogon.org/
FSA
Author(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk,
Poland
Purpose: Supported languages: German, English, French, Polish.
Access: Free at http://juggernaut.eti.pg.gda.pl/~jandac/fsa.html
HEART Of GOLD
Author(s): Ulrich Schäfer, DFKI Language Technology Lab, Germany
Purpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian,
Greek, German, French, English, Chinese.
Access: Free at http://heartofgold.dfki.de/
32
Stemmers (2)
LANGSUITE
Author(s): PetaMem
Purpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German,
French, English, Dutch, Danish, Czech.
Access: Not free. More information at http://www.petamem.com/
SNOWBALL
Purpose: Presentation of stemming algorithms, and Snowball stemmers, for English,
Russian, Romance languages (French, Spanish, Portuguese and Italian), German,
Dutch, Swedish, Norwegian, Danish and Finnish.
Access: Free at http://www.snowball.tartarus.org/
SProUT
Author(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, Germany
Purpose: Available for: Unicode, Spanish, Japanese, German, French, English, Chinese
Access: Not free. More information at http://sprout.dfki.de/
TWOL
Author(s): Lingsoft
Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish
Access: Not free. More information at http://www.lingsoft.fi/
33
Lemmatization

The process of grouping the inflected forms of a word together
under a base form, or of recovering the base form from an inflected
form, e.g. grouping the inflected forms COME, COMES, COMING,
CAME under the base form COME

Dictionary based

Input: token + pos

Output: lemma

Note: needs POS information

Example:


left+v -> leave, left+a->left
It is the same as looking for a transformation to apply on a word to
get its normalized form (word endings: what word suffix should be
removed and/or added to get the normalized form) => lemmatization
can be modeled as a machine learning problem
34
Lemmatizers (1)
CONNEXOR LANGUAGE ANALYSIS TOOLS
Author(s): Connexor, Finland
Purpose: Supported languages: English, French, Spanish, German, Dutch, Italian,
Finnish.
Access: Not free. Demos at http://www.conexor.fi/
ELLOGON
Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,
Greece
Access: Free at http://www.ellogon.org/
FSA
Author(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk,
Poland
Purpose: Supported languages: German, English, French, Polish.
Access: Free at http://juggernaut.eti.pg.gda.pl/~jandac/fsa.html
MBLEM
Author(s): ILK Research Group, Tilburg University
Purpose: MBLEM is a lemmatizer for English, German, and Dutch.
Access: Demo at http://ilk.uvt.nl/mblem/
35
Lemmatizers (2)
SWESUM
Author(s): Hercules Dalianis, Martin Hassel, KTH, Euroling AB
Purpose: Supported languages: Swedish, Spanish, German, French, English
Access: Free at http://www.euroling.se/produkter/swesum.html
TREETAGGER
Author(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart,
Germany
Purpose: The TreeTagger has been successfully used for German, English, French,
Italian, Spanish, Greek and old French texts and is easily adaptable to other
languages if a lexicon is available.
Access: Free at
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
TWOL
Author(s): Lingsoft
Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish
Access: Not free. More information at http://www.lingsoft.fi/
36
Syntactic Parsing

Syntax refers to the way words are arranged
together and the relationship between them

Parsing is the process of using a grammar to
assign a syntactic analysis to a string of
words

Approaches:

Shallow Parsing

Dependency Parsing

Context-Free Parsing
37
Shallow Parsing

Partition the input into a sequence of nonoverlapping units, or chunks, each a
sequence of words labelled with a syntactic
category and possibly a marking to indicate
which word is the head of the chunk

How?

Set of regular expressions over POS labels

Training the chunker on manually marked up text
38
Dependency Parsing

Based on dependency grammars, where a
syntactic analysis takes the form of a set of
head-modifier dependency links between
words, each link labelled with the
grammatical function of the modifying word
with respect to the head

Parser first labels each word with all possible
function types and then applies handwritten
rules to introduce links between specific
types and remove other function-type
readings
39
Context-Free (CF) Parsing


CF parsing algorithms form the basis for almost all approaches to
parsing that build hierarchical phrase structure
CFG Example:










S -> NP VP
NP -> Det NOMINAL
NOMINAL -> Noun
VP -> Verb
Det -> a
Noun -> flight
Verb -> left
A derivation is a sequence of rules applied to a string that
accounts for that string (derivation tree)
Parsing is the process of taking a string and a grammar and
returning one (more?) parse tree(s) for that string
Treebanks = Parsed corpora in the form of trees
40
Probabilistic CFGs

Assigning probabilities to parse trees






Attach probabilities to grammar rules
The expansions for a given non-terminal sum to 1
A derivation (tree) consists of the set of grammar
rules that are in the tree
The probability of a tree is just the product of the
probabilities of the rules in the derivation.
Needed: grammar, dictionary with POS,
parser
Task is to find the max probability tree for an
input
41
Noun Phrase (NP) Chunkers
fnTBL
Author(s): Radu Florian and Grace Ngai, John Hopkins University, USA
Purpose: fnTBL is a customizable, portable and free source machine-learning toolkit
primarily oriented towards Natural Language-related tasks (POS tagging, base NP
chunking, text chunking, end-of-sentence detection, word sense disambiguation). It is
currently trained for English and Swedish.
Platforms: Linux, Solaris, Windows
Access: Free at http://nlp.cs.jhu.edu/~rflorian/fntbl/
YAMCHA
Author(s): Taku Kudo
Purpose: YamCha is a generic, customizable, and open source text chunker oriented
toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP
chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs),
first introduced by Vapnik in 1995. YamCha is exactly the same system which
performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking
task.
Platforms: Linux, Windows
Access: Free at http://www2.chasen.org/~taku/software/yamcha/
42
Syntactic parsers
MACHINESE PHRASE TAGGER
Author(s): Connexor
Purpose: Machinese Phrase Tagger is a set of program components that perform basic
linguistic analysis tasks at very high speed and provide relevant information about
words and concepts to volume-intensive applications. Available for: English, French,
Spanish, German, Dutch, Italian, Finnish.
Access: Not free. Free access to online demo at http://www.connexor.com/demo/tagger/
43
Named Entity Recognition

Identification of proper names in texts, and their classification
into a set of predefined categories of interest:



entities: organizations, persons, locations
temporal expressions: time, date
quantities: monetary values, percentages, numbers

Two kinds of approaches
Knowledge Engineering
Learning Systems
 rule based
 use statistics or other machine
learning
 developed by experienced
language engineers
 developers do not need LE
expertise
 make use of human intuition
 require
large amounts of
 small amount of training data
annotated training data
 very time consuming
 some changes may require
 some changes may be hard to
re-annotation of the entire
44
accommodate
training corpus
Named Entity Recognition
Knowledge engineering approach


identification of named entities in two steps:

recognition patterns expressed as WFSA (Weighted FiniteState Automaton) are used to identify phrases containing
potential candidates for named entities (longest match
strategy)

additional constraints (depending on the type of candidate) are
used for validating the candidates
usage of on-line base lexicon for geographical names,
first names
45
Named Entity Recognition
Problems

Variation of NEs, e.g. John Smith, Mr. Smith, John

Since named entities may appear without designators (companies,
persons) a dynamic lexicon for storing such named entities is used
Example:
“Mars Ltd is a wholly-owned subsidiary of Food Manufacturing Ltd, a nontrading company registered in England. Mars is controlled by members of
the Mars family.”

Resolution of type ambiguity using the dynamic lexicon:
If an expression can be a person name or company name (Martin
Marietta Corp.) then use type of last entry inserted into dynamic
lexicon for making decision.

Issues of style, structure, domain, genre etc.

Punctuation, spelling, spacing, formatting
46
Named Entity Recognizers (1)
ELLOGON
Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,
Greece
Purpose: Available for Unicode.
Access: Free at http://www.ellogon.org/
HEART Of GOLD
Author(s): Ulrich Schäfer, DFKI Language Technology Lab, Germany
Purpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian,
Greek, German, French, English, Chinese.
Access: Free at http://heartofgold.dfki.de/
INSIGHT DISCOVERER EXTRACTOR
Author(s): TEMIS
Purpose: Supported language: Spanish, Russian, Portuguese, Polish, Italian, Hungarian,
Greek, German, French, English, Dutch, Czech.
Access: Not free. More information at http://www.temis-group.com/
47
Named Entity Recognizers (2)
LINGPIPE
Author(s): Bob Carpenter, Breck Baldwin, Alias-I
Purpose: Supported languages: Unicode, Spanish, German, French, English, Dutch.
Access: Free at http://www.alias-i.com/lingpipe/
YAMCHA
Author(s): Taku Kudo
Purpose: YamCha is a generic, customizable, and open source text chunker oriented
toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP
chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs),
first introduced by Vapnik in 1995. YamCha is exactly the same system which
performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking
task.
Platforms: Linux, Windows
Access: Free at http://www2.chasen.org/~taku/software/yamcha/
48
Automatic term extraction

Terms = linguistic labels of concepts

Concepts = units of thought (vague definition): if a
term represents a unit of thought, its appearance in
textual data has to be statistically significant,
otherwise, the “unit” nature of the concept the term
represents is in question.

Label: different labels can be used for the same
concept, and the same label can be used for
different concepts.
49
Automatic term extraction


Lexico-syntactic approaches
syntactic patterns:
use
lexical
and

domain-specific prefixes and suffixes (i.e. formaldehyde)

part-of-speech sequences (AN; NN; AAN; ANN; NAN;
NNN; NPN) (How about ((A|N)+|((A|N)*(N|P)?)(A|N)*)N)

cue word or phrase (immediate left/right contexts)
Statistical approaches: different statistical measures:

Frequency, relative frequency, tf.idf etc. (for the whole unit)

Mutual information; t-score; z-score; etc. (collocation
measurement)

C-value: combine both internal and external statistical
measures.
50
Terminology extractors (1)
CONNEXOR LANGUAGE ANALYSIS TOOLS
Author(s): Connexor, Finland
Purpose: Supported languages: English, French, Spanish, German, Dutch, Italian,
Finnish.
Access: Not free. Demos at http://www.conexor.fi/
ELLOGON
Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,
Greece
Purpose: Available for Unicode.
Access: Free at http://www.ellogon.org/
FASTR
Author(s): Christian Jacquemin, Groupe Langage et Cognition, CNRS-LIMSI
Purpose: Available for French and English.
Access: Free at http://www.limsi.fr/Individu/jacquemi/FASTR/
INTEX
Author(s): Max Silberztein, New York University
Purpose: Supported languages: Spanish, Portuguese, Italian, French, English.
Access: Free at http://www.nyu.edu/pages/linguistics/intex/
51
Terminology extractors (2)
NOMINO
Author(s): Université de Québec à Montréal
Purpose: French and English term extractors.
Access: Free at http://www.ling.uqam.ca/nomino/
PROMEMORIA
Author(s): BridgeTerm
Purpose: Translation memory system with terminology extraction component.
Access: Not free. More information at http://www.bridgeterm.com/en/promem.html
PWA
Author(s): Jörg Tiedemann, Mikael Andersson, Magnus Merkel, Lars Ahrenberg, Anna
Sågvall Hein, Department of Linguistics, Uppsala University; Department of Computer
and Information Science, Linköping University, Sweden
Purpose: Language independent terminology extractor.
Access: Free at http://stp.ling.uu.se/~corpora/plug/pwa/index.html
TerminologyExtractor
Author(s): Etienne Cornu, Chamblon Systems Inc., Cambridge, Ontario, Canada
Purpose: Available for French and English.
Access:
Not
free.
More
information
http://www.chamblon.com/terminologyextractor.htm
at
52
Terminology extractors (3)
Xerox TermFinder
Author(s): Xerox Multilingual Knowledge Management Solutions
Purpose: Supported languages: Swedish, Spanish, Russian, Portuguese, Norwegian,
Hungarian, German, French, Finnish, English, Dutch, Danish.
Access: Not free. More information at http://www.mkms.xerox.com/
53
Terminology data
management tools (1)
DÉJÀ VU
Author(s): Atril Software
Purpose: Translation memory system with integrated terminology tool.
Access: Not free. Trial version at: http://www.atril.com
DICOMAKER
Author(s): Dalix Software
Access: http://www.dicomaker.com/
EDITERM
Author(s): EDIT INC.
Access: Not free. More information at http://www.editerm.com/indexN.html
LEXSYN
Author(s): Babeling
Access: Not free. Evaluation version at http://www.babeling.com/accueil.html
LOGITERM
Author(s): Terminotix Inc.
Access: Not free. More information at http://www.terminotix.com/eng/index.htm
54
Terminology data
management tools (2)
MULTITERM
Author(s): TRADOS
Purpose: Available as a stand-alone version or as part of the TRADOS TM Workbench
translation memory system.
Access: Not free. More information at http://www.trados.com/products.asp?page=22
MULTITRANS
Author(s): MultiCorpora R&D Inc.
Purpose: Translation memory system with integrated terminology tool.
Access: Not free. More information at http://www.multicorpora.ca
SYSTEM QUIRK
Author(s): School of ECM, University of Surrey, UK
Access: Free at http://www.computing.surrey.ac.uk/SystemQ/
TERMBASE
Author(s): University of Mainz
Access: Free at http://www.fask.uni-MAINZ.de/user/srini/srini.html
TERMSTAR
Author(s): STAR-USA, LLC
Access: Not free. http://www.star-group.net/eng/software/sprachtech/termstar.html
55
Text Summarization

Text summarization = automatic creation of
summaries of one or more texts

Summary = a text that is produced from one or more
texts, that contains a significant portion of the
information in the original text(s) and that is no
longer than half of the original text(s)

Types of summary:

Extracts: summaries created by reusing portions (words,
sentences) of the input text(s)

Abstracts: summaries created by re-generating the
extracted content
56
Text Summarization
Methodology

There are three stages of automated text summarization:

Stage 1: Topic Identification

Using different criteria of importance, the system should identify the most
important units (words, sentences, passages). If it lists them => extract. If
not => Stage 2 and Stage 3

Criteria of importance:

Cue phrase indicator criteria

Word and phrase frequency criteria

Query and title overlap criteria

Combination of various criteria and scores

Stage 2: Interpretation or topic fusion: template representation of
important topics identified at stage 1

Stage 3: Summary generation: the information captured in the
templates is processed by NLG modules to obtain the summary
(abstract)
57
Text summarizers (1)
BREVITY
Author(s): Art Pollard, Lextek International
Access: Not free, demo available at http://www.lextek.com/brevity/
CAST
Author(s): Constantin Orasan, Laura Hasler,
Wolverhampton, UK
Access: Free at http://clg.wlv.ac.uk/projects/CAST
Ruslan
Mitkov,
University
of
COPERNIC SUMMARIZER
Author(s): Copernic Technologies
Purpose: Supported languages: Spanish, German, French, English.
Access: Not free, trial available at
http://www.copernic.com/en/products/summarizer/index.html
GATE
Author(s): NLP Group, University of Sheffield, UK
Access: Free but requires registration at http://gate.ac.uk/
58
Text summarizers (2)
LANGSUITE
Author(s): PetaMem
Purpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German,
French, English, Dutch, Danish, Czech.
Access: Not free. More information at http://www.petamem.com/
MEAD
Author(s): The Center for Language and Speech Processing, Johns Hopkins University,
USA
Access: Free at http://www.summarization.com/mead/
MUST
Author(s): Chin-Yew Lin, Eduard Hovy, ISI, USA
Purpose: MuST performs web access, text summarization and translation into English
from Japanese, Arabic, Spanish, and Indonesian.
Access: Demo at http://www.isi.edu/natural-language/projects/MuST.html
PERTINENCE
Author(s): A. Lehmam, P. Bouvet, Pertinence
Purpose: Available for English, French and Spanish.
Access: Free at http://www.pertinence.net/index.html
59
Text summarizers (3)
SUMMARIST
Author(s): Eduard Hovy, Chin-Yew Lin, Daniel Marcu, ISI, USA
Purpose: SUMMARIST produces extract summaries in five languages (English,
Japanese, Arabic, Spanish and Indonesian)
Access: Demo at http://www.isi.edu/natural-language/projects/SUMMARIST.html
SWESUM
Author(s): Hercules Dalianis, Martin Hassel, KTH, Euroling AB
Purpose: Supported languages: Swedish, Spanish, German, French, English
Access: Free at http://www.euroling.se/produkter/swesum.html
SYSTEM QUIRK
Author(s): School of ECM, University of Surrey, UK
Access: Free at http://www.computing.surrey.ac.uk/SystemQ/
60
Language Identification

The task of detecting the language a text is
written in.

Identifying the language of a text from some
of the text’s attributes is a typical
classification problem.

Two approaches to language identification:

Short words (articles, prepositions, etc.)

N-grams (sequences of n letters). Best results are
obtained for trigrams (3 letters).
61
Language Identification
Trigram method
Source
languages
texts
Trigram
Data Files
(Language specific)
Training Module
Input text
Combined
Data File
Language Detection
Module
(All languages)
Language of
the input text
62
Trigram method
Training module

Given a specific language and a text file written in this language, the
training module will execute the following steps:








Remove characters that may reduce the probability of correct language
identification
(! " ( ) [ ] { } : ; ? , . & £ $ % * 0 1 2 3 4 5 6 7 8 9 - ` +)
Replace all white spaces with _ to mark word boundaries, then replace any
sequence of __ with _ so that double spaces are treated as one
Store all three-character sequences within an array, with each having a counter
indicating number of occurrences
Remove from the list of trigrams all trigrams with underscores in the middle
(‘e_a’ for example) as they are considered to be invalid trigrams
Retain for further processing only those trigrams appearing more than x times
Approximate the probability of each trigram occurring in a particular language
by summing the frequencies of all the retained trigrams for that language, and
dividing each frequency by the total sum
This process is repeated for all languages the system should be trained
on.
All language specific trigram data files are merged into one combined
training file.
63
Trigram method
Language detection module

Input: text written in an unknown language

The unknown text sample is processed in a similar way to the
training data (i.e. removing unwanted characters, replacing
spaces with underscores and then dividing it into trigrams),
and for each trained language the probability of the resulting
sequence of trigrams is computed. This assumes that a zero
probability is assigned to each unknown trigram.

The language will be identified by the language trigram data
set with the highest combined probability of occurrence.

The fewer characters in the source text, the less accurate the
language detection is likely to be.

This method is successful in more than 90% of the cases
when the input text contains at least 40 characters.
64
Language Guessers (1)
SWESUM
Author(s): Hercules Dalianis, Martin Hassel, KTH, Euroling AB
Purpose: Supported languages: Swedish, Spanish, German, French, English
Access: Free at http://www.euroling.se/produkter/swesum.html
LANGSUITE
Author(s): PetaMem
Purpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German,
French, English, Dutch, Danish, Czech.
Access: Not free. More information at http://www.petamem.com/
TED DUNNING'S LANGUAGE IDENTIFIER
Author(s): Ted Dunning
Access: Free at ftp://crl.nmsu.edu/pub/misc/lingdet_suite.tar.gz
TEXTCAT
Author(s): Gertjan van Noord
Purpose: TextCat is an implementation of the N-Gram-Based Text Categorization
algorithm and at the moment, the system knows about 69 natural languages.
Access: Free at http://grid.let.rug.nl/~vannoord/TextCat/
65
Language Guessers (2)
XEROX LANGUAGE IDENTIFIER
Author(s): Xerox Research Centre Europe
Purpose: Supported languages: Albanian, Arabic, Basque, Breton, Bulgarian, Catalan,
Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish,
French. Georgian, German, Greek, Hebrew, Hungarian, Icelandic, Indonesian, Irish,
Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Malay, Maltese, Norwegian,
Polish, Poruguese, Romanian, Russian, Slovakian, Slovenian, Spanish, Swahili,
Swedish, Thai, Turkish, Ukrainian, Vietnamese, Welsh
Access: Not free. More information at http://www.xrce.xerox.com/competencies/contentanalysis/tools/guesser-ISO-8859-1.en.html
66
Statistical language
modeling toolkits
CMU - Cambridge Statistical Language Modeling Toolkit
Author(s): Philip Clarkson and Roni Rosenfeld, Carnegie Mellon University, USA
Purpose: The toolkit is a suite of UNIX software tools to facilitate the construction and
testing of statistical language models.
Access: Free at http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
BOW - A Toolkit for Statistical Language Modeling, Text Retrieval, Classification
and Clustering
Author(s): Andrew McCallu, Carnegie Mellon University, USA
Purpose: Bow (or LIBBOW) is a library of C code useful for writing statistical text
analysis, language modeling and information retrieval programs. The current
distribution includes the library, as well as front-ends for document classification
(RAINBOW), document retrieval (ARROW) and document clustering (CROSSBOW).
Access: Free at http://www-2.cs.cmu.edu/~mccallum/bow/
67
CMU - Cambridge Statistical
Language Modeling Toolkit

The Carnegie Mellon Statistical Language Modeling
(CMU SLM) Toolkit is a set of Unix software tools
designed to facilitate language modeling work in the
research community.

Some of the tools are used to process general
textual data into:

word frequency lists and vocabularies

word bigram and trigram counts

vocabulary-specific word bigram and trigram counts

bigram- and trigram-related statistics

various Backoff bigram and trigram language models
68
CMU - Cambridge Statistical Language
Modeling Toolkit – The Tools (1)




text2wfreq
 Input: Text file
 Output: List of every word which occurred in the text, along with
its number of occurrences.
wfreq2vocab
 Input: A word-frequency file as produced by text2wfreq.
 Output: A file containing a list of vocabulary words
text2wngram
 Input: Text file
 Output: List of every word n-gram (n - parameter) which occurred
in the text, along with its number of occurrences
text2idngram
 Input: Text file plus a vocabulary file
 Output: List of every id n-gram (n-tuples of numbers
corresponding to the mapping of the word n-grams relative to the
vocabulary) which occurred in the text, along with its number of
occurrences
69
CMU - Cambridge Statistical Language
Modeling Toolkit – The Tools (2)




wngram2idngram
 Input: Word n-gram file, plus a vocabulary file
 Output: List of every id n-gram which occurred in the text, along
with its number of occurrences
idngram2stats
 Input: An id n-gram file
 Output: A list of the frequency-of-frequencies for each of the 2grams, …, n-grams
mergeidngram
 Input: A set of id n-gram files
 Output: One id n-gram file containing the merged id n-grams from
the input files
idngram2lm
 Input: An id n-gram file and a vocabulary file
 Output: A language model in either binary format or in ARPA
format
70
CMU - Cambridge Statistical Language
Modeling Toolkit – The Tools (2)


binlm2arpa
 Input: A binary format language model, as generated by
idngram2lm
 Output: An ARPA format language model
evallm
 Input: A binary or ARPA format language model, as generated by
idngram2lm.
 Output: Output is confirmation or denial that the sum of the
probabilities of each of the words in the context supplied by the
user sums to one.
71
Corpora - Large collections
aimed at the NLP community
LDC (Linguistic Data Consortium)
Access: http://www.ldc.upenn.edu/
ELDA (European Language Resources Association)
Access: http://www.elra.info/
TRACTOR (TELRI Research Archive of Computational Tools and Resources)
Access: http://www.tractor.de/
CLR (Consortium for Lexical Research)
Access: http://crl.nmsu.edu/Tools/CLR/
European Corpus Initiative Multilingual Corpus I (ECI/MCI)
Access: http://www.elsnet.org/resources/eciCorpus.html
MULTEXT: Multilingual Text Tools and Corpora
Access: http://www.lpl.univ-aix.fr/projects/multext/
Electronic Text Collections in Western European Literature
Purpose: Pointers to internet sources for literary texts in the western European
languages other than English: Catalan, Danish, Dutch, Finnish, French, German,
Italian, Norwegian, Old Norse, Portuguese, Provençal, Spanish, Swedish.
Access: Free at http://www.lib.virginia.edu/wess/etexts.html
72
Other multilingual corpora
CRATER Multilingual Aligned Annotated Corpus
Purpose: Aligned corpus in English, French and Spanish.
Access: http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html
EMILLE/CIIL
Purpose: Monolingual written corpus data for 14 South Asian languages (Assamese,
Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi,
Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and
parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi
and Urdu). In addition, the parallel corpus contains the English originals from which
the translations stored in the corpus were derived. All data in the corpus is CES and
Unicode compliant. The EMILLE corpus totals some 94 million words.
Access: Free at http://bowland-files.lancs.ac.uk/corplang/emille/
OPUS
Purpose: An open source parallel corpus, aligned, in many languages, based on free
Linux etc. manuals.
Access: http://logos.uio.no/opus/
Searchable Canadian Hansard French-English parallel texts (1986-1993)
Access: http://rali.iro.umontreal.ca/
European Union Web Server
Access: http://europa.eu.int/
73
Online multilingual dictionaries
ECTACO
Access: www.ectaco.com
YOURDICTIONARY
Purpose: It is the most comprehensive index of dictionaries available on the web.
Access: http://www.yourdictionary.com/
74
Lexical resources (wordnets)
Balkanet
Purpose: The Balkanet project aimed at the development of a multilingual lexical
database comprising of individual WordNets for the Balkan languages (Bulgarian,
Czech, Greek, Romanian, Serbian and Turkish).
Access: http://www.ceid.upatras.gr/Balkanet/
EuroWordnet
Purpose: EuroWordNet is a multilingual database with wordnets for several European
languages (Dutch, Italian, Spanish, German, French, Czech and Estonian).
Access: http://www.illc.uva.nl/EuroWordNet/
WordNet
Purpose: WordNet is an online lexical reference system. The wordnets developed as a
result of the Balkanet and EuroWordnet projects are linked to the original Princeton
WordNet to ensure conceptual equivalence.
Access: http://wordnet.princeton.edu/
75
Treebanks (1)
Penn Treebank
Language: US-English
Size: 2 million + words
Access:
BLLIP WSJ corpus
Language: US-English
Size: 30 million words
Access:
ICE-GB
Language: UK-English
Size: 1 million words
Access:
NEGRA Corpus
Language: German
Size: 20000 sentences
Access:
76
Treebanks (2)
TIGER Corpus
Language: German
Size: 700000 words
Access:
Alpino Dependency Treebank
Language: Dutch
Size: 150000 words
Access:
The Prague Dependency Treebank 1.0
Language: Czech
Size: 500000 words
Access:
Bulgarian Treebank
Language: Bulgarian
Size: n/a
Access:
77
Treebanks (3)
Penn Chinese Treebank
Language: Chinese
Size: 100000 words
Access:
Danish Dependency Treebank 1.0
Language: Danish
Size: 100000 words
Access:
Syntactic Spanish Database
Language: Spanish
Size: 1.5 million words
Access:
LDC Korean Treebank
Language: Korean
Size: n/a
Access:
78
Methods and applications that did
not make it into this presentation

Word Sense Disambiguation


Anaphora Resolution


Dan Cristea, Constantin Orasan and Oana Postolache
Machine Translation


Nancy Ide and Dan Tufis
Daniel Marcu and Dragos Stefan Munteanu
Question Answering

Bernardo Magnini and Marius Pasca
79
Conclusions

Many resources for textual NLP already exist
on the Web and can be exploited and
adapted to new languages

All methods presented today can be adapted
to a new language

Hopefully the present inventory will be of help
in your future NLP activity
80
Thank you!
81
Download