Presentation - IIIT Hyderabad

advertisement
Introduction to Computational
Linguistics
Dipti Misra Sharma
IIIT, Hyderabad
<dipti@iiit.ac.in>
IASNLP 05-07-2012
Outline
 Background
 What is Computational Linguistics (CL)?
 What do the Computational Linguists do?
 What are the issues in processing natural
languages?
 What can we do with CL?
 Approaches in CL?
Background
Language is a means of communication
Therefore, one can say
It encodes what is communicated <information>
We apply the processes of
Analysis (decoding) for understanding
Synthesis (encoding) for expression (speaking)
What do we communicate ?
Information (SPAIN delivered a football masterclass at Euro 2012)
Intention <purpose>
Emphasis/focus (Euro 2012 won by Spain/ Spain bags Euro 2012)
Introduces variation
How do we communicate ?
We use linguistic elements such as
Words
(country, park, the, is, Bandipur, of, as, and, considered, National, a,
spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one)
Arrangement of the words (Sentences)
Words are related to each-other to provide the
composite meaning
(Bandipur National park is a beautiful tourist spot and considered as
one of the best wild life sanctuaries in the country)
How do we communicate ?
Arrangement of sentences (Discourse)
Sentences or parts of sentences are related to each other
to provide a cohesive meaning
*(Considered as one of the best wild life sanctuaries in the country.
It is a national park covering an area of about 874 km. Bandipur
National park is a beautiful tourist spot.)
(Bandipur National park is a beautiful tourist spot and considered
as one of the best wild life sanctuaries in the country. It is a
national park covering an area of about 874 km)
Languages differ in the way they organise information in
these entities
All of these interact in the organisation of information
What is Computational Linguistics?
 Computational linguistics is the scientific study
of language from a computational perspective.
What does it mean?
Scientific
 Provides explanation for a linguistic or
psycholinguisitc phenomenon
Computational
 Develops computational
models/techniques for linguistic phenomena
Human language is the subject of study
In other words
Computational linguistics is the application of
 linguistic theories and
 computational techniques
to problems of natural language processing.
http://www.ba.umist.ac.uk/public/departments/registrars/academic
office/uga/lang.htm
What do the Computational Linguists do?
 Linguistic research
 Develop language models for processing natural
languages
 Develop language resources for NLP
research/applications
 Understand and develop models for analysis and
generation of natural languages by the
computers
So,
A Computational Linguist needs to
understand
 How language works
 What information is available in the
language?
 How languages encode information?
 How this knowledge/information can be
representated for computational
processing?
Information in Language (1/4)
Languages encode information
cuuhe maarate haiN kutte
rats
kill
dogs
Hindi sentence is ambiguous
Possible interpretations
Dogs kill rats
Rats kill dogs
However,
English sentence is not ambiguous
Information in Language (2/4)
Ambiguity in Hindi is resolved if,
cuuhe maarate haiM kuttoN ko
rats
kill
dogs acc
English encodes information in positions
Hindi in morphemes
Languages encode information differently
Information in Language (3/4)
Another example,
This chair has been sat on
– The chair has been used for sitting
– X sat on this chair, and it is known
– The sentence does not mention X
Languages encode information partially
Information in Language (4/4)
English pronouns
Hindi pronoun
he, she, it
vaha
He is going to Delhi ==> vaha dilli jaa rahaa hai
She is going to Delhi ==> vaha dillii jaa rahii hai
It broke
==> vaha TuuTa ??
Information does not always map fully from
one language into another
Conceptual worlds may be different
Differences ?
Words
English
boys
<n,pl>
Hindi
Telugu
laDake/laDakoN
<n,sg/pl,case>
He/she/it
vaha
atanu/aame/adi
is/am/are
hai/huuN/haiN/ho
is going
jaa rahaa hai/rahii hai/rahe haiN
Indian Languages
 Relatively flexible word order
1. a) baccaa phala khaataa
‘child’
hai
‘fruit’ ‘eat+hab’ ‘pres’
The child eats fruits
b) phala baccaa khaataa hai
c) phala khaataa hai baccaa
d) baccaa khaataa hai phala
Some structural differences
English
Declarative : Ravi is coming today
Interrogative : Is Ravi coming today ?
Change in the position of ‘is’ brings the change in meaning
Hindi
Declarative : ravi aaj aa rahaa hai
Interrogative : kyaa ravi aaj aa rahaa hai ?
Word ‘kyaa’ encodes the question information
Alternatively, more natural spoken form in Hindi
ravi aaj aa rahaa hai ? (with appropriate intonation) OR
Ravi aaj aa rahaa hai kyaa?
Post nominal modification
'ing' clauses
I know [the man playing guitar]
Hindi, on the other hand
maiN [giTaar bajaa rahe vyakti ko] jaanataa huuN
Clauses having 'un-' negative constructions
English
Unless you reach there the job will not be done
Hindi
jab tak tum vahaaN nahiiN pahuNcate , kaam
nahiiN hogaa
Languages Differ
Different languages have different
mechanisms/devices to encode information
Some devices are common across certain languages and
some are different
There are alternative ways of expressing the same
meaning within the same language
Languages show preferences for one device over the
others
English exploits ‘position’ for encoding information
Hindi uses ‘words’ more effectively
Thus, differences in grammatical structures
Ambiguity in Natural Language (1/2)
Look at the word 'plot' in the following examples
(a) The plot having rocks and boulders is not good.
(b) The plot having twists and turns is interesting.
'plot' in (a) means 'a piece of land' and
in (b) 'an outline of the events in a story'
Ambiguity in Natural Language (2/2)
 Lexical level
 Sentence level
Structural differences between SL and TL
in a Machine Translation system.
Lexical ambiguity
Lexical ambiguity can be both for
Content words – nouns, verbs etc
Function words – prepositions, TAMs etc
 Content words' ambiguity is of two types
Homonymy
Polysemy
Homonymy
A word has two or more unrelated senses
Example :
I was walking on the bank (river-bank)
I deposited the money in the bank (moneybank)
Polysemy
A word having two or more related senses
Example : English word 'issue', noun
1. The issue is under discussion (muddaa)
2. The latest issue of the journal is out (aNka)
3. He buys stamps on the day of the issue
(vimocan)
4. The couple has no issue even after five years
of marriage (saNtaan)
Information Flow and Ambiguity
1. He scratched a figure on the rock (engrave)
2. She scratched the figure on the rock (scrape)
• Other words in the context make a difference
• Change of 'a' (in 1) to 'the' (in 2) changes the
meaning of 'scratched'
Function words can also
pose problems (1/4)
Function words can also be ambiguous
For example – English preposition 'in'
(a) I met him in the garden
maiN usase bagiice meiN milaa
(b) I met him in the morning
maiN usase subaha 0 milaa
'Ambiguity' here refers to the 'appropriate correspondence' in
the target language.
Function words can also
pose problems (2/4)
1. He bought a shirt with tiny collars.
usane chote kaular vaalii kamiiz khariidii
‘he
tiny collars
with shirt bought’
‘with’ gets translated as ‘vaalii’ in Hindi
2. He washed a shirt with soap.
usane saabun se kamiiz dhoii
‘he soap
with shirt
washed’
‘with’ gets translated as ‘se’ .
Function words can also
pose problems (3/4)
TAM Markers mark tense, aspect and modality
– Consist of inflections and/or auxiliary verbs in Hindi
– An important source of information
– Narrow down the meaning of a verb (eg. lied, lay)
Function words can also
pose problems (4/4)
English Simple Past vs Habitual'
1a. He stayed in the guest house during his visit to our
University in Jan (rahaa)
1b. He stayed in the guest house whenever he visited
us (rahataa thaa)
2a. He went to the school just now (gayaa)
2b. He went to the school everyday (jaataa thaa)
Sentence level ambiguity
I met the girl in the store
+ Possible readings
a) I met the girl who works in the store
b) I met the girl while I was in the store
Time flies like an arrow.
+ Possible parses:
a) Time flies like an arrow (N V Prep Det N)
b) Time flies like an arrow (N N V Det N)
c) Time flies like an arrow (V N Prep Det N) (flies are like an
arrow)
d) Time flies like an arrow (V N Prep Det N) (manner of
timing)
Thus,
Languages encode information differently
Languages code information only partially
Tension between BREVITY and PRECISION
Brevity wins leading to inherent ambiguity at different
levels
Human beings use
World knowledge
Context (both linguistic and extra-linguistic)
Cultural knowledge and
Language conventions to resolve ambiguities
Can all this knowledge be provided to the
machine ?
Computational Linguistics aims for this.
How to provide this knowledge ? (1/2)
Analyse language at various levels (word, phrase, sentence
etc)
Build Tools for analysing the natural language at various
levels in a text
POS tagger (category marking)
Morphological analysers (analysis of a word)
Morphological generators (word generators)
Chunkers (shallow parsers)
Parsers (syntactic analysis)
Filters (markers for special expressions)
Sense Disambiguation Algorithms
Etc
The tools need linguistic knowledge
How to provide this knowledge ? (2/2)
Build language resources
Machine Readable Lexicon
Rules for various levels of linguistic analysis
Computational Grammars
Mapping rules for the concerned language pair
for an MT system
Sense Disambiguation Rules
Annotated corpora
Etc
POS Tagger
What is a POS?
Take the following English sentence
My old friend Ram recently bought a book on Indian snakes for his
cousin from London from the new bookshop .
Each word in the above sentence belongs to a word class (also
called as a Part Of Speech (POS))
The class to which a word may belong is based on its
morphological and syntactic behavior
Morphological
Kind of affixes a word takes, for example,
boy, boys; girl, girls; book, books (noun class)
Syntactic
How it is distributed in a sentence
He chairs the next session (verb)
The chairs are new (noun)
Why is POS relevant in CL/NLP ? (1/2)
•
•
Word class information of a given word in a
sentence helps to predict its neighbour
WSD
He runs a mile every day (verb)
Their team made 250 runs (noun)
Time flies like an arrow (n v prep det n)
• Helps in further processing – chunking, morph
pruning, sentence parsing
•
IR
POS tagged sentence
My
pronoun
old
friend
Ram
recently
bought
a
book
on
Indian
possesive
adjective
noun
proper noun
adverb
verb
determiner
noun
preposition
adjective
his
possesive
pronoun
cousin
noun
from
preposition
London
proper noun
,
punctuation
from
preposition
the
determiner
new
adjective
bookshop
noun
POS Tagging Approaches

Rule Based

Statistical

Transformation Based
Rule Based POS Tagging

Two staged architecture algorithms
(Harris, 1962; Klein and Simmons, 1963; Green and Rubin,
1971)

Stage 1
dictionary
assign POS by referring to the
Eg Dictionary entry for Eng word that
that

Conj, Adv, Pronoun
Stage 2
disambiguate, using manually
crafted rules
Statistical


Taggers use probabilities for tagging
The tagger picks the most likely tag for a
given word in a context

HMM based algorithms are most commonly used
for POS tagging task

Requires manually tagged corpus
Annotating Corpus for POS


Annotated corpora is useful for developing
statistical POS taggers
Tagging scheme
Set of POS Tags
Guidelines for the annotators

The tagged corpora should be
High quality (in terms of tagging accuracy)
Consistent
POS Tags for English

English

Penn Tree Bank – 45 tags

C5 - Lancaster – 61 tags – used in CLAWS
Basic tagset used for BNC
http://view.byu.edu/bnc_tags.htm
- C7 – 147 tags – Leech
http://www.comp.lancs.ac.uk/ucrel/claws7tags
.html
Pen Treebank Tags
My
old
friend
Ram
recently
bought
a
book
on
Indian
snakes
PP$
JJ
NN
NNP
RB
VBD
DT
NN
IN
JJ
NNS
his
cousin
from
London
,
,
from
the
new
bookshop
in
IN
town
PP$
NN
IN
NNP
IN
DT
JJ
NN
NN
POS Tags for Indian Languages

Objective
To arrive at a standard POS and Chunk
tagging scheme for all Indian
languages
Assumption
Commonality in Indian Languages

Issues in Tag Set Design (1/2)

Linguistic knowledge coarse vs fine

Syntactic function vs lexical category (for
POS tags)

New tags vs tags close to existing English
tags

Should be comprehensive/complete
Issues in Tag Set Design (2/2)




Simple
Less effort in manual tagging
Number of tags
Common for all Indian languages
Linguistic Knowledge :
Fine vs Coarse (1/2)
Example
Only noun (NN) laDakA, laDake, laDakoM, laDakI, laDakiyAM,
ladakiyoM
OR
Noun with gender, number, case information
(NNM) ladakA, ladAke, laDakoM,
(NNMS) ladakA, laDake
(NNMP) laDake, laDkoM,
(NNMSD) laDakA,
(NNMSO) laDake,
(NNMPD) laDake,
(NNMPO) laDakoM
The decision has implications for the size of
corpora and machine learning
Linguistic Knowledge :
Fine vs Coarse (2/2)

Alternatives
Coarse - NN (advantages/disadvantages)
 Fine - NNMSD
(advantages/disadvantages)
 Hierarchical

Example: NN_m_sg_d
Hierarchical tag set provides the possibility
for underspecification
Considerations

POS tagger is NOT a replacement for a
morph analyzer

Coarse analysis to begin with

Expandable if needed

If the information can be obtained from
elsewhere, it need not be included in the
POS tag
Syntactic function vs lexical
category
Example
harijana bAlaka

‘harijan’
‘child’
Decision : Lexical category

Helps achieve


Consistency in annotation
Better learning
New tags vs tags close to existing
English tags
New tags
Noun, Pron, Adj, Adv
 Familiar tags (Penn Treebank tags)
NN, PRP, JJ, RB

Decision : Penn tags for common lexical
types
New tags for certain IL specific
cases
Comprehensive/Complete


All the lexical items occurring in a sentence
should be marked for their POS, including
punctuations.
If the language has some special cases,
these should also be captured –
Reduplications in ILs
Simple




Why simple ?
The tags are designed for some manual
annotation
Ease of learning
Consistency in annotation
Less Effort in Manual Tagging

The annotators should not have to


Write too much
Take too many steps in annotating a lexical item
Number of Tags



Number of tags makes a difference both for
the man and the machine
For the man in decision making
For the machine in learning for automatic
tagging
Common for All Indian Languages
Indian languages belong to various language
families
 Share linguistic features
However,
 There are differences



Some languages have quotatives, some don't
Some have classifiers, some don't
Chunking
What forms a chunk ?
Non-recursive phrase
((det adj noun))
Partial structure without distorting the
dependencies
Include inflections
(postposition/auxiliaries) with a lexical
category
Example : ((mere choTe bhaaii ne))_NP
((jaa rahaa hai))_VG
Chunker
A Chunker automatically groups words in a sentence as
chunks and labels them
((My old friend Ram))_NP ((recently bought))_VG ((a
book))_NP on ((Indian snakes))_NP for ((his
cousin))_NP from ((London))_NP from ((the new
bookshop))_NP.
IL Chunk Tags (1/2)
NP
JJP
RBP
NEGP
CCP
BLK
noun chunk
bahut acchiiI kitaab
adjective chunk
bahut sundar sii
adverb chunk
dhiIre – dhIire
chunk for negatives
nahiiN
conjunct chunks raam Ora shyaam
miscellaneous
interjections etc
IL Chunk Tags (2/2)
 VGF
Finite verb chunk
jaa rahaa hai
 VGNF Non finite verb chunk
jaate hue
 VGINF Infinitive verb chunk
jaanaa
 VGNN
Gerunds
jaanaa
 FRAGP
Discontiguous fragments of a chunk
raama (meraa bhaaii) ne
Some Issues
How to chunk the following ?
Adverbs
within a verb chunk or separately
Eg ((recently bought)) or ((recently)) ((bought))
Punctuations
Particles – hii (only), to, bhii (also) etc
Current approach
For punctuation – chunk them with the
preceding chunk
Adverbs – chunk them separately
Particles – chunk them with the chunk to
which they belong
((raam ne bhii)) ((jaa hii rahaa thaa))
Issues
• Verb Negation
‘not going’
2. kahaa hii nahiiN
‘just did not mention’
3. kaha to nahiiN rahaa thaa ‘was not saying’
(emphatic)
4. binaa yaha baata kahe ‘without saying this’
5. yahii nahiiN, balki likhita ruup meiN bhii yah miltaa
hai
‘Not only this, in fact, this is also found in writing'
1. nahiiN jaa rahaa
Current approach
For cases 1 to 3, chunk NEG with the verb group
For 4, chunk the NEG separately in a chunk
For 5, also a separate NEGP chunk will work
NOUN NEGATION ???
Chunking Co-ordinate Constructions
1. word1 CC word2
raam aur shyaam
((raam))_NP ((aur))_CCP ((shyaam))_NP
2. phrase CC phrase
meraa bhaaii shyaam aur tumhaaraa bhaaii mohan
((meraa bhaaii shyaam))_NP ((aur))_CCP ((tumhaaraa
bhaaii mohan))_NP
3. clause CC clause
Discontiguous Phrases
What about cases such as ' X (Y) Z' ?
where X = noun, Y = a phrase, Z = postposition
raam (meraa xillii vaalaa bhaaii) ne
OR
isa 'upanyaas – samraaT' shabda kaa'
FRAGP
Chunking Conjunct Verbs
Conjunct verbs
A verb composed of a noun/adj and a
verb (sviikaar karnaa 'accept')
Should the conjunct verbs be tagged as a single
chunk or two chunks?
'prawIkSA karanA', 'kSamA karanA' etc
‘to wait’
‘to forgive’
What about genitives ?
raam kaa betaa
'brother of Ram'
usakaa betaa
'his/her son'
mere bhaaii raam kaa betaa
'my brother Ram's son'
iske pahale
'before this'
mez ke uupar
'above/on the table'
ravi ke saath
'with Ravi'
Chunking Numbers/Quantifiers (1/2)
Numerals, quantifiers may occur as follows
a) ek laDakaa 'one boy'
b) 1 laDakaa '1 boy'
c) pahalaa laDakaa 'first boy'
d) karoDoN log 'billions of people'
e) 1962 meiN 'in 1962'
Chunking Numbers/Quantifiers (2/2)
The POS tags for numerals and quantifiers are
QC (numerals) and QF (other quantifiers) in IL
POS tagset
Example (d) and (e) in the previous slide show
cases where the quantifier is behaving like a
noun
The issue :
Should the quantifiers in cases such as (d) and
(e) be tagged as a Q* or as NN since the
chunk itself is a noun chunk ?
Summary
For annotating POS and Chunk a scheme needs
to be designed
While doing so following issues need to be
considered.
Definition of 'chunk'
Elements which together can form a chunk
type
Whether to include postpositions, punctuations
etc inside a chunk or form them as
independent chunks
POS/Chunk tag labels
Approaches in Computational
Linguistics (for Tools)
Two major approaches
Rule based
Requires manually crafted rules
Explicit linguistic knowledge
Needs manual time and effort
Trained manpower
High precision
Less robust
Approaches in Computational
Linguistics (for Tools)
Data driven approach
Uses statistical methods or machine learning
Requires less human effort
Often requires large scale data sources
(manually annotated corpora, lexicons etc)
Linguistic knowledge is implicit
More adaptive to noisy text
More robust
Computational Linguistics Application
Areas
Is useful for
Communication between
Man-machine
Question answering systems, interactive railway
reservation
Text summarization
Web applications
Intelligent search engines
Cross lingual search
Man – man
Machine translation
Download