Introduction to Computational Linguistics Dipti Misra Sharma IIIT, Hyderabad <dipti@iiit.ac.in> IASNLP 05-07-2012 Outline Background What is Computational Linguistics (CL)? What do the Computational Linguists do? What are the issues in processing natural languages? What can we do with CL? Approaches in CL? Background Language is a means of communication Therefore, one can say It encodes what is communicated <information> We apply the processes of Analysis (decoding) for understanding Synthesis (encoding) for expression (speaking) What do we communicate ? Information (SPAIN delivered a football masterclass at Euro 2012) Intention <purpose> Emphasis/focus (Euro 2012 won by Spain/ Spain bags Euro 2012) Introduces variation How do we communicate ? We use linguistic elements such as Words (country, park, the, is, Bandipur, of, as, and, considered, National, a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one) Arrangement of the words (Sentences) Words are related to each-other to provide the composite meaning (Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country) How do we communicate ? Arrangement of sentences (Discourse) Sentences or parts of sentences are related to each other to provide a cohesive meaning *(Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot.) (Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km) Languages differ in the way they organise information in these entities All of these interact in the organisation of information What is Computational Linguistics? Computational linguistics is the scientific study of language from a computational perspective. What does it mean? Scientific Provides explanation for a linguistic or psycholinguisitc phenomenon Computational Develops computational models/techniques for linguistic phenomena Human language is the subject of study In other words Computational linguistics is the application of linguistic theories and computational techniques to problems of natural language processing. http://www.ba.umist.ac.uk/public/departments/registrars/academic office/uga/lang.htm What do the Computational Linguists do? Linguistic research Develop language models for processing natural languages Develop language resources for NLP research/applications Understand and develop models for analysis and generation of natural languages by the computers So, A Computational Linguist needs to understand How language works What information is available in the language? How languages encode information? How this knowledge/information can be representated for computational processing? Information in Language (1/4) Languages encode information cuuhe maarate haiN kutte rats kill dogs Hindi sentence is ambiguous Possible interpretations Dogs kill rats Rats kill dogs However, English sentence is not ambiguous Information in Language (2/4) Ambiguity in Hindi is resolved if, cuuhe maarate haiM kuttoN ko rats kill dogs acc English encodes information in positions Hindi in morphemes Languages encode information differently Information in Language (3/4) Another example, This chair has been sat on – The chair has been used for sitting – X sat on this chair, and it is known – The sentence does not mention X Languages encode information partially Information in Language (4/4) English pronouns Hindi pronoun he, she, it vaha He is going to Delhi ==> vaha dilli jaa rahaa hai She is going to Delhi ==> vaha dillii jaa rahii hai It broke ==> vaha TuuTa ?? Information does not always map fully from one language into another Conceptual worlds may be different Differences ? Words English boys <n,pl> Hindi Telugu laDake/laDakoN <n,sg/pl,case> He/she/it vaha atanu/aame/adi is/am/are hai/huuN/haiN/ho is going jaa rahaa hai/rahii hai/rahe haiN Indian Languages Relatively flexible word order 1. a) baccaa phala khaataa ‘child’ hai ‘fruit’ ‘eat+hab’ ‘pres’ The child eats fruits b) phala baccaa khaataa hai c) phala khaataa hai baccaa d) baccaa khaataa hai phala Some structural differences English Declarative : Ravi is coming today Interrogative : Is Ravi coming today ? Change in the position of ‘is’ brings the change in meaning Hindi Declarative : ravi aaj aa rahaa hai Interrogative : kyaa ravi aaj aa rahaa hai ? Word ‘kyaa’ encodes the question information Alternatively, more natural spoken form in Hindi ravi aaj aa rahaa hai ? (with appropriate intonation) OR Ravi aaj aa rahaa hai kyaa? Post nominal modification 'ing' clauses I know [the man playing guitar] Hindi, on the other hand maiN [giTaar bajaa rahe vyakti ko] jaanataa huuN Clauses having 'un-' negative constructions English Unless you reach there the job will not be done Hindi jab tak tum vahaaN nahiiN pahuNcate , kaam nahiiN hogaa Languages Differ Different languages have different mechanisms/devices to encode information Some devices are common across certain languages and some are different There are alternative ways of expressing the same meaning within the same language Languages show preferences for one device over the others English exploits ‘position’ for encoding information Hindi uses ‘words’ more effectively Thus, differences in grammatical structures Ambiguity in Natural Language (1/2) Look at the word 'plot' in the following examples (a) The plot having rocks and boulders is not good. (b) The plot having twists and turns is interesting. 'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story' Ambiguity in Natural Language (2/2) Lexical level Sentence level Structural differences between SL and TL in a Machine Translation system. Lexical ambiguity Lexical ambiguity can be both for Content words – nouns, verbs etc Function words – prepositions, TAMs etc Content words' ambiguity is of two types Homonymy Polysemy Homonymy A word has two or more unrelated senses Example : I was walking on the bank (river-bank) I deposited the money in the bank (moneybank) Polysemy A word having two or more related senses Example : English word 'issue', noun 1. The issue is under discussion (muddaa) 2. The latest issue of the journal is out (aNka) 3. He buys stamps on the day of the issue (vimocan) 4. The couple has no issue even after five years of marriage (saNtaan) Information Flow and Ambiguity 1. He scratched a figure on the rock (engrave) 2. She scratched the figure on the rock (scrape) • Other words in the context make a difference • Change of 'a' (in 1) to 'the' (in 2) changes the meaning of 'scratched' Function words can also pose problems (1/4) Function words can also be ambiguous For example – English preposition 'in' (a) I met him in the garden maiN usase bagiice meiN milaa (b) I met him in the morning maiN usase subaha 0 milaa 'Ambiguity' here refers to the 'appropriate correspondence' in the target language. Function words can also pose problems (2/4) 1. He bought a shirt with tiny collars. usane chote kaular vaalii kamiiz khariidii ‘he tiny collars with shirt bought’ ‘with’ gets translated as ‘vaalii’ in Hindi 2. He washed a shirt with soap. usane saabun se kamiiz dhoii ‘he soap with shirt washed’ ‘with’ gets translated as ‘se’ . Function words can also pose problems (3/4) TAM Markers mark tense, aspect and modality – Consist of inflections and/or auxiliary verbs in Hindi – An important source of information – Narrow down the meaning of a verb (eg. lied, lay) Function words can also pose problems (4/4) English Simple Past vs Habitual' 1a. He stayed in the guest house during his visit to our University in Jan (rahaa) 1b. He stayed in the guest house whenever he visited us (rahataa thaa) 2a. He went to the school just now (gayaa) 2b. He went to the school everyday (jaataa thaa) Sentence level ambiguity I met the girl in the store + Possible readings a) I met the girl who works in the store b) I met the girl while I was in the store Time flies like an arrow. + Possible parses: a) Time flies like an arrow (N V Prep Det N) b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing) Thus, Languages encode information differently Languages code information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at different levels Human beings use World knowledge Context (both linguistic and extra-linguistic) Cultural knowledge and Language conventions to resolve ambiguities Can all this knowledge be provided to the machine ? Computational Linguistics aims for this. How to provide this knowledge ? (1/2) Analyse language at various levels (word, phrase, sentence etc) Build Tools for analysing the natural language at various levels in a text POS tagger (category marking) Morphological analysers (analysis of a word) Morphological generators (word generators) Chunkers (shallow parsers) Parsers (syntactic analysis) Filters (markers for special expressions) Sense Disambiguation Algorithms Etc The tools need linguistic knowledge How to provide this knowledge ? (2/2) Build language resources Machine Readable Lexicon Rules for various levels of linguistic analysis Computational Grammars Mapping rules for the concerned language pair for an MT system Sense Disambiguation Rules Annotated corpora Etc POS Tagger What is a POS? Take the following English sentence My old friend Ram recently bought a book on Indian snakes for his cousin from London from the new bookshop . Each word in the above sentence belongs to a word class (also called as a Part Of Speech (POS)) The class to which a word may belong is based on its morphological and syntactic behavior Morphological Kind of affixes a word takes, for example, boy, boys; girl, girls; book, books (noun class) Syntactic How it is distributed in a sentence He chairs the next session (verb) The chairs are new (noun) Why is POS relevant in CL/NLP ? (1/2) • • Word class information of a given word in a sentence helps to predict its neighbour WSD He runs a mile every day (verb) Their team made 250 runs (noun) Time flies like an arrow (n v prep det n) • Helps in further processing – chunking, morph pruning, sentence parsing • IR POS tagged sentence My pronoun old friend Ram recently bought a book on Indian possesive adjective noun proper noun adverb verb determiner noun preposition adjective his possesive pronoun cousin noun from preposition London proper noun , punctuation from preposition the determiner new adjective bookshop noun POS Tagging Approaches Rule Based Statistical Transformation Based Rule Based POS Tagging Two staged architecture algorithms (Harris, 1962; Klein and Simmons, 1963; Green and Rubin, 1971) Stage 1 dictionary assign POS by referring to the Eg Dictionary entry for Eng word that that Conj, Adv, Pronoun Stage 2 disambiguate, using manually crafted rules Statistical Taggers use probabilities for tagging The tagger picks the most likely tag for a given word in a context HMM based algorithms are most commonly used for POS tagging task Requires manually tagged corpus Annotating Corpus for POS Annotated corpora is useful for developing statistical POS taggers Tagging scheme Set of POS Tags Guidelines for the annotators The tagged corpora should be High quality (in terms of tagging accuracy) Consistent POS Tags for English English Penn Tree Bank – 45 tags C5 - Lancaster – 61 tags – used in CLAWS Basic tagset used for BNC http://view.byu.edu/bnc_tags.htm - C7 – 147 tags – Leech http://www.comp.lancs.ac.uk/ucrel/claws7tags .html Pen Treebank Tags My old friend Ram recently bought a book on Indian snakes PP$ JJ NN NNP RB VBD DT NN IN JJ NNS his cousin from London , , from the new bookshop in IN town PP$ NN IN NNP IN DT JJ NN NN POS Tags for Indian Languages Objective To arrive at a standard POS and Chunk tagging scheme for all Indian languages Assumption Commonality in Indian Languages Issues in Tag Set Design (1/2) Linguistic knowledge coarse vs fine Syntactic function vs lexical category (for POS tags) New tags vs tags close to existing English tags Should be comprehensive/complete Issues in Tag Set Design (2/2) Simple Less effort in manual tagging Number of tags Common for all Indian languages Linguistic Knowledge : Fine vs Coarse (1/2) Example Only noun (NN) laDakA, laDake, laDakoM, laDakI, laDakiyAM, ladakiyoM OR Noun with gender, number, case information (NNM) ladakA, ladAke, laDakoM, (NNMS) ladakA, laDake (NNMP) laDake, laDkoM, (NNMSD) laDakA, (NNMSO) laDake, (NNMPD) laDake, (NNMPO) laDakoM The decision has implications for the size of corpora and machine learning Linguistic Knowledge : Fine vs Coarse (2/2) Alternatives Coarse - NN (advantages/disadvantages) Fine - NNMSD (advantages/disadvantages) Hierarchical Example: NN_m_sg_d Hierarchical tag set provides the possibility for underspecification Considerations POS tagger is NOT a replacement for a morph analyzer Coarse analysis to begin with Expandable if needed If the information can be obtained from elsewhere, it need not be included in the POS tag Syntactic function vs lexical category Example harijana bAlaka ‘harijan’ ‘child’ Decision : Lexical category Helps achieve Consistency in annotation Better learning New tags vs tags close to existing English tags New tags Noun, Pron, Adj, Adv Familiar tags (Penn Treebank tags) NN, PRP, JJ, RB Decision : Penn tags for common lexical types New tags for certain IL specific cases Comprehensive/Complete All the lexical items occurring in a sentence should be marked for their POS, including punctuations. If the language has some special cases, these should also be captured – Reduplications in ILs Simple Why simple ? The tags are designed for some manual annotation Ease of learning Consistency in annotation Less Effort in Manual Tagging The annotators should not have to Write too much Take too many steps in annotating a lexical item Number of Tags Number of tags makes a difference both for the man and the machine For the man in decision making For the machine in learning for automatic tagging Common for All Indian Languages Indian languages belong to various language families Share linguistic features However, There are differences Some languages have quotatives, some don't Some have classifiers, some don't Chunking What forms a chunk ? Non-recursive phrase ((det adj noun)) Partial structure without distorting the dependencies Include inflections (postposition/auxiliaries) with a lexical category Example : ((mere choTe bhaaii ne))_NP ((jaa rahaa hai))_VG Chunker A Chunker automatically groups words in a sentence as chunks and labels them ((My old friend Ram))_NP ((recently bought))_VG ((a book))_NP on ((Indian snakes))_NP for ((his cousin))_NP from ((London))_NP from ((the new bookshop))_NP. IL Chunk Tags (1/2) NP JJP RBP NEGP CCP BLK noun chunk bahut acchiiI kitaab adjective chunk bahut sundar sii adverb chunk dhiIre – dhIire chunk for negatives nahiiN conjunct chunks raam Ora shyaam miscellaneous interjections etc IL Chunk Tags (2/2) VGF Finite verb chunk jaa rahaa hai VGNF Non finite verb chunk jaate hue VGINF Infinitive verb chunk jaanaa VGNN Gerunds jaanaa FRAGP Discontiguous fragments of a chunk raama (meraa bhaaii) ne Some Issues How to chunk the following ? Adverbs within a verb chunk or separately Eg ((recently bought)) or ((recently)) ((bought)) Punctuations Particles – hii (only), to, bhii (also) etc Current approach For punctuation – chunk them with the preceding chunk Adverbs – chunk them separately Particles – chunk them with the chunk to which they belong ((raam ne bhii)) ((jaa hii rahaa thaa)) Issues • Verb Negation ‘not going’ 2. kahaa hii nahiiN ‘just did not mention’ 3. kaha to nahiiN rahaa thaa ‘was not saying’ (emphatic) 4. binaa yaha baata kahe ‘without saying this’ 5. yahii nahiiN, balki likhita ruup meiN bhii yah miltaa hai ‘Not only this, in fact, this is also found in writing' 1. nahiiN jaa rahaa Current approach For cases 1 to 3, chunk NEG with the verb group For 4, chunk the NEG separately in a chunk For 5, also a separate NEGP chunk will work NOUN NEGATION ??? Chunking Co-ordinate Constructions 1. word1 CC word2 raam aur shyaam ((raam))_NP ((aur))_CCP ((shyaam))_NP 2. phrase CC phrase meraa bhaaii shyaam aur tumhaaraa bhaaii mohan ((meraa bhaaii shyaam))_NP ((aur))_CCP ((tumhaaraa bhaaii mohan))_NP 3. clause CC clause Discontiguous Phrases What about cases such as ' X (Y) Z' ? where X = noun, Y = a phrase, Z = postposition raam (meraa xillii vaalaa bhaaii) ne OR isa 'upanyaas – samraaT' shabda kaa' FRAGP Chunking Conjunct Verbs Conjunct verbs A verb composed of a noun/adj and a verb (sviikaar karnaa 'accept') Should the conjunct verbs be tagged as a single chunk or two chunks? 'prawIkSA karanA', 'kSamA karanA' etc ‘to wait’ ‘to forgive’ What about genitives ? raam kaa betaa 'brother of Ram' usakaa betaa 'his/her son' mere bhaaii raam kaa betaa 'my brother Ram's son' iske pahale 'before this' mez ke uupar 'above/on the table' ravi ke saath 'with Ravi' Chunking Numbers/Quantifiers (1/2) Numerals, quantifiers may occur as follows a) ek laDakaa 'one boy' b) 1 laDakaa '1 boy' c) pahalaa laDakaa 'first boy' d) karoDoN log 'billions of people' e) 1962 meiN 'in 1962' Chunking Numbers/Quantifiers (2/2) The POS tags for numerals and quantifiers are QC (numerals) and QF (other quantifiers) in IL POS tagset Example (d) and (e) in the previous slide show cases where the quantifier is behaving like a noun The issue : Should the quantifiers in cases such as (d) and (e) be tagged as a Q* or as NN since the chunk itself is a noun chunk ? Summary For annotating POS and Chunk a scheme needs to be designed While doing so following issues need to be considered. Definition of 'chunk' Elements which together can form a chunk type Whether to include postpositions, punctuations etc inside a chunk or form them as independent chunks POS/Chunk tag labels Approaches in Computational Linguistics (for Tools) Two major approaches Rule based Requires manually crafted rules Explicit linguistic knowledge Needs manual time and effort Trained manpower High precision Less robust Approaches in Computational Linguistics (for Tools) Data driven approach Uses statistical methods or machine learning Requires less human effort Often requires large scale data sources (manually annotated corpora, lexicons etc) Linguistic knowledge is implicit More adaptive to noisy text More robust Computational Linguistics Application Areas Is useful for Communication between Man-machine Question answering systems, interactive railway reservation Text summarization Web applications Intelligent search engines Cross lingual search Man – man Machine translation