Lecture 2 Regular Expressions and Automata in Language Analysis CS 4705 Statistical vs. Symbolic (Knowledge Rich) Techniques • How much linguistic knowledge do our representations and algorithms need to have to do ‘successful’ NLP? – Bill hit John. – John, Bill hit. • 80/20 Rule: when do we need to worry about the other 20%? Today • Review some of the simple representations and ask ourselves how we might use them to do interesting and useful things – Regular Expressions – Finite State Automata • Think about the limits of these simple approaches: when do we need more? Uses of Regular Expressions in NLP • Simple but powerful tools for large corpus analysis -- ‘shallow’ processing – What word is most likely to begin a sentence? – What word is most likely to begin a question? – How often do people end sentences with prepositions? • With other simple statistical tools, allow us to – Obtain word frequency and co-occurrence statistics – Build simple interactive applications (e.g. Eliza) – Authorship: Who wrote Shakespeare’s plays? The Federalist papers? The Unibomber letters? Review RE Matches Possible use /./ Any character A non-blank line /\./, /\?/ A ‘.’, a ‘?’ /[bckmsr]/ Any of these chars /[a-z]/ Any l.c. letter A statement, a question Rhyme:/[bckmrs]i te/ Rhyme: /[a-z]ite/ /[A-Z]/ Any u.c. letter /[A-Z][a-z]*/ /[^A-Z]/ Any non-u.c. char /[^A-Z][a-z]*/ RE Description Uses? /a*/ Zero or more a’s /(very[ ])*/ /a+/ One or more a’s /(very[ ])+/ /a?/ Optional single a /(very[ ])?/ /cat|dog/ ‘cat’ or ‘dog’ /[a-z]* (cat|dog)/ /^[Nn]o$/ A line with only ‘No’ or ‘no’ in it /\bun\B/ Prefixes Words prefixed by ‘un’ (nb. union) RE plus E.G. /kitt(y|ies)/ Morphological variants of ‘kitty’ -- but / (.+ier) and \1 / Patterns: happier and happier, fuzzier and fuzzier, classifier and classifier Substitutions (Transductions) • E.g. unix sed or ‘s’ operator in Perl – – – – – s/regexp1/pattern/ s/I am feeling (.+)/You are feeling \1 ?/ s/I gave (.+) to (.+)/Why would you give \2 \1 ?/ s/You are (.+)[.]*/Why would you say that I am \1?/ s/([1]?[0-9]) o’clock ([AaPp][. ]*[Mm][. ]*)/\1:00 \2/ • How would you convert to 24-hour clock? – s/[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]/ <guess> Examples • Predictions from a news corpus: – Which candidate for President is mentioned most often in the news? Is going to win? – What stock should you buy? – Which White House advisers have the most power? • Language use: – Which form of comparative is more frequent: ‘Xer’ or ‘more X’? – Which pronouns occur most often in subject position? – How often do sentences end with infinitival ‘to’? – What words most often begin and end sentences? – What are the 20 most common words in your email? In the news? In Shakespeare’s plays? • Emotional language: – What words indicate what emotions? • Happiness • Anger • Confidence • Despair – How can we identify emotions automatically? Finite State Automata • FSAs recognize the regular languages represented by regular expressions a – SheepTalk: /baa+!/ b a q0 q1 a q2 ! q3 q4 • Directed graph with labeled nodes and arc transitions •Five states: q0 the start state, q4 the final state, 5 transitions Formally • FSA is a 5-tuple consisting of – – – – – Q: set of states {q0,q1,q2,q3,q4} : an alphabet of symbols {a,b,!} q0: a start state in Q F: a set of final states in Q {q4} (q,i): a transition function mapping Q x to Q a b a a ! q0 q1 q2 q3 q4 • FSA recognizes (accepts) strings of a regular language – – – – baa! baaa! baaaa! … • Tape metaphor: will this input be accepted? a b a ! b State Transition Table for SheepTalk Input State b a ! 0 1 0 0 1 0 2 0 2 0 3 0 3 0 3 4 4 0 0 0 Non-Deterministic FSAs for SheepTalk b q0 a q1 b q0 a q2 a q1 a ! q3 a q2 ! q3 q4 q4 Problems of Non-Determinism • At any choice point, we may follow the wrong arc • Potential solutions: – – – – Save backup states at each choice point Look-ahead in the input before making choice Pursue alternatives in parallel Determinize our NFSAs (and then minimize) • FSAs can be useful tools for recognizing – and generating – subsets of natural language – But they cannot represent all NL phenomena (center embedding: The mouse the cat chased died.) – Simple vs. linguistically rich representations…. – How do we decide what we need? FSAs as Grammars for Natural Language dr the q0 rev q1 q2 hon mr pat q3 l. q4 ms mrs robinson q5 q6 • If we want to extract all the proper names in the news, will this work? – What will it miss? – Will it accept something that is not a proper name? – How would you change it to accept all proper names without false positives? – Precision vs. recall…. Summing Up • Regular expressions and FSAs can represent subsets of natural language as well as regular languages – Both representations may be impossible for humans to understand for any real subset of a language – But they are relatively easy to use for small subsets – Can be hard to scale up: when many choices at any point (e.g. surnames) • Next time: Read Ch 3