Lecture 2 Regular Expressions and Automata in Language Analysis CS 4705

advertisement
Lecture 2
Regular Expressions and
Automata in Language Analysis
CS 4705
Statistical vs. Symbolic (Knowledge Rich)
Techniques
• How much linguistic knowledge do our
representations and algorithms need to have to do
‘successful’ NLP?
– Bill hit John.
– John, Bill hit.
• 80/20 Rule: when do we need to worry about the
other 20%?
Today
• Review some of the simple representations and
ask ourselves how we might use them to do
interesting and useful things
– Regular Expressions
– Finite State Automata
• Think about the limits of these simple approaches:
when do we need more?
Uses of Regular Expressions in NLP
• Simple but powerful tools for large corpus
analysis -- ‘shallow’ processing
– What word is most likely to begin a sentence?
– What word is most likely to begin a question?
– How often do people end sentences with prepositions?
• With other simple statistical tools, allow us to
– Obtain word frequency and co-occurrence statistics
– Build simple interactive applications (e.g. Eliza)
– Authorship: Who wrote Shakespeare’s plays? The
Federalist papers? The Unibomber letters?
Review
RE
Matches
Possible use
/./
Any character
A non-blank line
/\./, /\?/
A ‘.’, a ‘?’
/[bckmsr]/
Any of these chars
/[a-z]/
Any l.c. letter
A statement, a
question
Rhyme:/[bckmrs]i
te/
Rhyme: /[a-z]ite/
/[A-Z]/
Any u.c. letter
/[A-Z][a-z]*/
/[^A-Z]/
Any non-u.c. char
/[^A-Z][a-z]*/
RE
Description
Uses?
/a*/
Zero or more a’s
/(very[ ])*/
/a+/
One or more a’s
/(very[ ])+/
/a?/
Optional single a
/(very[ ])?/
/cat|dog/
‘cat’ or ‘dog’
/[a-z]* (cat|dog)/
/^[Nn]o$/
A line with only ‘No’
or ‘no’ in it
/\bun\B/
Prefixes
Words prefixed by
‘un’ (nb. union)
RE plus
E.G.
/kitt(y|ies)/
Morphological variants of ‘kitty’ -- but
/ (.+ier) and \1 /
Patterns: happier and happier, fuzzier
and fuzzier, classifier and classifier
Substitutions (Transductions)
• E.g. unix sed or ‘s’ operator in Perl
–
–
–
–
–
s/regexp1/pattern/
s/I am feeling (.+)/You are feeling \1 ?/
s/I gave (.+) to (.+)/Why would you give \2 \1 ?/
s/You are (.+)[.]*/Why would you say that I am \1?/
s/([1]?[0-9]) o’clock ([AaPp][. ]*[Mm][. ]*)/\1:00 \2/
• How would you convert to 24-hour clock?
– s/[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]/
<guess>
Examples
• Predictions from a news corpus:
– Which candidate for President is mentioned most often
in the news? Is going to win?
– What stock should you buy?
– Which White House advisers have the most power?
• Language use:
– Which form of comparative is more frequent: ‘Xer’ or
‘more X’?
– Which pronouns occur most often in subject position?
– How often do sentences end with infinitival ‘to’?
– What words most often begin and end sentences?
– What are the 20 most common words in your email? In
the news? In Shakespeare’s plays?
• Emotional language:
– What words indicate what emotions?
• Happiness
• Anger
• Confidence
• Despair
– How can we identify emotions automatically?
Finite State Automata
• FSAs recognize the regular languages represented
by regular expressions
a
– SheepTalk: /baa+!/
b
a
q0
q1
a
q2
!
q3
q4
• Directed graph with labeled nodes and arc transitions
•Five states: q0 the start state, q4 the final state, 5
transitions
Formally
• FSA is a 5-tuple consisting of
–
–
–
–
–
Q: set of states {q0,q1,q2,q3,q4}
: an alphabet of symbols {a,b,!}
q0: a start state in Q
F: a set of final states in Q {q4}
(q,i): a transition function mapping Q x  to Q
a
b
a
a
!
q0
q1
q2
q3
q4
• FSA recognizes (accepts) strings of a regular
language
–
–
–
–
baa!
baaa!
baaaa!
…
• Tape metaphor: will this input be accepted?
a
b
a
!
b
State Transition Table for SheepTalk
Input
State
b
a
!
0
1
0
0
1
0
2
0
2
0
3
0
3
0
3
4
4
0
0
0
Non-Deterministic FSAs for SheepTalk
b
q0
a
q1
b
q0
a
q2
a
q1
a
!
q3
a
q2
!
q3

q4
q4
Problems of Non-Determinism
• At any choice point, we may follow the wrong arc
• Potential solutions:
–
–
–
–
Save backup states at each choice point
Look-ahead in the input before making choice
Pursue alternatives in parallel
Determinize our NFSAs (and then minimize)
• FSAs can be useful tools for recognizing – and
generating – subsets of natural language
– But they cannot represent all NL phenomena (center
embedding: The mouse the cat chased died.)
– Simple vs. linguistically rich representations….
– How do we decide what we need?
FSAs as Grammars for Natural Language
dr
the
q0
rev
q1
q2
hon

mr
pat
q3
l.
q4
ms
mrs

robinson
q5
q6
• If we want to extract all the proper names in the
news, will this work?
– What will it miss?
– Will it accept something that is not a proper name?
– How would you change it to accept all proper names
without false positives?
– Precision vs. recall….
Summing Up
• Regular expressions and FSAs can represent subsets of
natural language as well as regular languages
– Both representations may be impossible for humans to understand
for any real subset of a language
– But they are relatively easy to use for small subsets
– Can be hard to scale up: when many choices at any point (e.g.
surnames)
• Next time: Read Ch 3
Download