Introduction to Syntactic Analysis after John Bryant

advertisement
Introduction to
Syntactic Analysis
after
John Bryant
What Is Natural Language?
• Form
– Written
– Sound
– Motion (Sign Language)
• Bridged to Meaning
– Factual meaning (what the form literally asserts)
– Pragmatic meaning (what the speaker wanted the hearer
to know).
• In Some Context
– Shared world knowledge
– Common situation
– Shared knowledge of the discourse
What is Syntax?
• The Way Words Are Put Together
– For example, Determiners come before Nouns in English.
• Constituency
– How words group together to behave as a single unit.
• Grammatical Relations
– E.g. What word functions as the subject of the sentence.
• Subcategorization and dependency
– How particular words constrain the sentence.
Why Are We Interested In It?
Beyond the scientific interest in the structure of language,
syntax is important because it tells us (along with the
words) what the sentence means. Syntactic modification
is indicative of semantic modification.
Modeling Syntax
The standard approach for modeling syntax is to treat
natural language as a formal language.
Using Formal Languages for
Describing Syntax
• Problematic
– Different ways of specifying formal languages have different
levels of expressive power.
– Much care must be taken to choose a mechanism that is
expressive enough, but not too expressive.
• But Necessary
– Knowledge of a process like language must be formalized
for computational methods to be effective.
• Which type of formal language is the right one?
What Is a Formal
Language?
• A (possibly infinite) set of strings
– String here means a sequence of words or symbols.
– Mary had a little lamb could be a string in some set.
• Defined by a set of rules
– The rules are a compact way of representing which
strings belong to the set.
– They provide a strict mathematical definition of which
strings are in the set, and which are not.
– They are called the grammar of the language.
– Allowing these rules to be more complex, lets us define
more complex sets of strings.
More Precisely…
• A finite set of terminals
– Terminals are the atomic symbols in our language (the words).
• A finite set of nonterminals
– A nonterminal is a special symbol that refers to a chunk of
terminals and nonterminals. (a.k.a. a constituent)
– Nonterminals are the syntactic categories of the language.
• A set of rules
– For defining how the symbols can be grouped/ordered
• A designated start symbol
– This is the symbol from which rule application must originate.
An Example Language
• Terminals: {b}
• Nonterminals: {S}
• A designated start symbol: {S}
• A set of rules: {S  bS; S  b}
– The rules are read “S goes to bS” or “S goes to b”
– Can be interpreted in both directions, either as saying S can
be rewritten into a bS or that bS can be reduced to an S.
• This language generates all strings containing at
least one b that only have b’s.
Things We Can Do With a
Formal Language
• Determine if a particular string is in the language.
– By trying to derive it. Deriving a string just means finding a
mapping, via the grammar rules, between the start symbol and the
string. It is also called parsing.
• Generating all the strings in the language
– Trying every possible rule combination from the start symbol
allows us to check that we only allow the “good” strings.
• Compare it to other formal languages
– Different ways of defining the rules leads to different amounts of
expressive power.
Deriving the string ‘bbb’
going top down.
1) S  bS
2) S  b
S is the designated
start symbol.
S
2
1
bS
b
Using rule 2 here is
the wrong move
because there are no
more nonterminals to
rewrite and we have
not derived ‘bbb’.
So instead we use rule 1
which at least gives us a
nonterminal to expand.
2
bb
1
bbS
2
bbb
Using rule 2 here is right
because we will have
matched the desired string,
and don’t have anymore
nonterminals to deal with.
Top Down Parsing as Search
• The initial state is the designated start symbol
• The states are combinations of terminals and
nonterminals derivable from S
• The operators are the grammar rules.
– Any chunk of a state that matches the left hand side of a
rule can be replaced by the right hand side of that rule.
• The goal state is the input string without extra
nonterminals.
Deriving the string ‘bbb’
going bottom up.
1) S  bS
2) S  b
SSS SSS SS SSS
SSb
SbS
Sbb
SSb
SSS SS SS SSS SS
bSS Sb
bSb
SbS
SSS SS
bSS
SS S
bS
bbS
bbb
Start with the input string, and try to find the start symbol.
Bottom Up Parsing as Search
• The initial state is the input string
• The states are combinations of terminals and
nonterminals
• The operators are the grammar rules.
– Any chunk of the state that matches the right hand side
of a grammar rule can be replaced by the left hand side
of that rule.
• The Goal state is the designated start symbol.
Parsing as Search
• Using search appears to have drawbacks.
– Repeated states (infinite search trees)
– Exponential with respect to the desired string
– Ambiguity: Is the derivation we found the right one?
• Actual Natural Language Parsers
– Keep a table of states (a chart) so as not to repeat them
– The chart allows the parser to keep track of multiple
derivations which makes it possible to deal with ambiguity.
– With the chart, we also don’t get caught in infinite loops.
• The chart makes parsing polynomial.
– Even with ambiguous grammars
More on the Rules
They can schematically be represented as:

Where
 and  are ordered lists of terminals and nonterminals.
Constraining the number of terminals and nonterminals in

and  constrains the expressive power of the rules. i.e. the
more complex we let  and  be, the more complex our
languages can get.
Context Free Grammar
Is a type of grammar that constrains the rules such that:

 can only be a single nonterminal.
 can be any number of terminals and nonterminals.
Some flavor of Context Free Grammar is usually used
to recognize English syntax.
A Tiny NL CFG
Using context free grammar rules, we can make a tiny
natural language grammar.
The Lexicon
Noun  soul | pipe | fiddlers | bowl
ProperNoun  King Cole
Verb  was | called | plays | play
Adjective  old | merry | three
Article  a | the
Possessive  his
Conjunction  and
Preposition  for
Pronoun  he
The Lexicon is the list of words that we support, organized by
part of speech. These words are the terminal symbols.
The Syntax Rules
S  NP VP
| S Conjunction S
NP  Adjective* ProperNoun
| Possessive Adjective* Noun
| Article Adjective* Noun
| Pronoun
VP  Verb NP | Verb PP
The * means any
number of
PP  Preposition NP
NP, VP, and PP stand for Noun Phrase, Verb Phrase and
Prepositional phrase. They are the constituents in our grammar
as well as some of the constituents of actual English.
What’s a Constituent?
• Consider the noun phrase
– A sequence of words surrounding a noun referring to something
– The screaming monkey; The laptop on the table;
• How do we know these words form a constituent?
– Noun phrases can all appear before a suitable verb
– The screaming monkey grabbed my tie.
– The laptop on the table beeps when it’s low on power.
• But each piece can’t appear before a verb
– Screaming grabs…*; the beeps…*; on beeps…*
• There is other evidence for constituency
A Tiny NL CFG
Lexicon
Noun  soul | pipe | fiddlers | bowl
ProperNoun  King Cole
Verb  was | called | play | plays
Adjective  old | merry | three
Article  a | the
Possessive  his
Conjunction  and
Preposition  for
Pronoun  he
Grammar Rules
S  NP VP
| S Conjunction S
NP  Adjective* ProperNoun
| Possessive Adjective* Noun
| Article Adjective* Noun
| Pronoun
VP  Verb NP | Verb PP
PP  Preposition NP
The complete tiny grammar. It can generate lines from the Old
King Cole nursery rhyme.
Parse Trees
• When a parser derives a string
– It also outputs the associated parse tree(s).
– Parse trees are different from the search tree that was used
to find a derivation in that a parse tree just shows the
successful rule applications, ignoring the order in which
they were applied.
• A parse tree is the graphical representation of
the derivation of a sentence.
– Each node represents a rule used in the derivation
– Getting the parse tree out of the search tree is basically just
equivalent to remembering the operators that led to a
successful parse.
A Parse Tree With Our Grammar
S
VP
NP
NP
Adj
PropNoun
Verb
Art
Old
King Cole
was
a
Adj
Adj
Noun
merry
old
soul
Constituency
(Graphically Speaking)
S
The constituents
of this S node are
the NP and VP.
VP
NP
NP
Adj
PropNoun
Verb
Art
Old
King Cole
was
a
Adj
Adj
Noun
merry
old
soul
The children of a node are referred to as its constituents. i.e.
each nonterminal on the rhs of a rule is a constituent of the lhs.
CFG’s are useful
They let us model syntactic phenomena like word
order and constituency.
But are CFGs the right way?
Let’s take a look a closer look at our grammar…
A Tiny NL CFG
Lexicon
Noun  soul | pipe | fiddlers | bowl
ProperNoun  King Cole
Verb  was | called | play | plays
Adjective  old | merry | three
Article  a | the
Possessive  his
Conjunction  and
Preposition  for
Pronoun  he
Grammar Rules
S  NP VP
| S Conjunction S
NP  Adjective* ProperNoun
| Possessive Adjective* Noun
| Article Adjective* Noun
| Pronoun
VP  Verb NP | Verb PP
PP  Preposition NP
One way of measuring a grammar’s performance is to
see if it generates unwanted sentences.
Generated Sentences
Old King Cole was a merry old soul.
A merry old soul was he.

He called for his pipe.


He called for his bowl.
He called for his three fiddlers.


The fiddlers play for old King Cole. 
The fiddlers plays for old King Cole.

The subject and verb disagree about whether the subject
should be singular or plural.
With our grammar, any verb will do.
S
VP
NP
PP
NP
Poss
The
Noun
fiddlers
Verb Prep
?
for
Adj
Adj
PropNoun
merry
old
King Cole
Any combination of verb and noun is fine according to our
grammar. In other words, any verb is derivable regardless of
whether it agrees with the noun.
How do we solve this
problem?
• Maybe we don’t…
– Allowing the grammar to over-generate is fine for some
applications.
– Allowing over-generation makes life harder after the
parser because it means that we will have many more
parses for the same sentence.
• Assuming that we do want to fix it…
– We need to build the distinctions we need into the
grammar.
Agreement
• Number
– Singular vs plural : “They play” vs “They plays”*
• Person
– 1st person, 2nd person, 3rd person : “I am” vs “You am”*
• Case
– nominative vs accusative: “I hit him” vs “I hit he”*
• Gender
– In languages like German all the words have a gender and
the adjectives and articles must mark this gender.
– “Ein kleines Huendschen” vs “Eine kleine Huendschen”*
Subcategorization
• Verbs usually have a default number of things
they like to refer to
– Fred slept. (intransitive)
– I hit Paul. (transitive)
– The screaming monkey gave Anne a book. (ditransitive)
• Verbs also have preferences for other types of
constituents
– Tom walked into the café. (a path)
– I thought the screaming monkey was dead. (a sentence)
• These preferences are called the verb’s
subcategorization.
Subcategorization
• The verb “hit” is said to subcategorize for an NP.
– The subject must always be there, so it isn’t mentioned.
– That word is used because we are breaking verbs up into
subcategories based upon their semantic requirements.
• Verb subcategorization is also a source of
overgeneration problems.
– Tom slept Lindsay the puck.***
– Tom washed Lindsay the puck.***
• But there is some freedom.
– Tom hit Lindsay the puck.
– Regina sneezed the napkin off the table.
Fixing the Lexicon
SgNoun  soul | pipe | bowl
PlNoun  fiddlers
SgProperNoun  King Cole
SgArticle  a | the
PlArticle  the
3rdSgNomPronoun  he
3rdPlPronoun  They
1stSgPronoun  I
Within the lexicon, it’s
necessary to indicate with
new nonterminal symbols
all the distinctions we would
like to make.
1stSgIntrans  sleep
3rdSgIntrans  sleeps
3rdPlIntrans  sleep …
1stSgTrans  play
3rdSgTrans  plays
3rdPlTrans  play …
Verbs also need to be marked
with their subcategorization.
Updating the Grammar Rules
Updating the lexicon is not the worst of it though!
NP  Article Adjective* Noun
For each of the combinations, we need to enforce
agreement, turning just one NP rule into two NP rules:
3rdSgNP  SgArticle Adjective* SgNoun
3rdPlNP  PlArticle Adjective* PlNoun
Similar changes must be made for the other NP rules as
well as the VP rules.
Updating the Grammar Rules
But then changing to new NP and VP nonterminals
means that our S  NP VP rule now needs to be
updated for all the possible legal combinations.
S  1stSgNP 1stSgVP
| 3rdSgNP 3rdSgVP
| 3rdPlNP 3rdPlVP
|…
It’s already annoying to have to deal with this, and
we don’t even have a large grammar!
It’s Unsatisfying
• Adding lots of syntactic categories works, but
we lose a lot of elegance in our syntax rules.
• All the different nonterminals make the
grammar harder to maintain.
• Once the grammar reaches a certain level of
complexity, supporting agreement,
subcategorization etc. makes the number of
rules explode.
An Alternative Approach
• Leave all the syntactic categories the same
– Using the old categories allows us to keep our syntax
rules simple.
• But add a data structure to each nonterminal
– This data structure can hold our special syntactic features
like agreement.
• Change the parsing process to also deal with
these data structures
– The grammar rules would indicate to the parser how to
interact with these data structures.
Feature Structures
• Simple Role, Filler data structure
– Basically a table that associates a particular value for a
particular feature (or role)
• Each lexical rule can set the values for the
relevant roles in its associated feature structure.
– This data structure can hold the agreement features.
• The syntactic rules then just need to make sure
that each constituent has features that are in
agreement.
Basic Feature Structure
A new rule for “I”
Pronoun  I
number SG
person  1st
-The top part of the rule is the old
CFG rule.
-The next two lines set the agreement
features.
-The  denotes assignment to the
feature listed on the lhs.
The corresponding fstruct
number : SG 
person : 1st 


-This data structure is attached to the
nonterminal during parsing so that the
parser can use the information.
-The feature is on the lhs of the colon
And the value is rhs of the colon.
Complex Feature Structures
A new rule for “I”
Pronoun  I
agreement.number  SG
agreement.person  1st
The corresponding fstruct

agreement

number : SG  
:


person : 1st  
Features can be filled by feature structures too.
Reentrant Feature Structure


number :   
 NP : agreement : {1}


person : 3rd   


Article : Agreement : {1}









Noun
:
Agreement
:
{1}


The {1} is a pointer. It constrains the article.Agreement,
Noun.Agreement and NP.agreement features to be the same.
All three features are filled by the exact same value. Another
name for this connection between the slots is co-indexation.
Updating the Grammar Rules
NP  Article Adjective* Noun
Article.agreement  Noun.agreement
NP.agreement Noun.agreement
NP.agreement.person  3rd
The  is the operator responsible for co-indexation. Because it
insures sameness, it is the operator used to guarantee
compatibility between each constituent.
The last two constraints listed in the rule are there to percolate the
information about the noun up to the NP so that the sentence rule
will be able check agreement between the subject and verb.
Feature Structure Unification
• To check the compatibility of two fstructs
– Two feature structures are compatible if they have the same
value for every feature they have in common (or if one or both
leave the value unspecified).
– This process of checking compatibility is called unification.
• Unification
– Is a recursive process that takes two feature structures and
either returns the combined feature structure if they are
compatible or it returns failure.
– Base case: Two values unify if they are the same string.
– Recursive Case: Two feature structures unify if for each feature
they have in common, those values unify.
– The resulting feature structure just adds the features they don’t
have in common to the resulting structure.
Unification Example
agreement : {1}person : 3rd 


case
:
nom
 
subject : 
agreement : {1}  
case :



 

subject
:

agreement



=


person : 3rd   
:

number
:
SG


agreement : {1}



case
:
nom




subject : 
person : 3rd  

agreement : {1}





number : SG   
It’s ok if the two features structures have different features, the result is just
the union of the features. The empty value unifies with anything.
Unification Failure
agreement : {1}



case
:
nom

subject : 
agreement : {1} 



=

case : acc


subject
:

agreement





person : 3rd   
:

number
:
SG


FAILURE!
But if both feature structures do have the same feature, except with different
values, that will cause a unification failure.
Free word order languages
Some Languages mainly use marking and agreement
Latin is famous for this, also Turkish.
German and Russian to some degree.
The good girl loves the poor boy.
Puella bona amat puerum parvum.
Xoroshaya devochka liubit bednovo malchika.
Das gute Mädchen liebt den armen Jungen.
Where Do We Go From Here?
• Remember, what we really want is the
meaning of the sentence
– There are representational issues.
– What knowledge needs to be represented for a language
understanding system?
– How does the syntax interact with the semantics?
• The next lecture will address these issues
– Hint: Notice that we’re not forced to limit our features
to syntactic ones. We could also put semantic features
in the feature structures…
Download