lai01 - CLAIR - University of Michigan

advertisement
September 7, 2000
Language and Information
Handout #1
(C) 2000, The University
of Michigan
1
Course Information
•
•
•
•
•
•
Instructor: Dragomir R. Radev (radev@si.umich.edu)
Office: 305A, West Hall
Phone: (734) 615-5225
Office hours: TTh 3-4
Course page: http://www.si.umich.edu/~radev/760
Class meets on Thursdays, 5-8 PM in 311 West Hall
(C) 2000, The University
of Michigan
2
Introduction
(C) 2000, The University
of Michigan
3
Demos
•
•
•
•
Google
AskJeeves
OneAcross
Systran
(C) 2000, The University
of Michigan
4
Some Statistics
•
•
•
•
•
•
•
•
Business e-mail sent per day in the US: 2.1Billion
Spam per day: 7 Billion
First class mail per year: 107 Billion
Text on Internet (2/99): > 6TB
indexed: 16% (Lawrence and Giles, Nature 400, 1999)
Dialog (www.dialog.com): 9 TB
Average college library: 1 TB
More statistics: http://www.cyberatlas.internet.com
(C) 2000, The University
of Michigan
5
Languages
• Languages: 39,000 languages and dialects (22,000 dialects
in India alone)
• Top languages: Chinese/Mandarin (885M), Spanish
(332M), English (322M), Bengali (189M), Hindi (182M),
Portuguese (170M), Russian (170M), Japanese (125M)
• Source: www.sil.org/ethnologue, www.nytimes.com
• Internet: English (128M), Japanese (19.7M), German
(14M), Spanish (9.4M), French (9.3M), Chinese (7.0M)
• Usage: English (1999-54%, 2001-51%, 2003-46%, 200543%)
• Source: www.computereconomics.com
(C) 2000, The University
of Michigan
6
Syllabus
• Introduction to the course and linguistic background
– The study of language. Computational Linguistics and Psycholinguistics.
• Elementary probability and statistics
– Describing data. Measures of central tendency. The z score. Hypothesis
testing.
• Information theory
– Entropy, joint entropy, conditional entropy. Relative entropy and mutual
information. Chain rules.
• Data compression and coding
– Entropy rate. Language modeling. Examples of codes. Optimal codes.
Huffman codes. Arithmetic coding. The entropy of English.
(C) 2000, The University
of Michigan
7
Syllabus
• Clustering
– Cluster analysis. Clustering of terms according to semantic similarity.
Distributional clustering.
• Concordancing and collocations
– Concordances. Collocations. Syntactic criteria for collocability.
• Literary detective work
– The statistical analysis of writing style. Decipherment and translation.
• Information extraction
– Message understanding. Trainable methods.
• Word sense disambiguation and lexical acquisition
– Supervised disambiguation. Unsupervised disambiguation. Attachment
ambiguity. Computational lexicography.
(C) 2000, The University
of Michigan
8
Syllabus
• Part-of-speech tagging [*]
– Statistical taggers. Transformation-based learning of tags. Maximum
entropy models. Weighted finite-state transducers.
• Question answering
– Semantic representation. Predictive annotation.
• Text summarization
– Single-document summarization. Multi-document summarization.
Language models. Maximal Marginal Relevance. Cross-document
structure theory. Trainable methods. Text categorization.
• Other topics
– Text alignment. Word alignment. Statistical machine translation.
Discourse segmentation. Text categorization. Maximum entropy
modeling.
(C) 2000, The University
of Michigan
9
Assignments
• Problem sets
– The assignments will involve analysis of Web-based
data using both manual and automated techniques
• Project
– Data analysis and/or programming involved
• Midterm
– A mixture of short-answer and essay-type questions
• Final
– A mixture of short-answer and essay-type questions
(C) 2000, The University
of Michigan
10
Projects
Each student will be responsible for designing and completing a research project that
demonstrates the ability to use concepts from the class in addressing a practical
problem for humanities computing. A significant part of the final grade will depend on
the project assignment. Students will need to submit a project proposal, a progress
report, and the project itself. Students can elect to do a project on an assigned topic, or
to select a topic of their own.
The final version of the project will be put on the World Wide Web, and will be
defended in front of the class at the end of the semester (procedure TBA).
In some cases (and only with instructor’s approval), students may be allowed to work
in pairs, e.g., students with different backgrounds may collaborate on a larger project.
(C) 2000, The University
of Michigan
11
Readings
• Textbook:
– Oakes, Chapter 1, pages 1 – 10, 24 – 35
• Additional readings
– M&S, Chapter 2, pages 39 – 54
– M&S, Chapter 3, pages 81 – 113
(C) 2000, The University
of Michigan
12
Computational Linguistics
(C) 2000, The University
of Michigan
13
Syntactic categories
• Substitution test:
Joseph eats
Chinese
hot
fresh
vegetarian
{
}
food.
• Open (lexical) and closed (functional)
categories:
No-fly-zone
yadda yadda yadda
(C) 2000, The University
of Michigan
the
in
14
Morphology
The dog chased the yellow bird.
•
•
•
•
•
•
Parts of speech: eight (or so) general types
Inflection (number, person, tense…)
Derivation (adjective-adverb, noun-verb)
Compounding (separate words or single word)
Part-of-speech tagging
Morphological analysis (prefix, root, suffix, ending)
(C) 2000, The University
of Michigan
15
Part of Speech Tags
From Church (1991) - 79 tags
NN
IN
AT
NP
JJ
,
NNS
CC
RB
VB
VBN
VBD
CS
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
singular noun */
preposition */
article */
proper noun */
adjective */
comma */
plural noun */
conjunction */
adverb */
un-inflected verb */
verb +en (taken, looked (passive,perfect)) */
verb +ed (took, looked (past tense)) */
subordinating conjunction */
(C) 2000, The University
of Michigan
16
Part of Speech Tags
From Tzoukermann and Radev (1995) - 258 tags
r
s
sx
a
u
v1p
v1p
v1p
v1p
v1p
v1p
v1p
v1p
v2p
v2p
RP
S
SX
T
U
V1PPI
V1PPM
V1PPC
V1PPS
V1PFI
V1PII
V1PSI
V1PIS
V2PPI
V2PPC
(C) 2000, The University
of Michigan
partitive article
particle
particle
nominal
proper noun
verb 1st person plural
verb 1st person plural
verb 1st person plural
verb 1st person plural
verb 1st person plural
verb 1st person plural
verb 1st person plural
verb 1st person plural
verb 2nd person plural
verb 2nd person plural
present indicative
present imperative
present conditional
present subjunctive
future indicative
imperfect indicative
simple-past indicative
imperfect subjunctive
present indicative
present conditional
17
Jabberwocky (Lewis Carroll)
`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"
(C) 2000, The University
of Michigan
18
Nouns
• Nouns: dog, tree, computer, idea
• Nouns vary in number (singular, plural),
gender (masculine, feminine, neuter), case
(nominative, genitive, accusative, dative)
• Latin: filius (m), filia (f), filium (object)
German: Mädchen
• Clitics (‘s)
(C) 2000, The University
of Michigan
19
Pronouns
• Pronouns: she, ourselves, mine
• Pronouns vary in person, gender, number, case (in
English: nominative, accusative, possessive, 2nd
possessive, reflexive)
Joe bought him an ice cream.
Joe bought himself an ice cream.
• Anaphors: herself, each other
(C) 2000, The University
of Michigan
20
Determiners and Adjectives
•
•
•
•
•
•
•
Articles: the, a
Demonstratives: this, that
Adjectives: describe properties
Attributive and predicative adjectives
Agreement: in gender, number
Comparative and superlative (derivative and periphrastic)
Positive form
(C) 2000, The University
of Michigan
21
Verbs
•
•
•
•
•
•
•
•
•
•
Actions, activities, and states (throw, walk, have)
English: four verb forms
tenses: present, past, future
other inflection: number, person
gerunds and infinitive
aspect: progressive, perfective
voice: active, passive
participles, auxiliaries
irregular verbs
French and Finnish: many more inflections than English
(C) 2000, The University
of Michigan
22
Other Parts of Speech
•
•
•
•
•
Adverbs, prepositions, particles
phrasal verbs (the plane took off, take it off)
particles vs. prepositions (she ran up a bill/hill)
Coordinating conjunctions: and, or, but
Subordinating conjunctions: if, because, that,
although
• Interjections: Ouch!
(C) 2000, The University
of Michigan
23
Phrase-structure Grammars
Alice bought Bob flowers.
Bob bought Alice flowers.
•
•
•
•
•
•
•
Constituent order (SVO, SOV)
imperative forms
sentences with auxiliary verbs
interrogative sentences
declarative sentences
start symbol and rewrite rules
context-free view of language
(C) 2000, The University
of Michigan
24
Sample Phrase-structure
Grammar
S
NP
NP
NP
VP
VP
VP
P








NP
AT
AT
NP
VP
VBD
VBD
IN
(C) 2000, The University
of Michigan
VP
NNS
NN
PP
PP
NP
NP
AT
NNS
NNS
NNS
VBD
VBD
VBD
IN
IN
NN










the
drivers
teachers
lakes
drank
ate
saw
in
of
cake
25
Phrase-structure Grammars
• Local dependencies
• Non-local dependencies
• Subject-verb agreement
The students who wrote the best essays were given a reward.
• wh-extraction
Should Derek read a magazine?
Which magazine should Derek read?
• Empty nodes
(C) 2000, The University
of Michigan
26
Dependency: Arguments and
Adjuncts
Sally watched the kids in the car.
•
•
•
•
•
Event + dependents (verb arguments are usually NPs)
agent, patient, instrument, goal - semantic roles
subject, direct object, indirect object
transitive, intransitive, and ditransitive verbs
active and passive voice
(C) 2000, The University
of Michigan
27
Phrase Structure Ambiguity
•
•
•
•
•
•
•
Grammars are used for generating and parsing sentences
Parses
Syntactic ambiguity
Attachment ambiguity: Visiting relatives can be boring.
The children ate the cake with a spoon.
High vs. low attachment
Garden path sentences: The horse raced past the barn fell.
Is the book on the table red?
(C) 2000, The University
of Michigan
28
Ungrammaticality vs. Semantic
Abnormality
* Slept children the.
# Colorless green ideas sleep furiously.
# The cat barked.
(C) 2000, The University
of Michigan
29
Semantics and Pragmatics
• Lexical semantics and compositional semantics
• Hypernyms, hyponyms, antonyms, meronyms and
holonyms (part-whole relationship, tire is a
meronym of car), synonyms, homonyms
• Senses of words, polysemous words
• Homophony (bass).
• Collocations: white hair, white wine
• Idioms: to kick the bucket
(C) 2000, The University
of Michigan
30
Discourse Analysis
• Anaphoric relations:
1. Mary helped Peter get out of the car. He thanked her.
2. Mary helped the other passenger out of the car.
The man had asked her for help because of his foot
injury.
• Information extraction problems (entity crossreferencing)
Hurricane Hugo destroyed 20,000 Florida homes.
At an estimated cost of one billion dollars, the disaster
has been the most costly in the state’s history.
(C) 2000, The University
of Michigan
31
Pragmatics
• The study of how knowledge about the world and
language conventions interact with literal
meaning.
• Speech acts
• Research issues: resolution of anaphoric relations,
modeling of speech acts in dialogues
(C) 2000, The University
of Michigan
32
Other Research Areas
• Linguistics is traditionally divided into phonetics,
phonology, morphology, syntax, semantics, and
pragmatics.
• Sociolinguistics: interactions of social organization and
language.
• Historical linguistics: change over time.
• Linguistic typology
• Language acquisition
• Psycholinguistics: real-time production and perception of
language
(C) 2000, The University
of Michigan
33
Ambiguities in Natural Language
• address, resent, entrance, number
• Lee: Wait to buy IBM
(http://cnnfn.cnn.com/2000/07/19/investing/q_talking_stocks/)
• Pfizer to buy Warner-Lambert in $90-billion deal
(http://detnews.com/2000/business/0002/07/02080007.htm)
(C) 2000, The University
of Michigan
34
My Research Interests
• Text summarization (especially, of multiple
documents)
• Text categorization and clustering
• Information extraction
• Question answering
(C) 2000, The University
of Michigan
35
Main Research Forums
• Conferences: ACL, SIGIR, ANLP, Coling,
EACL/NAACL, AMTA/MT Summit, ICSLP/Eurospeech
• Journals: Computational Linguistics, Natural Language
Engineering, Information Retrieval, Information
Processing and Management, ACM Transactions on
Information Systems
• University centers: Columbia, CMU, UMass, MIT, UPenn,
USC/ISI, NMSU, Brown, Michigan, Maryland, Edinburgh,
Cambridge, Saarbrücken, Kyoto, and many others
• Industrial research sites: AT&T, Bell Labs, IBM, Xerox
PARC, SRI, BBN/GTE, MITRE, Microsoft
• Startups: Nuance, Ask.com, Inxight
(C) 2000, The University
of Michigan
36
Mathematical Foundations
(C) 2000, The University
of Michigan
37
Probability Spaces
• Probability theory: predicting how likely it is that
something will happen
• basic concepts: experiment (trial), basic outcomes, sample
space 
• discrete and continuous sample spaces
• for NLP: mostly discrete spaces
• events
•  is the certain event while  is the impossible event
• event space - all possible events
(C) 2000, The University
of Michigan
38
Probability Spaces
• Probabilities: numbers between 0 and 1
• Probability function (distribution): distributes a
probability mass of 1 throughout the sample space
.
• Example: coin is tossed three times. What is the
probability of 2 heads?
• Uniform distribution
(C) 2000, The University
of Michigan
39
Conditional Probability and
Independence
• Prior and posterior probability
P(A  B)
P(A|B) =
P(B)

A
B
AB
(C) 2000, The University
of Michigan
40
Conditional Probability and
Independence
• The chain rule:
n-1
P(A1 …  An) = P(A1) P(A2 |A1) P(A3|A1A2 ) … P(An |  Ai)
i=1
• This rule is used in many ways in statistical NLP more
specifically in Markov Models.
• Two events are independent when P(AB) = P(A)P(B)
• Unless P(B)=0 this is equivalent to saying that P(A) = P(A|B)
• If two events are not independent, they are considered
dependent
(C) 2000, The University
of Michigan
41
Bayes’ Theorem
• Bayes’ theorem is used to calculate P(A|B) given
P(B|A).
P(A|B)P(B)
P(BA)
P(B|A) =
=
P(B)
(C) 2000, The University
of Michigan
P(A)
42
Random Variables
• Simply a function:
X:   Rn
• The numbers are generated by a stochastic process with a
certain probability distribution.
• Example: the discrete random variable X that is the sum of
the faces of two randomly thrown dice.
• Probability mass function (pmf) which gives the
probability that the random variable has different numeric
values:
P(x) = P(X = x) = P(Ax) where Ax = {   : X() = x}
(C) 2000, The University
of Michigan
43
Random Variables
• If a random variable X is distributed according to
the pmf p(x), the we write X ˜ p(x)
• For a discrete random variable, we have that:
Sp(xi) = SP(Axi) = P() = 1
(C) 2000, The University
of Michigan
44
Expectation and Variance
• Expectation = mean (average) of a random variable.
• If X is a random variable with a pmf p(x), such that
S |x| p(x) < , then the expectation is:
E(X) = S xp(x)
• Example: rolling one die
• Variance = measure of whether the values of the random
variable tend to be consistent over trials or to vary a lot.
Var(X) = E((X - E(X))2) = E(X2) - E2(X)
• Standard deviation = square root of variance
(C) 2000, The University
of Michigan
45
Expectation and Variance
• Composition of functions:
E(g(Y)) = S g(y)p(y)
• Examples:
If g(Y) = aY + b, then E(g(Y)) = aE(Y) + b
E(X+Y) = E(X) + E(Y)
E(XY) = E(X)E(Y), if X and Y are independent
(C) 2000, The University
of Michigan
46
Joint and Conditional
Distributions
• Joint (multivariate) probability distributions:
p(x,y) = P(X = x , Y = y)
• Marginal pmf:
px(x) = Syp(x,y)
pY(y) = Sxp(x,y)
• If X and Y are independent:
p(x,y) = pX(x)pY(y)
(C) 2000, The University
of Michigan
47
Joint and Conditional
Distributions
• Conditional pmf in terms of the joint distribution:
P(x,y)
pX|Y(x|y) =
(C) 2000, The University
of Michigan
pY(y)
for y such that pY(y) > 0
48
Determining P
•
•
•
•
Estimation
Example “The cow chewed its cud”
Relative frequency
Parametric approach (doesn’t work for
distribution of words in newspaper articles
in a particular topic category)
• Non-parametric approach
(C) 2000, The University
of Michigan
49
The Binomial Distribution
• The number r of successes out of n trials given
that the probability of success in any single trial is
p:
B(r; n,p) =
n
r
( )
pr
(1-p)n-r,
where
n
r
( )
n!
=
(n-r)!r!
• Example: tossing a (possibly weighted) coin
n times.
(C) 2000, The University
of Michigan
50
Pascal’s Triangle
1
1
1
1
1
(C) 2000, The University
of Michigan
1
2
3
4
1
3
6
1
4
1
51
The Normal Distribution
• Describes a continuous distribution
n(x; m,s) =
1
2p s
-(x-m)2/(2s2)
e
• Standard normal distribution: when m = 0 and s = 1
• In statistics, normal distribution is often used to
approximate the binomial distribution. It should only be
used when np(1-p) > 5
• In NLP, such assumptions are unwise. Example: “shade
tree mechanics”
(C) 2000, The University
of Michigan
52
Statistics for Corpus Linguistics
(C) 2000, The University
of Michigan
53
Statistics for Corpus Linguistics
• Descriptive statistics: how to describe data
• Describing relationships: the Chi-square
test, correlation, regression
• Information theory: information, entropy,
coding, redundancy, optimal codes, mutual
information
(C) 2000, The University
of Michigan
54
Measures of Central Tendency
• Mode: the most frequent score in a data set
• Median: central score of the distribution
• Mean: average of all scores
(C) 2000, The University
of Michigan
55
Examples
• Split “Moby Dick” into 135 files (“pages”).
• Occurrences of the word “the” in the first
15 pages:
Data: 17 125 99 300 80 36 43 65 78 259 62 36 40 120 45
Mean: 93.67
Median: 65
Mode: 36
(C) 2000, The University
of Michigan
56
Probabilities
• p = a/n, where a is the number of successes,
and n is the number of trials.
the sum of all probabilities is 1:
S p (i) = 1
• Independent probabilities (product of
probabilities): P(a AND b) = P(a) * P(b)
(C) 2000, The University
of Michigan
57
Binomial Coefficient
(
n
) = n!/r!(n-r)!
r
The probability of success in a single trial is:
n
( ) pr qn-r
r
where q = 1 - p
(C) 2000, The University
of Michigan
58
Related Concepts
• For binomial distributions:
– standard deviation is the square root of n, p, and q
– mean is nxp
• Normal distributions:
– same as binomial, for large values of n
– asymptotical bell curves
(C) 2000, The University
of Michigan
59
Skewed Normal Distributions
• Positively skewed (most of the data is below the
mean)
• Negatively skewed (the opposite)
• Bimodal distributions
• In corpus analysis: the number of letters in a word
or the length of a verse in syllables is usually
positively skewed
• Lognormal distributions
(C) 2000, The University
of Michigan
60
Central Limit Theorem
When samples are repeatedly drawn from a
population, the means of the samples are normally
distributed around the population mean. This
occurs whether or not the actual distribution is
normal or not.
(C) 2000, The University
of Michigan
61
Measures of Variability
• Variance =
S (x-m) /N-1
2
• Range
• Standard deviation is the square root of the variance
• Semi inter-quartile range (25%-75% range): Columbia
ACT scores (26-30)
Data: 17 125 99 300 80 36 43 65 78 259 62 36 40 120 45
Mean: 93.67
Median: 65
Variance: 6729.52
Standard Deviation: 82.03
(C) 2000, The University
of Michigan
62
z-score
• A measure of how far a value is from the mean, in
terms of standard deviations
• Example: m = 93, s = 82. Let’s consider a page
with 144 occurrences of the word “the”. The zscore for that page is:
z = (144-93)/82 = 0.62
• Using the table on pages 258-259 of Oakes, we
find that the new page is at the 26th percentile
(C) 2000, The University
of Michigan
63
Hypothesis Testing
• If two data sets are both normally distributed, and the
means and standard deviations are known
• Example: Francis and Kucera reported that the mean
sentence length in government documents is 25.48 words,
while in the Present-Day Edited American English corpus,
the mean length is 19.27 words only
(C) 2000, The University
of Michigan
64
Hypotheses
• Null hypothesis: that the difference can be
explained in terms of chance and natural
variability
• Statistical significance: when there is less than 5%
chance that the null hypothesis holds
(C) 2000, The University
of Michigan
65
T-testing
• Tests the difference between two groups for
normally-distributed interval data
• The t-test is normally used with small samples:
less than 30 items
• The one-sample study compares a sample mean
with an established population
Tobs = (x - m) / stderr
(C) 2000, The University
of Michigan
66
Example 1
• Mixed corpus: 2.5 verbs per sentence with 1.2
standard deviation
• Scientific corpus: 3.5 verbs per sentence with 1.6
standard deviation
• number of sentences in the scientific corpus: 100
• standard error in scientific corpus: 3.5/10
• observed value of t = (3.5-2.5)/0.35 = 2.86
(C) 2000, The University
of Michigan
67
Example 1 (Cont’d)
•
•
•
•
Number of degrees of freedom: in the example: 99
Use table on page 260 of Oakes
Find value: 1.671
The observed value of t is larger, therefore the null
hypothesis can be rejected
(C) 2000, The University
of Michigan
68
Tests for Difference
Tobs = (x1 - x2) / stderr
stderr2 = s12/n1 + s22/n2
(C) 2000, The University
of Michigan
69
Control
(n=8)
10
Test
(n=7)
8
5
1
3
2
6
1
4
3
4
4
7
2
9
(C) 2000, The University
of Michigan
70
Example 2
stderr =
2.27 x 2.27
7
+
2.21 x 2.21
0.736 + 0.611
8
=
=
1.347
= 1.161
t = (6-3)/1.161 = 2.584
(C) 2000, The University
of Michigan
71
Example 2 (Cont’d)
• Number of degrees of freedom:
7 + 8 - 2 = 13
• critical value of significance at the 5 per cent level
is 2.16
• Since the observed value is greater than 2.16, we
can reject the null hypothesis
(C) 2000, The University
of Michigan
72
Parametric and Non-parametric
Tests
• Four scales of measurement: ratio, interval,
ordinal, nominal
• parametric tests (e.g., t-test): interval or ratioscored dependent variables; assumes independent
observations; usually normal distributions only
• non-parametric tests: mostly for frequencies and
rank-ordered scales; any type of distributions; less
powerful than parametric tests
(C) 2000, The University
of Michigan
73
Chi-square Test
• Relationship between the frequencies in a
display table
• Null hypothesis: no difference in
distribution (all distributions are equal)
2 = S
(C) 2000, The University
of Michigan
(O-E)2
E
74
Special cases
• When the number of degrees of freedom is 1, as in
a 2x2 contingency table, Yates’s correction factor
is used.
• If O > E, add 0.5 to O, otherwise, subtract 0.5
from O.
• If E < 5, results are not reliable.
(C) 2000, The University
of Michigan
75
Two-dimensional Contingency
Table
X = yes
X = no
Y = yes
a
b
Y = no
c
d
(C) 2000, The University
of Michigan
76
Row total x column total
Expected value =
Grand number of items
2 =
(C) 2000, The University
of Michigan
N( |ad - bc| - N/2)2
(a+b)(c+d)(a+c)(b+d)
77
Third Person Singular Reference (O)
Japanese English Total
Ellipsis
104
0
104
Central pronouns
73
314
387
Non-central pronouns
12
28
40
Names
314
291
605
Common NPs
205
174
379
Total
708
807
1515
(C) 2000, The University
of Michigan
78
Third Person Singular Reference (E)
Japanese
Ellipsis
English Total
48.6
55.4
104
180.9
206.1
387
18.7
21.3
40
Names
282.7
322.3
605
Common NPs
177.1
201.9
379
708
807
1515
Central pronouns
Non-central pronouns
Total
(C) 2000, The University
of Michigan
79
(O-E)2/E for the Two Languages
Japanese
S = 258.8;
English
Ellipsis
63.2
55.4
Central pronouns
64.4
56.5
Non-central pronouns
2.4
2.1
Names
3.5
3.0
Common NPs
4.4
3.9
df = (5-1) x (2-1) = 4
(C) 2000, The University
of Michigan
--> different at the 0.001 level
80
Rank Correlation
• Pearson - continuous data
• Spearman’s rank correlation coefficient non-continuous variables
r=1-
6 Sd2
N (N2 - 1)
(C) 2000, The University
of Michigan
81
Example
S
X
Y
X'
Y'
d
d2
1
894
80.2
2
5
3
9
2
1190
86.9
1
2
1
1
3
350
75.7
6
6
0
0
4
690
80.8
4
4
0
0
5
826
84.5
3
3
0
0
6
449
89.3
5
1
4
16
r=1-
6 x 26
6 (62 - 1)
(C) 2000, The University
of Michigan
= 0.3
82
Linear Regression
• Dependent and independent variables
• Regression: used to predict the behavior of
the dependent variable
• Needed: mX, mY, X, b = slope of Y(X)
b=
NSXY - SXSY
NSX2 - (SX)2
(C) 2000, The University
of Michigan
Y’ = mY + b(X - mX)
83
Example
Section
X
Y
X2
XY
1
22
20
484
440
2
49
24
2401
1176
3
80
42
6400
3360
4
26
22
676
572
5
40
23
1600
920
6
54
26
2916
1404
7
91
55
8281
5005
TOTAL
362
212
22758
12877
(C) 2000, The University
of Michigan
84
Example (Cont’d)
(7 x 12877) - (362 x 212)
90139 - 76744
13395
b = (7 x 22758) - (362 x 362) = 159306 - 131044 = 28262 = 0.474
a = 5.775
(C) 2000, The University
of Michigan
Y’ = 5.775 + 0.474 X
85
N-gram Models
(C) 2000, The University
of Michigan
86
Word Prediction
•
•
•
•
•
•
Example: “I’d like to make a collect …”
“I have a gub”
augmentative communication systems
“He is trying to fine out”
“Hopefully, all with continue smoothly in my absence”
“They are leaving in about fifteen minuets to go to her
house”
• “I need to notified the bank of [this problem]
• Language model: a statistical model of word sequences
(C) 2000, The University
of Michigan
87
Counting Words
• Brown corpus (1 million words from 500 texts)
• Example: “He stepped out into the hall, was delighted to
encounter a water brother” - how many words?
• Word forms and lemmas. “cat” and “cats” share the same
lemma (also tokens and types)
• Shakespeare’s complete works: 884,647 word tokens and
29,066 word types
• Brown corpus: 61,805 types and 37,851 lemmas
• American Heritage 3rd edition has 200,000 “boldface
forms” (including some multiword phrases)
(C) 2000, The University
of Michigan
88
Unsmoothed N-grams
• First approximation: each word has an equal probability to
follow any other. E.g., with 100,000 words, the probability
of each of them at any given point is .00001
• “the” - 69,971 times in BC, while “rabbit” appears 11
times
• “Just then, the white …”
P(w1,w2,…, wn) = P(w1) P(w2 |w1) P(w3|w1w2) … P(wn |w1w2…wn-1)
Bigram model:
Replace P(wn |w1w2…wn-1) with P(wn|wn-1)
(C) 2000, The University
of Michigan
89
Markov Models
• Assumption: we can predict the probability of
some future item on the basis of a short history
• Bigrams: first-level Markov models
• Bigram grammars: as an N-by-N matrix of
probabilities, where N is the size of the vocabulary
that we are modeling.
(C) 2000, The University
of Michigan
90
Relative Frequencies
a
aardvark
aardwolf
aback
…
zoophyte
zucchini
a
X
0
0
0
…
X
X
aardvark
0
0
0
0
…
0
0
aardwolf
0
0
0
0
…
0
0
aback
X
X
X
0
…
X
X
…
…
…
…
…
…
…
…
zoophyte
0
0
0
X
…
0
0
zucchini
0
0
0
X
…
0
0
(C) 2000, The University
of Michigan
91
Download