An introduction to statistical methods with language data, in R Overview

advertisement
An introduction to statistical methods with language
data, in R
Diana Santos
ILOS
d.s.m.santos@ilos.uio.no
November 2013
Overview
Motivation, warning, disclaimer, terminology
Some methods that can be used with corpora
Some methods that can be used with questionnaires
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
2 / 48
Some quotes about use of measures, or figures
Stefan Evert, 2006:
Statistical methods give only numbers - it is linguistic
interpretation that gives them meaning.
Eugenie C. Scott, 2013 (free rendering):
From hypothesis to textbook... the iterative scientific method,
get data, test, find alternative explanations (critical thinking),
test again, publish, have others testing, have others come with
alternative explanations, get some scientific consensus, translate
into textbook science to teach to students as scientific
discoveries.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
3 / 48
Some methods that can be used with corpora
Using statistical artillery
1
Comparison of two proportions
2
Correlation of two properties
3
Classifying instances using a set of properties
4
Grouping several elements in sets or clusters based on a set of
measures
All this use measures over features, that is, some quantification (counting)
and in some cases the notion of geometrical space.
No magic, and one needs a solid linguistic analysis of the features used, in
addition to a solid mathematical analysis of the mathemeatical
presuppositions.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
4 / 48
Statistics in linguistics and social sciences at a glance
the things one counts are discrete: either yes or no, or categorical
so, what we can check is the influence and interaction of factors
detective work: are things related? are there other causes or things
we should control?
interaction (and correlation) is not causality
To test these things, one computes particular numbers based on our
observations. So it is important that we know what we are observing, that
we can interpret what we see.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
5 / 48
Going backwards
We have a file with observations, with each feature in one column. We ask
for a statistical model (and depending on the kind of observations, there
are different ones), and the numbers associated to that model tell us
things like
some features are good to explain the variation
some features have to be seen together with others
how well this model allows us to foresee other cases
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
6 / 48
Operationalization: how do you really do this?
First you create a “table”, or better a dataframe in R, which is a
computational object that includes, for each observation/unit, a set of
values, organized in columns, and obtained from your corpus (or from
your field observations).
R provides a lot of machinery to deal with such “tables” (of numbers
or values). Both for counting (arithmetics), for visualization, and for
statistical processing.
The tables people usually see in papers are already the result of
processing these dataframes, for example contingency tables.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
7 / 48
What is a statistical test?
(in frequentist statistics) A test has always two elements:
1
a test statistic – a function of the sample that we will compute based
on our sample);
2
and a rejection region – we will reject the hypothesis if the test
statistic lies in that region
The p-value is the probability that the test statistic has this value if the
nulll hypothesis, H0 , is true. The lower the p-value (also called the
observed significance level), the more comfortable we are in rejecting the
null hypothesis.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
8 / 48
What is a statistical test?
En test består alltid av to deler:
1
en teststatistikk (en funksjon av utvalget som ligger til grunn for å
forkaste eller ikke forkaste en hypotese);
2
en forkastningsregion: alle verdier som forkaster hypotesen
chisq.test(TABELL)
Pearson’s Chi-squared test with Yates’ continuity correction
data: tabell
X-squared = 33.112, df = 1, p-value = 8.7e-09
P-verdien er sannsynligheten, når H0 er sann, for at teststatistikken
har denne verdien eller strider enda mer mot H0
Jo mindre P-verdien er, desto mer strider dataene mot H0
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
10 / 48
More on p-values
P-values are also called observed significance levels (“observert
signifikansnivå”), because they can represent the smallest value from
which we reject the null hypothesis (H0 )
One can thus also define beforehand
P-verdi <= α - reject H0 at the alpha level
P-verdi > α - do not reject H0 at the alpha level
Very important: Something can be statistically significant but not
practically significant!
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
11 / 48
Exploratory data analysis (fra Cohen, 1995:46)
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
12 / 48
Men vi har bare noen tall... I
Det finnes begrep for å beskrive en populasjon, og et utvalg.
Sentralitet Gjennomsnitt, median, mode: når man bare kan bruke ett
tall for å beskrive mange
Dispersjon Varians, standardavvikk, variasjonsbredde (“range”),
interkvartil-intervall, variasjonskoeffisient
Forventningsretthet (bias) Måler asymetri (positiv eller negativ): kurven
går mer til høyre eller til venstre?
Kurtosis Måler spissing sammenliknet med normalfordelingen
I statistikk prøver vi å finne de virkelige verdiene ved å måle noen
observasjoner som gir oss “estimatorverdier”. Estimatorene er det vi bruker
for å nå de underliggende parametrene: de virkelige tallene vi er ut etter.
For å finne gode estimatorer, finnes det to metoder: momentmetoden, og
MLE (maximum likelihood estimation).
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
13 / 48
Parametric statistics
Parametric methods use known probability distributions, that only require
that we know the values of the parameters. Families of distributions (one
element for each parameter value).
Distribution
Binomial
Poisson
Gaussian
t
F
χ2
Parameters
p
λ
µ, σ
ν - degrees of freedom
ν1 , ν2
ν - degrees of freedom
If we don’t know the distribution function, there are non-parametric
methods.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
14 / 48
One and two-tailed tests
Pairing with a null hypothesis (where there is no effect or interaction or
difference), there is always an alternative hypothesis, Ha .
two-tailed, either way rejects the hypothesis
one tailed: there is one direction we expect
The null hypothesis has to be precise (so that we can mathematically
study its distribution).
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
15 / 48
A statistician can never be sure
There is a compromise between two kinds of errors, type I or type II:
Tabela: Reality
Decision
Accept H0
Reject H0
H0 is true
correct
Type I error, α
H0 is not true
Type II error, β
correct
One therefore defines also the power of a statistical test, 1 − β, in addition
to the α level (for the P-values, or significance values), also called
significance level.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
16 / 48
Study of passive
What could one do about studying the passive in Portuguese (or another
language) using corpora?
Absolute frequencies?
Relative? (what is the unit? Is this a meaningful proportion?
Distribution – by which feature?
Let us study the proportion of passives in different genres
Let us study the distribution of passives for different verbs
Let us study the distribution of passive for different tenses
prop.test(PROPORTION1,PROPORTION2)
prop.test(c(120,81), c(140,100))
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
18 / 48
The simplest example
Two books (one genuine, the other tentatively assigned to Aristotle) have
the following distribution of the part of speech of the last word in the
sentence, in the first 100 sentences. Assuming that this statistic (POS of
the last word) is sound, what can we conclude from the table? (A: others,
S: nouns, V: verbs)
PoS
Rhetorics
R. to Aleksander
S
28
27
V
32
52
A
40
21
We want to know whether the two samples come from the same
population (=the books written by Aristotle). We use χ2
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
19 / 48
The simplest example: manually, and with R
Computing expected if the null hypothesis were true:
PoS
Rhetorics
R. to Aleksander
S
27.2
27.5
Computing the χ2 statistic, which is χ2 =
V
42
42
A
30.5
30.5
(Oi −Ei )2
i=1
Ei
PN
arist<-matrix(c(28,27,32,52,40,21),ncol=3)
colnames(arist)<-c("noun", "verb", "other")
row.names(arist)<-c("Ret","R. to Alex")
chisq.test(arist)
Pearson’s Chi-squared test with Yates’ continuity correction
data: arist
X-squared = 10.6981, df = 2, p-value = 0.004753
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
21 / 48
Examples of use of a t-test
Differences between means: is the mean in our sample just like the known
mean (12)?
t.test(HEDGES, mu=12)
Are F1 frequencies of men and women different?
t.test(F1S~gender, paired=FALSE)
t.test(F1S[gender=="M"], F1S[gender=="F"], paired=FALSE)
Are the length differences in translation consistently higher?
t.test(Length~OrigOrTrans, paired=TRUE)
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
23 / 48
Contingency tables (krysstabeller)
This is probably the most common way to present (linguistic) data. But it
is important to note that one can ask R to create the contingency tables
for us, provided one has one observation per line in a dataframe.
In statistics contingencies are all possible things that can happen.
See R commands table, xtabs, prop.table.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
24 / 48
Type informasjon
Akkurat som vi har adjektiv, navn og verb i språk (med forskjellige
egenskaper og bruksmåter), finnes det også forskjellige typer (med
forskjellige operasjoner tilknyttet) i programmeringsspråk, og i statistikk
finnes det forskjellige metoder for ulike typer informasjon (forskjellige
skalanivåer):
kategori (trenavn, kortstokk, kjønn, nasjonalitet, ...) – kategoriske
variabler, også kalt nominale variabler (nominalskala)
kategori som kan rangeres (mye, lite, ingenting; barn, ungdom,
voksne) – ordinale variabler (rangskala eller ordinalskala)
kontinuerlige (vekt, temperatur, alder, ...) – det er mulig å regne med
dem aritmetisk
“meningsfattige” tall (vilkårlige tall, slik temperatur er, eller noen
skolekaraktere), også kalt intervallvariabler (intervallskala)
meningsfulle tall, hvor null betyr noe: bare i slike tilfelle kan man dele
tallene seg imellom, og snakke om dobbelt så stor (forholdskala)
Det er viktig å skjønne disse forskjellene, selv om man stort sett kan
representere alt med tall. Men man må tolke det med skjønn.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
25 / 48
ANOVA and logistic regression
If your explanations are discrete/categorical: ANOVA, and you get an
analysis of variance table
more than one level of one factor: one way ANOVA
more than one factor: two/multi-way ANOVA
more than one factor measured on the same subjects: repeated
measures ANOVA
If also what you measure is discrete/categorical, and you get an
analysis of deviation table: logistic regression (or generalized linear
models)
there is no difference between one predictor or more than one predictor
more than one factor measured on the same subjects: repeated
measures ANOVA
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
26 / 48
Mixed effects models
Life is never simple... often you can interpret your data in different ways.
fixed effects: there are true values which are known, and influence
just the mean
random effects: the numbers are themselves a sample of a larger
population, and influence only the variation
One estimates fixed effects, but predicts random effects. Crawley (1995:
179):
It is difficult without lots of experience to know when to use
categorical explanatory variables as fixed effects and when as
random effects.
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
27 / 48
8
8
ANOVA eksempel (1)
●
●
●
●
●
6
●
● ●
●
2
●
●
●
●
● ●
●
●
●
●
●
●
4
● ●
concentração
●
●
4
●
●
●
●
●
●
● ●
●
2
6
●
●
●
●
0
●
0
concentração
●
5
10
15
20
ordem
Diana Santos (UiO)
5
10
15
20
ordem
Oslo, minicourse on statistics
18 november 2013
28 / 48
ANOVA eksempel (1): resultat
> summary(aov(ozono~jardim))
Df Sum Sq Mean Sq F value Pr(>F)
jardim
1
20 20.0000
15 0.001115 **
Residuals
18
24 1.3333
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
30 / 48
ANOVA eksempel (2)
●
●
●●
●
●
5
5
●
●
●
10
●●●
●
●●●
●
15
●
4
20
ordem
Diana Santos (UiO)
●
●
3
●
1
●
2
●
crescimento
4
●
3
2
1
crescimento
5
●
●
●
6
6
●
●
●
●
●
●●
●
●
5
●
●
●
●
10
●●●
●
●●●
●
15
20
ordem
Oslo, minicourse on statistics
18 november 2013
31 / 48
ANOVA eksempel (2): resultat
> summary(aov(Growth~Photoperiod)
Df Sum Sq Mean Sq F value Pr(>F)
Photoperiod 3
7.12
2.375
1.462 0.255
Residuals
20 32.50
1.625
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
33 / 48
Litt bedre forklart om Anova
Repetisjon: Anova er variansanalyse når uavhengige variabler er
kategoriske (med faktorer som har forskjellige nivåer).
ANOVA kan forklares som å teste forskjellen mellom
xij = µ + τ i + ij
og
xij = µ + ij
som kalles modell for behandling og model for ikke-behandling
(“treatments” and “no treatments” models), som er nullhypotesen.
Følgende begrep er essensielle i regresjonsanalyse: de tre korrigerte
summene SSX, SSY, SSXY, og SSE og SSR
P 2 (P y )2
SSY:
y − n , corrected sum of squares
P P
P
SSXY:
xy − xn y , corrected sum of products
SSR:
SSXY 2
SSX ,
regression sum of squares
SSE:
(y − a − bx)2 , sum of the squares of the residuals
(SSE = SSY − SSR)
P
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
34 / 48
ANOVA: avhengighet av mange faktorer
Teknisk beskrevet: ANOVA tester om de forskjellige gjennomsnittene som
finnes i ulike kontekster, er avhengig av kontekstene, eller bare er fri
variasjon fra utvalg til utvalg.
ANOVA utfører analysen ved å sammenlikne varianser for de
forskjellige gjennomsnittene og for globale gjennomsnitt
For at man kan anvende metoden og stole på resultatene, må
variansene til de to delpopulasjonene være tilstrekkelig like
ANOVA-tabellen viser prosentandelen av varians som er “forklart” av
de forskjellige faktorene. Ideelt sett, er variabelen normalt fordelt i
hver delpopulasjon
ANOVA generaliserer t-testen, og bruker F-fordelingen (F = t 2 )
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
35 / 48
Metodologiske kommentarer I
Nå har vi sett mange måter å tilpasse en modell til dataene på.
Men, det som er enda viktigere, er å finne ut hva slags data vi har, hva
slags eksperimenter ble gjort, og hva er antakelser.
En viktig ting er måten vi skaffet oss dataene på:
factorial experiments for hvert faktor har man fått uavhengige data for
dem
split-plot experiments forskjellige behandlinger er brukt til forskjellige
størrelser, med en hierarkiske design (“fixed effects”, som har
konsekvenser for gjennomsnittet)
nested designs forskjellige mål tas fra det samme individet (“random
effects”, som har konsekvenser for varians)
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
36 / 48
Metodologiske kommentarer II
For å unngå statistiske feil må man spesifisere hva som er knyttet til hva, i
modellbygging med en Error term. For ellers antar modellene
feiluavhengighet.
De meste kompliserte modellene kalles “mixed effects” modeller, fordi de
har både fixed effects og random effects.
Man estimerer “fixed effects”
og forutsier “random effects”.
For mixed effects bruker man da i R lme eller glme.
Her kan det nevnes Clarks “Language as fixed-effect fallacy” (1973).
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
37 / 48
18 november 2013
38 / 48
Liars, and statisticians
Beware of:
scale-dependent correlations (Crawley 2005 page 99)
Simpson’s paradox
Diana Santos (UiO)
Oslo, minicourse on statistics
On the mismatch of statistical terminology
Error structure: because the fisrt incertainties were about a “fixed”
measurement
Regression: because of the regression to the middle in height (genetics)
Treatments: because the first ANOVA were used in medicine
Sample: stichprøve, amostra, come from different metaphors
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
39 / 48
Practical part
1
Obtain several contingency tables from dataframes.
2
Compute descriptive statistics of datasets.
3
Use χ2
4
Use t-test, and proportion test
5
Check the Simpson paradox
6
Try one one-way ANOVA and one repeated measures ANOVA
7
Try a multifactorial analysis
Diana Santos (UiO)
Oslo, minicourse on statistics
18 november 2013
40 / 48
Download