docx - Natalia Levshina

advertisement
How to make semantic maps with R
(maps based on contextual features of exemplars)
by Natalia Levshina
version as of 06.09.2015
1. Introduction
This tutorial illustrates a generalized approach how one can make probabilistic MDS-based semantic
maps that represent the similarities between exemplars of one or more constructions in one
language as proximities in a low-dimensional Multidimensional Scaling map. These similarities are
based on overlapping semantic and syntactic features of the exemplars. The approach is illustrated
by a case study that shows how one can model the semantic space of two Dutch causative auxiliaries,
doen ‘do’ and laten ‘make’, on the basis of 31 syntactic, semantic and morphological features.
If you use the code from this tutorial, please refer to:
Levshina, N. 2011. Doe wat je niet laten kan [Do what you cannot let]: A usage-based analysis of
Dutch causative constructions. PhD diss., University of Leuven.
This approach can be particularly useful in the following cases:
a) you want to identify clusters of senses, rather than individual semantic features that help you
explain lexical or grammatical variation;
b) you want to investigate the areas where the borders between lexical or grammatical categories
are fuzzy, and where they are clear-cut;
c) you want to identify the prototypical core and periphery of lexical or grammatical categories. See
an example in Levshina, Natalia. In press. An integrative exemplar-based model of semantic
structure: The Dutch causative construction with laten. In J. Yoon & S. Th. Gries (Eds.), Construction
Grammar beyond English: Observational and experimental approaches;
d) ‘small n, large p’, i.e. there are many variables and relatively few observations;
e) the contextual variables are highly intercorrelated;
f) the data contain many missing values, which makes other methods, such as regression analysis or
Multiple Correspondence Analysis, more difficult to use.
2. Data
The data for the case study can be downloaded as the R data object Dutch.Rdata from my personal
website http://www.natalialevshina.com/statistics.html. Dutch analytic causatives consist of two
verbal components, the Causative Auxiliary (doen “do” and laten “let”) and the infinitive, which is
called the Effected Predicate. There are also several nominal slots: the Causer, the Causee and the
Affectee (in case of transitive Effected Predicates). Consider an example:
(1)
De
generaal
liet
het
leger
de
stad
vernielen.
the
general
let.PST the
army
the
city
destroy.INF
“The general ordered the army to destroy the city.”
where de generaal “the general” is the Causer, liet “let” is the Causative Auxiliary, het leger “the
army” is the Causee, de stad “the city” is the Affectee and vernielen “destroy” is the Effected
Predicate. For more information about the constructions, see Levshina (2011).
The data frame Dutch contains 100 examples of Dutch analytic causatives (rows) with causative
auxiliaries doen “do” and laten “let” (variable ‘Aux’) coded for 31 various contextual (semantic,
syntactic and morphological) features from Levshina (2011).
> str(Dutch)
'data.frame':
100 obs. of
32 variables:
$ ClauseTense: Factor w/ 5 levels "Fut","Past","PastPerf",..: 2 4 3
$ SyntFun
: Factor w/ 3 levels "InfClause","Pred",..: 2 2 2
$ Clause
: Factor w/ 5 levels "Add","Adv","Compl",..: 4 4 2
$ Sent
: Factor w/ 2 levels "Decl","Q": 1 1 1 ...
$ Adv
: Factor w/ 10 levels "Degree","Dur",..: 5 5 5 ...
$ Modal
: Factor w/ 4 levels "kunnen","moeten",..: 4 3 3...
...
...
...
[…]
$ Aux
: Factor w/ 2 levels "doen","laten": 2 1 1 ...
3. R code
3.1. First, install (if needed) and load the packages that you will require to reproduce the code in this
tutorial.
library(cluster)
library(smacof)
library(MASS)
3.2. Make a distance matrix by using Gower distances (function daisy() in the package cluster).
For a data frame with categorical data, this function compares the values of categorical variables in
each pair of exemplars and turns the similarities into distances. Do not forget to exclude the column
with the auxiliaries.
Dutch.dist <- daisy(Dutch[, -32])
3.3. Perform Multidimensional Scaling (MDS) with the iterative majorization algorithm with the help
of smacofSym() from the package smacof. MDS is a dimensionality-reduction technique that
represents distances from a distance matrix as proximities on a low-dimensional map. We will use
the non-metric (‘ordinal’) method (note that it may take some time to run). This means that the
algorithm tries to preserve the order of the original distance values, rather than the numeric values
of the distances.
First, we perform several MDS analyses with varying number of dimensions (from 1 to 10) and create
a scree plot that will help us decide on the optimal number of dimensions. To automatize the
procedure, you can use the following code:
stress <- sapply(1:10, function(x) smacofSym(Dutch.dist, type = "ordinal",
ndim = x)$stress)
stress
[1] 0.39763392 0.23978828 0.16912232 0.13058995 0.10507390 0.08707420
[7] 0.07432792 0.06379829 0.05506099 0.04882822
plot(1:10, stress, type = "b", xlab = "n of dimensions", ylab = "stress",
main = "Scree plot of stress in ordinal MDS")
0.25
0.20
0.05
0.10
0.15
stress
0.30
0.35
0.40
Scree plot of stress in ordinal MDS
2
4
6
8
10
n of dimensions
Figure 1. Scree plot of stress in ordinal MDS.
The result is shown in Figure 1. Usually, one tries to identify the place where the plot ‘elbows’, but
here we do not find any such place. In what follows, we will investigate the three-dimensional
solution, which accounts for 0.17 of stress. The reader is encouraged to try to interpret other
dimensions. The three-dimensional solution can be created as follows:
Dutch.mds <- smacofSym(Dutch.dist, type = "ordinal", ndim = 3)
3.4. Visualize the data in the form of a semantic map. First, we create a map with all exemplars
represented as points. To plot dimensions 1 and 2, one can use the following code:
plot(Dutch.mds$conf, main = "Exemplars of doen and laten: Dim 1 and 2")
-0.5
0.0
D2
0.5
1.0
Exemplars of doen and laten: Dim 1 and 2
-1.0
-0.5
0.0
0.5
1.0
D1
Figure 2. Exemplars of Dutch analytic causatives. Dimensions 1 and 2.
In order to plot the second and third dimensions, one can do the following:
plot(Dutch.mds$conf[, 2:3], main = "Exemplars of doen and laten: Dim 2 and
3")
The result is displayed in Figure 3.
0.0
-0.5
D3
0.5
Exemplars of doen and laten: Dim 2 and 3
-0.5
0.0
0.5
1.0
D2
Figure 3. Dimensions 2 and 3.
3.5. Interpret the dimensions. The dimensions on the MDS maps can be interpreted with the help of
two methods. The first one involves fitting a set of linear regression models and finding the variables
that are the most correlated with the MDS dimensions. The second one is based on mapping of the
semantic and syntactic features on the MDS space and a visual examination of the configurations.
Method A. Simple linear regression analyses
One can run a series of simple linear regression analyses with the as the dimensional coordinates of
the points as the response and the semantic and syntactic variables as predictors. One can decide
which semantic and syntactic features are relevant by comparing the adjusted R2 or other goodnessof-fit statistics. To do it automatically, you can use the following code:
y <- Dutch.mds$conf[, 1] # the coordinates on the first dimension
adjR2 <- sapply(1:31, function(x) summary(lm(y ~ Dutch[,
x]))$adj.r.squared)
res <- data.frame(colnames(Dutch)[-32], adjR2)
The most important variables are as follows:
res[order(-adjR2), ]
colnames.Dutch...32.
adjR2
31
EPTransNew
0.590518116
14
CausedSem
0.546254492
CeSem
0.409542722
22
CeSynt
0.408975441
16
CeEnergy
0.397741704
25
CePers
0.374181248
28
AffPOS
0.202812703
23
CePOS
0.172476340
6
Modal
0.097648966
15
CeIntent
0.082501050
11
Coref_new
0.077670553
17
CrSynt
0.058471400
30
AffNo
0.052711913
20
CrPers
0.045935659
9
…
This suggests that the variable EPTransNew, which describes the valency of the Effected Predicate is
the most strongly associated with the horizontal dimension. To interpret the direction, one can check
the summary of the linear regression model with EPTransNew:
> summary(lm(y ~ EPTransNew, data = Dutch))
Call:
lm(formula = y ~ EPTransNew, data = Dutch)
Residuals:
Min
1Q
Median
3Q
Max
-0.68343 -0.16952 -0.02443
0.13837
0.94399
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-0.25605
0.03554
-7.204 1.26e-10 ***
EPTransNewTr
0.71747
0.05978
12.003
EPTransNewDitr
0.49366
0.28655
1.723
< 2e-16 ***
0.0881 .
--Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2843 on 97 degrees of freedom
Multiple R-squared:
0.5988,
Adjusted R-squared:
F-statistic: 72.38 on 2 and 97 DF,
0.5905
p-value: < 2.2e-16
The positive coefficients of EPTransNew = “Tr” and EPTransNew = “Ditr” show that the exemplars
with transitive and ditransitive Effected Predicates, e.g. make X kill Y or make X give Y Z have higher
values on Dimension 1 than the observations with the reference level (EPTransNew = "Intr", i.e.
intransitive predicates, e.g. make X go). Other important variables are the semantic class of the
caused event (physical and social on the left; mental on the right) and some properties of the Causee
(inanimate patient-like unmarked Causees on the left; agentive, animate and marked by instrumental
door ‘by’ on the right). This dimension thus contrasts more direct causation (on the left) and less
direct causation (on the right).
Method B. Visual inspection
One can also visualize the values of this and other variables on the MDS map by using different
colours:
plot(Dutch.mds$conf, type = "n", main = "Valency of Effected Predicate ")
points(Dutch.mds$conf[Dutch$EPTransNew == "Intr", ], pch = 16,
"red")
points(Dutch.mds$conf[Dutch$EPTransNew == "Tr", ], pch = 16,
points(Dutch.mds$conf[Dutch$EPTransNew == "Ditr", ], pch = 16,
"green")
col =
col = "blue")
col =
legend("topright", c("Intr", "Tr", "Ditr"), pch = 16, col = c("red",
"blue", "green"))
The result is shown in Figure 4.
Valency of Effected Predicate
-0.5
0.0
D2
0.5
1.0
Intr
Tr
Ditr
-1.0
-0.5
0.0
0.5
1.0
D1
Figure 4. Valency of Effected Predicates: Dimensions 1 and 2.
Repeating the procedure for the second dimension, one can learn that the latter is strongly
associated with the semantic class of the Causer. Exemplars with abstract Causers (as well as nominal
and 3rd person) tend to be located at the bottom, whereas those with human Causers and some
related classes are found at the top (also 1st person and pronominal Causees). The third dimension is
interpreted as the Causees that are singular, pronominal and first-person having higher values than
plural, nominal and 3rd person.
3.6. Plot the exemplars of specific constructions (here: doen and laten). In order to represent the
exemplars of doen and laten with different symbols, on dimensions 1 and 2, one can use the
following code (see Figure 5):
plot(Dutch.mds$conf, type = "n", main = "Exemplars of doen and laten: Dim 1
and 2")
points(Dutch.mds$conf[Dutch$Aux == "doen", ], pch = 16,
points(Dutch.mds$conf[Dutch$Aux == "laten", ], pch = 16,
col = "red")
col = "blue")
legend("topright", c("doen", "laten"), pch = 16, col = c("red", "blue"))
Exemplars of doen and laten: Dim 1 and 2
-0.5
0.0
D2
0.5
1.0
doen
laten
-1.0
-0.5
0.0
0.5
1.0
D1
Figure 5. Exemplars of doen and laten. Dimensions 1 and 2.
For dimensions 2 and 3, you can use the following code (see Figure 6):
plot(Dutch.mds$conf[, 2:3], type = "n", main = "Exemplars of doen and
laten: Dim 2 and 3")
points(Dutch.mds$conf[Dutch$Aux == "doen", 2:3], pch = 16,
points(Dutch.mds$conf[Dutch$Aux == "laten", 2:3], pch = 16,
col = "red")
col = "blue")
legend("topright", c("doen", "laten"), pch = 16, col = c("red", "blue"))
Exemplars of doen and laten: Dim 2 and 3
0.0
-0.5
D3
0.5
doen
laten
-0.5
0.0
0.5
1.0
D2
Figure 6. Exemplars of doen and laten. Dimensions 2 and 3.
One can also visualize the density of the exemplars (similar to density of population) by using 2D
contour plots. See the result in Figure 7 for dimensions 1 and 2 and in Figure 8 for dimensions 2 and
3.
dens.doen <- kde2d(Dutch.mds$conf[Dutch$Aux == "doen",
Dutch.mds$conf[Dutch$Aux == "doen", 2],)
dens.laten <- kde2d(Dutch.mds$conf[Dutch$Aux == "laten",
Dutch.mds$conf[Dutch$Aux == "laten", 2],)
1],
1],
plot(Dutch.mds$conf, main = "Contour plot: Dim 1 and 2", pch = 16, col =
"grey")
contour(dens.doen, add = TRUE, col = "red")
contour(dens.laten, add = TRUE, col = "blue")
legend("topright", c("doen", "laten"), col = c("red", "blue"), lty = 1)
Contour plot: Dim 1 and 2
1.0
doen
laten
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.1
0.1
0.5
0.6
D2
0.7
0.0
0.3
0.2
0.7
0.8
0.4
0.6
1
1.2
0.8
1
0.
1.1
1.2
-0.5
0.9
0.5
0.
2
0.3
0.1
0.3
-1.0
-0.5
0.1
0.2
0.0
0.5
1.0
D1
Figure 7. Contour plot of doen and laten. Dimensions 1 and 2.
dens.doen <- kde2d(Dutch.mds$conf[Dutch$Aux == "doen",
Dutch.mds$conf[Dutch$Aux == "doen", 3],)
dens.laten <- kde2d(Dutch.mds$conf[Dutch$Aux == "laten",
Dutch.mds$conf[Dutch$Aux == "laten", 3],)
2],
2],
plot(Dutch.mds$conf[, 2:3], main = "Contour plot: Dim 2 and 3", pch = 16,
col = "grey")
contour(dens.doen, add = TRUE, col = "red")
contour(dens.laten, add = TRUE, col = "blue")
legend("topright", c("doen", "laten"), col = c("red", "blue"), lty = 1)
Contour plot: Dim 2 and 3
doen
laten
0.3
0.2
0.5
0.40.2
0.
8
0.4
0.7
0.9
0.1
1.2
0.1
0.2
0.
2
0.0
D3
1
1.4
1.1
0.2
1.8
0.8
1.6
0.1
0.5
1
-0.5
1.2
0.6
0.6
0.2
0.3
-0.5
0.0
0.5
1.0
D2
Figure 8. Contour plot of doen and laten. Dimensions 2 and 3.
4. Interpretation
The visual examination suggests that the exemplars of doen have lower values on dimension 1
(intransitive Effected Predicates, non-agentive/inanimate Causees, physical and social caused events)
and lower values on dimension 2 (abstract Causers) than the exemplars of laten (transitive Effected
Predicates, animate/agentive Causees, mental caused events; human Causers). Thus, the
construction with doen is primarily associated with non-intentional direct causation, whereas the
construction with laten typically designates intentional interpersonal indirect causation. The
differences in the position of the exemplars of both constructions with regard to the third dimension
are less clear, however.
Of course, the analyses presented above are only exploratory. One needs hypothesis-testing
methods, such as logistic regression, in order to test the observations made on the basis of the visual
inspection.
Download