How to make semantic maps with R (maps based on contextual features of exemplars) by Natalia Levshina version as of 06.09.2015 1. Introduction This tutorial illustrates a generalized approach how one can make probabilistic MDS-based semantic maps that represent the similarities between exemplars of one or more constructions in one language as proximities in a low-dimensional Multidimensional Scaling map. These similarities are based on overlapping semantic and syntactic features of the exemplars. The approach is illustrated by a case study that shows how one can model the semantic space of two Dutch causative auxiliaries, doen ‘do’ and laten ‘make’, on the basis of 31 syntactic, semantic and morphological features. If you use the code from this tutorial, please refer to: Levshina, N. 2011. Doe wat je niet laten kan [Do what you cannot let]: A usage-based analysis of Dutch causative constructions. PhD diss., University of Leuven. This approach can be particularly useful in the following cases: a) you want to identify clusters of senses, rather than individual semantic features that help you explain lexical or grammatical variation; b) you want to investigate the areas where the borders between lexical or grammatical categories are fuzzy, and where they are clear-cut; c) you want to identify the prototypical core and periphery of lexical or grammatical categories. See an example in Levshina, Natalia. In press. An integrative exemplar-based model of semantic structure: The Dutch causative construction with laten. In J. Yoon & S. Th. Gries (Eds.), Construction Grammar beyond English: Observational and experimental approaches; d) ‘small n, large p’, i.e. there are many variables and relatively few observations; e) the contextual variables are highly intercorrelated; f) the data contain many missing values, which makes other methods, such as regression analysis or Multiple Correspondence Analysis, more difficult to use. 2. Data The data for the case study can be downloaded as the R data object Dutch.Rdata from my personal website http://www.natalialevshina.com/statistics.html. Dutch analytic causatives consist of two verbal components, the Causative Auxiliary (doen “do” and laten “let”) and the infinitive, which is called the Effected Predicate. There are also several nominal slots: the Causer, the Causee and the Affectee (in case of transitive Effected Predicates). Consider an example: (1) De generaal liet het leger de stad vernielen. the general let.PST the army the city destroy.INF “The general ordered the army to destroy the city.” where de generaal “the general” is the Causer, liet “let” is the Causative Auxiliary, het leger “the army” is the Causee, de stad “the city” is the Affectee and vernielen “destroy” is the Effected Predicate. For more information about the constructions, see Levshina (2011). The data frame Dutch contains 100 examples of Dutch analytic causatives (rows) with causative auxiliaries doen “do” and laten “let” (variable ‘Aux’) coded for 31 various contextual (semantic, syntactic and morphological) features from Levshina (2011). > str(Dutch) 'data.frame': 100 obs. of 32 variables: $ ClauseTense: Factor w/ 5 levels "Fut","Past","PastPerf",..: 2 4 3 $ SyntFun : Factor w/ 3 levels "InfClause","Pred",..: 2 2 2 $ Clause : Factor w/ 5 levels "Add","Adv","Compl",..: 4 4 2 $ Sent : Factor w/ 2 levels "Decl","Q": 1 1 1 ... $ Adv : Factor w/ 10 levels "Degree","Dur",..: 5 5 5 ... $ Modal : Factor w/ 4 levels "kunnen","moeten",..: 4 3 3... ... ... ... […] $ Aux : Factor w/ 2 levels "doen","laten": 2 1 1 ... 3. R code 3.1. First, install (if needed) and load the packages that you will require to reproduce the code in this tutorial. library(cluster) library(smacof) library(MASS) 3.2. Make a distance matrix by using Gower distances (function daisy() in the package cluster). For a data frame with categorical data, this function compares the values of categorical variables in each pair of exemplars and turns the similarities into distances. Do not forget to exclude the column with the auxiliaries. Dutch.dist <- daisy(Dutch[, -32]) 3.3. Perform Multidimensional Scaling (MDS) with the iterative majorization algorithm with the help of smacofSym() from the package smacof. MDS is a dimensionality-reduction technique that represents distances from a distance matrix as proximities on a low-dimensional map. We will use the non-metric (‘ordinal’) method (note that it may take some time to run). This means that the algorithm tries to preserve the order of the original distance values, rather than the numeric values of the distances. First, we perform several MDS analyses with varying number of dimensions (from 1 to 10) and create a scree plot that will help us decide on the optimal number of dimensions. To automatize the procedure, you can use the following code: stress <- sapply(1:10, function(x) smacofSym(Dutch.dist, type = "ordinal", ndim = x)$stress) stress [1] 0.39763392 0.23978828 0.16912232 0.13058995 0.10507390 0.08707420 [7] 0.07432792 0.06379829 0.05506099 0.04882822 plot(1:10, stress, type = "b", xlab = "n of dimensions", ylab = "stress", main = "Scree plot of stress in ordinal MDS") 0.25 0.20 0.05 0.10 0.15 stress 0.30 0.35 0.40 Scree plot of stress in ordinal MDS 2 4 6 8 10 n of dimensions Figure 1. Scree plot of stress in ordinal MDS. The result is shown in Figure 1. Usually, one tries to identify the place where the plot ‘elbows’, but here we do not find any such place. In what follows, we will investigate the three-dimensional solution, which accounts for 0.17 of stress. The reader is encouraged to try to interpret other dimensions. The three-dimensional solution can be created as follows: Dutch.mds <- smacofSym(Dutch.dist, type = "ordinal", ndim = 3) 3.4. Visualize the data in the form of a semantic map. First, we create a map with all exemplars represented as points. To plot dimensions 1 and 2, one can use the following code: plot(Dutch.mds$conf, main = "Exemplars of doen and laten: Dim 1 and 2") -0.5 0.0 D2 0.5 1.0 Exemplars of doen and laten: Dim 1 and 2 -1.0 -0.5 0.0 0.5 1.0 D1 Figure 2. Exemplars of Dutch analytic causatives. Dimensions 1 and 2. In order to plot the second and third dimensions, one can do the following: plot(Dutch.mds$conf[, 2:3], main = "Exemplars of doen and laten: Dim 2 and 3") The result is displayed in Figure 3. 0.0 -0.5 D3 0.5 Exemplars of doen and laten: Dim 2 and 3 -0.5 0.0 0.5 1.0 D2 Figure 3. Dimensions 2 and 3. 3.5. Interpret the dimensions. The dimensions on the MDS maps can be interpreted with the help of two methods. The first one involves fitting a set of linear regression models and finding the variables that are the most correlated with the MDS dimensions. The second one is based on mapping of the semantic and syntactic features on the MDS space and a visual examination of the configurations. Method A. Simple linear regression analyses One can run a series of simple linear regression analyses with the as the dimensional coordinates of the points as the response and the semantic and syntactic variables as predictors. One can decide which semantic and syntactic features are relevant by comparing the adjusted R2 or other goodnessof-fit statistics. To do it automatically, you can use the following code: y <- Dutch.mds$conf[, 1] # the coordinates on the first dimension adjR2 <- sapply(1:31, function(x) summary(lm(y ~ Dutch[, x]))$adj.r.squared) res <- data.frame(colnames(Dutch)[-32], adjR2) The most important variables are as follows: res[order(-adjR2), ] colnames.Dutch...32. adjR2 31 EPTransNew 0.590518116 14 CausedSem 0.546254492 CeSem 0.409542722 22 CeSynt 0.408975441 16 CeEnergy 0.397741704 25 CePers 0.374181248 28 AffPOS 0.202812703 23 CePOS 0.172476340 6 Modal 0.097648966 15 CeIntent 0.082501050 11 Coref_new 0.077670553 17 CrSynt 0.058471400 30 AffNo 0.052711913 20 CrPers 0.045935659 9 … This suggests that the variable EPTransNew, which describes the valency of the Effected Predicate is the most strongly associated with the horizontal dimension. To interpret the direction, one can check the summary of the linear regression model with EPTransNew: > summary(lm(y ~ EPTransNew, data = Dutch)) Call: lm(formula = y ~ EPTransNew, data = Dutch) Residuals: Min 1Q Median 3Q Max -0.68343 -0.16952 -0.02443 0.13837 0.94399 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.25605 0.03554 -7.204 1.26e-10 *** EPTransNewTr 0.71747 0.05978 12.003 EPTransNewDitr 0.49366 0.28655 1.723 < 2e-16 *** 0.0881 . --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2843 on 97 degrees of freedom Multiple R-squared: 0.5988, Adjusted R-squared: F-statistic: 72.38 on 2 and 97 DF, 0.5905 p-value: < 2.2e-16 The positive coefficients of EPTransNew = “Tr” and EPTransNew = “Ditr” show that the exemplars with transitive and ditransitive Effected Predicates, e.g. make X kill Y or make X give Y Z have higher values on Dimension 1 than the observations with the reference level (EPTransNew = "Intr", i.e. intransitive predicates, e.g. make X go). Other important variables are the semantic class of the caused event (physical and social on the left; mental on the right) and some properties of the Causee (inanimate patient-like unmarked Causees on the left; agentive, animate and marked by instrumental door ‘by’ on the right). This dimension thus contrasts more direct causation (on the left) and less direct causation (on the right). Method B. Visual inspection One can also visualize the values of this and other variables on the MDS map by using different colours: plot(Dutch.mds$conf, type = "n", main = "Valency of Effected Predicate ") points(Dutch.mds$conf[Dutch$EPTransNew == "Intr", ], pch = 16, "red") points(Dutch.mds$conf[Dutch$EPTransNew == "Tr", ], pch = 16, points(Dutch.mds$conf[Dutch$EPTransNew == "Ditr", ], pch = 16, "green") col = col = "blue") col = legend("topright", c("Intr", "Tr", "Ditr"), pch = 16, col = c("red", "blue", "green")) The result is shown in Figure 4. Valency of Effected Predicate -0.5 0.0 D2 0.5 1.0 Intr Tr Ditr -1.0 -0.5 0.0 0.5 1.0 D1 Figure 4. Valency of Effected Predicates: Dimensions 1 and 2. Repeating the procedure for the second dimension, one can learn that the latter is strongly associated with the semantic class of the Causer. Exemplars with abstract Causers (as well as nominal and 3rd person) tend to be located at the bottom, whereas those with human Causers and some related classes are found at the top (also 1st person and pronominal Causees). The third dimension is interpreted as the Causees that are singular, pronominal and first-person having higher values than plural, nominal and 3rd person. 3.6. Plot the exemplars of specific constructions (here: doen and laten). In order to represent the exemplars of doen and laten with different symbols, on dimensions 1 and 2, one can use the following code (see Figure 5): plot(Dutch.mds$conf, type = "n", main = "Exemplars of doen and laten: Dim 1 and 2") points(Dutch.mds$conf[Dutch$Aux == "doen", ], pch = 16, points(Dutch.mds$conf[Dutch$Aux == "laten", ], pch = 16, col = "red") col = "blue") legend("topright", c("doen", "laten"), pch = 16, col = c("red", "blue")) Exemplars of doen and laten: Dim 1 and 2 -0.5 0.0 D2 0.5 1.0 doen laten -1.0 -0.5 0.0 0.5 1.0 D1 Figure 5. Exemplars of doen and laten. Dimensions 1 and 2. For dimensions 2 and 3, you can use the following code (see Figure 6): plot(Dutch.mds$conf[, 2:3], type = "n", main = "Exemplars of doen and laten: Dim 2 and 3") points(Dutch.mds$conf[Dutch$Aux == "doen", 2:3], pch = 16, points(Dutch.mds$conf[Dutch$Aux == "laten", 2:3], pch = 16, col = "red") col = "blue") legend("topright", c("doen", "laten"), pch = 16, col = c("red", "blue")) Exemplars of doen and laten: Dim 2 and 3 0.0 -0.5 D3 0.5 doen laten -0.5 0.0 0.5 1.0 D2 Figure 6. Exemplars of doen and laten. Dimensions 2 and 3. One can also visualize the density of the exemplars (similar to density of population) by using 2D contour plots. See the result in Figure 7 for dimensions 1 and 2 and in Figure 8 for dimensions 2 and 3. dens.doen <- kde2d(Dutch.mds$conf[Dutch$Aux == "doen", Dutch.mds$conf[Dutch$Aux == "doen", 2],) dens.laten <- kde2d(Dutch.mds$conf[Dutch$Aux == "laten", Dutch.mds$conf[Dutch$Aux == "laten", 2],) 1], 1], plot(Dutch.mds$conf, main = "Contour plot: Dim 1 and 2", pch = 16, col = "grey") contour(dens.doen, add = TRUE, col = "red") contour(dens.laten, add = TRUE, col = "blue") legend("topright", c("doen", "laten"), col = c("red", "blue"), lty = 1) Contour plot: Dim 1 and 2 1.0 doen laten 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.1 0.5 0.6 D2 0.7 0.0 0.3 0.2 0.7 0.8 0.4 0.6 1 1.2 0.8 1 0. 1.1 1.2 -0.5 0.9 0.5 0. 2 0.3 0.1 0.3 -1.0 -0.5 0.1 0.2 0.0 0.5 1.0 D1 Figure 7. Contour plot of doen and laten. Dimensions 1 and 2. dens.doen <- kde2d(Dutch.mds$conf[Dutch$Aux == "doen", Dutch.mds$conf[Dutch$Aux == "doen", 3],) dens.laten <- kde2d(Dutch.mds$conf[Dutch$Aux == "laten", Dutch.mds$conf[Dutch$Aux == "laten", 3],) 2], 2], plot(Dutch.mds$conf[, 2:3], main = "Contour plot: Dim 2 and 3", pch = 16, col = "grey") contour(dens.doen, add = TRUE, col = "red") contour(dens.laten, add = TRUE, col = "blue") legend("topright", c("doen", "laten"), col = c("red", "blue"), lty = 1) Contour plot: Dim 2 and 3 doen laten 0.3 0.2 0.5 0.40.2 0. 8 0.4 0.7 0.9 0.1 1.2 0.1 0.2 0. 2 0.0 D3 1 1.4 1.1 0.2 1.8 0.8 1.6 0.1 0.5 1 -0.5 1.2 0.6 0.6 0.2 0.3 -0.5 0.0 0.5 1.0 D2 Figure 8. Contour plot of doen and laten. Dimensions 2 and 3. 4. Interpretation The visual examination suggests that the exemplars of doen have lower values on dimension 1 (intransitive Effected Predicates, non-agentive/inanimate Causees, physical and social caused events) and lower values on dimension 2 (abstract Causers) than the exemplars of laten (transitive Effected Predicates, animate/agentive Causees, mental caused events; human Causers). Thus, the construction with doen is primarily associated with non-intentional direct causation, whereas the construction with laten typically designates intentional interpersonal indirect causation. The differences in the position of the exemplars of both constructions with regard to the third dimension are less clear, however. Of course, the analyses presented above are only exploratory. One needs hypothesis-testing methods, such as logistic regression, in order to test the observations made on the basis of the visual inspection.