Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification Christian Hennig and Tim F. Liao∗ Department of Statistical Science, UCL, Department of Sociology, University of Illinois August 3, 2010 Abstract Data with mixed type (metric/ordinal/nominal) variables can be clustered by a latent class mixture model approach, which assumes local independence. Such data are typical in social stratification, which is the application that motivates the current paper. We explore whether the latent class approach groups similar observations together and compare it to dissimilarity based clustering (k-medoids). The design of an appropriate dissimilarity measure and the estimation of the number of clusters are discussed as well, comparing the BIC, average silhouette width and the Calinski and Harabasz index. The comparison is based on a philosophy of cluster analysis that connects the problem of a choice of a suitable clustering method closely to the application by considering direct interpretations of the implications of the methodology. According to this philosophy, model assumptions serve to understand such implications but are not taken to be true. It is emphasised that researchers implicitly define the “true” clustering and number of clusters by the choice of a particular methodology. It is illustrated that even if there is a true model, a clustering that doesn’t attempt to estimate this truth may be preferable. The researcher has to take the responsibility to specify the criteria on which such a comparison can be made. The application of this philosophy to data from the 2007 US Survey of Consumer Finances implies some techniques to obtain an interpretable clustering in an ambiguous situation. Keywords: mixture model, k-medoids clustering, dissimilarity design, number of clusters, interpretation of clustering ∗ Research Report No. 308, Department of Statistical Science, University College London. Date: August 2010. 1 1 INTRODUCTION 1 2 Introduction In this paper we explore the use of formal cluster analysis methods for social stratification based on mixed type data with continuous, ordinal and nominal variables. Two quite different approaches are compared, namely a latent class/finite mixture model for mixed type data (Vermunt and Magidson, 2002), in which different clusters are modelled by underlying distributions with different parameters (mixture components), and a dissimilarity based approach not based on probability models (k-medoids or “partitioning around medoids”, Kaufman and Rouseeuw, 1990) with different methods to estimate the number of clusters. The application that motivated our work on mixed type data is social stratification, in which such data typically arise (as in social data in general). The focus of this paper is on the statistical side, including some general thoughts about the choice and design of cluster analysis methods that could be helpful for general cluster analysis in a variety of areas. Another publication in which the sociological background is emphasized and discussed in more detail is in preparation. The philosophy behind the choice of a cluster analysis method in the present paper is that it should be driven by the way concepts like “similarity” and “belonging together in the same class” are interpreted by the subject-matter researchers, and by the way the clustering results are used. This can be difficult to decide in practice. The concept of social class is central to social science research, either as a subject in itself or as an explanatory basis for social, behavioral, and health outcomes. The study of social class has a long history, from the social investigation by the classical social thinker Marx to todays ongoing academic interest in issues of social class and stratification for both research and teaching purposes (e.g., Grusky, Ku, and Szelnyi 2008). Researchers in various social sciences use social class and social stratification as explanatory variables to study a wide range outcomes health and mortality (Pekkanen et al. 1995) to cultural consumption (Chan and Goldthorpe 2007). When social scientists employ social class or stratification as an explanatory variable, they follow either or both of the two common practices, namely using one or more indicators of social stratification such as education and income and using some version of occupation class, often aggregated or grouped into a small number of occupation categories. For example, Pekkanen et al. (1995) compared health outcomes and mortality between white-collar and blue-collar workers; Chan and Goldthorpe (2007) analyzed the effects of social stratification on cultural consumption with a variety of variables representing stratification, including education, income, occupation classification, and social status (a variable they operationalized themselves). The reason for researchers routinely using some indicators of social class is simple: There is no agreed-upon definition of social class, let alone a specific agreed-upon operationalization of it. Neither is the usage of social stratification unified. 1 INTRODUCTION 3 Various different concepts of social classes are present in the sociological literature, including a “classless” society (e.g., Kingston, 2000), society with a gradational structure (e.g., Lenski, 1954) and a society in which discrete classes (interpreted in various, but usually not data-based ways) are an unmistakable social reality (e.g., Wright, 1997). The question to be addressed by cluster analysis is not to decide the issue eventually in favour of a certain concept, but rather to “let the data speak” concerning the issues discussed in the literature. It is of interest whether clear clusters are apparent in data consisting of indicators of social class, but also how these data can be partitioned in a potentially useful and interpretable way even without claiming that these classes are necessarily “undeniably real”; they may rather serve as efficient reduction of the information in the data and as a tool to decompose and interpret inequality. In this way, multidimensional manifestations of inequality can be structured. A latent class model was proposed for this by Grusky and Weeden (2008) and applied to (albeit one-dimensional) inequality data by Liao (2006). It is also of interest how clusterings of relevant data relate to theoretical concepts of social stratification applied to the data. A problem is that typically in data used for social stratification there is no clear separation between clusters on the metric variables, whereas categorical variables may create artificial gaps. Similar data have been analysed by multiple correspondence analysis (e.g., Chapter 6 of Le Roux and Rouanet, 2010), on which a cluster analysis can be based. This, however, requires continuous variables to be categorised and seems more suitable with a larger number of categorical variables. A main task of the present paper is to relate the characteristic of different cluster analysis methods to the subject matter. When comparing the methods, the main focus is not whether the assumption of an underlying mixture probability model is justified or not. Whereas such an assumption can probably not be defended, the clustering outcomes of estimating latent class models may still make sense, depending on whether the underlying cluster concept is appropriate for the subject matter (the present study is somewhat ambiguous about this question). The different methods are therefore compared based on the characteristics of their resulting clusterings. “Model assumptions” are taken into account in order to understand these characteristics properly, not in order to be verified or refuted. Important characteristics are the assumption of “local independence” in latent class clustering (see Section 2) and the question whether the methods are successful in bringing similar observations together in the same cluster. In Section 2 latent class clustering is introduced. Section 3 discusses the philosophy underlying the choice of a suitable cluster analysis methodology. Dissimilarity based clustering requires a dissimilarity measure, the design of which is treated in Section 4. Based on this, Section 5 introduces partitioning around medoids along with some indexes to estimate the number of clusters. Section 6 presents a comparative simulation study. In Section 7, the methodology is applied to data from the US Survey of Consumer Finances and a concluding discussion is given in 4 2 LATENT CLASS CLUSTERING Section 8. 2 Latent class clustering This paper deals with the cluster analysis of data with continuous, ordinal and nominal variables. Denote the data w1 , . . . , wn , wi = (xi , yi , zi ), xi ∈ IRp , yi ∈ O1 × . . . × Oq , zi ∈ C1 × . . . × Cr , i = 1, . . . , n, where Oj , j = 1, . . . , q are ordered finite sets and Cj , j = 1, . . . , r are unordered finite sets. A standard method to cluster such datasets is latent class clustering (Vermunt and Magidson 2002), where w1 , . . . , wn are modelled as i.i.d., generated by a distribution with density f (w) = k X h=1 πh ϕah ,Σh (x) q Y j=1 τhj (yj ) r Y τh(q+j) (zj ), (2.1) j=1 where w = (x, y, z) is defined as wi above without subscript i. Furthermore Pk h=1 πh = 1, ϕa,Σ denotes the p-dimensional Gaussian density with mean vector a and covariance matrix Σ (which may be restricted, for example to be a diagonal P P matrix), and y∈Oj τhj (y) = z∈Cj τh(q+j)(z) = 1, πh ≥ 0, τhj ≥ 0∀h, j. A way to use the ordinal information in the y-variables is to restrict, for j = 1, . . . , q, exp(ηhjy ) , ηhjy = βjξ(y) + βhj ξ(y), (2.2) τhj (y) = P u∈Oj exp(ηhju ) where ξ(y) is a score for the ordinal values y. This is based on the adjacent-category logit model (see Agresti, 2002), as used in Vermunt and Magidson (2005). The score can be assigned by use of background information, certain scaling methods (see, e.g., Gifi, 1990), or as standard scores 1, 2, . . . , |Oj | for the ordered values if there is no further information about ξ. In some sense, through ξ, ordinal information is used at interval scale level, as in most techniques for ordinal data. Note that in (2.2) it is implicitly assumed that ordinality works by either increasing or decreasing monotonically the mixture component-specific contribution to τhj (y) through βhj , which is somewhat restrictive. There are several alternative approaches for modeling the effect of ordinality (see, e.g., Agresti, 2002), which incorporate different restrictions such as a certain stochastic order of mixture components (Agresti and Lang, 1993). The latter is particularly difficult to justify in a multidimensional setting in which components may differ in a different way for different variables (or not at all for some). The parameters of (2.1) can be fitted by the method of maximum likelihood (ML) using the EM- or more sophisticated algorithms, and given estimators of the parameters (denoted by hats), points can be classified into clusters (in terms of interpretation identified with mixture components) by maximising the estimated 5 3 SOME CLUSTERING PHILOSOPHY posterior probability that observation wi had been generated by mixture component h under a two-step model for (2.1) in which first a mixture component γi ∈ {1, . . . , k} is generated with P (γi = h) = πh , and then wi given γi = h according to the density fh (w) = πh ϕah ,Σh (x) q Y τhj (yj ) j=1 r Y τh(q+j)(zj ). j=1 Using this, the estimated mixture component or cluster for wi is γ̂i = arg max π̂h fˆh (wi ), (2.3) h where fˆh denotes the density with all parameter estimators plugged in. The number of mixture components k can be estimated by the Bayesian Information Criterion (BIC). All this is implemented in the software package LatentGOLD (Vermunt and Magidson, 2005; note that its default setting include a certain pseudo-Bayesian correction of the ML-estimator in order to prevent the likelihood from degenerating). The meaning of (2.1) is that, within a mixture component, the continuous variable is assumed to be Gaussian, the discrete (ordinal and nominal) variables are assumed to be independent from the continuous variables, and independent from each other (“local independence”) and the ordinal variables are assumed to be distributed as given in (2.2). The question arises why it is justified to interpret the estimated mixture components as “clusters”. The model formulation apparently does not guarantee that the points classified into the same mixture component are similar, which usually is taken as a cluster-defining feature. In the following, we investigate this question and some implications. This includes a discussion of the definition of “similarity” in the given setup, and the comparison of latent class clustering with an alternative similarity-based clustering method. (See Celeux and Govaert, 1991, for an early attempt to relate latent class clustering to certain dissimilarities.) 3 Some clustering philosophy The merit of the model-based view of statistics is not that it would be good because the models really holded and that therefore the methods derived from the model-based point of view as optimal (for example ML) were really the best methods that could be used. Statisticians do not believe that the models are really true, and although it is accepted that statistical models may fit the data better or worse, and there is often a case to use a better-fitting model, it is misleading to discuss model-based methodology as if it were crucial to know whether the model is true. The models are rather used as an inspiration to find methodology that 6 3 SOME CLUSTERING PHILOSOPHY 1 2 3 4 x 5 6 7 6 2 2 4 5 222 22 2222 2 22222 2 222 2 222 552 2222 22 22222 5 2 5 5 5 2 2 5 5 2 5555 5 5 2 2 55 2 2 355555 2 2222222222 2222 22 33 33 8 2 333 5 2 22 8 888888 33 333 33 33 88 888 8 8 3333 3 333 3 888 888 3333 3 3 3 88 3333 1 33 88 8 1 1 1 8 8 8 1 8 1 8 8 1111 111 11111 77 8 88 88 8888 8 7777 7777 7 777 6666 7 77 111 7 7 1 8 7 8 7 7 7 7 1 77 7 7 7 7 7 7 6 8 77777777 66 77 777 666 7 7 7 666666666 8 44 66666 44 444 4 4 4 4 4 44 8 4 44 44 4 4 3 1 1 2 22 2 4 3 y 5 111 11 1111 3 11311 3 3 113 333 3 111 1111 33 3 333 1 33 3 1 1 3 1 3 1 1 1131 3 3 3 1 3 3 1 1111111 3 333333333 3333333 3 11 3 11 111 1 3 33 3 333333 11 111 11 11 33 333 3 3 1111 1 111 3 333 333 1111 1 1 1 11 33 3 1 1 1111111 3 3 3 111 111 11 11 33 33333333333 3 2 1 1 2 2 3 1 2 2 2 2 2 1 11111 11 222 22 2 2 2 2 2 3 2 3 2 2 2 2 22 2 2 2 2 2 2 1 3 22222222 11 22 222 111 2 2 2 111111111 3 11 11111 11 111 1 1 1 1 1 11 3 1 11 11 1 1 2 1 1 1 y 6 1 13 4 1 2 3 4 5 6 7 x Figure 1: Artificial dataset from a 3-components Gaussian mixture (components indicated on left side) with optimal clustering (right side) from Gaussian mixture model according to the BIC with within-component covariance matrices restricted to be diagonal. otherwise would not have been found. But ultimately a model assumption is only one of many possible aspects that help to understand what a statistical method does. The ultimate goal of statistics cannot be to find out the true underlying model, because chances are that such a model does not exist (the whole frequentist probability setup is based on the idealisation of infinite repetition, and the Bayesian approach uses similar idealisations through the backdoor of “exchangeability”, see Hennig, 2009). Therefore, methodology that is not based on a statistical model may compete with model-based methodology in order to analyse the same data, as is the case in cluster analysis (for example, most hierarchical agglomerative methods such as complete linkage are not based on probability models). The idea of “true clusters” is similarily misleading as the idea of the “true underlying model”. Neither the data alone, nor any model assumed as true can determine what the “true” clusters in a dataset are. This is always dependent on what the researchers are looking for, and on their concept of “belonging together”. This can be nicely illustrated in the case of the Gaussian mixture model (the latent class model (2.1) above with q = r = 0). The left side of Figure 1 shows a mixture of three Gaussian distributions with p = 2. Of course, in some applications, it makes sense here to define the appropriate number of clusters as 3, with the mixture components corresponding to the clusters. However, if the application is social stratification, and the variables are income and some status indicator, for example, this is not appropriate, because it would mean that the same social stratum (interpreted to correspond to mixture component no. 1) would contain the poorest people with lowest status as well as the richest people with the highest 3 SOME CLUSTERING PHILOSOPHY 7 status. The cluster concept imposed by the Gaussian mixture model makes a certain sense, but it does not necessarily bring the most similar observations together, which, for some applications, may be inappropriate even if the Gaussian mixture model were true (more examples for this can be found in Hennig, 2010). Gaussian mixtures are very versatile in approximating almost any density, so that it is in fact almost irrelevant whether data have an obvious “Gaussian mixture shape” in order to apply clustering based on the Gaussian mixture model. The role of the model assumption in cluster analysis is usually (apart from some special applications in which the models can be directly justified) not about the connection of the model to the “truth”, but about formalising what kind of cluster shape the method implies. One alternative to fitting a fully flexible Gaussian mixture is to assume a model in which all within-component covariance matrices are assumed to be diagonal matrices, i.e., the (continuous) variables are assumed to be locally independent. This gives more sensible clusters for the dataset of Figure 1 (see right side) for social stratification, because clusters could no longer be characterised by features of dependence (“income increases with status” for cluster no. 1, comprising low income/low status as well as high income/high status) even though we know that this model is wrong for the simulated dataset. It still does not guarantee that observations are always similar within clusters, because variances may still differ between components, see Figure 1. One may argue that covariance matrices should be even stronger restricted (for example to be equal across components), but this may be too restrictive for at least some uses of social stratification, in which one would want to distinguish very special distinctive classes (with potentially small within-classes variation) from more general classes with larger variation. Without claiming that “diagonal covariance matrices” are the ultimatively best approach, we stick to it here (but present a non-model-based alternative later). Another reason for this is that it extends the local independence assumption for ordinal and nominal variables in the latent class model (2.1) to continuous variables. Actually, local independence is the only component-defining assumption in that model for nominal variables. As a side remark, it was already noted in Section 2 that (2.2) is restricted and does not allow density peaks in the middle of the order in the component-specific part. The problem with this is not that (2.2) therefore cannot fit the true distributions of ordinal data (actually, ordinality is a feature of their interpretation, not of their distribution), but it seems to be too restrictive to fit a certain natural (though not unique) concept of how an ordinality based cluster should look like (although potential remedies for this are not the topic of the current paper). Assuming a mixture of locally independent distributions for nominal data basically means that latent class clustering decomposes dependences between nominal variables into several clusters within which there is independence. Considering the right panel of Figure 1 (our intuition about “clusters” is usually shaped by Euclidean variables), this could make some sense, although it cannot be taken for 8 1.5e+07 16 14 log income 1.0e+07 10 333 333333 333 3333 3 3 3333 333 33 33 33333 333 3 33333 3 12 1 1 111 11 1 11 111111 1 111111111 11 11111111111111 1111 1111111 1111 11 11 1111111 111 111 111 11111111 1 1 11111 1 11 111111 1 111 11111111 11 1 11 1 11 111111111111111 1 1 1 1111 11 111 111 1 5 10 x 15 0.0e+00 5 y income 10 2 5.0e+06 15 22 222 2 22 222 2222 22 2 22 222222222 222 22222 222 222 2 2 2222 222 2 222 22 2 2 2 2 2 2 2 22 22222 222 2 2 2 2 2 2 22 2 2 18 3 SOME CLUSTERING PHILOSOPHY 0e+00 1e+06 2e+06 3e+06 savings amount 4e+06 5e+06 5 10 15 log savings amount Figure 2: Left side: artificial dataset from a 3-components Gaussian mixture. Middle: subset of data from US Survey of Consumer Finances 2007 (some outliers are outside the plot range and are not shown because they would dominate the plot too much). Right side: same dataset with log-transformed variables (all observations shown). granted that within-cluster dissimilarity will be small (dissimilarity based clusterings of this dataset are shown in Figure 3). An advantage of the local independence assumption in terms of interpretation is that it makes sure that clusters can be interpreted by interpreting the within-cluster marginal distributions of the variables alone, which determine the component-wise distributions. Finding out whether the mixture components in a latent class model bring together similar observations requires the definition of dissimilarity between them, which is done in Section 4. Often in cluster analysis it is required to estimate the number of clusters k. In the latent class cluster model (2.1), this can be done by the BIC, which penalises the loglikelihood for every k with 12 log(n) times the number of free parameters in the model. It can be expected that the BIC estimates k consistently if the model really holds (though as far as we know this is only proven for a much simpler setup, see Keribin 2000). If the model does not hold precisely, it is quite problematic to use such a result as a justification of the BIC, because the true situation may be best approximated by very many, if not infinitely many, mixture components if there are only enough observations, which is not of interpretative value. In general, the problem of estimating the number of clusters is notoriously difficult. It requires a definition of what the true clusters are, which looks straightforward only if there is a simple enough true underlying model such as (2.1), and it has been illustrated above that even in this case it is not as clear as it seems to be. It generally depends on the application; sometimes clusters are required to be separated by gaps the size of which depends on the application, sometimes clusters are not allowed to contain large within-cluster dissimilarities, in some applications it is required that clusterings remain stable under small changes of the data, but this is unnecessary in other applications. Sometimes the idea of “truth” 3 SOME CLUSTERING PHILOSOPHY 9 is connected to the idea of an unobservable true underlying model, sometimes it is connected to some external information, and sometimes “truth” is only connected to the observable data. Unfortunately, up to now, the vast majority of the literature is not explicit enough about the connection of any cluster analysis method and any method to estimate the number of clusters to the underlying cluster concept to be decided by the researcher. Deciding about a criterion to “estimate” the true number of clusters often actually rather means defining this number. Most such criteria can be expected to give a proper estimate in situations that seem intuitively absolutely clear (see the left side of Figure 2; all methods discussed in the present paper yield the same clustering with k = 3 for this dataset), but in social stratification datasets far more often are so complex that there is no clear clustering that immediately comes to mind when looking at the data, in this sense supporting the idea of a gradational structure of society. In the middle (and on the right side, in logtransformed form) of Figure 2, only two variables from a (random) subset of data from the US Survey of Consumer Finances 2007 are shown; assume that social strata should be somewhat informative, so there should probably be two or more of them even if there are no clear “gaps” in the data. However, it is illusory, at least for two-dimensional data, to expect that a formal cluster analysis method can reveal clearer grouping structure than what our eyes can see in such a plot. If, as on the middle or right side of Figure 2, k cannot be determined easily by looking at the graph, it can be expected that criteria to estimate k run into difficulties as well and may come up with a variety of numbers. For higher dimensional data, of course it is possible that a clearer grouping structure exists than what can be seen in two-dimensional scatterplots of the data, but generally it cannot be taken for granted that a clear and more or less unique grouping can be found. Therefore the researchers have to live with the situation that different methods can produce quite different clustering solutions without clearly indicating which one is best. Some researchers may believe that using a formal criterion to determine the number of clusters is more “objective” and therefore better than fixing it manually, but this only shifts subjectivity to the decision about which criterion to use. Most criteria to estimate the number of clusters can only be heuristically motivated. The penalised likelihood used in the BIC seems to be reasonable for theoretical reasons in such cases, but it is quite difficult to understand what its implications are in terms of interpretation, and why its exact definition should be in any sense optimal. There are certain dissimilarity-based criteria (see Sections 5 and 6) that may be seen as more directly appealing, but any of these may behave oddly in certain situations. Usually, they are only convincing for comparisons over a fairly limited range for values of k and may degenerate for too large k (and/or k = 1), so that the researcher should make at least a rough decision about the order of magnitude k should have for the given application (if mixture models are to be applied, one may also use a prior distribution over the numbers of clusters in a Bayesian approach; such a prior would then not be about “belief” but rather 4 DEFINING DISSIMILARITY 10 about “tuning” the method in order to balance desired numbers of clusters against “what the data say”). In some situations in which formal and graphical analysis suggests that it is illusory to get an objective and well justified estimation of the number of clusters, this may be an argument to fix the number of clusters at some value roughly seen to be useful, even though there is no strong subject-matter justification for any particular value. Note that usually the literature implies that fixing the number of clusters means that their true value is known for some reason, but in practice the difficult problem of estimating this number is often avoided (or done in a far from convincing way) even if there are no strong reasons for knowing the number. A particular difficulty with mixed type data is that the standard clustering intuition of most people is determined by the idea of “clumps” and “separation” in Euclidean space, but this is inappropriate for discrete ordinal and nominal data. Such data can in principle be represented in Euclidean space (standard scores can be used for ordinal variables, and nominal variables can be decomposed into one indicator variable for each category, so that no inappropriate artificial quantitative information is added). But if this is done, there are automatically “clumps” and“gaps”, because the observations clump on the (usually small) number of admissible values (see, for example, Figures 4 and 5). Whereas in principle such data can still be analysed using Gaussian mixtures or other standard methods for Euclidean data, the discreteness may produce clustering artifacts. This makes it difficult to have a clear intuition about what clustering means for discrete data. The “local independence”-assumption in latent class clustering looks attractive because it at least yields a clear formal description of a cluster, which, however, may not agree with the aim of the researcher. In conclusion, the researchers should not hope that the data will tell them the “objectively true” clustering if they only choose the optimal methodology, because the choice of methodology defines implicitly what the true clusters are. Therefore, this choice requires several decisions of the researchers on how to formalise the aim of clustering in the given application. In social stratification, the researchers cannot expect the data to determine what the true social strata are, but they need to define first how social strata should be diagnosed from the data (what “dissimilarity” means, and what the underlying cluster concept is). 4 Defining dissimilarity In order to discuss whether or not latent class clustering (or any other clustering method) puts similar observations together into the same cluster, a formal definition of “dissimilarity” is needed (dissimilarity measures are treated as dual to similarity measures here). As the choice of the clustering method, this is a highly nontrivial task that depends on decisions of the researcher, because the measure should reflect what is taken as 11 4 DEFINING DISSIMILARITY “similar” in a given application (see Hennig and Hausdorf, 2006, for some general discussion of dissimilarity design). The crucial task for mixed type data is how to aggregate and how to weight the different variables against each other. Variable-wise dissimilarities can be aggregated for example in a Euclidean or in a Gower/Manhattan-style. The Euclidean distance between two objects xi , xj on p continuous variables xi = (xi1 , . . . , xip ) and analogously for j is defined as v u p uX dE (xi , xj ) = t (xil − xjl )2 = l=1 v u p uX t dl (xil , xjl )2 , l=1 where dl is the standard dissimilarity (absolute value of the difference) on variable l. The so-called Gower distance (Gower, 1971) aggregates mixed type variables in the same way as the Manhattan or L1 -distance aggregates continuous variables variables: p dG (xi , xj ) = X dl (xil , xjl ). l=1 Variable weights wl can easily be incorporated in both aggregation schemes by multiplying the dl with constants depending on l, which is equivalent to multiplying the variables by wl . Although dG seems to be the more direct and intuitive aggregation method (at least if standard transformations of Euclidean space such as rotations do not seem to be meaningful because of incompatible meanings of the variables) and was recommended by Gower for mixed type data, there is an important advantage of dE for datasets with many observations and not so many variables, as are common in social stratification. Many computations for Euclidean distances (such as the clara clustering method and the Calinski and Harabasz index, see Section 5) can be computed directly from the “observations ∗ variables”-matrix and do not need the handling of an (often too large) full dissimilarity matrix. Therefore Euclidean aggregation is preferred here. The following discussion will be in terms of the variables, which are then aggregated as in the definition of dE . The definition of dE can be extended to mixed type variables in a similar way in which Gower extended dG . Ordinal variables can be used with standard scores. Of course, alternative scores can be used if available. An alternative that preserves the ordinal nature of the data but is often dubious in terms of interpretation is to rank the variable values so that mean ranks are used for ties. This introduces larger dissimilarities between neighbouring categories with many observations than between neighbouring categories with few observations. It depends on the application whether this is suitable, but it runs counter to the intuition behind clustering to some extent, because it introduces large dissimilarities between “densely populated” sets of neighbouring categories, which may be regarded as giving rise to clusters. 4 DEFINING DISSIMILARITY 12 The different values of a nominal variable should not carry numerical information, and therefore nominal variables should be replaced by binary indicator variables for all their values (let mj denote the number of categories of variable j; technically only mj − 1 binary variables would be needed to represent all information, but in terms of dissimilarity definition, leaving one of the categories out would lead to asymmetric treatment of the categories). The variables then need to be weighted (or, equivalently, standardised by multiplying them with constant factors; adding constants to “center” variables can be done as well but is irrelevant for dissimilarity design) in order to make them comparable for aggregation. There are two aspects of this, the statistical aspect (the variables need to have comparable distributions of values) and the substantial aspect (subject matter knowledge may suggest that some variables are more or less important for clustering). For most of the rest of this section, the substantive aspect is ignored, but it should not be forgotten that after applying the following considerations for statistical reasons, further weighting factors can be incorporated for substantive reasons (a subtle example is given in Section 7), and it will also turn out that the two aspects cannot be perfectly separated. Before weighting, it also makes sense to think about transformations of the continuous variables. The main rationale for transformation in the philosophy adopted here is that transformation makes sense if distances on the transformed variable reflect better the “interpretative distance” in terms of the application between cases. It is for example not the aim of transformation here to make data “look more normal” for the sake of it, although this sometimes coincides with a better reflection of interpretative distances. For example, for the variables giving income and savings amount in the middle of Figure 2, log-transformations were applied (right side of Figure 2). The distributional pattern of the transformed data looks somewhat more healthy to statisticians (and there are no outliers dominating the plot anymore, as there were in the untransformed variables), but the main argument for the transformation is that in terms of social stratification, it makes sense to allow proportionally higher variation within high-income and/or high-savings clusters; the interpretative difference between two people with yearly incomes of $2m and $4m is not clearly larger than but rather equal to the interpretative difference between $20,000 and $40,000. Of course, transformations like log(x + 1) may be needed to deal with zeroes in the data, and researchers should feel encouraged to come up with more creative ideas (such as piecewise linear transformations to compress some value ranges more than others) if these add something from the “interpretative” subject-matter perspective. Transformations that are deemed sensible for dissimilarity definition should also be applied before running latent class/Gaussian mixture clustering. There are various ways of standardisation to make the variation of continuous variables comparable, which comprise for example • range standardisation (for example to [0, 1]), 4 DEFINING DISSIMILARITY 13 • standardisation to unit variance, • standardisation to unit interquartile range (or median absolute deviance). The main difference here is how the methods deal with extreme observations. A major disadvantage of range standardisation is that this is governed by the two most extreme observations, and in presence of outliers this can mean that pairwise differences on such a variable between a vast majority of observations could be approximately zero and only the outliers are considerably far away from the rest. This is problematic if valuable structure (in terms of the interpretation) is expected among the non-outliers. On the other hand, range standardisation guarantees that the maximum within-variable distances are equal over all variables. The opposite is to use a robust statistic for standardisation such as the interquartile range, which is not affected by extreme outliers. This has a different disadvantage in presence of outliers, because if there are extreme outliers on a certain variable, the distances between these outliers and the rest of the observations on this variable can still be very large and outliers on certain variables may dominate distances on other variables when aggregating. In the present paper standardisation to unit variance is adopted, which is a compromise between the two approaches discussed before. Variable-wise extreme outliers are problematic under any approach, though, and it is preferable to handle them by transformation in advance (Winsorising, see Tukey 1962, may help). Categorical variables have to be standardised so that the variable-wise distances between their levels are properly comparable to the distances between observations on continuous variables with unit variance. Nominal variables are discussed first. Assume that for a single original nominal variable there are two variables in the dataset, namely the dummy indicator variables for both levels (as discussed above, this is necessary for symmetric treatment of categories for general nominal variables, although it would not be necessary for binary variables). E(X1 − X2 )2 = 2 holds for i.i.d. random variables X1 , X2 with variance 1. For an originally nominal variable with I categories, let Yij , i = 1, . . . , I, be the value of observation j on dummy variable i. A rationale to standardise the dummy variables Yi is to achieve PI 2 2 some factor q. The rationale i=1 E(Yi1 − Yi2 ) = qE(X1 − X2 ) = 2q with P for this is that in the Euclidean distance dE , Ii=1 (Yi1 − Yi2 )2 is aggregated with (X1 − X2 )2 . It may seem natural to set q = 1, so that the expected contribution from the nominal variable equals, on average, the contribution from continuous variables with unit variance. However, if the resulting dissimilarity is used for clustering, there is a problem with q = 1, namely that, because of the fact that the distance between two identical categories is zero, it makes the difference between two different levels of the nominal variable (potentially much) larger than E(X1 − X2 )2 , and therefore it introduces wide gaps (i.e., subsets between which there are large distances), which could force a cluster analysis method into identifying clusters with levels of the categorical variable too easily. Therefore we rather 14 4 DEFINING DISSIMILARITY recommend q = 12 , though larger values can be chosen if it is deemed, in the given application, that the nominal variables should carry higher weight. A small comparison can be found in Section 6.3. q = 12 implies that, for an originally binary variable for which the probability of both categories is about 12 (which implies that the number of pairs of observations in the same category is about equal to the number of pairs of observations in different categories), the effective distance between the two categories is about equal to E(X1 − X2 )2 = 2 (and correspondingly lower for variables with more than two categories). There is another consideration regarding the standardisation of the Yi -variables so that I X i=1 ! E(Yi1 − Yi2 )2 = 2q. (4.1) (Yi1 −Yi2 )2 can only be 0 or 1 for dummy variables, but the expected value depends on the category probabilities. As a default, for standardisation these can be taken to be I1 for each category. An obvious alternative is to estimate them from the data. However, it may not be desired that the effective distance between two categories depends on the empirical category distribution in such a way; it would for example imply that the distance between categories would be much larger for a binary variable with only a very small probability for one of the categories. Whether this is appropriate can again only be decided taking into account the meaning and interpretative weight of the variables. For ordinal variables Y with standard coding and I categories, we suggest ! E(Y1 − Y2 )2 = 2q, q = 1 1 + 1/(I − 1) (4.2) as rationale for standardisation, which for binary variables (I = 2; for binary variables there is no difference between ordinal and nominal scale type) yields the same expected variable-wise contribution to the Euclidean distance as (4.1), and q → 1 for I → ∞ means that with more levels the expected contribution converges toward that of a continuous variable. The same considerations as above hold for the computation of the expected value. To summarise, the overall dissimilarity used here is defined by • Euclidean aggregation of variables, • suitable transformation of continuous variables, • standardisation of (transformed) continuous variables to unit variance, • using I dummy variables for each nominal variable, standardised according to (4.1) with q = 21 , 15 5 DISSIMILARITY BASED CLUSTERING • using standard coding for each ordinal variable, standardised according to (4.2), • additional weighting of variables according to subject matter requirements. 5 Dissimilarity based clustering A dissimilarity based clustering method that is suitable as an alternative to latent class clustering for social stratification data is “partitioning around medoids” (Kaufman and Rouseeuw, 1990). This is implemented as function pam in the addon package cluster for the software system R (www.r-project.org). This is based on the full dissimilarity matrix and may therefore require too much memory for large datasets (n ≈ 20, 000 as in the example dataset in Section 7 is quite typical for social stratification examples). “Clustering large applications” (function clara in cluster) is an approximative version for Euclidean distances that can be computed for much larger datasets. There are many alternative dissimilarity based clustering methods in the literature (see, e.g., , for example Kaufman and Rouseeuw, 1990, Gordon, 1999) the classical hierarchical ones such as single or complete linkage, but most of these require full dissimilarity matrices and are unfeasible for too large datasets. pam and clara minimise (approximately), for a dissimilarity measure d, the objective function n g(w1∗ , . . . , wk∗ ) = X i=1 min j∈{1,...,k} d(wi , wj∗ ) (5.1) by choice of k medoids w1∗ , . . . , wk∗ from w. Note that this is similar to the popular k-means clustering, but somewhat more flexible in terms of cluster shapes and more robust by using d instead of d2 (see Kaufman and Rouseeuw, 1990, although gross outliers could still cause problems and should rather be handled using transformations in the definition of dissimilarity) and more appropriate for mixed type data because the medoids are not, as in k-means, computed as mean vectors, but are required to be members of the dataset, so no means are computed for nominal and ordinal categories. The difference is illustrated in Figure 3. pam was used here instead of clara because the dataset is small enough to make this computationally feasible. The pam and k-means clustering (both with number of clusters optimized by the CH-criterion between 2 and 9, see below) are similar (which they are quite often), but cluster 8 in the k-means solution seems to include too many points from the diagonal mixture component, causing a big gap within the cluster. The corresponding pam cluster 4 only contains one of these points. This happens because the pam criterion can tolerate larger within-cluster dissimilarities in cluster 5 (corresponding to cluster 7 in the k-means solution), and because a single point from the diagonal component has such an influence on the cluster mean (but not on the cluster medoid) that further points are included under k-means. 16 5 DISSIMILARITY BASED CLUSTERING 1 2 3 4 5 6 7 6 4 4 4 5 444 44 4444 64444 4 444 4 4 666 666 6666 45 44545 6 55 666666 5 66666 655 555 66666666 66655555555 5555555 2 22 555 5 22 2222 6 6 66 6 3555555 22 2222 22 2 2 2 2 5 5 3 3 2 2 2 2222222 3 3 3 2 2 2 33 3 3 2 2 22 212 111 3 3 3 111 111 12 1 11 33 33333333333 3 8 1 1 8 8 3 8 1 8 8 8 8 1 11111 11 88 88 8 888 8 8888 3 3 3 8 8 88 88 1 888 88 88 88 888 888 1111 8 8 88 8 8888111 111 3 88 88888 8 77 777777 777 3 7 77 7 7 7 7 3 1 5 4 44 2 4 3 y 5 222 22 2222 2 22222 2 228 8 22 666 66262 22 22828 6 88 8 6 6 6 8 8 6 6 6666 8 8 6 6 8 8 6 6666666 6 688888888 8888 88 33 33 8 8 333 6 6 66 6 888888 33 333 33 33 99 988 9 8 3333 3 333 3 696 999 3333 3 3 3 31 99 1 1 33 99 9 1 1 1 9 9 9 1 9 1 9 9 1111 1111 44 9 99 99 9999 9 4444 444 4 444 777711111 4 44 111 4 4 9 4 9 4 4 4 4 1 44 4 4 4 4 4 7 4 9 44444444 77 44 444 777 4 4 4 777777777 9 45577777 55 55 5 5 5 5 5 55 9 5 55 55 5 5 2 1 2 2 y 6 2 22 7 1 2 3 x 4 5 6 7 x Figure 3: Artificial dataset with pam clustering with k = 9 (left side) and 8means clustering (right side), number of clusters chosen optimally according to the CH-criterion for both methods. There are several possibilities to estimate the number of clusters for clara. Some of them are listed (though treated there in connection to k-means) in Sugar and James (2003); an older but more comprehensive simulation study is given in Milligan and Cooper (1985). Taking into account the discussion in Section 3, it is recommended to use a criterion that allows for direct interpretation and cannot only be justified based on model-based theory. In the present paper, three such criteria are used: Average Silhouette Width (ASW) (Kaufman and Rousseeuw, 1990). For a b(i,k)−a(i,k) be the partition of w into clusters C1 , . . . , Ck let s(i, k) = max(a(i,k),b(i,k)) so-called “silhouette width”, where a(i, k) = X 1 1 X d(wi , wj ), b(i, k) = min d(wi , wj ) Cl 6∋wi |Cl | |Ch | − 1 w ∈C w ∈C j j h for wi ∈ Ch . The ASW estimate kASW maximises 1 n l Pn i=1 s(i, k). The rationale is that b(i, k) − a(i, k) measures how well chosen Ch is as a cluster for wi . If, for example, b(i, k) − a(i, k) < 0, wi is further away, on average, from the observations of its own cluster Ch than from those of the cluster Cl 6∋ wi minimising the average dissimilarity from wi . A gap between two clusters would make b(i, k) − a(i, k) large for the points of these clusters, whereas splitting up a homogeneous data subset could be expected to decrease b(i, k) stronger than a(i, k). Calinski and Harabasz index (CH) (Calinski and Harabasz, 1974). The CH 17 5 DISSIMILARITY BASED CLUSTERING estimate kCH maximises the CH index W(k) = Pk B(k) = B(k)(n−k) W(k)(k−1) , 1 P wi ,wj ∈Ch h=1 |Ch | 1 Pn 2 i,j=1 d(wi , wj ) n where d(wi , wj )2 , and − W(k). In the original definition, assuming Euclidean distances, B(k) is the betweencluster means sum of squares and W(k) is the within-clusters sum of squared distances from the cluster means. The form given here is equivalent but can be applied to general dissimilarity measures. The index is attractive for direct interpretation because in clustering it is generally attempted to get the between-cluster dissimilarities large and the within-cluster dissimilarities small at the same time. These need to be properly scaled in order to reflect how they can be expected to change with k. This index was quite successful in the simulations of Milligan and Cooper B(k)(n−k) (1985), which indicates that W(k)(k−1) , derived from a standard F-statistic, is a good way of scaling. However, it should be noted that using squared dissimilarities in the index makes its use together with clara look somewhat inconsistent; it is more directly connected to k-means. As far as we know, it is not discussed in the literature how to scale a ratio of unsquared betweencluster and within-cluster dissimilarities in order to estimate the number of clusters, and even if this could be done properly, another reason in favour of CH is that it can be more easily computed for large datasets because a complete dissimilarity matrix is not required and CH can be easily computed from the overall and within-cluster covariance matrices. Furthermore, Milligan and Cooper (1985) used the index with various clustering methods some of which were not based on squared dissimilarities. Pearson version of Hubert’s Γ (PH) The PH estimator kΓ maximizes the Pearson correlation ρ(d, m) between the vector d of pairwise dissimilarities and the binary vector m that is 0 for every pair of observations in the same cluster and 1 for every pair of observations in different clusters. It therefore measures, in some sense, how good the clustering is as an approximation of the dissimilarity matrix. This is not exactly what is of interest in social stratification, and therefore this criterion will not be used directly for estimating the number of clusters here, but will serve as an external (though not exactly “independent”) criterion to compare the solutions yielded by the other approaches. Hubert’s Γ (Baker and Hubert 1975) was originally defined in a way similar to the above definition, but with Goodman and Kruskal’s rank correlation coefficient Γ instead of the Pearson correlation. The Pearson version (as proposed under the name “Normalized Γ” in Halkidi, Batistakis and Vazirgiannis, 2001) is used here, because it is computationally demanding to compute the original version for even moderately large datasets (n > 200 or so). A side effect of using the Pearson version is that large dissimilarities 6 A SIMULATION STUDY 18 within clusters are penalised more, but it is affected more by outliers than the original version. Although the problem is clearest with CH, none of the three indexes is directly connected to the clara objective function (for example by adding a penalty term to it, as the BIC does with the loglikelihood). There is a certain tradition in non-model based clustering and cluster validation of using indexes for estimating the number of clusters that are not directly connected to the clustering criterion for fixed k. One could argue against that by saying that if an index formalises properly what the researcher is interested in, it should be optimal to optimise this index for both given k and over a range of different values of k. However, one could also argue that if the aim of the study is somewhat imprecise (as in social stratification), having something that was obtained based on combining two different (but reasonable) criteria could be more trustable (cf. the two clusterings in Figure 3). Also, optimisation over a huge number of possible partitions is more difficult than optimisation over a small number of admissible values for k, and on the other hand it is much more complicated to obtain theoretical results about estimating k than about clustering with fixed k, so what is good for one of these tasks is not necessarily suitable for the other one. Note that all three indexes could be distorted by gross outliers, which therefore need to be handled by transformation in the definition of the dissimilarities. It also has to be pointed out that all three indexes do not apply to k = 1, which means that they estimate k ≥ 2. Assuming that the use that is made of social stratification requires that there are at least two strata, this is not a problem. A strategy to distinguish a homogeneous dataset from k ≥ 2 applicable with any index is to simulate 1000 datasets, say, of size n, from some null distribution (for example a Gaussian or uniform, potentially categorical or independent mixed continuouscategorical distribution), cluster them with k = 2 fixed, and estimate k = 1 if the observed index value for k = 2 is below the 95% quantile of the simulated index values, assuming implicitly that if in fact k > 2, k = 2 can already be expected to give a significant improvement compared to k = 1. 6 6.1 A simulation study Data generating models In order to compare latent class clustering and clara (in the version defined above), a simulation study with eight different data generating models was carried out. In the study, we focused on models with continuous and nominal variables only. For the generation of datasets it is irrelevant whether the interpretation of the levels of categorical variables is nominal or ordinal, but the clustering methods were applied so that categorical variables were treated as nominal. The reason for this is that the potential for designing data generating models with potentially informative 19 6 A SIMULATION STUDY simulation outcomes is vast. Therefore it seemed to be reasonable to suppress the complication added by investigating mixing of three types of variables, which will be a topic of further research, in order to rather understand certain situations in detail than to cover everything superficially. Data were generated according to (2.1) with different choices of parameters defining the eight different models. Further restrictions were made. There were always two continuous variables, and with the one exception of M7, only one of the continuous variables (X1) was informative about the mixture components, whereas the other variable (X2) was distributed according to a standard Gaussian distribution in all mixture components. Only one number of observations was simulated for each model, and the numbers of observations were, for computational reasons, generally smaller than those met in typical social stratification data. Below, for component h, nh denotes the number of observations, and µh and σh2 denote the mean and variance of X1. M1 - 2 components clearly separated in Gaussian variables, for each of them 2 components clearly separated in categorical variables. Component 1 n1 = 150 , µ1 = 0, σ12 = 2. Distributions of categorical variables: Categorical variable Z1 Z2 Z3 1 0.8 0.4 0.8 Levels 2 3 0.1 0.1 0.2 0.2 0.1 0.1 4 0 0.2 0 Component 2 n2 = 100 , µ2 = 0, σ22 = 2. Distributions of categorical variables: Categorical variable Z1 Z2 Z3 1 0 0.2 0 Levels 2 3 0.1 0.1 0.2 0.2 0.1 0.1 4 0.8 0.4 0.8 Component 3 n3 = 200 , µ3 = 5, σ32 = 1. Distributions of categorical variables as in Component 1. Component 4 n4 = 100 , µ4 = 5, σ42 = 1. Distributions of categorical variables as in Component 2. M2 - 2 components clearly separated in Gaussian variables, for each of them 2 components not so clearly separated in categorical variables. 20 6 A SIMULATION STUDY Component 1 n1 = 150 , µ1 = 0, σ12 = 2. Distributions of categorical variables: Categorical variable Z1 Z2 Z3 1 0.5 0.4 0.4 Levels 2 3 0.2 0.2 0.2 0.2 0.2 0.2 4 0.1 0.2 0.2 Component 2 n2 = 100 , µ2 = 0, σ22 = 2. Distributions of categorical variables: Categorical variable Z1 Z2 Z3 1 0.2 0.2 0.2 Levels 2 3 0.1 0.2 0.2 0.2 0.2 0.2 4 0.5 0.4 0.4 Component 3 n3 = 200 , µ3 = 5, σ32 = 1. Distributions of categorical variables as in Component 1. Component 4 n4 = 100 , µ4 = 5, σ42 = 1. Distributions of categorical variables as in Component 2. M3 - 2 overlapping components in Gaussian variables, for each of them 2 components not clearly separated in categorical variables. As M2, but with σ12 = σ22 = 3, σ32 = σ42 = 2. M4 - 2 components clearly separated in Gaussian variables, for each of them 2 components not so clearly separated in categorical variables, with 6 categorical variables. Component 1 n1 = 150 , µ1 = 0, σ12 = 1. Distributions of categorical variables: Categorical variable Z1 Z2 Z3 Z4 Z5 Z6 1 0.5 0.4 0.4 0.5 0.4 0.25 Levels 2 3 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.25 0.25 4 0.1 0.2 0.2 0.1 0.2 0.25 Component 2 n2 = 100 , µ2 = 0, σ22 = 1. Distributions of categorical variables: 21 6 A SIMULATION STUDY Categorical variable Z1 Z2 Z3 Z4 Z5 Z6 1 0.2 0.2 0.2 0.2 0.2 0.25 Levels 2 3 0.1 0.2 0.2 0.2 0.2 0.2 0.1 0.2 0.2 0.2 0.25 0.25 4 0.5 0.4 0.4 0.5 0.4 0.25 Component 3 n3 = 200 , µ3 = 5, σ32 = 1. Distributions of categorical variables as in Component 1. Component 4 n4 = 100 , µ4 = 5, σ42 = 1. Distributions of categorical variables as in Component 2. M5 - 4 strongly overlapping components in Gaussian variables, with supporting information from single categorical variable. Component 1 n1 = 150 , µ1 = 0, σ12 = 3. Distributions of categorical variables: Categorical variable Z1 1 0.9 Levels 2 3 0.05 0.05 4 0.1 Component 2 n2 = 100 , µ2 = 1, σ22 = 3. Distributions of categorical variables: Categorical variable Z1 1 0.1 Levels 2 3 0.8 0.1 4 0 Component 3 n3 = 200 , µ3 = 4, σ32 = 2. Distributions of categorical variables: Categorical variable Z1 1 0 Levels 2 3 0.05 0.05 4 0.9 Component 4 n4 = 100 , µ4 = 5, σ42 = 2. Distributions of categorical variables: Categorical variable Z1 1 0.1 Levels 2 3 0 0.8 4 0.1 M6 - 3 components in Gaussian variables, two of which are far away from each 22 6 A SIMULATION STUDY other, but with large variance component in between, for each of them 2 components not clearly separated in categorical variables. Component 1 n1 = 150 , µ1 = 0, σ12 = 1. Distributions of categorical variables: Categorical variable Z1 Z2 1 0.9 0.4 Levels 2 3 0.05 0.05 0.2 0.2 4 0 0.2 Component 2 n2 = 150 , µ2 = 0, σ22 = 1. Distributions of categorical variables: Categorical variable Z1 Z2 1 0.2 0.1 Levels 2 3 0.2 0.2 0.7 0.1 4 0.4 0.1 Component 3 n3 = 150 , µ3 = 6, σ32 = 1. Distributions of categorical variables as in Component 1. Component 4 n4 = 150 , µ4 = 6, σ42 = 1. Distributions of categorical variables: Categorical variable Z1 Z2 1 0.2 0.1 Levels 2 3 0.2 0.2 0.1 0.7 4 0.4 0.1 Component 5 n5 = 50 , µ5 = 3, σ52 = 4. Distributions of categorical variables as in Component 1. Component 6 n6 = 50 , µ6 = 3, σ62 = 4. Distributions of categorical variables: Categorical variable Z1 Z2 1 0.2 0.1 Levels 2 3 0.2 0.2 0.1 0.1 4 0.4 0.7 M7 - 6 components with non-diagonal covariance matrices, i.e., X1 and X2 dependent. n1 , . . . , n6 , µ1 , . . . , µ6 and the distributions of the categorical variables are as in M6. The means of X2 are 0 in every component. The covariance ! 1 0.8 2 matrix for X1 and X2 in component h is σh Σ, where Σ = , 0.8 1 σ12 = σ22 = 0.5, σ32 = σ42 = 4, σ52 = σ62 = 2. 23 6 A SIMULATION STUDY M8 - 2 components scale mixture in Gaussian variables, with clear clustering information in categorical variables (note, however, that according to Goodman (1974) the components are not identifiable from the categorical variables alone in this situation, because there are not enough levels for only two variables; the Gaussian mixture makes the overall partition identifiable). Component 1 n1 = 200 , µ1 = 0, σ12 = 1. Distributions of categorical variables: Categorical variable Z1 Z2 Levels 1 2 3 0.9 0.1 0 0.8 0.1 0.1 Component 2 n2 = 300 , µ2 = 0, σ22 = 3. Distributions of categorical variables: Categorical variable Z1 Z2 1 0 0.1 Levels 2 3 0.1 0.9 0.1 0.8 From every model 50 datasets were generated (the small number is mainly due to the computational complexity of fitting latent class clustering). 6.2 Clustering methods and quality measurement Latent class clustering (LCC) as explained in Section 2 and clara as explained in Section 5 based on Euclidean distances as defined in Section 4 were applied for the number of clusters k between 2 and 9. The optimal number of clusters was selected by the BIC for latent class clustering. For clara, two different methods to determine the number of clusters were applied, namely CH and ASW, see Section 5. As opposed to ASW and CH, the BIC can theoretically estimate the number of clusters to be 1; it was recorded during the simulations how often this would have happened if 1 were included as a valid number of clusters. The results of this are not shown because it hardly ever happened. In order to run the whole simulation study involving different methods and statistics in R, the latent class model was fitted by an own R-implementation using the R-package flexmix (Leisch 2004), which allows users to implement their own drivers for mixture models not already covered by the package. The EM-algorthm is run 10 times from random initial clusterings and the best solution is taken. According to our experience, the resulting EM-algorithm behaves generally very similar to the one in LatentGOLD, although the implemented refinements allow LatentGOLD to find a little bit better solutions for some datasets. 6 A SIMULATION STUDY 24 clara was computed by the function clara in the R-package cluster. Default settings were used for both packages. Several criteria were used in order to assess the quality of the resulting clusterings. The quality of recovery of the “true” partition was measured by the adjusted Rand index (RAND; Hubert and Arabie, 1985). This compares two partitions of the same set of objects (namely here the simulated partition as given in Section 6.1 and the partition yielded by the clustering method). A value of 1 indicates identical partitions, 0 is the expected value if partitions are independent, which means that negative values indicate bad agreement between the partitions. However, recalling the remarks in Section 3, recovery of the true data generating process is not seen as the main aim of clustering here (in reality such truth is not available anyway, at least not in social stratification). Therefore, it was also measured how well the methods achieved to bring similar observations together. A difficulty here is that the criteria by which this can be measured coincide with the criteria that can be used to estimate the number of components based on dissimilarities, see Section 5. The three criteria ASW, CH and PH were applied to all final clustering solutions. ASW and CH were also used to estimate the number of clusters, and it is interesting to see how solutions that are not based on these criteria compare with those where the respective criterion was optimised, whereas PH was added as an external criterion. It should be noted that the comparisons are, in several respects, not totally fair, although one could argue that the sources for unfairness are somewhat balanced. Data generation according to (2.1) gives latent class clustering some kind of advantage (particularly regarding RAND) in the simulations, because it makes sure that its model assumptions are met, except of M7, where covariance matrices for the continuous variables were not diagonal. On the other hand, obviously clara/CH is favoured by the CH criterion, clara/ASW is favoured by the ASW criterion and clara is generally by definition rather associated to the dissimilarity based criteria. There is crucial problem with designing a “fairer” comparison for applications in which “discovering the true underlying data generating mechanism” is not the major aim (this is a problem for cluster validation criteria in general). If there were a criterion that would optimally formalise what is required in a given application, one could argue that the clustering method of choice should optimise this criterion directly, which then would automatically mean that this criterion could not be used for comparison. Therefore, the more “independent” of the clustering method a criterion is, the less relevant it is expected to be. Regarding the estimation of the number of clusters, the average estimated number of clusters ANC and its standard deviation SNC were computed. Furthermore, variable impact was evaluated in order to check to what extent clusterings were dominated by certain (continuous or nominal) variables. In order to do this, the clustering methods were applied to all datasets with one variable at a time left out, and the adjusted Rand index was computed between the resulting 6 A SIMULATION STUDY 25 partition and the one obtained on the full dataset. Values close to 1 here mean that omitting a variable does not change the clustering much, and therefore that the variable has a low impact on the clustering. 6.3 Results The results of the simulation study are given in the Tables 1 and 2. Overall, they are inconclusive. The quality of the methods depends strongly on the criterion by which quality is measured, and to some extent on the simulation setup. Typical standard deviations of simulated values were 0.03 for ASW, 30 for CH, 0.06 for PH, 0.09 for RAND (these varied about proportionally with√the average and were usually lower for LCC/BIC); they have to be divided by 50 in order to get (very roughly) estimated standard deviations for the averages, but a more realistic idea of the precision of the comparisons could be obtained by paired tests. For example, a paired t-test comparing the ASW values between clara/ASW and clara/CH in M3 yields p = 0.002, comparing the two (quite close) CH-values yields p = 0.003, comparing clara/ASW and clara/CH according to ASW (average difference 0.004) in M2 is still significant at p = 0.009, whereas the ASW result of LCC/BIC in the same model cannot be distinguished significantly from any of the other two (both average differences 0.002, both p > 0.7). As could be expected, clara/ASW was best according to ASW and LCC/BIC according to RAND (often strongly so). clara/CH was best according to CH in most situations, but LCC/BIC surprisingly outperformed it according to CH in M2 (where it also achieved a very good ASW result) and M3. It levelled clara/CH according to CH in M6 and did at least better than clara/ASW in M5. In some other setups LCC/BIC performed much worse than the dissimilarity based methods according to all three dissimilarity based criteria (particularly M7 and M8). In some setups (M2, M3, M4, M6), LCC/BIC was optimal according to PH. In M4, this seems slightly odd, because LCC/BIC did much worse according to ASW and CH. Overall, this means that LCC/BIC is not necessarily worse than the dissimilarity based methods in grouping similar observations together, but this is not reliable. Unfortunately it is difficult to see what the models M1, M7 and M8 (in which LCC/BIC did badly according to the dissimilarity based criteria) have in common, and what separates them from M2, M3 and M6, in which LCC/BIC did well. An explanation of some of the results is that the components in the Gaussian variables in M7 and M8 cannot properly be interpreted as generating similar within-component observations. M2, M3 and M6 are the models with the strongest impact of the first (cluster separating) Gaussian variable on the LCC/BIC clustering. Interestingly, this impact, and accordingly the dissimilarity based quality of LCC/BIC, is lower in M1, in which mixture components are even stronger separated by X1 than in M3, but in M1 the stronger information in the categorical variables seems “distractive”. 26 6 A SIMULATION STUDY M1 Method clara/ASW clara/CH LCC/BIC M2 Method clara/ASW clara/CH LCC/BIC M3 Method clara/ASW clara/CH LCC/BIC M4 Method clara/ASW clara/CH LCC/BIC M5 Method clara/ASW clara/CH LCC/BIC M6 Method clara/ASW clara/CH LCC/BIC M7 Method clara/ASW clara/CH LCC/BIC M8 Method clara/ASW clara/CH LCC/BIC ASW 0.236 0.213 0.179 CH 110 134 109 ASW 0.199 0.195 0.197 CH 129 135 150 ASW 0.187 0.179 0.162 CH 109 113 120 ASW 0.129 0.124 0.090 CH 72.7 79.0 60.2 ASW 0.395 0.314 0.281 CH 189 243 230 ASW 0.297 0.289 0.276 CH 269 288 288 ASW 0.303 0.294 0.229 CH 236 270 219 ASW 0.314 0.275 0.203 CH 124 142 101 Criterion PH RAND 0.477 0.447 0.433 0.453 0.427 0.844 Criterion PH RAND 0.418 0.315 0.408 0.333 0.459 0.492 Criterion PH RAND 0.368 0.218 0.356 0.232 0.377 0.383 Criterion PH RAND 0.357 0.295 0.343 0.307 0.396 0.507 Criterion PH RAND 0.498 0.338 0.445 0.342 0.425 0.414 Criterion PH RAND 0.503 0.326 0.476 0.339 0.511 0.370 Criterion PH RAND 0.510 0.271 0.518 0.287 0.363 0.366 Criterion PH RAND 0.518 0.218 0.483 0.397 0.270 0.928 ANC 5.4 2.8 3.5 SNC 2.82 0.82 0.50 ANC 2.9 2.2 2.3 SNC 1.74 0.59 0.45 ANC 3.0 2.4 2.3 SNC 1.95 0.70 0.48 ANC 3.2 2.3 3.2 SNC 2.32 0.60 0.47 ANC 6.2 2.2 2.0 SNC 3.33 0.43 0.00 ANC 2.7 2.0 2.1 SNC 1.39 0.14 0.33 ANC 4.0 2.5 2.2 SNC 2.76 0.50 0.39 ANC 6.9 3.9 2.0 SNC 2.17 0.87 0.00 Table 1: Average values over 50 simulated datasets from each of models M1M8 of average silhouette width, Calinski/Harabasz, Pearson version of Hubert’s Γ, adjusted Rand index between clustering and “truth”, (“large is good” for all these), estimated number of clusters k (with standard deviation given as SNC) for clustering solutions by clara/ASW, clara/CH and LCC/BIC. 27 6 A SIMULATION STUDY M1 Method clara/ASW clara/CH LCC/BIC M2 Method clara/ASW clara/CH LCC/BIC M3 Method clara/ASW clara/CH LCC/BIC M4 Method clara/ASW clara/CH LCC/BIC M5 Method clara/ASW clara/CH LCC/BIC M6 Method clara/ASW clara/CH LCC/BIC M7 Method clara/ASW clara/CH LCC/BIC M8 Method clara/ASW clara/CH LCC/BIC Variable X2 0.481 0.537 0.978 Variable X2 0.475 0.520 0.888 Variable X2 0.378 0.403 0.729 impact (Rand) X1 Z1 Z2 Z3 0.212 0.503 0.531 0.521 0.168 0.556 0.611 0.556 0.447 0.481 0.981 0.477 impact (Rand) X1 Z1 Z2 Z3 0.067 0.505 0.575 0.524 0.055 0.531 0.623 0.563 0.019 0.920 0.916 0.856 impact (Rand) X1 Z1 Z2 Z3 0.101 0.436 0.406 0.415 0.098 0.432 0.419 0.427 0.088 0.751 0.827 0.759 Variable impact (Rand) X1 X2 Z1 Z2 Z3 Z4 0.069 0.395 0.419 0.472 0.451 0.468 0.056 0.409 0.435 0.499 0.496 0.496 0.131 0.799 0.652 0.718 0.727 0.680 Variable impact (Rand) X1 X2 Z1 0.341 0.507 0.447 0.129 0.739 0.647 0.124 0.983 0.590 Variable impact (Rand) X1 X2 Z1 Z2 0.073 0.744 0.796 0.771 0.013 0.861 0.859 0.840 0.014 0.991 0.951 0.652 Variable impact (Rand) X1 X2 Z1 Z2 0.162 0.555 0.697 0.653 0.075 0.625 0.765 0.778 0.173 0.887 0.934 0.906 Variable impact (Rand) X1 X2 Z1 Z2 0.388 0.393 0.492 0.476 0.448 0.582 0.505 0.543 0.805 0.992 0.510 0.818 Z5 0.445 0.459 0.703 Z6 0.521 0.540 0.786 Table 2: Average over 50 simulated datasets of adjusted Rand index between clustering on full dataset and clustering with a variable omitted. Values near zero mean that the variable has a strong impact. 28 6 A SIMULATION STUDY M1, q = 1 Method clara/ASW clara/CH Variable impact (Rand) X1 X2 Z1 Z2 Z3 0.445 0.548 0.449 0.440 0.428 0.756 0.806 0.419 0.788 0.397 overall RAND 0.385 0.431 Table 3: Average over 50 simulated datasets of adjusted Rand index between clustering on full dataset and clustering with a variable omitted for model M1 with q = 1 in (4.1). The last column gives the Rand index between clustering on full dataset and true mixture component membership. Consistently, clara/CH did better than clara/ASW according to RAND, whereas clara/ASW did better than clara/CH according to PH. Generally, results produced by clara/ASW are further away from LCC/BIC than those of clara/CH. This can be explained by the closer connection between CH and the Gaussian likelihood. clara/CH sometimes seems to be closer to clara/ASW according to ASW than they are according to CH, which could be taken as a slight advantage of clara/CH. A stronger argument against clara/ASW comes from the lack of stability in the estimation of k highlighted by the consistently high SNC values. Although there is no “best” number of clusters if the dissimilarity based criteria are taken as definitory, it is still worrying that clara/ASW comes up with almost erratically varying estimates of k for data generated by the same model. This cannot be fully explained by the fact that the other two methods tend to estimate the lower bound k = 2 very often, which makes it easier to achieve a lower value of SNC. Still, comparing LCC/BIC in M1 with clara/AWC in M2, even a small ANC does not prevent clara/AWC from high variation (SNC). Obviously, if the major aim is to recover the “true” model, LCC/BIC has to be preferred based on the superior RAND performance. Where RAND is not good in absolute terms (M3, M6, M7), the Gaussian components are strongly overlapping and a brilliant performance cannot be expected (although the information in the categorical variables helped less than one could have hoped). Furthermore, LCC/BIC always comes up with the most stable estimation of k (minimum SNC), although ANC is not always close to the true number of mixture components (which the BIC attempts to estimate, as opposed to ASW and CH). Model M8 illustrates particularly clearly how model recovery measured by RAND can run counter to grouping similar observations together. Table 2 shows that all methods tend to be dominated by X1, which separates the Gaussian components, except in M8, where there is no mean difference between them. A general pattern is that LCC/BIC tends to ignore variables totally (Rand index close to 1) that do not contribute to the separation of the mixture components. This is a very crucial difference, because it makes LCC/BIC more suitable for finding underlying clusters in the data generating process, whereas the dissimilarity based methods are to be preferred if all variables are seen as impor- 7 APPLICATION TO US CONSUMER FINANCES DATA 29 tant and the clustering method is still expected to differentiate between clusters in variables where no clear “gap” can be found. In social stratification, this seems to make some sense. Generally, the variable impact seems to be more balanced for the dissimilarity based methods, but there are exceptions (for example clara/CH in M6, where all methods did not seem to make a lot of sense of the categorical variables). Whereas often the impact of categorical variables on LCC/BIC was low (particularly where they did not discriminate strongly between “true” latent classes), it is surprising that there is a considerable impact of Z1 for M5 and M8, where the grouping in the categorical variables is theoretically not identifiable. Apparently nonidentifiable groupings in the categorical variables are still helpful if they correspond roughly to what goes on in the continuous variables. The choice of q = 21 in Section 4 was successful in preventing clara from being dominated by the gaps between the categorical variables, although their impact may be seen as a little too weak, and one may try out some 12 < q < 1 as alternative. On the other hand, in most of the models, for reasons of identifiability of the latent class model, the categorical variables were designed to be globally strongly dependent (though independent within components), which means that the variable impact of each single categorical variable was presumably limited by the fact that others carried similar information. For the results in Table 3, data from model M1 were fitted by clara with q = 1. This leads to a more uniform variable impact for clara/ASW, but the impact of the continuous variables (particularly the truly informative X1) seems much too low for clara/CH. Dissimilarity based criteria cannot be compared with the values in Table 1 because q = 1 changes the definition of the dissimilarities, but according to RAND (the only “external” criterion left for this comparison), clara/CH is better than clara/ASW, and both are worse than their counterparts with q = 21 . 7 Application to US Consumer Finances data For this section, the methods were applied to data from the 2007 US Survey of Consumer Finances. The survey was sponsored by the Board of Governors of the Federal Reserve System in cooperation with the Statistics of Income Division of the Income Revenue Service in the US, with the data collected by the National Opinion Research Center at the University of Chicago. See Kennickell (2000) for a detailed introduction to the methodology of the survey. The original data has 20,090 observations (individuals). For the analysis presented here, 17,560 observations (males without missing values) were used. There were six variables, namely lsam log(x + 1) of total amount of savings as of the average of last month (treated as continuous), 7 APPLICATION TO US CONSUMER FINANCES DATA 30 linc log(x + 1) of total income of 2006 (treated as continuous), cacc number of checking accounts that one has; this is ordinal with 6 levels (corresponding to no/1/2/3/(4 or 5)/6 or more accounts; this is because precise differences between large number do not seem to matter here), sacc number of savings accounts, coded as cacc, hous housing, nominal with 9 levels, namely (in order of coding) “neither owns nor rents”, “inapplicable”, “owns or is buying/land contract”, “pays rent”, “condo”, “co-op”, “townhouse association”, “retirement lifetime tenancy” and “own only part”, life whether or not one has life insurance (binary). Obviously, these data can only be used to deal with certain aspects of social stratification, basically those connected to income, savings and certain assets; occupation and education are not covered, although the methodology is general enough to cover them as well with suitable data. Furthermore, in order to interpret the results properly, survey design and representativity of the data need to be taken into account, but this is ignored in the present paper. Note that the housing levels are very unequally distributed. “Owns” (72.6%) and “pays rent” (17.9%; category 3 and 4 in Fig. 4) are by far the strongest categories. Furthermore, categories such as “co-op” and “neither owns nor rents” are in some sense in between the former two in terms of similarity. In order to pronounciate the importance of categories 3 and 4 for dissimilarity clustering, we weighted the two dummy variables belonging to these categories by 2 and then reweighted all dummy variables belonging to hous in order to keep (4.1). This increases the effective distance between categories 3 and 4 compared to all other distances. An alternative method to incorporate non-uniform distances between categories would be to resplace the dummy variables by results from a one- or two-dimensional scaling of a dissimilarity matrix between variable levels and treat these as continuous (or ordinal). For the ordinal variables cacc and sacc we used standard scores and not rank scores (see Section 4), because rank scores would put the highest categories closer to the next lower ones in terms of effective distance, because of lower frequencies in these categories. This would not be appropriate because the higher categories had already been defined by aggregating several numbers of account in order to make the interpretative dissimilarities between neighboring categories approximately equal. Further variable weighting was not applied, although one could argue that for most aims the information in lsam is more crucial than those in the two ordinal variables sacc and cacc (Table 5 shows that at least for the 12-cluster solutions sacc and cacc did not have a strong impact anyway). 31 7 APPLICATION TO US CONSUMER FINANCES DATA 2 10 1 2 3 4 5 6 22 2 1 1 1 2 22 2 2 1 1 22 22222 222 122 2222 22222222222 2 1 2 222 222 2 2 2222 2 1 2 2 1 2222 2 22 22222 12 1 1 21 222 222222 2 22 2 2 1 222 22 22 22 222 22 22 2 22 1 22222222222222 22 22 2 22 22 22 1 11 1222 2222222 22 1122 1 22222 2 22 2 22222 222 22 2222 11 2222 22 222 22 1 2 22 2 22 1 222 22 22 1 1 222 22 22 22 2 22222 22 1 22 22222 22 2222 12122 2 222 22 222 2 1 222 2222 22 222 2 1222 22 2 22 22 1112 22 222 222 222 22222 2 22 1122 1 1 12 22 2222222 22 2 22 2 2 2 222 1 2 22 22 222 1 2 22 11 222 22 1 22 12 12111 2222 22 222 22222 2 22 222 222 12222 2222 2 1 1 2222 2 1 111 1222 2 1 1 1222222 2 2 1 1 12 22222 22 2 2222 22222222 2 2222 2 22222 2 22 22222 22 222222 22222222 2222 2222222 222 22 222 22222 2 22 222222 2222 22 22 22 2222222 22 222 222222 2 22 12 22 222 2 2 222 2222 22 22222 22 22 222222222 222 1 222 22 222 2222 222221 2 222 22 22 22 22 22 22 1 2222 2 12 22 2222 2 11 2 222222 222 1 222 22 22 22 222 222 22222 222 222 22 12 2222 1 2 2 22 11 222 111 22 2222 222 222 111 2 22 22 22222 22 22 1 2 22 22 2 2 2222 22 11 1 22 22 22 2 12 22 2 1 2 2 22222 22 11 2 2222 22 22 112 2 22 2 2222 111 11 222222222 12 11 22 2222 22222 22222 1 1 1 1 2 linc 1 1 1 2 1 2 1 2 1 1 2 2 1 1 2 1 1 2 2 1 1 2 1 2 1 1 1 2 1 2 22 2 1 2 1 22 22 1 2222 2222 1 222222 2 22 1 1 22 2 22 11 2 2 1 2 1 2 2 1 2 1 2 2 1 2 2222 1 2 2 2 1 2 2 2 2 2 112 1222222 12222222222222 1 1112 222 1 2 222 222 22 222 22 222 22 1 22 11 12 2222222 1 2222 2 2 2 11 2 22 22 222 22 12 222 1 1 22 22 2 2 12 22 2 22222 2 2 22 22 11 1 122222 11 22 2 2 22 222 22 2 22 22 22 2222 22 222 22 12 22 2 22 2 2 2222 2 2 222 222 1 2 1 1 2 2 2 2 2 2 1 2 1 2 2 1 2 2 2 11 2 2 2 1 2222 2 2 1 1 1 1 1 2 1 2 1 2 1 11111 222 222 22 2 2 1 1 22 1 222 11 2222 222 2222 2 222 1 2 11111 1 2 22 1 2 2 22222 2 2 1 2222 2222 2 1 22 22 22 1 11 11 2222 1 2 22 22 222 1 11 1 22 2 2 2 222 2 1 11 2 1 2 1 2 222 22222 2 2 111 22 2 11 22 2 11 2 222 2 22 2 2 2 1 2 222222 2 2 22 2 11 1 2 2 22 11 22 222 2 1 2 22222 2 22 11 1 1 2 2 22 2 1 22 2 2 2 2 2 2 1 2222 2 2 2 22 11 22 2 11 2 22222 12 222 2 2 11 2 1 2 2 12 2 2 11 2 2 1 1 22 2 222 1 222 2 1 2 2 22 2 22 2 2 2 1 1 1 2 22 22 2 12 2 1 1 2 1 12 2 2 2 11 1 2 1 2 222 1 1 2 2 1 2 2 2 2 2 2 1 1 1 1 22 1 1 1 12 1 22 1 2 1 1 111 2 2 2 1 2 2 1 21 222 1 122 2 1 212 2 2 1 1 12 22 1 1 1 2 1 1 2 2 1 2 2 2 2 1 1 1 1 2 1 1 1 1 2 2 1 1 1 1 2 1 1 1 1 2 1 1 111111 21 11 2 1 2 2 12 1 121 2 221 2 1 1 1 2 1 2 1212 21 2 1 1 2 1 1 1 2 22 2 22 2 222222 2 1 2 1 1 21 11 1 22 1 1 1 2 21 1 2 2222 212 2 2 1 1 1 2 2 1 1 2 2 2 2 1 1 12 2 21 21 12 1 22 1 1 1 12 1 22 22 2 2 1 2 1 2 12 1 2 1 2 1 1 22 1 12212 1 2 2 1 1 2 1 1 2 1 1 1 2 1 22 2 2 2222 1 2 2 1 11 1 222 22 1 1 2 2 12 2 2 1 1 22 2 1 1 1 2 2 1 2 1 1 2 2 2 2 2 1 2 21 1 2 2 1 1 1 1 2 1 2 2 1 1 1 12 2 2 1 1 2 1 1 1 2 2 1 2 1 2 2 12 212 2 2 11 21 1 1 1 1 12 2 2 1211 21 2 11 2 2 222 12 2 1 11 2 22 1 2 1 1 1 2 1 1 1 2 2 1 1 2 2 1 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 11 21 1 11 1111 1 1 1 11 1 121 1212 11 2 1 2112 11111 2121 2 2 11 211 212211 2212 22 1 12 2 22 22 1 1 2 1121 121 12 22 212 11 12 11 2 1 2 2 2 2 212 21 22 2 221 222 2 1 2 1 2 12 2 2 221 2 1211112212 21 212 12 22 22 1 112 1 11 121 21 1 1 2 2 12 21 2 1 2 1112 1 1 11 2 1 21 2 11 1 2 21 2 21 1 11 2 222 1 2 22121 2 1 1 2 1 21 1 2 2 11 22 112 12 1 21 1 2 2 12 1 1 12 2 2 11212 2 12 12 1 21 1212 2 21 2 1 12 22 1 1 2 222 22 2 221 21 2 12 12 21 1 222 121 1 1 1 21 1 1 1 21 2 2 2 1 12 1 1 2 1 1 1 2 2 2 12 2 1 1 1 1 2 12 2 1 1 2 22 2 1 1 2 22 2 22 12 2 1 22 11 21 22 211 1 2 2 1 212 11 1 1 22 112 2 1 2 2 1 222 1 11 2 2 1 2 2 1 1 2 22 22 12 1 1 1 2 1 21 2 2 22 1 12 1 2 2 1 21 2 2 212 1 2 2 1222 1 1 2 2 1 2 11 1 2 2 1 2 1 11 2 2 2 1 122 1 1 212 22 1 2 2 1 2 21 2 1 1 2 2 2 2 2 21 2 2 1 2 12 1 2 2 1 1 1 11 1 1 1 2 2 1121111 2 22 2 12 12 2 2 1 2 121 2 1 2 1 2 1 21 1 21 12 1 1 2 1 11 1 21 2 2 1 2 1 21 2 21 12 1111 1 2111 111 21 1122 22 1 2 1 1111 12 112 2 1 2 11 1 11 cacc 22 22222222222 2222 2 22 22 22222222 2222 22 22 2 22 222 2 22222 222 222 22222 22 22 22 212 222 22222222 2 2 222 2 22 22 22 222 2 2 2 222 2 2 222 2 2 22 2 2 22 2 1 22 222222 2 222222 212 2 22 2 1 22 1 22 12 222 2 2 2 22222 2 222 222 1 2 222222222 2 21 222 2 222 22 1 2 2 1 222 2 1 11 22 22 2 2 2 2 2 1 12 2 1 2 22 22 222 222 222 1 22 2 1 2 1 2 22 1 22 1 22 1 2 2 2 1 22222 2 2 2 1 21 22 22 22 2 1 2 1 2 22 2 212 22 22 1 21 2221 2 22 1 22 12 2 22 1 2 222 11 111 1 11 11 11 1 11 1 1111 1 111 1 11 1 11 11 1 111 11 1 11 11 11 111 1 11 1 1 1 11 1 1 1 11 1 1 11 1111 1 111 1 11 11 11 11 111 111 1 11 11 11111 11 11 11 111 11111 11 1 11 1 11 22 2 22 222 22 22 2 2 2 2222 2222 2 2 22 22222 22 2 222 22 22 22222 222 2 22222 2 2 222 22 22 2 2222 2 22222222 22 2222 2 2222 22 2222222 22 2222 222222 2 2222 221 2 222 2 222 222 2 222 2222222 222 22 2 2 22222 22 2 2 22 2 2 21 22 2 22 2 21 1 221 12 2222 22 22 2 22 22 12 1 2 2 2 1 2 22 2 2 22 1 22222 2 22222 22 1 2 1 22 2 1 222 222 22 2 222 22 2222 2 2 2 2 12 2 12 2 211 1 12 22 1 2 22 12 22 2 1 2 2 22 22 22 2 2 2 1 2 2 2 2 21 22 2 2 2 2 221 2 2 1 2 2 2 2 1 1 2 2 1 2 2 1 222 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2221 22 12 22 2222 12 22 2 22 1222222 12212 2 2 121 121 2 1 1 111 111 11 11 11 1 1 1 11 11 1 11 1 111 1 1 11 11 111 1 11 1 111 1 1 1 11 1 1 11 11 11 1 1 1 1 1 1 1 11 11 11 111 111 111 1 1 1 11 11 11 1 111 11 11 11 1 11 11 1 11 11 11 11 11 11 11 1 1111 111 11 1 11 2 1111 2 2 11 1 1 2 11 222 1 22 2 222111 12 211 22211 22 12 12 2221 2 2 2 1 2 1 1 2 1 2 2 12222 12 2 1 2 2 1 1 2 2 1 1 1 1 1 12 1 1 2 2 1 2 1 1 1 2 2 1 2 2 2 1 2 2 1 2 1 12 1212 12 12 21 21 12 1 112121 21 22 111 11 21 2 11 2 1212 11 2 2 1 12 1 21 1 2 21 11 11 1 21 2 21 2 2 2 1 21 21 2 1 1 2 1 2 21 1 1 21 1 12 1 2 11 1 1 22 1 12 12 2 12 11 22 1 1 22 1 12 2 2 2 1 2 1 22 2 21 11 1 12 22 22 11 2 2 1 1 2 12 2 2 22 2 2 2 1 2 2 2 2 12 1 1 1 2 2 2 2 1 1 11 21 2 211 22 22 1 1 1 22 1 12 1 2 1 2 21 1 21 2 1 2 1 2 1 21 2 1 2 1 21 22 2 2 22 1 22 1 1 2 1 2 2 22 1 22 2 2 2 1 2 22 21 22 11 2 1 111 1 2 12221 121 2 2 11 11 212 1 1 11 1 12 2 12 22 12122 1 2 2 1 1111 11212 2 11 1222 22 2 1221 2222 1211 2 1 112 1 22 2 2 12122 2122 1 2 11 2 2 1 1 1 22111 21 1 2 2 1 1 21 21122 11 211 11 1 2 2 11 2 12 2 21 12 1 2 2 1 11 22 2 1 1 2 12 1 1 2 1 1 211 11 1 11 1 1 11 1 22 1 1 1 2 2 2 2 1 2 212 21 1 1 1 12 21 11 1 2 2 2 2 1 2 1 1 1 1 1 1 2 21 2 2 2 2 2 1 12 12 1 2 2 11 2 2 1 21 1 1 2 1 1 2 2 2 22 1 1 222 1 2 2 1 2 2 1 12 2 2 1 1 22 1 22 2 1 2 1 2 1 2 1 1 2 111 21 1 1 12 1 2 2 1 12 12 1 2111 2 1 2 2 1 2 2 1 2 21 2 2 22 21 1 1 1 1 1 1 1 1 2 2 2 2 2 1 22 12 1 2 1 12 11 2 1 2 21 2 2 2 2 2 11 1 2 221 2 1 1 11 2 1 21 1 2 2 2 1 2 11 2 1 1 21 1 21 2 21 2 22 1 22 2 1 2 2 1 21 1 1 1 2 2 1 2 2 1 2 21 21 1 2 11 121 11 121 121 21112 121 2 21 1 1 1 1 2 1111 12 12 1 2 1212 2 21222 22 111 12111 2 12 1 1122 12 111 2 1111 2 2 11 122 112122121 221 12 1222 2 2 1 2 111 2 222 2 2 1 12 121 21 2 21 11 112 2 211 21 2 2 21 1 22 121 12 2 2 2 11 21 22 1 1 2 2 1 2 21 1 1 21 21 2 12 2 2 2 2 21 12 12 21 1 2 2 1 21 1 22 2 2 21 2 2 1 2 1 22 2 1 2 2 21 2 1 2 2 21 2 22 2 1 1 1 2 2 1 2 2 22 2 1 2 1 21 1 1 1 2 2 1 1 1 1 2 2 2 2 2 2 12 1 1 2 2 1 1 2 2 2 1 2 2 2 2 1 2 1 1 2 1 2 2 2 2 1 22 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 1 2 1 1 2 1 2 1 1 2 1 2 2 11 1 1 1 22 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 22 1 2 2 1 2 1 1 1 2 1 1 2 2 1 22 2 1 2 2 1 2 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 1 2 2 2 1 2 21 2 1 1 2 2 1 1 2 1 2 1 2 1 1 2 2 2 2 1 1 1 1 1 2 2 1 22 1 1 11 121 211 2 21 1 2 2 2 1 12 21 2 21 12 2 11 21 1 1 11 11 121 1 2 121 1 1 2 2 12 1 2 2 2 11 2 2 1 2 1 2 1 2 1 1 11 1211 2 2 1.5 2.0 22 2 22 2 2 2222 22222 2 2 22222222 222 2 2 22222222222222222222 2 2 2 2 2 2222222222 222 2222222 2 22 2222 2212222 2222 2 212 222 2 222 22222 2 2222 2 222222222212 2 112 12122 2 1 11111 11111211 11 111111111 11 111111 11111 11 111 11111111111 21121212 11 222 2222 222 211 2121 211 12121 1 2 2 1 222 22 2 21 122 1112 222 21 1121 2 122 22 2 11 1122 2 2 2 121 1 1 2 2 1112 1 11 2 12 21 222 22 21 1 2 2 12 21 2 1 1 2 1 1 1 22 1 1 2 2 2 1 1 2 2 2 2 2 12 2 2 1 1 1 21 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 2 1 1 2 2 21 2 2 2 2 1 1 2 2 21 12 1 1 2 1 11 21 12 1 2 1 1121 1 1 1 1 12 2 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 21 1 1 1 1 2 11 1 11 1 121111 111 1 1 1 1 112 21 11 1 1 22 12 11 1 21 11212111112 11 11 12 2 2 2 22 11 1 111 1111 2 222 222 2 22 222 11111 2222 22222 2 22 222 2 11 1 22 222 22 112 1 2 1 1 2 1 2 2 2 1 1 11 1111 22 22222 222 22 222 2 22 222 2 2 12 2 222 1 21 11 22 11 22 111 22 22 12 11 22 2222 2222 2 2 2 11 2 2 1 2 2 2 1 12 2 1 2222 22 2 11 22 22 22 2222 12 222 11 22 2 11 221 222 2 11 2 221 21 1222 2 222222 2 222 1 1 2 111 22 1 1 2 12 12 2 22 1 21 222222 2 1 11 11 22 11 1 11 22 1 22 2 2 2 1 1 1 21 22 2 2 1 12 22 1 2 2 1 2 2 1 2 2 2 22 222 1 22 2 222 2 2222 22 222222222 12 11 11 2 2 11 2 22 22 22 22 2 111 1 222 2 222 11 11 2 2 21 11 2 2 1 2 22 1 21 1 222 122 1 22 2 2 12 2222 11 11 1 1 2 22 22222 2 2 1 1 1 2 11 2 1 2 2 2 2 11 1 1 222 1 1 2 22 1 22 11 11 1 1 1 2 2 2 21 222 2 1 222 11 1 22 22 1 2 1 222222 2 22 12 1 22 11 11 22 22 1 2 222 2 2 11 1 1 1 111 11 2 11 1 1 1 12121 1 2 11 22 11 1 12 111 1 21222 2 111 2 2 2 12122 2 11 21 2 11 2 12 2 1 11 2 21 2 1 121 2 1 1 2 1 2 2 21 1 2 22 1 2 2 2 2 1 2 1 1 121 1 122 2 1 2 121 1 1 2 2 2 1 1 11 1 1 1 2 1 21112 1 12 222 2 2 2 1 1 121 2 1 1 12 2 112 2 2 21 21 1 2 122 12221 11 22 1 2 2 1 1 1 222 21 1 2 12 2 2 12 2 1 11 2 1 22 21 1 1 1 2 2 2 2 1 21 1 22 2 1 2 1 12 2 2 2 1 1212 1 1 2 1 2 2 2 2 2 2 2 21122 2 2 1 2 12 21 2 1 2 21 1 1 2 1 2 11 2 112 1 1 1 1 21 2 2 1 1 121112 2 11 21 11 2 2 1 2 2 1 11 1222 2 1 11 2 112 1 1 21 2 1 2 2 2 2 1 1 2 2 2 2 1 2 1 2 22122 12 2 1 1 21 1 1 12 2 1 2 1 1 2 12 1 11 2 12 2 1 1 2 121 2 1 1 1 22 2 1 11 2 21 1 21222 1 2 1 21 2 21 12 1 11 1 1 1 11121 1 2 1 1 1 2 1 1 1 1 1 2 1 1 1111 122 11 2 1 1 212 111 2 211 12 2121211222 2 12 21 1 122 22 22 2 22 12 2 2 12 12 1 1 211 11 12 2 11 22 2 11 212 212 2 1 211 2 12 221 2 2 121 2 11 2212 2 1 222 1 1 11 2 22 2 112 12 12 22 2 212 2 2 22 2 11 1 11 1 2 11 1 22 2 21 2 2 12 22 1 1 1 2 2 2 1 1 2 2 2 2 1 22 1 1 2 2 11 2 1 12 2212 1 22 22 1 2 22 121 11 1 1 1211 121 1 22 2 222 1 1 2 2 2 1 2 22 1 1 2 2 2 1 1 2 2 2 21 2 1 2 2 1 2 2 2 21 11 2 1 2 1 1 1 1 1 2 12 2 2 1 22 1 2 1 2 1 1 2 2 2 2 1 2 2 1 2 2 1 1 2 2 2 2 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 2 1 2 1 2 2 1 2 2 2 1 1 1 1 2 2 2 2 2 112 11212 21 12 21212 21 12 12 22 11 12 12 1212 1 2 2 11 21 1 12 2 1 2 2 2 2 2 2 22 1 2 1 12 2 22 1 2 1 1 2 11 2 2 21 12 2 2 1 1 2 12 12 1 1 2 1 11 2 2 2 2 2 2 2 2 1 1 1 2 21 2 2 2 1 2 1 1 1 1 2 1 1 21 1 2 1 2 21 2 1 2 2 1 2 1 21 2 22 1 12 2 1 11 2 1 1 12 12 2 2 21 21 1 21 2 2 1 1 1 21 1 1 1 2 22 2 2 12 2 2 111121112 2 2121121 2 1111121 2 12 2 11 1 2 1 1 1 2 12 1 22 1 2 1 2 1 21222 12 222 2 221 1 22 2 11122111221 2 1211 212122 22 1 21 1 121 2 112 2 21 221 11 2 12 1 12 12 122 1 1 1 1212 2 1 1 2 2 1 2 1 1 2 11 2 1 1 21 1 1 2 21 21 121 21 2 12 12 2 2 12 1 2 1 22 1 21 1 2 212 2 21121 1 1 1 2 1 2 1 22 2 1 1 11 1 1 2 2 12 1 1 21 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 1 2 1 2 2 2 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1112 22 2121 21 12 2112121111 12 1 12121 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 111 1121121121 211 121 1211 2 sacc 2 222 22 22 222 2 2 2 22 2 22 22 22 22 22 2 2 22 2 222 2 22 2 2 222222 22 22 22 2 22 22 2 22 122 2 22 22 2 22 2 2 2222 2 22221 22 22 22 21 22 2 222 1 2 1 2 2 222 2 2 22 1 1 11 2 2 222 2 1 1 22 1 2 2 2 2 22 1 22 2 2 1 12 2 12 1 21 22 2 2222 2 2 122 22 212 2 2 21 2 1 12 21 1 2 2 2 12 2 11 2 1 2 11 1 11 2 2 2 1 2 1 22 22 2 12 2 22 222 2 22 2 12 1 1 2 2 2 2 2 1 212 2 2 22 2 2 111 11 11111 11 1 1 1 1 1 1 1 1 1 111 11 1 1 1 1 11 1 11 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1 1111 11 11111 111 111 1 11 11111 111 11 11 22222222 2222 2222 222222222222 2 22 2222222222 222222 222222 2 2222 22 222 22222 22222 22 22 222222222 222 22 222222 222 22 2 2 22 2 22 2 222222222 2222222 2222 2 22 12 2 212 22 2 2 22 2 222 22 2 21 2 2 22 2 2 222 1 22 2 2 2 22 2 2 22 21 22 1 2222 2 22 1 2 22 22 2 1 22 2 1 2 22 2 22 2 22 222 2 21 22 12 2 2 12 2 22 1 1 2 12 2 22 22 1 2 221 222 2 22 2 2 22 2 1 22 1 22 22 2 1 2 2 12 2 1 2 2 2 1 2 22 21 1 22 1 2 2 1 1 12 2 22 2 22 2 2 1 2 222 1 21 2 22 1 2 22 1 1 1 2 2 22 21 22 2 11 111 11 11 1111 1 1 11 11 1 11 11 11 11 1 11 11 1 11 1 111 1 1 1 11 11 11 11 11 1 1 11 11 11 11 1 11 1 1 111 11 1 1 11 11 1111 11 11 11 222 222222 22222212 2 222 222222 22 222 22 22222 2 2222 22 2212 222222222 21 22 2 21 2 22 2 21 21 12 1 222 2 21 1 2 2 1 122 222 1 2 1 2 2 222 2 212 21 222 2 21 2 2 2 21 12 2 2 222 2 2 1 221 11 11 11 11 1111 11 11111 111 11 1111 1 1 11 11 1 1 11 11 11 11 11 111 11 1111 111 1 11 11 11 1 11 11 11 1 1 111 1 2 2 2 2 111 22 22 1 12 2 111 22 222 22 2 22 2 222 2222 1111 11 22 11 2222 22 2 22 22 2 2 1 2 111111 2 2 2 1 11 222 2 22 11 11 1 1 2 11 11 2 11 1 2 222 222 22 111 21 12 2 1 2 22 2 21 11 1 1 12 1 1 11 2 11 1 1 2 1 22 2 21 1 2 1 2 11 1 2 2 22 1 2 1 221 2 22 1 1 222 222 1 2 22 222 1 11 1 2 2 2 22 1 21 22 22 22 1 12 11 2 1 2 11 22 11 22 22 11 22 2 2 1 2 11 2 11 1 2 1 12 22 11 22 1 1 22 12 2 2 22 2 1 12 2 22 1 11 2 2 22 2 22 1 1 11 2 22 22 1 1 11 11 2 1 22 11 2 22 1 22 222 2 1 2222 2 2 2 22 22 2 2 22 2 11 1 1 2 1 1 1 2 1 2 22 2 1 1 2 2 2 1 2 2 1 1 2 1 1 2 2 2 1 111 112 1 2 222 1111111 22 2 1 1 11 2 hous 1 221 212 21 221 1 2 12 1222212 1 22221 2221 12112211 2 1 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 1 1 2 1 2 12 2 111 2 121 112 2112 12 11 2 11 211 2 21112 21 1 2 1 21 1 2 2 1 2 1 1 2 2 1 2 1 1 2 21 2 2 2 2 1 2 1 1 1 2 1 1 21 1 1 2 1 2 1 1 12 2 1 2 22 2 1 2 1 1 21 2 2 2 2 12 1 1 22 11 2 2 12 22 1 1 1 1 2 22 2 1 1 2 1 2 1 2 12 2 21 22 2 22 2 2 2 1 2 1 2 2 2 2 22 1 1 12 1 2 1 2 12 1 2 1 1 1 1 12 21 1 2 2 2 11 2 1 22 2 2 2 2 2 12 12 1 21 2 1 12 2 2 2 2 1 1 1 1 2 2 2 11 1 2 2 2 1 22 2 22 1 1 1 1 1 1 1 1 1 2 2 2 1 12 2 1 2 2 2 1 2 2 11121 11222111 212 21 221 1 15 2 12 1122 11 1 121122 2 11 22 1 11221 22 2 221 1 1 22 222 21 2 1 121 122 2 12 1 2 1 1 221 1 2 1 21 211 11 2 2 21 2 21 1 2 21 2 22 1 22 111 1 2 1 2 21 2 1 21 2 2 2 22 1 2 1 2 1 1 2 11 2 12 2 2 21 2 21 1 2 2 12 11 21 2 12 1 2 1 12 2 1 11 1 2 2 222 2 1 1 2 1 2 2 2 12 2 1 22 1 2 1 2 2 2 1 1 2 21 1 2 2 22 2 21 2 2 1 1 21 2 12 2 2 1 1 1 2 2 21 2 2 1 2 2 1 222 2 2 2 1 2 1 2 2 2 2 2 21 1 1 2 2 21 1 1 1 1 1 2 2 2 1 1 2 2 2 1 2 2 1 1 1 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 1 2 1 22 1 1 2 2 1 1 2 2 2 2 1 1 2 1 2 21 2 2 2 21 2 2 1 21 2 2 1 1 1 11 2 1 1 1 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 22 2 2 1 1 1 1 2 2 1 1 1 2 2212 1 2 2 2 1 1 1 1 1 2 2 1 2 2 1 2 12 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2 1 1 21 1 1 2 1 1 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 12 2 1 11 22 22 2 1 2 2 1212 2 21 1 1 21 11 1 12 1 1 1 21 21 2 2 1 111 2 1 2 1221 1 1 1 2 2 1 11 12 2 1 1 1 1 2 2 2 2 1 2 2 2 1 2 1 2 2 121 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 22 1 1111 1 1 2222 2 22 2222 2 222222 222222 222 2222 2222 2 2 22 22 2 22222 222 22 2 22 22 22 22 2 22 222 22 2 22 22 22 2 22 2 22 22 2 2 2 22 222 222 2 2 2 2 22 2 2 2 2 222 22 222 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2222 2222222 22 2 22 2 2 22 22 22 22 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 222 22222 222 22 22 22 22 22 22 22 22 2 22 22 22 2 2 2 2 2 2 2 2 2 2 2 2 2222222 22 2 22 2222 22 222 222 22 22222 1 21 2121 22 1122 1 2211 1 11 121 211 2121 2111 11 1 11112 1 11111 1 11 11 111 11111 11 1111 11 11111 11 11 111 11 11 111 111111111 10 22 2 222222 2 22 2 222 22 2 22 2 2 22 22 2 22 22 2 2 2 2 2 2 22 2 2 222 222222 22 22 2 22 2 2 2 2 2 2 2 22 22 2 2 2222 22 22 2 222 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 222 22222222 222 2 222 22 2 2 2 222 2 22 2 2 222 2 2 22 22 22 22 22 222 2 22 22 22 2 2 2 22 1 1 222 2 22 22 222 222 2 2 22 22 222222 2 2 2 2 2 2 2 1 1 2 2121 2 1 11 2 1 211 1122 1 21 1 2 121 11 21 1 2 2 2 1 1 12 11 1 111 21 12 11111111 11111 1111111 11 11 1 11 111 11 1 11 11111 11111111111111 11 5 1.0 2 22222 2222222 22 2 2 22 22222 2 222222 22 2222 2222 22 2 22 2222 22222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2222222 222 222 22 2 22 2 22 2 2222222 22 22 2 22 2 22 22 22 2222 22 2222 2 222 2 22 22 22 222 2 222222 2 222 2 2 222 2 222 2 22 22 2 22 22 222 22222 22 2 22 22 2 2 22 22 222 22 2 2 2 2 22 2 22 2 2 2 22 2 22 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 222 2 22 22 2 2222 2 2 2 22 22 2 22 222 2 22 22 22 222 222 222 2 2 22 2 2 2 22 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 22 222 2 22 222 222 22 2 2222 2222 2 2222 1 2 1 212 12 12 1 11 2 2 222 11 22 2 1 1 11 11 1 12 1 1 112 1 2 1 1 1 1 1 1 1 1 1 1 2 1111 1 1 1 11 11 11 1 11 11 11 1 11 11 1 11 11 1111 111 0 1 2 3 4 5 6 22 2 222 22 2 2222 2 22222 22 2 2 222 22 22222 2 22 2 222 22 2 22 222 2 222222 2 2 2 2 2 2 2 2 2 22222 22 222 2 22 2 222 2 222 2 22222 2 222 22 2 2 2 22 22 22 22 2 2 2 22 22222 22 22 22 2 2 2 22 22 2 22 2 22 222 2 22 2 2 2 22 2 2 222 2 2 222 2 2 222 2 2 22 22 22 22222 22 22 2 22 2 22 2 22 2222 2 2 22 2 22 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2221 22 222 2 22 2 222 2 22 2 222222 222 222 2 22 2 2222 2 2 2 22 12 2 222 2 2 2 2 22 22 2 2 2 2 22 22211 2 222 22 222 22 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 211 1111 212 1 2221 12 1121 21 1 1 1 2 1 121211 1121 2211 2 111 2111 111 11 111 11 1 1 111111111 111 1 11 11 11 1 11 1111 1 11 11 1 11 1 11 111111 111111111 111111111 11111 1 2 3 4 5 6 12122 2 22 22 22222 222 2222 22 222 22 22 22 222 2 22 22 2 2 2 2 2 2 2 2 2 2 2222 22 22 22 22 2 22 222 22 22 2 222 2 2 22 22 2 2222 2 2 222 22 2 2 222 2 22 22 22 2 2 22 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2222 2222 2222222 222 2 222 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 222222222222 22 22 2 222 22 2 22 22 22 22 22 2 22 2 2222 2 2 2 22 2 2 2 22 22 2222 22 2 2 21 2 1 1 1 1 2 2 11 2 11 1 2121 1 11 1 2 111 1 1 1 1 2 1111 1 111111 11111111 11 1 11 1 11 11 1 11 11 1 11 11 111 11 11 11 11 111 11 11 1111 111111 11111111111111111 21 8 1 15 5 0 1 2 2 1 1 2222222222 22 1 22222 222 22 2 1 22 22 222 1 2222 2222 2222 222 22 22 2 1 22 2 222 2 2 2 1 2 2 222222 2 22 2 2 2 2 2 2 1 2 222 2 2 2 2 22 22222 222 22 2 22 222 22 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 22 2 2 2 1 2 2 2 22222 2222 22 2 22 2 22 1 222 2 2 2 122 22 22 22 22 2 22 2 2 111 222 22 1 1 22 1 22222 12 22 2 22 22 22 1 2 1 22 21 12 22 11 12 2 2 2 11 2222 222 22222 12 2 1 22 22 1 1 2222222 2 1 2 22 222 1 1 11 111111222 2 222 1 1 1 15 2 2 1 2 212 22212 11 2211222122 211 112 1 1 1 1 2 1 2 1 1 111 1 2 1 1 1 2 2 1 2 2 2 11 1 2 1 2 1 1 1 1 2112 1 121 1211 111112 21122 21212 2 11 1 211 2 2 1 2 1 21 12 1 2 1 12 1 1 11 2 2 211 2 1 1 1 1 1 22 1 2 1 2 2 2 21 1 1 2 11 2 1 12 2 2 1 2 1 22 12 2 11 1 2 1 12 2 12 1 2 1 1 2 12 2 1 2 22 1 2 1 1 2 11 2 2 1 2 1 1 2 2 2 1 1 1 1111211 1 1 2 2 2 1 2 21 1 11 1 1111121111 221 6 lsam 10 4 2 2 2 2 1 2 1 5 2 0 e Figure 4: Subset of data from US Survey of Consumer Finances 2007 with clara/CH clustering and “uniformly jittered” categorical variables to avoid overplotting. 7 APPLICATION TO US CONSUMER FINANCES DATA 32 LCC/BIC, clara/CH and clara/ASW were applied to this dataset with k ranging from 2 to 20. We used sampsize=1000 in function clara, which is slower but more stable than the default value. The LCC/BIC solution was computed by LatentGOLD using its default settings. Estimation of k turned out to be somewhat unsatisfactory, because the BIC for the latent class model kept improving beyond k = 20 so that k̂ = 20 was taken as optimal for LCC/BIC. clara yielded kCH = 2 on the lower bound. According to the ASW, the solutions with 3 and 11 clusters are best with a similar value of the ASW; we concentrate on the 11-cluster solution here, which is better on the data used in Table 4, although not on the 1000observations subset used by the default implementation of clara. As emphasised before, in social stratification the criteria do not estimate a “true” k, but rather define what a suitable number of clusters could be. The researcher has therefore to decide whether the suggested number(s) of clusters are appropriate for the aim of the analysis. The results from the three methods suggest that there is no clear optimum between 2 and 20. Depending on the point of view, solutions on the upper bound could be preferable (if high differentiation is required), or the very rough, but nicely interpretable solution with k = 2. The ASW result suggests k = 11 if a solution “somewhere in between” is desired. We also computed the LCC and clara solution for k = 12 fixed. Because the estimation of the number of clusters is problematic as discussed above, it may make sense to fix k, taking into account some background knowledge and considerations about what the clustering should be used for, where available. Here we used Wright’s (1985) number of classes as a benchmark. However, clara’s 12-cluster solution is worse than the one with k = 11 according to all criteria. Whereas it is fast to compute clara solutions with 2-20 clusters even for n = 17, 560, computing LCC/BIC in LatentGold is more cumbersome, taking about 5-15 minutes for each value of k on our machine. Therefore variable impact (Table 5) was only computed for k = 12 fixed. For better plotting, and in order to evaluate the dissimilarity based criteria, which require the computation of the full dissimilarity matrix, a random subset of size n = 2, 000 was drawn. Dissimilarity based criteria were computed for the clusterings restricted to the random subsample as shown in Table 4 (CH could be computed on the full dataset). Again it strongly depends on the criterion which solution is preferred. The clara/ASW-solution is the best solution according to the “external” criterion PH and is not bad according to CH either. clara/CH and clara/ASW are both much better than the solution with k = 12 whereas for LCC k = 12 is better than the BIC-solution according to the dissimilarity based criteria. Both LCC-solutions are much worse than the clara solutions according to these criteria, though. Given that a clustering solution with k ≈ 12 may be of particular theoretical interest, the statistics in Table 4 seem to suggest the clara/ASW-solution with 11 clusters. 7 APPLICATION TO US CONSUMER FINANCES DATA Method clara/ASW clara/CH LCC/BIC clara fixed LCC fixed k 11 2 20 12 12 ASW 0.328 0.305 0.010 0.278 0.145 Criterion CH PH 4984 0.490 7416 0.442 1474 0.404 4678 0.435 2582 0.409 AW 1.19 2.18 1.60 1.21 1.64 33 AB 2.86 3.16 2.85 2.82 2.87 Table 4: Number of clusters and dissimilarity based criteria for clusterings by clara/ASW, clara/CH and LCC/BIC on US Survey of Consumer Finances 2007 data (ASW, PH, AB and AW were evaluated on random subsample). Method clara fixed LCC fixed lsam 0.523 0.284 Variable impact (Rand) linc cacc sacc hous 0.470 0.708 0.729 0.491 0.503 0.581 0.827 0.556 life 0.492 0.687 Table 5: Adjusted Rand index between clustering on all variables (full dataset) and clustering with a variable omitted. Values near one mean that the variable has almost no impact. Additionally, the average within clusters dissimilarity (AW) and the average between clusters dissimilarity (AB) were computed, although these cannot be properly compared over different values of k. It is remarkable though that LCC/BIC has a larger AW than clara/ASW and about the same AB, even though generally both AW and AB are expected to go down with increasing k, consistently with the large values of the 2-cluster solution of clara/CH. clara with k = 12 is worse in both respects (larger AW, smaller AB) than clara/ASW with k = 11. A k-means solution was computed as well (not shown, but quite similar to clara), which yielded, with both k = 11 and k = 12, better values than clara with k = 12 and worse than clara/ASW. The random subset is shown in Figure 4 with categorical variables jittered by adding some random uniformly distributed noise (note that two of the plots in Figure 2 are based on the same subset, but with observations with income or savings amount zero omitted). The clara/CH clustering with 2 clusters is shown as well. It can be seen that cluster 1 corresponds strongly (but not totally) to individuals without savings or with very modest savings. Consequently, most of them do not have a savings account, or only a very small number of them. “Rented accomodation” (26% vs. 11.1% in cluster 2) is overrepresented in this cluster. The median incomes are $58,000 in cluster 1 and $120,000 in cluster 2, although there are a few individuals in cluster 1 with a quite high income. The other clusterings can be interpreted in terms of the variables as well. Without giving the details here, for the most convincing clustering clara/ASW, the 11 34 7 APPLICATION TO US CONSUMER FINANCES DATA a a −2 0 2 aa a a a aaaaaa aa 2 2 0 2 737 3 5 8 8 2822822 2232 37 3 222222 37 3337 33 7 3 73 3 3355555 55 33 22 22 2 222222 2 3 1 777 73 32 333373 73 888888882222 7 3 2822 3 7 3 22222 223 3 22 2 8 2 3 73 7 3 535 35555555 2 2222 3 75 2 2 7 3 2 8 73 322 3 8 7 2 8 2 2 2 8 232 33 3 2 2 2 2 4 3 73 7 2 2 3 23 2 8 777 8 3 3 77 2 2 2 8 8 2 2 7 73 7 3 35 2 33 3 2 2 2 2 2 73 7 33 73 2 2 37 3 2 3 2 2 37 7 4 4 5555 9595999990 7 8822222 7 2222222222 7 8 3 5 7 3 7 3 2 3 3 7 3 7 8 3 2 3 37 2 3 7 56665 3 37 7 2 2 77 3 3 3 75 8 27 3 2 7 3 8 66 4444444114188 2 3 2 3 7 2 3 5 3 7 3 3 559 7 3 5 3 7 3 8 2 2 2 2 3 2 2 95595959 73 22233333 8222 555 22 222222222 37 22222 73 56 73 6 73 73 37 373 3 5 7 3 2 5 3 969 3 5 2 5 3 3 44 3 3 6 44444444441448 5 1 3 3 6 6 3 3 2 5 550 56 656595 3 5 5 6 8 2 2 3 2 2 3 8 5 2 1 65556655 35565 6 81 69 5 5 3 999 4111122 2 222 3337 6 55 3 4444441 55 6 4 999 9 9 0 5559 3 66 6 41 4 44 9 5 556 2 1 4 5 4 6 6 5 3 6 4 9 9 4 4 9 4 5 4 69 1 4 95 4 1111 1 44 144 111 69 665 9 636665566666 66 414 56 4 44444 900 0 444 5 14 14 1 1 6 1 1 6 6 6 1 6 1 4 1 6 1 1 4 1 1 2 1111111 0 11 11 6 66 6 009096000000 111 66666 1 1 11111 1 1 1 0 1 1 1 1 111111111111111 6 66 66 600 0 0 11 11111111 111 1 1111 6 6 1 1 1 11111111 11 11 1 1 1111 1 1 0 11 11 111 11 11 11 11111111 11111 1 11111 11111111 111 1111 1 −4 a 4 a 54 4 44 4 4 5 6 5 555565 434644 363334 33 635 44444444 4 44 4444 44 6 4 33 4 5565 4 7 444 634 36 3 5566 444 975559655555 44 5955 446 3 3 6 4 6 33 34 65 4 7 65 63 3 5 444244422 3 3 443 5 5 3 5 3 4 4 4 3 3 9 33 34 4 5 7 4 3 6 7 36 5 66 6 6 7 5 6 5 44 6 3244 7 3 5 5 7 66 33 33 4 3 6 6 7 3 7 363 346 4 3 33 3 5 5 9 3 5 34 5 5 3 5 6 4 4 5 6 5 4 5 5 9 6 33 63 5 34 3 3 3 3 3 5 7 7 4 43 4 3 63 23 4424 224222222 6 4 6 4 4 5333 6 4 4 0 3 6 3 3 3 3 99 7577555 36 4 3 59 3 3 5 5 6 3 3 3 5 3 6 9 0 6 3 6 4 364 3 3 4 6 3 3 3 6 3 5 2 5 5 3 6 333 5 3 3 35 23 6 6 3 3 4 7777757579177 3 3 24 3 222 6 3 5 6 3 3 2 42 3 4 3 3 222422 3 3 3 4 3 7 1 3 3 3 7 1 6 5 336 3 4 3 63 6 3 63 36 6 6 33 633 63 4 663 7955101161 63 3 3 33 3 3 6 3 3 3 3 5 3 3 6 6 6 3 5 3 3 4 3 3 6 222 6 2 6 9 22222 422 1171 3 3 7777777755599009 6 2 6 6 3 3 6 1 6 222 222 3 24 5 22 3 3 7 224 6 3 3 3 3 9 5 2 0 3 11155 1 2 62226 33363 63 3336 2 3 3 1 2 77777777795 3 6 22 9 6 50 222 2 2 2 3 5 0 6 24 222 222 3 0 1 9 0 242 2 1 0 6 2 3 5 2 222 9 9 9 5 2 2 1 222 050 1111 9955 222 95 9 2 6322232322222 5555 422 111 7777 2 222 2 2 0 9 2 9 0 0 2 0 2 2 2 1 2 0 79 0 1 9 1 2 9 7 9 0 1 2 09 09 11 11 775 111 2 22 2 222222222222 0 25622 1 1 0 75705 0 1 0 2 1 9 1 1 5 0009000111111111 2 22 22 222 2 2 00 009000010 118 8 188 8 2 2 8 0 8 8 8 58088088 88 88 8 8 7088 8 8 2 88 88 888 88 88 88 88888888 88888 8 88888 88888888 888 8888 8 0 a MDS dimension 2 aa −2 a aaaaaa aa −2 MDS dimension 2 4 a 4 −4 MDS dimension 1 −2 0 2 4 MDS dimension 1 Figure 5: First two dimensions of multidimensional scaling of subset of data from US Survey of Consumer Finances 2007 with LCC (left side) with k=12 and clara/ASW clustering (right side) with k = 11. clusters can be interpreted, in order of the numbering given by clara, as 1. middle-class spenders (characterised by, on average, more checking accounts than other middle-class clusters; n1 = 2, 555 observations), 2. average upper-class (n2 = 1, 986), 3. average middle-class (n3 = 3, 881), 4. middle-class savers (more savings and savings accounts than other middleclass; n4 = 1, 517), 5. average working-class (paying rent; n5 = 1, 673), 6. middle-class self-reliants (no life insurance but large savings; n6 = 1, 090), 7. barely-making-it working-class (no life insurance, very little savings; n7 = 1, 515), 8. upper-class spenders (n8 = 1, 467), 9. working-class without own home (and little savings; n9 = 686), 10. working-class home owners (n10 = 1, 055), 11. not working (no income; n11 = 135). Note that the upper, middle, and working classes are distinguished by their income levels. 8 CONCLUSION 35 Applying the bootstrap stability assessment from Hennig (2007) for k = 11 fixed, clusters 1, 2, 6, 7, 8 and 10 turn out to be very stable (γ̂C > 0.8 with 20 bootstrap repetitions), clusters 3, 4, 5 and 11 are fairly stable (0.8 > γ̂C > 0.5) and only the second smallest cluster 9 is very instable (γ̂C = 0.23). Figure 5 shows the first two dimensions of classical multidimensional scaling (MDS; Torgerson, 1958) for LCC with k = 12 and clara/ASW. The MDS is dominated by the two continuous variables lsam and linc. Cluster “a” (no. 11) in both solutions are the individuals without income. These are separated by a clear gap from the observations with minimum income and therefore form a cluster according to both methods. The parallel diagonal lines in the lower left are the individuals without savings, parallels caused by different levels of housing. clara puts some individuals without savings together with some with low savings in clusters 1, 5, 7 and 9, whereas the corresponding LCC clusters 1 and 4 only contain individuals without savings. According to the MDS plots, the zero-income and zero-savings groups yield the clearest separation. More detail can be seen by looking up the other variables (not shown; a matrix plot such as Figure 4 for more clusters is much better with colours on a big screen). The LCC solution is more clearly dominated by the continuous variables, as Table 5 shows. This implies that Figure 5, which is dominated by the same variables, shows the differences between the LCC clusters more clearly than those between the clara clusters in spite of the fact that the latter are better according to the dissimilarity based criteria. This can be seen by looking at further MDS dimensions (not shown). Note that the cluster sizes for the LCC/BIC solution are much less balanced. The largest LCC/BIC cluster incorporates 5,068 observations whereas there are eight clusters with fewer than 100 each. Variable impact (Table 5) looks more balanced for clara (not putting much weight on cacc and sacc), whereas LCC is strongly dominated by the savings amount with more emphasis on checking accounts and less on savings accounts than clara, which makes some sense because the savings accounts information is strongly related to the amount of savings anyway. As discussed before, there is no clear optimal clustering and the number of clusters is ambiguous. Therefore formal cluster analysis here (and in other datasets of similar type that we have analyzed) does not yield social classes of strong theoretical value, although it is interesting to compare them with those that have been suggested in the literature, and the clusterings can also be helpful where “social class” is used as an auxiliary variable in empirical studies investigating other response variables of interest. Despite the lack of theoretical justification for these classes, their advantage is that they are directly connected to empirical data. 8 Conclusion The results of the simulation study do not indicate clearly one method as optimal, not even uniformly according to a given dissimilarity based criterion (although REFERENCES 36 latent class clustering is better than dissimilarity based methods to recover a true underlying latent class model). clara/ASW, however, did badly in terms of stability, and clara/CH looks preferable in that respect. Research is required in order to develop a criterion that works similar to CH, but is based on unsquared dissimilarities. The good thing about the dissimilarity based indexes is that they can be evaluated for the dataset to be analyzed in practice. This means that for a given dataset, LCC/BIC, clara/ASW and clara/CH can be evaluated, and they can be compared by the desired index in order to see whether in the given situation LCC/BIC does a good enough job in grouping similar observations. Sometimes data subsetting may be needed for this, because in order to evaluate some of the criteria (but not the clustering methods themselves, as long as Euclidean dissimilarities are used), huge dissimilarity matrices need to be handled. For the dataset in Section 7, the clara/ASW solution looks best in spite of the stability problems in the simulation study. In any case, many application-based decisions have to be made, particularly regarding weighting, transformation and standardisation of the variables. The analysis of the US Survey of Consumer Finances 2007 illustrates that in real situations the decision about an appropriate number of clusters (or at least an appropriate range over which to search) requires a strong subjective (aim-dependent) impact as well, in spite of the fact that estimation criteria with some scientific justification exist. The automatic clustering methods do not free the scientist from defining how the variables should contribute to what constitutes a social class, and what properties are desired for a “good” stratification. It may seem to be an attractive feature of LCC/BIC that it can be run without making such decisions, but this is deceptive. In order to find out whether, for a given dataset, the method does something sensible in terms of grouping similar observations, which cannot be taken for granted as the examples and simulations show, a dissimilarity measure is required, and the decisions mentioned above have to be made. More research is required concerning some of these decisions (choice of q, treatment of outliers), a better systematic characterisation of the behaviour of the methods (the current simulation study was quite successful to highlight various different patterns of behaviour of the methods, but not comprehensive enough to see in detail what behaviour is to be expected under which circumstances), and a more comprehensive treatment of ordinal variables. References Agresti, A. (2002) Categorical Data Analysis. Second Edition, Wiley, New York. REFERENCES 37 Agresti, A. and Lang, J. (1993) Quasi-symmetric latent class models, with application to rater agreement. Biometrics 49, 131-139. Baker, F. B. and Hubert, L. J. (1975) Measuring the Power of Hierarchical Cluster Analysis. Journal of the American Statistical Association 70, 31-38. Calinski, R. B., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27. Celeux, G. and Govaert, G. (1991) Clustering criteria for discrete data and latent class models. Journal of Classification, 8, 157-176. Chan, T. W. and Goldthorpe, J. H. (2007) Social Stratification and Cultural Consumption: The Visual Arts in England. Poetics 35, 168-190. Gifi, A. (1990) Nonlinear Multivariate Analysis. Wiley, Chichester. Goodman, L. A. (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61, 215-231. Gordon, A. D. (1999) Classification (2nd ed.). Chapman & Hall/CRC, Boca Raton. Gower, J. C. (1971) A general coefficient of similarity and some of its properties. Biometrics 27, 857-871. Grusky, D. B., Ku, M. C. and Szelnyi, S. (2008) Social Stratification: Class, Race, and Gender in Sociological Perspecive. Westview, Boulder, CO. Grusky, D. B. and Weeden, K. A. (2008) Measuring Poverty: The Case for a Sociological Approach. In: Kakwani, N. and Silber, J. (eds.): Many Dimensions of Poverty, Palgrave-Macmillan, New York, 20-35. Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems 17, 107-145. Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis 52, 258-271. Hennig, C. (2009) A Constructivist View of the Statistical Quantification of Evidence. Constructivist Foundations 5, 39-54. Hennig, C. (2010) Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification 4, 3-34. Hennig, C. and Hausdorf, B. (2006) Design of dissimilarity measures: a new dissimilarity measure between species distribution ranges. In: Batagelj, V.; Bock, H.-H.; Ferligoj, A.; Ziberna, A. (eds.): Data Science and Classification. Springer, Berlin, 29-38. REFERENCES 38 Hubert, L. and Arabie, P. (1985), Comparing Partitions, Journal of Classification 2, pp. 193-218. Kaufman, L. and Rouseeuw, P. J. (1990) Finding Groups in Data, Wiley, New York. Kennickell, A. B. (2000) Wealth Measurement in the Survey of Consumer Finances: Methodology and Directions for Future Research. Working paper (May), http://www.federalreserve.gov/pubs/oss/oss2/method.html Keribin, C. (2000) Consistent estimation of the order of a mixture model, Sankhya A, 62, pp. 49-66. Kingston, P. W. (2000) The Classless Society. Stanford University Press, Stanford, CA. Lenski, G. E. (1954) Status Crystallization: A Non-Vertical Dimension of Social Status. American Sociological Review 19, 405-413. Le Roux B. and Rouanet H. (2010) Multiple Correspondence Analysis. SAGE, Thousand Oaks (CA). Liao, T. F. (2006) Measuring and Analyzing Class Inequality with the Gini Index Informed by Model-Based Clustering. Sociological Methodology 36, 201-224. Milligan, G. W. and Cooper, M. C. (1985), “An examination of procedures for determining the number of clusters in a data set”, Psychometrika, 50, pp. 159-179. Pekkanen, J., Tuomilehto, J., Uutela, A., Vartiainen, E., Nissinen, A. (1995) Social Class, Health Behaviour, and Mortality among Men and Women in Eastern Finland. British Medical Journal 311, 589-593. Sugar, C. A. and James, G. M. (2003) Finding the Number of Clusters in a Dataset: an Information-Theoretic Approach. Journal of the American Statistical Association 98, 750-763. Torgerson, W. S. (1958) Theory and Methods of Scaling. New Wiley, New York. Tukey, J. W. (1962) The Future of Data Analysis, The Annals of Mathematical Statistics, 33, 1-67. Vermunt, J. K., and Magidson, J. (2002) Latent class cluster analysis. In: J.A. Hagenaars and A.L. McCutcheon (eds.), Applied Latent Class Analysis, Cambridge University Press, Cambridge, 89-106. Vermunt, J. K. and Magidson, J. (2005) Technical Guide for Latent GOLD 4.0: Basic and Advanced. Statistical Innovations Inc, Belmont Massachusetts. REFERENCES 39 Wright, E. O. (1985) Classes. Verso, London. Wright, E. O. (1997) Class Counts: Comparative Studies in Class Analysis. Cambridge University Press, Cambridge.