Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. i PREFACE For a number of years I have worked with teaching about, implementing and writing computer programs representing fairly sophisticated mathematical models. Some problems have then emerged which I have not seen any systematic attempts to answer. A basic observation is that a program for, say, factor analysis, multidimensional scaling, or whatever will always grind out an answer. With increasing sophistication in software development, advanced programs are to a larger and larger extent available for users with perhaps less than complete insight in the mathematics’ of the underlying algorithms. There does not seem to be any necessary reason for deploring this state of affairs since any computer program is a tool which can be used without detailed knowledge of how it is built. It is, however, often a very difficult task to figure out the meaning of the output from a program applied to a specific set of data. There exists what perhaps may be regarded as a paradoxical state of affairs in that there is a discrepancy between on the one hand the amount of sophisticated mathematics used in many programs and on the other hand an absence of developed rationale for answering many concrete questions which the user will want to ask when he has applied an advanced program. The novice or uninitiated may well be awestruck by what appears as highly developed tools. But when he asks questions as for instance: is there any structure in the data, is it worth, while to try to Interpret the output, what is really the dimensionality of these data, there does not exist any explicit rationale for answering these questions. The user is typically left with more or less intuitively based rules of thumb. Of course an expert will be able to give reasonable answers based on his long experience. But the use of advanced computer programs would be much more convenient if the knowledge of the expert could be replaced with explicit rules. The aim of the present work is to outline an approach which I hope will lead to simple, explicit rules for answering such questions as those exemplified above which currently are answered on a more or less intuitive basis. Unfortunately this does not imply that the present work can be regarded as a textbook. For one thing the general approach is only applied to one set of methods, namely multidimensional scaling. Furthermore the methodology is not worked out in detail for many types of applications of multidimensional scaling and finally there is as yet no report of how the methodology works for empirical data. At present this work will mainly be of interest for the expert in methodology and the user of multidimensional scaling, though it is hoped that the reader with general interest in methodology will find the general approach of interest. Some comments on the separated chapters may be of help to the general reader in that it will indicate which of the sections can be skipped. The basic idea in the present work is really quite simple. Empirical data are regarded as infested with noise and the aim of the model (program) is regarded as the "removing" of noise, "purifying" the data, or more literally: if noise obscures the underlying structure or moves the data away from the latent structure, applying an optimal model will move the result back towards the true structure. The general background of this idea is sketched in Chapter 1 (which the expert can just skim) - discussion of critical views on this seemingly Platonic view is reserved for the final Concluding remarks. Chapter 2 is the basic chapter in the present work, spelling out the main idea in some detail. Since the idea represents a conceptual framework that can be applied to a large variety of different models it may be regarded as a model for models or a metamodel. There are three basic relations, the relation between the latent structure and the data, the relation between the data and the output and finally the relation between the output and the latent structure. Since these three relations are not systematically discussed in the literature previously, the first three sections of Chapter 3 reviews and classifies the scattered comments of relevance. These sections are fairly technical and may be skipped by the reader with mainly general interests. Section 3.4, however, represents what is perhaps a novel approach to the justification of a new index and may be of general interest. In Chapter 4 the first two sections are again classificatory, Sections 4.3 and 4.4 are fairly technical and then Section 4.5 directly picks up the threads from Chapter 2 and leads to the results of practical Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. ii interest which are presented in Sections 4.6 and 4.7. An attempt has been made to capture the basic procedure and the main results in terms of graphical diagrams. These four chapters in Part l represent an attempt to flesh out an approach which has not been made explicit previously though it has been in some current literature. In the terminology of G.A. Kelly this part tightens and clarifies an approach which hopefully in the future may make it easier to evaluate results from complex programs. Part 2 is different in that it raises rather than answers questions. Chapter 5 deals with tree structures, then in Oh. 6 there are some frankly speculative attempts to sketch a more general model than at present exists. In Chapter 6 there is a variety of illustrations from cognitive (and clinical) psychology which is used to justify the search for more complex models. It is also hoped that this chapter may inspire closer collaboration between experts in the development of technical tools for data analysis and those whose technical expertise is manifested in the alleviation of human distress, the clinical psychologists. --------------- ----------------- --------------- Looking back on the years spent on the present work there is the deep realization that it would not have been possible to finish this work without the help of many colleagues and friends. First of all I wish to express my gratitude to Gudrun Eckblad who patiently worked through many drafts and did invaluable service in removing or clarifying several obscure passages. In discussions she also pointed out to me several implications of the basic framework which were unclear to me. It is now difficult for me to sort out and rank order the help provided by several other colleagues who critically commented separate sections of this work and provided invaluable encouragement. The list below is not complete, but my gratitude to each of the persons listed below is deeply felt: Rolv Blakar, Carl Erik Grenness, Paul Heggelund, Steinar Kvale, Thorleif Lund, Erik Paus, Ragnar Rommetveit, Jon Martin Sundet, Astrid Heen Wold and Joseph Zinnes. A very special gratitude is due to the staff at the Computer Centre of the University of Oslo. Understaffed and overworked, yet there always seemed to be someone available to help me to utilize our CDC machine and to debug errors in the program system I had to develop. And, finally, this research has in part been supported by grants from the Norwegian Research Council for Science and Humanities. Finn Tschudi October 1972. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. iii CONTENTS page i PREFACE Part I A METAMODEL AND APPLICATIONS TO DIMENSIONAL MODELS 1 Introduction 1.1 Multivariate models and the research process. Statement of problems 1 1.2 Comments on formal and content oriented approache 5 1.3 General properties of data reduction models 9 1.4 Type of model and psychological theory 11 1.41 Spatial (dimensional) models 11 1.42 Tree structure model (hierarchical) 13 2. A metamodel for data reduction models 16 2.1 A metamodel 16 2.2 The extended form of the metamodel. Empirical and theoretical purification 20 3. Nonmetric multidimensional scaling and the metamodel 25 3.1 Nonmetric algorithms and criteria for apparent fit (AF) 25 3.2 Methods of introducing error and indices of noise level (NL) 30 3.3 On indices of true fit (TF) 41 3.4 Direct judgments of true fit. An empirical approach 48 4. Beyond stress -how to assess results from multidimensional scaling by means of simulation studies 54 4.1 Two approaches in simulation studies 54 4.2 Classification of variables in simulation studies 54 4.3 Comparison of algorithms 57 4.31 Choice of initial configuration and local minima 57 4.32 Comparing MDSCAL, TORSCA and SSA-1 58 4.33 Metric versus nonmetric methods. 63 Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. iv 4.4 Previous simulation studies and some methodological problems. Analytical versus graphical methods. Unrepeated versus repeated designs. 65 4.5 Implications from the metamodel. 68 4.6 Evaluation of precision. Construction of TF-contours from AF-stress. 71 4.7 Evaluation of dimensionality and applicability. Application of the extended form of the metamodel. 89 Part II ANALYSIS OF A TREE STRUCTURE MODEL AND SOME STEPS TOWARDS A GENERAL MODEL . 113 5. Johnson's hierarchical clustering schemes, HCS. 5.1 A presentation of hierarchical clustering schemes. 113 5.2 HCS and the Guttmann scale. 117 5.3 A dimensional representation of the objects in HCS (for binary trees) - a tree grid matrix. 121 6. Filled, unfilled and partially filled spaces. 6.1 A discussion of HCS and spatial models. 127 127 6. 2 The inadequacy of tree structure models. Comments on tree grid matrices, G.A. Miller's semantic matrices and G.A. Kelly's Rep Grid. 129 6.3 Outline of a general model 133 6.4 Comments on the general model 136 6.41 Some technical problems. The metamodel and the general model 136 6.42 The general model as a conceptual model, new directions for psychological research. 139 CONCLUDING REMARKS 143 Appendix. Main features of the program system 147 References 150 Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. List of figures. v page Chapter 1. Fig.1. Schematic diagram of the research process, based on Coombs (1964, fig. 1.1) Fig. 2. Illustration of relation between 1 and 2 sets of objects. Fig. 3. A classification(typology) represented as a tree. 2 3 14 Fig.4. Tree structure resulting from a hierarchical cluster analysis of latency data for visual discrimination of pairs of letters by adult subjects. Based on Gibson (1970, p. 139). 15 Chapter 2. Fig. 1. A metamodel representing the relations between latent structure (L), manifest (M) and reconstructed data (G). 16 Fig. 2. Extended form of metamodel (for repeated measurerments). The figure illustrates empirical purification. 21 Fig. 3. Illustrations of possible lack of equivalence between empirical purification and theoretical purification. 22 Fig.4. Construct network for extended form of metamodel. 23 Chapter 3. Fig. 1. Illustration of error process used by Young (1970). 31 Fig. 2. Illustrations of different categories of true fit for 1 dimensional configurations, n= 20. 52 Fig. 3. Illustrations of different levels of true fit for 2dimensional configurations, n= 20. 53 Chapter 4. Fig. 1. Schematic illustration of the relation between (NL) and (TF,AF). 68 Fig. 2. Sample of results from simulation studies showing the relation between TF-categories and AF-stress for selected values of n and t. 72 Fig. 3. TF contours from AF-stress for 1 dimensional configurations. Each curve shows a TF category boundary ( contour ) as a function of AF and n. Also included is a curve showing the 5% significance level. 73 Fig.4. TF contours from AF-stress for 2 dimensional configurations. Each curve shows a TF category boundary (contour) as a function of AF and n. Also included is a curve showing the 5% significance level. 74 Fig. 5. TF contours from AF-stress for 3 dimensional configurations. Each curve shows a TF category boundary ( contour ) as a function of AF and n. Also included is a curve showing the 5% significance level. 75 Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Fig.6. Relation between TF (expressed as categories and as correlations) and n for configurations in 1, 2 and 3 dimensions when stress = 0. vi 76 Fig.7. Relations between AF-stress and TF-catcgories for 7, 12 and 25 points and crossvalidation results for 1 dimensional configurations. 77 Fig.8. Relations between AF-stress and TF-categories for 9, 12 and 25 points and crossvalidation results for 2 dimensional configurations. 78 Fig. 9. Relations between AF-stress and TF-categories for 9, 12 and 25 points and crossvalidation resulta for 3 dimensional configurations. 79 Fig. 10. TF contours from NL for 1 dimesaional configurations. Each curve shows a TF category boundary ( contour ) as a function of NL and n. 82 Fig. 11. TF contours from NL for 2 dimensional configurations. Each curve shows a TF category boundary (contour) as a function of NL and n. 83 Fig. 12. TF contours from NL for 3 dimensional configurations. Each curve shows a TF category boundary ( contour ) as a function of NL and n. 84 Fig. 13. Some comparisons between the relation of AF and NL based on Figs. 3-5 and Figs. 10-12 and the relation of AF and NL in the original results. Fig. 14. Schematic representation of the design used for the extended form of the metamodel. 85 90 Fig. 15. Application of the extended form of the metamodel when the analysis is done in varying dimensionalities for a given true dimensionality, t. A schematic illustration of expected relative size of Theoretical Purification, TP, based on theoretical correlations and Empirical Purification, EP, based on empirical correlations. Fig. 16. Curves showing how the amount of purification, Est(TP), depends upon n and t. 91 107 Chapter 5. Fig. 1. An example of a HCS and the corresponding tree representation. 113 Fig. 2. Illustration of a nested sequence of sets. 118 Fig. 3. Different presentations of the same tree structure. 120 Fig.4. Illustration to the proof of having to represent a HCS in n-1 dimensions or the 1 ∞metric. 124 Chapter 6. Fig. 1. A tree with 4 objects and the corresponding 3 dimensional spatial representation. 127 Fig.2. Representation of a mixed class and dimensional structure, 3 cIasses and one con-tinuous dimensions. 134 Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 1 A METAMODEL AND APPLICATIONS TO DIMENSIONAL MODELS. Chapter l. INTRODUCTION 1.1 Multivariate models and the research process. Statement of problems Multivariate models are finding increasing application in all branches of psychology, as for instance testified by the handbook edited by Cattell (1966). The classical example of multivariate models, factor analysis, is still probably the most popular of multivariate models. The most important methodological contributions in recent years, however, are nonmetric models. The modern period of development of nonmetric models may be dated from the time when Shepard (1962a, 1962b) first described an algorithm in the form of a computer program for nonmetric multidimensional sca1ingc Shepard's work must partly be seen against the background of metric multidimensional scaling, which again may be regarded as an outgrowth of factor analysis, cfr. Torgerson (1958) An equally important background for Shepard’s work was earlier work in nonmetric models, notably by Coombs and Guttmann. In a most interesting published letter Guttmann (1967) describes some of this work and also the unfortunate lack of nonmetric approaches in the 1950’s. Recently there has, however, been an extremely rapid development of nonmetric models, to a large extent inspired by Shepard’s contributions. The most exciting recent development is conjoint measurement, which for instance provides an approach to testing theories stated as metric functions (e.g. Hull’s vs. Spence's theories on the relation between drive, habit strength and incentive) without assuming more than ordinal properties of the response measures, cfr. Krantz and Tversky (1971). Another consequence of applying conjoint measurement is that this in many cases makes it unnecessary to use more or less arbitrary transformations in analysis of variance. One basic contrast between the newer nonmetric approach and the older metric approach is whether transformations are left open to discovery or whether they are more or less arbitrary imposed on the data. The recent developments in nonmetric models have occurred jointly with perhaps equally important reformulations of the basis for psychological measurement. Today one can only glimpse the farreaching implications these new developments may have both for experimentation and theory building in psychology (cfr. Krantz, 1972). The present work focuses on some methodological problems, which are common for both metric and nonmetric models. More specifically a point of departure for the present work is the fact that researchmaking use of multivariate models involves a series of decisions which often may be partly arbitrary. The aim of the present work is to outline a conceptual framework which provides methods to aid the process of making inferences from the output of analyses and to guide some of the decisions to be made. This conceptual framework will here mainly be applied to nonmetric multidimensional scaling; hopefully it may in the future be applied to other multivariate models, nonmetric as well as metric models. Coombs (1964, p. 4) has a “flow diagram from the real world to inferences”, which will serve as a point of departure for specifying the problems to be studied here. For our purposes it will be convenient further to subdivide Coombs' "phase 3" - since this is the focus for the present work - into three steps - labelled 3a, 3b and 3c in Fig. 1. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 2 Fig.1. Schematic diagram of the research process, based on Coombs (1964, fig.1.1.). 3a, b and c correspond to Coombs’ phase 3 (“inferential classification”). We have nothing to add to the brief treatment of phase 1 by Coombs (1964, p. 4): The universe of potential observations contains all of the things the behavioural scientist might choose to record. If an individual is asked whether he would vote for candidate A, the observer usually records his answer, yes or no: but we might ask why the time it took him to answer is not of interest, or whether there was a change in respiration, or in his galvanic skin responses or what he did with his hands, and so on. From this richness the scientist must select some few things to record, and this is called phase 1 in the diagram. While acknowledging the importance of this phase in the research process, Coombs has nothing further to say about it: "Phase l, perhaps the most important of all, the decision as to what to observe, is beyond the scope of this theory." (op.cit. p. 5.) Phase 2 concerns one of the important contributions of Coombs’ Theory of data, the distinction between "recorded observations" and "data". Coombs reserves the term “data” for that which is analyzed and points out that "the same observations may frequently be interpreted as one of two or more different kinds of data. The choice is an optional decision by the scientist and represents a creative step on his part...." (op. cit. p. 4). On the other hand, as we shall presently see, a variety of different kinds of observations may be mapped into the same kind of data. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 3 Coombs interprets "data" in terms of two dichotomies1, whether the objects in the study may be conceived as consisting of one or two sets of points, and whether the data may be interpreted as an order or proximity relation on the points. This gives four kinds of data, the main focus in this work is on the type of data which can be conceived as proximity relations on one set of points, "similarities data” in Coombs' terminology. This is motivated by the fact that the structure of such data is both simpler and more general than that used in factor analysis, which is the most common multivariate model used. Similarities data are simpler because only one set of points is involved in contrast to factor analysis, where there are two sets of points (usually identified as "persons" and "tests”). The greater generality of similarities data is apparent from an argument made by Coombs (1964, Ch. 24), where he points out that the case of two sets of points may be regarded as an off diagonal submatrix from a more complete (intact) matrix with just one set of points. We might for instance subsume persons and tests under a more general concept, “objects”. It is readily apparent that the two diagonal submatrices - persons x persons and tests x tests …will contain no observations. It is only the offdiagonal submatrix, persons x tests, where we have observations on relations between objects. This is illustrated in Fig. 2. Fig.2 Illustration of relation between 1 and 2 sets of objects a) An intact similarities matrix (1 set of objects). b) An offdiagonal submatrix (2sets of points) from a hypothetical matrix with 1 set of “objects”. One consequence of the greater simplicity is that the solution is unique up to a similarity transformation for similarities data. The problem of "oblique" vs. "orthogonal" transformation, prominent in discussions of factor analysis, is not relevant for similarities data since only orthogonal transformations are permissible (cfr. Cliff, 1966, p. 41). It should, however, be pointed out that we will not have any special discussion of rotation in the present work. The simplicity of similarities data will make it easier to concentrate on the basic relations which are dealt with in the conceptual framework presented in Ch. 2. The consequence of this conceptual framework can then be explored in detail for similarities data. A study of similarities data is also of substantial interest as is evidenced by the wide variety of experimental observations which may be mapped into similarities data. Direct judgements of similarity/dissimilarity of each of the (2) pairs of objects are obvious examples. Some examples are the study of hue by Ekman (1954), shape of U.S. states (Shepard and Chipman 1970) facial expressions 1 In the 1964 version of Coombs’ data theory there were actually three dichotomies, but the more recent presentation (Coombs et. al., 1970) is somewhat simpler in that only two dichotomies are required Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 4 (Abelson and Sermat, 1962). Latency of response in discriminating pairs of stimuli is an interesting alternative response measure, cfr. for instance Gibson,(1970). Other types of examples include: overlap indices for pairs of words in classical free association studies, cfr. Deese (1962), sorting words into any number of piles and for each pair recording the number of subjects sorting them together, cfr. Miller 1969), and similarity of profiles, e.g. for colour names applied to each of a set of spectral colours, cfr. Shepard and Carroll (1966). Further examples include relations between journals which may be indexed by the amount of mutual references, cfr. Coombs et.al. (1970, Ch. 3), see also Xhignesse and Osgood (1967) and substitution errors in learning Morse signals, see Shepard (1963)2. A systematic study of phase 2, in the present context the various procedures which may give similarities data, is outside the scope of the present study, we now turn to phase 3. In Coombs’ flow diagram phase 3 is labelled “inferential classification of individuals and stimuli” and further described: "phase 3 involves the detection of relations, order and structure which follows as a logical consequence of the data and the model used for analysis.” (op.cit. p. 5.) A general perspective on the research process is provided by a further quotation: the scientist enters each of these three phases in a creative way in the sense that alternatives are open to him and his decisions will determine in a significant way the results that will be obtained from the analysis. Each successive phase puts more limiting boundaries on what the results might be. (op. cit. p. 5.) Coombs' description of phase 3 is highly condensed and by breaking it up we can study separately different choice points within this phase 3. While we do not wish to reduce the importance of the scientist’s "creativity” the major aim of this work is to provide an approach which may help the scientist to make the basis of his choices as explicit as possible. Each of the three subdivisions which we have made of phase 3 corresponds to a major problem to be illuminated. The first step (phase 3a) in the process of arriving at "inferential classification” is of course to select a model. Usually this will be a spatial (dimensional) model. There is then the problem of the applicability of such a mode. Will it give a faithful representation of the data? Or is the model inappropriate for the given data? Coombs does not consider alternatives to a spatial model, but a more recently proposed tree structure model (Johnson, 1967) does provide an alternative. While the present work mostly treats a spatial model, some discussion of Johnson's model is included. By presenting two different types of models for the same kind of data we wish to emphasize that a choice point exists which is often overlooked. Spatial models represent a type of model which may take on different forms, and thus we have a further choice to make in phase 3 b (Fig. l). For one thing we may select from a variety of different distance functions (though mainly we concentrate on the usual Euclidean space). Another choice of form is the proper dimensionality. It is always possible to have solutions in several different dimensionalities. We will try to outline a foundation for choosing the proper dimensionality. As will be further discussed later, the decision as to applicability and dimensionality will in practice not be made independent of each other. In order to conclude that a spatial model is inapplicable for a given set of data, it will in most cases be necessary to show that the model is inapplicable for all relevant dimensionalities. 2 The last two examples give asymmetric data matrices, conditional proximity matrices in Coomb’s terminology. The number of times signal j will be given as an answer when signal i is presented will generally be different from the number of times signal i will be given as an answer to signal j to give one example of asymmetry. The models to be discussed for similarities data do, however, require symmetry. If these models are to be applied to asymmetric matrices, the latter must then be converted to symmetric matrices by some averaging procedure (this is usually done). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 5 Having settled on a specific form of a data reduction model, what Coombs (1964) called "detection of relations, order and structure", is not an automatic process, there still remains the final phase 3c, to make inferences from the output of the analysis. At this stage the output is usually represented as a configuration of points in a spatial diagram. One example is "semantic structures” in the well-known study of a case of multiple personality by Osgood and Luria (1954). Fifteen concepts for each of the three personalities were presented in three- dimensional semantic space. Another example from a quite different area of psychology is the study of nine Munsell colours, by Torgerson (1952, 1958). The results were here presented in a two-dimensional diagram (the dimensions being value and chroma). Regardless of the subject matter an important consideration in the final phase is what we will call the precision of the output. 'The concept of precision implies statistical considerations. The naive point of departure for this concept is questions of the type: How "good" is this configuration, how much confidence can I have in it? Acknowledging that there will always be random error from various sources infesting the results, the question can be restated: If the study was repeated under similar conditions to what extent would the resulting configuration be identical? The concept of precision has some similarity to the concept of “behaviour domain validity” in the reformulation of classical reliability theory by Tryon (1957). By precision we will mean the extent to which the location of points in an observed configuration is the same as their “true” positions. In terms of Tryon's conceptualization the observed configuration corresponds to an actually obtained set of test scores, while the true positions correspond to "behaviour domain scores". It is well known that the square root of the reliability coefficient is an index for behaviour domain validity (discrepancy between domain and observed scores). There is, however, no index for the precision of a configuration. One of the major problems in this work is to construct an index of precision (discrepancy between true and observed configuration) and show how this index can be estimated in practice. Extensive simulation studies (Monte Carlo methods) are necessary for this purpose. How will knowledge of precision be related to the process of making inferences? Generally speaking the better the precision the more inferences can be drawn from the “fine structure” of the configuration. Conversely, if precision is only marginally satisfactory, only the coarse structure of the configuration can be used for making inferences. In the latter case only major clusters can be tentatively identified. With close to perfect precision one can also use structural relations within cluster as basis for inferences. 1.2 Comments on formal and content oriented approaches. Two major approaches to the problems of applicability, dimensionality and precision may be described. One approach relies on general indices which are independent of the more or less specific hypotheses which the scientist may entertain. This will be called a formal approach and will be discussed first. The most important such index is "goodness of fit". This is an over all measure of how well the output matches the data. For similarities data nonmetric multidimensional scaling is today the most common method used for analyzing the data, and goodness of fit is usually assessed by stress, as described by Kruskal (1964a, 1964b) in his important improvement of Shepard’s original program. Though nonmetric multidimensional scaling and stress is extensively discussed in Section 3.1, some discussion of stress at this point will serve to highlight weaknesses of the current formal approach. For the evaluation of stress in practice Kruskal first states that stress is a normed "residual sum of squares", and since it is dimensionless it can be thought of as a percentage. Different ranges of stress are described as follows: Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 20% and above: 15% and above: 10% -15%: Alternatively stress in the range 10% - 20%: From 5% -10%: Below 5%: From 2.5% - 5%: Less than 2.5%: 0%: 6 Unlikely to be of interest We must still be cautious We wish it were better Poor Satisfactory or fair. Impressive. Good Excellent. Perfect. These descriptions are extracted from Kruskal (1964a, p.3) and (1964b, p. 32) .The underlined phrases represent the most frequently quoted part of Kruskal’s description. The background for his description is described as follows: "Our experience with experimental and synthetic data suggests the following verbal evaluation". As regards the problem of applicability of the model, the answer based on stress would be to draw a cutting point somewhere in the neighbourhood of a stress of 20% and regard all values above this as evidence against the applicability of the model. Similarly the only guide available to judge the precision of the output configuration is the phrases "poor", "fair”, "good", "excellent". Kruskal (1964a, p. 16) is more explicit on the problem of' dimensionality, where he suggests an “elbow criterion”. First one plots a graph showing how stress decreases as dimensionality, t, increases (This curve will further be referred to as a stress curve.) If adding an extra dimension gives a relatively small improvement in stress, this may be noticeable as an elbow in the stress curve and this elbow may then point to the appropriate value of t. The problem here is well known in factor analysis ("when to stop factoring") and similar to the problem of looking for an elbow in the stress curve is the process of inspecting a curve showing how eigenvalues (characteristic roots) decrease for successive factors extracted. It is a well-known strategy to look for “breaks” or elbows in such curves. This is recognized as a process which may require fairly subtle skills, and Green (1966) even suggests detailed studies of how accomplished "root starers" go about their task in order to write simulation programs to "mechanize” this task. Here we note that the elbow criterion very often fails. Whether this is due to limited applicability of the criterion itself or lack of skill in applying it, will be only incidentally treated, since the conceptual framework outlined in Ch. 2 will suggest an alternative way of using a stress curve to determine dimensionality. Notice that from a more general point of view there is an implicit conflict between two goals in determining dimensionality. On the one hand a frequently stated goal of parsimony requires a low dimensionality. Generally, however, low dimensionality will give high stress. This conflicts with the goal of not doing too much violence to the data. This latter goal, that the final solution remains close (faithful) to the data, is easier to satisfy, the higher the dimensionality. Higher dimensionality implies more parameters, and with a sufficient number of parameters of course any body of data can be faithfully accounted for. We will later show that the apparent conflict between parsimony and faithfulness stems from placing too much emphasis on goodness of fit. In the conceptual framework to be proposed the conflict will be resolved. A more important shortcoming of basing conclusions primarily on stress is that a variety of simulation studies have shown that a given stress value has very different implications depending upon the number of objects, n, and the dimensionality, t. The conceptual framework to be developed will serve to interrelate and extend these studies. It will then be possible to offer concrete alternative guidelines for answering the problems of applicability, dimensionality and precision, based on all three parameters, stress, n and t. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 7 Finally it is evident that the description of how to use the stress index by Kruskal, quoted previously, clearly is of a provisional nature. The description has undoubtedly been of help in many cases, but the criticism can be raised that the description is scarcely an impressive improvement over purely nonquantitative criteria. The present approach will still be a formal approach, since the meaning of a particular configuration of points will not be considered. In practice the scientist will in most cases be most concerned with precisely such meaning or what we here choose to call the content. Some comments on content orientation and its relation to the formal approach in multivariate research are therefore appropriate. Recognizing the limitation of the current formal approach, for instance to the problem of dimensionality, an alternative criterion of "interpretability" is often proposed, for instance by Kruskal (1964a, p. 16). “Interpretability” is not elucidated by Kruskal, but the concept appears to be equivalent to looking at a configuration and seeing to what extent it makes sense. Faced with configurations in differing dimensionalities one picks the one who make most sense. This approach is probably also the most common in connection with the problem of applicability. If none of the configurations in the relevant dimensionalities makes sense, the scientist will probably conclude that a spatial model is not applicable for his data. Before commenting on the relations between a formal approach and a content orientation a somewhat more detailed account of the process of "making sense of' the results is necessary. The extent to which a scientist makes sense out of results from an analysis clearly depends upon how the results fit in with his preconceptions as to what the results might have been. Henryson (1957) gives a useful description of various ways of using factor analysis which is equally relevant for multidimensional scaling. First of all he distinguishes between descriptive and explanatory use of factor analysis. Descriptive use is not relevant for our purpose, but the further subdivision of explanatory use in hypothesis testing and hypothesis generating is highly relevant. Emphasizing hypothesis testing appears to be equivalent to placing multivariate models within a strict hypotheticaldeductive framework. In such cases the content of the configuration is completely specified à priori. The study of Munsell colours by Torgerson (1958) may be used as an example, here the Munsell system provides a framework for specifying precisely how the colours are expected to be related. On the other hand, hypothesis generation is equivalent to explorative research. Thurstone (1947, p. 56) stressed the explorative nature of factor analysis: "The new methods have a humble role. They enable us to make only the crudest first map of a new domain". The study by Osgood and Luria (1954) may serve as an example of exploratory use. It is, however, important to emphasize that the above distinction does not delineate two different kinds of research. As Henryson (1957, p. 91) points out: It is not possible to maintain any clear distinction between testing and generating of hypotheses. By hypothesis testing some of the hypotheses can be disproved, and at same time a basis can be found for establishing of new ones. These, in their turn, require new tests, and so on. Nor is the generating of hypotheses a completely isolated procedure. It is usually based on some form of more or less implicit hypotheses, which become evident in the selection of variables… This points the way to a dialectic view of the process of interpretation. The scientist has more or less clearly preformed images (expectations) as to how the results will turn out, and these expectations guide his preliminary interpretations. On the other hand, the results will stimulate more or less pronounced changes in the images. In predominantly explorative research the images will be of low clarity or sometimes we might even say that the results are not matched with images but stimulate generative processes. If such processes do not provide a meaningful and subjectively acceptable structure, the scientist will conclude that the results are worthless or that the model was not applicable. In strict hypothesis testing the scientist will have a completely preformed image. In Piagetian terms we might say that in this case the question is whether the results can be assimilated within present structure or not. For explorative research the emphasis is mainly on how the cognitive structure of the Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 8 scientist accommodates to the empirical results. In practice the process of interpretation will be a very complicated interplay of assimilative and accommodative processes. Should interpretation - which is here regarded as a content oriented. Approach - guide decisions on applicability, dimensionality and precision? One could regard this as a “subjective” (vague, difficult to communicate) criterion, which ought to be replaced by an "objective" (precise quantitative, communicable) criterion (a formal approach). One can indeed regard much of the history of multivariate methods as attempts to replace "subjective" with “objective” criteria for decisions. An argument can be made that our three main problems are best settled by a purely formal approach, which will then provide an optimal framework for interpretation. This approach provides a different perspective on the relation between “subjective” and “objective” approaches. To illustrate this we first point out some dangers connected with unfettered interpretation. There is first what we might call "overinterpretation”. Imagination may run wild. In discussing the Rorschach test, Smedslund (1967) suggests that the capacity of some specialists to make sense out of this type of verbal material may be compared to how some people manage to “see” goldfish swim in conformity with music being played. More to our point there is a disturbing report of how accomplished experts can make sense out of randomly constructed configurations (Armstrong and Soelberg, 1968)" This danger may be particularly salient when the points represent verbal material, since for any scientist extensive networks inter-relating concepts exists. Stimulated by for instance psychoanalytic thinking one may then easily make sense out of practically any configuration. A quotation from Osgood and Luria. (1954, p. 588) may illustrate this: “To rhapsodise, Eve Black finds PEACE OF MIND through close identification with a God-like therapist (MY DOCTOR, probably a father figure for her), accepting her HATRED and FRAUD as perfectly legitimate aspects of the God-like role.” A knowledge of the precision of the configuration may indicate whether the clusters forming the basis for such an interpretation can be reliably identified. This will be a necessary condition for having confidence in interpretations (though of course not sufficient). If precision is fairly low, this will serve to temper speculation, on the other hand a high precision may encourage more detailed interpretation. At a different pole from overinterpretation there is the danger of what we may call "false neglect”. If one is tempted to throw away the results because the output does not seem to make sense, a high precision should encourage continued attempts at interpretation. While overinterpretation and false neglect are easiest to illustrate within mainly explorative designs, similar processes may occur when the research is closer to conventional hypothesis testing. First, a knowledge of precision is necessary to decide to what extent only the crude aspects of the hypothesised structure can be verified. If the precision is very low the results may be simply irrelevant to the hypothesis, highly incongruent results cannot refute the hypothesis in this case. Recalling that research rarely conforms to a strict hypothetical-deductive model, we can outline a general type of choice facing the scientist when he has some commitment to a detailed hypothesis. In most cases the results will show both features supporting the hypothesis and features deviating from the hypothesis. Correspondingly the scientist can choose whether he will emphasize the support for the hypothesis, or the deviations. A knowledge of the precision may give valuable information guiding this decision. The previously quoted example from Torgerson (1958, Ch. 11) may be used as an illustration. Comparing Fig. 4 and Fig. 8 one is first of all impressed by the pronounced similarity between the Munsell placement (representing the hypothesis) and the empirical outcome (Fig. 8). On the other hand, there are some deviations, in Fig. 8 stimulus 2 is somewhat displaced and to a lesser extent stimulus 5. If now the precision is extremely high, these discrepancies might suggest taking a closer look at whether research should be undertaken which might result in revision of the Munsell system. On the other hand, with a more moderate precision there might be no basis for expecting closer fit than actually observed, and in such a case there would be no reason for ascribing any significance to the deviations. Concerning dimensionality “interpretability” does not seem to be a desirable criterion. If a detailed hypothesis exists a formal approach might provide a valuable independent check on the hypothesized dimensionality. If interpretability in this case was used as the criterion of dimensionality, one might miss the possibility that a formal approach might have provided a different answer. This danger seems even more pronounced when the research is more exploratory. In this case no clear image of the outcome exists and the hazard of ending up with a wrong dimensionality is added to the hazards of Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 9 overinterpretation and false neglect. The general position taken here is quite similar to the case made by Armstrong and Soelberg (1968). The distinction between a formal and a content oriented approach must be regarded as tentative. A more satisfactory description of the relation between them would require a psychological description of' how research actually occurs. Unfortunately this is a neglected field (studies of "root staring" previously referred to would cover one small and perhaps insignificant area in this field). It is to be hoped that studies will be undertaken to give a detailed description of how experienced scientists actually go about completing “phase 3” (cfr. Fig.1l). This might in turn give rise to even better guide lines for improving future research. The motivation for the present work is, however, that improving the present formal approach will be of help for research even though a satisfactory formulation of the relation between this approach and “substantive”, content oriented considerations is lacking. 1.3 General properties of data reduction models. The conceptual framework to be developed in this work has grown out of a more general interest in exploring a specific view on multivariate techniques. This view regards nonmetric scaling as one example of a broad class of methods which not only serves to give a maximally comprehensible representation of a complex set of data, but also do have substantial theoretical implications. As expressed by Coombs (1964, p. 6): The entire process of measurement and scaling may be said to be concerned with data reduction. We have perhaps an overwhelming number of observations and their comprehensibility is dependent upon their reduction to measurement and scales. This is a mechanical process but only after buying a (miniature) theory of behaviour. “Data reduction” and “model” (miniature theory) in the term "data reduction models" are highly interrelated. The most general starting point for any model for data analysis is the belief that there is some structure (constraints) in the elements in the data matrix. This point of view is clearly brought forth by Shepard (1963a, p. 33-34) when he discusses substitution errors in learning Morse signals: now presumably there is some structure or pattern underlying [these (36/2) = 630 offdiagonal entries]. Otherwise there would be no purpose in presenting these numbers separately, their means and variance would alone suffice. But of course we do not believe that these numbers are independent and random. Inspection is not a sufficient method to grasp this structure: Unfortunately, though, man’s information processing system is notoriously unable to discern any pattern in an array of [630] numbers by inspection alone. Therefore in order to find out what these numbers can tell us about the way this system processes dot and dash symbols we must first supplement this natural processing system with artificial machinery. It is such "artificial machinery" that we call data reduction models. Successful applications of such a model then rests upon assumptions about the structure (constraints): the entire set of numbers is in some way constrained or redundant. This in turn implies that this set can in principle be recovered from a smaller set of numbers. And if the recovery is sufficiently complete, then the smaller set of numbers contains all the information in the original larger set. In fact we might [then] be said to have captured the latent structure behind the manifest data. (underlined here), (op.cit. p. 34). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 10 A data reduction model can then be said to imply a specific theoretical view of the kind of constraints in the data. The theoretical content is embodied in the type of latent structure and also in how the relation between the latent structure and the manifest data3 is conceptualized. The symbol L will denote the latent structure or model, M will be used to denote the data, the input to a data reduction model. A further basic concept is that in practice there can never be a simple, direct relation between L and M, there is always measurement error or noise to be coped with. A very interesting notion is that a data reduction model may somehow strip away the noise which infests the data so that the output may give a truer picture. This point of view is clearly expressed by Shepard (1966, p. 308): The analysis can be regarded in part as a technique for purifying noise data. A spatial representation [output from the analysis] of the type considered here can be both more reliable and more valid than the fallible data from which it was derived. (underlined here). A similar point of view is expressed by Coombs et al. (1970, p. 32): to construct a scale from noise data, one needs a scaling method that "removes" the error and provides means of estimating the “true” scale values. I have not seen any systematic attempt to develop procedures to investigate whether it really is the case that data reduction models may "purify data” in the sense that the output "is more reliable and valid". If this possibility exists this logically implies that there is also the possibility that a data reduction model may distort the data in the sense of giving a less valid representation than M. A major aim of the present investigation is to attempt to clarify conditions under which data reduction models may distort or purify data. The discussion so far may tentatively be summarised by listing the following possible properties of the output from data reduction models: a) The output is simpler than M - it may be regarded as a reduced description of M since it contains fewer numbers. b) From the output one may - more or less completely - reconstruct (recover, reproduce) M. c) The output from a data reduction model may give a “truer picture” than M in the sense that noisy data are purified. d) The output may reveal underlying – latent - structure. The symbol L will denote the latent structure or model. e) The output may give information about psychological processes. Of these properties we especially wish to emphasize c). Generally it seems reasonable to assume that if a data reduction model may give a false or distorted picture, it can in this case not reveal underlying structure and thus not give valid information on psychological processes. Conversely purification may tentatively be listed as a necessary (but perhaps not sufficient) condition for a data reduction model to reveal underlying structure. Provided our data reduction model contains a valid conception of the processes underlying the data, we expect purification to occur. On the other hand, if the model implies an inappropriate theory we might expect distortion to occur. c) is thus related to the larger problem of evaluating theories. Clearly procedures for evaluating such theories should have priority over the often stressed need for “comprehensibility” or a simpler, more manageable description than M. (What use can a “simple” description have if it distorts the data?) In other words we wish to emphasize that c) is also more basic than a). 3 The term “manifest data” might be regarded as a redundant misnomer since data always are "manifest" “Manifest” is, however, a useful contrast to "latent" and the terminology is seen to have some currency in the literature. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 11 So far the quotations supporting the listed properties have been drawn from multidimensional scaling. The general description given is, however, also applicable to other types of multivariate models. Concerning for instance factor analysis, Thurstone refers to this type of model as a method for "revealing underlying order". He asks the reader to imagine a correlation matrix where various performance measures are intercorrelated and: our question now is to determine whether these relations can be comprehended in terms of some underlying order which is simpler than the whole table of several hundred experimentally determined coefficients of correlations (Thurstone, 1947, p. 57). There is also the aim to “discover whether the variables can be made to exhibit some underlying order that may throw light on the processes that produce the individual differences”, (op. cit. p. 57) - cfr. e) above. The basic term "latent structure" is probably best known from Lazarsfeld's work. In his major exposition of "latent structure analysis" (LSA) - (Lazarsfeld, 1959) he repeatedly refers to "underlying constructs". '"Locating of objects cannot be done directly. We are dealing with latent characteristics in the sense that their parameters must somehow be derived from manifest characteristics. Our problem is to infer this latent space from manifest data.” (op.cit. p. 490.) It is interesting to note that LSA may be regarded as a data reduction model with the same general properties as the model discussed more explicitly in the article by Shepard quoted above. Lazarsfeld is concerned with “the restrictions put on the interrelations within the data by the assumptions of the model” (op. cit. p. 507) - in Shepard's terminology this corresponds to "redundancy" in the data. When discussing an application of LSA to academic freedom Lazarsfeld is clearly concerned with how the latent structure may throw light on psychological processes (op. cit. p. 528-532). Finally an important step of LSA is from a set of latent parameters to compute “fitted manifest parameters” (op.cit. p. 507). This corresponds to b) above. Since we have especially stressed c) the following quotations are of special interest: "We know that many individual idiosyncrasies will creep into the answers. -- In the manifest property space [corresponding to our M] we are at the mercy of these vagaries. But in the latent space --- we can take them into account and thus achieve a more "purified" description". (op.cit. p. 490, underlined here.) Since LSA deals with quite different data structures than similarities data, this shows the general character if the concept of data reduction models as elaborated in a) - e). The central concepts of this section, purification and distortion, are relevant to the presently unsolved problem of evaluating applicability of multidimensional scaling models. (An axiomatic formulation by Beals et.al. (1968) has met with limited success, cfr. Zinnes. 1969). Logical clarification of the terms “purification” and “distortion” will be the main task for the next chapter. This chapter will be concluded with a section giving an introduction to the two types of models we are going to discuss in detail. 1.4 Type of model and psychological theory. The most important aspect of any data reduction model is how the nature of L is conceived within the model. Two broad classes of models may be distinguished which both are geometrical. The best known is some kind of spatial (or dimensional) model, the other type of model will be called a tree structure model. 1.41 Spatial (dimensional) models. In a spatial model the objects are represented as a configuration of points in a space of specific dimensionality. The basic relation between points in space is the mathematical relation distance. There are three axioms a function must satisfy in order to be a distance function: dij.= 0 if and only if i = j d ij = dij distance is a symmetric function dij + djk ≥ dik Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 12 The third is the most important axiom and is usually referred to as the triangular inequality. The sum of two distances in a triangle can not be less than the third or stated otherwise, the indirect distance between two points (i, k via j) can not be less than the direct distance, dik. The most well-known distance function is the Euclidean distance - well known from plane geometry in high school. 1 2⎤2 ⎡ (1) dij = ⎢ Σt (ail − a jl ) ⎥ ⎢ l=1 ⎥ ⎦ ⎣ Where t is the number of dimensions, and ajl is the coordinate of point i on dimension l. This is, however, only a special case of a more general class of distance functions, usually referred to as Minkowski metrics: 1 ⎤ ⎡ 2 p (2) dij = ⎢ Σt ail − a jl ⎥ p≥1 ⎥ ⎢ l=1 ⎦ ⎣ Beals et al. (1968) prefer the term power metrics which seems better since it is a more descriptive term. In mathematical literature the term 1p metrics is widely used. For the power (p) equal 2 equation (2) is readily seen to give the usual Euclidean function. Two other special cases of equation (2) are of interest. For p = 1 the metric becomes the so-called city-block distance: (3) t dij = ∑ ail − a jl l =1 For this model "we must think of the shortest distance between two points (stimuli) as passing along lines parallel to the axes: metaphorically speaking, we must always go around the corner to get from one stimulus to the other" (Attneave, 1950, p.521). For the other limiting case - p = ∞, dij - reduces to: [ (4) dij = max l ail − a jl ] which is called the l∞ metric or dominance metric. In this space only the largest component difference counts, all the others are neglected. A coordinate matrix A, dimensionality (n, t) where ail is an element is the primary output from a spatial data reduction model. For p ≥ 1 the lp metrics are known to satisfy the triangular inequality. From equation (2) it is readily apparent that the two first distance axioms are satisfied. Why should spatial models be relevant to similarities data? And does the power - (p) - have any possible psychological significance (since spatial models are mostly based on power metrics)? The first question is answered by Shepard who points out: There is a rough isomorphism between the constraints that seem to govern all of these measures of similarity on the one hand, and the metric axioms on the other. In particular, to the metric requirement that distance be symmetric, there is the corresponding intuition that if A is near B then B is also near A. To the metric requirement that the length of one side of a triangle can not exceed the sum of the other two, there is the corresponding intuition that, if A is close to B and B to C, then A must be at least moderately close to C. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 13 -- this in turn invites an attempt to carry the powerful quantitative machinery that has developed around the concept of distance to the intuitively defined notion of proximity. (Shepard, 1962a, p.126.) As for the second question raised above, Shepard has again made important contributions. He has drawn a distinction between analyzable and non-analyzable stimuli. For non-analyzable stimuli the different dimensions are not phenomenologically given for the person judging similarity or difference. The standard example is here pure colours where the dimensions hue, brightness and saturation are not immediately given. For such stimuli Shepard suggests that judgements may be mediated by an Euclidean model. For analyzable stimuli, however, he suggests that the component differences will not be combined according to an Euclidean model, but that city block model may be more appropriate. For further discussion on the implications of this for verbal learning and decision making see Shepard (1964, and 1963b). Torgerson had a similar distinction in his 1958 book and later (Torgerson, 1965) suggests that an Euclidean model may be appropriate for purely perceptual processes while the city block model will be more appropriate for cognitive processes. For the other special case of the power metrics Coombs et al (1970, p. 64) states: The p = ∞ model corresponds to Lashley's principle of dominant organization. He (1942) proposed that the mechanism of nervous integration may be such that when any complex of stimuli arouses nervous activity, that activity is immediately organized and certain stimulus properties become dominant for reaction while others become ineffective. This model, called the dominance model, is suggested by experiments in which some one stimulus property appears to be dominant in exerting stimulus control of behaviour. The distance between two points in such a metric is the greatest of their differences on the component dimensions. A general perspective is provided by the following quotation from Coombs (1964, p. 248-249): Any model which presumes to make a multidimensional analysis of a data matrix is by its very nature a theory about how these components [the coordinate differences in equation (2)] are put together to generate the behaviour. Any theory about a composition function [cfr. p in equation (2)] is a theory about behaviour. This, it seems to me, is what makes the subject interesting and important. The components in and of themselves are static, inert and just descriptive, until a composition model imbues them with life. Perhaps most of psychological theory can be expressed in the context of a search for composition models. 1.42 Tree structure model (hierarchical). While in spatial models objects are regarded as points in a space, a basic notion in a tree structure is the concept of an object as belonging to one of a set of non-overlapping classes. In the present work "class", "type" and "cluster" will be treated as synonymous concepts. In the field of clinical psychology one may, albeit very roughly, distinguish between type theories versus dimensional (or trait) theories. Horney, Jung, Kretschmer, Freud may for instance be regarded as theorists describing persons as falling within separate types, while Allport, Cattell and Guilford may be taken to exemplify theorists preferring to describe persons as varying along a number of dimensions. Considering psychiatric classifications within the framework of data reduction models Torgerson writes: "It is possible to categorize the investigators in two classes: the typologists and the dimensionalists". (Torgerson, 1967, p. 179) Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 14 In Fig. 3. the notion of representing a classification as a tree is illustrated: "Tree" is used in the strict graph theoretical sense of the term in this work, that is, a tree is a connected graph with no loop. Whenever a tree structure is used as a model, the objects will be represented as terminal nodes or leaves in the graph, cfr. level 0 above. The branches from the leaves to the nodes at level 1 indicate which of the objects are included in each of the three classes. The final node, level 2 above, is the root node which serves to delineate the domain of interest in a specific study. From the point of view of Steven’s well known classification of scales, a typology is equivalent to a nominal scale. This is often regarded as a rather primitive way of describing structure. Another way of stating this point of view is that a tree with just two levels is in most cases not very interesting psychologically. The tree concept is much more powerful by considering not only a simple classification (as we have done so far), but also subclasses within classes and further sub-subclasses within subclasses etc. This way of thinking is perhaps best known from taxonomic schemes in biology, where we at a fairly “high” level have phyla, then classes within a phylum, orders within a class, geni within an order, species within a genus and finally as leaves specific creatures. In psychology tree structures are used rather infrequently compared with spatial models. Miller (1969) has used a tree structure model in semantics - partly inspired by taxonomic schemes - to account for sorting data. He has also reanalyzed some of the free association data of Deese (1962) with such a model.. Since tree structures are not generally well known in psychological research, a detailed example is given below. The example (from Gibson, 1970) also illustrates that a tree model may give information on psychological processes, (cfr. e), p 10). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Fig. 4 15 Tree structure resulting from a hierarchical cluster analysis of latency data for visual discrimination of pairs of letters by adult subjects. Based on Gibson (1970, p. 139) The branches in this example may be considered as graphemic features. The branches we have labelled are commented by Gibson: "The first split separates the letters with diagonality [a2 which subsumes the cluster (MNW)] from all the others [a1 which subsumes (CGEFPR)]. On the left branch [b1 the “round” letters C and G, next split off from the others. [b2 - (EFPR)]. At the third branch, the square right-angular letters, E and F [cl], split off from letters differentiated from them by curvature. [c2]”. The tree structure for seven year old children was similar, but not quite the same. Gibson's summarizing comments are especially interesting from the point of view of throwing light on psychological processes: The result “suggests to me that children at this stage may be doing straightforward sequential processing of features, while adults have progressed to a more Gestalt-like processing picking up higher orders of structure given by redundancy and tied relations" (op.cit.p. 139). How can a tree structure be extracted from a symmetric data matrix? The basic notion underlying all classification is that objects within a cluster are close together while objects belonging to different clusters are less close together. Extending this notion to subclasses implies that the objects represented in clusters at the lowest levels (closest to the leaves) are closest together. Conversely a specific pair of two objects are further apart the closer to the root one must move in order to find a cluster which contains the pair. While this concept of closeness applied to trees is not recent, it seems to have been first formalized in distance concepts by the highly important work of Johnson (1967). His formalization also resulted in simple algorithms which can be applied to noisy data. Johnson's concept of hierarchical clustering schemes, HCS, which is isomorphic to tree structure, will be extensively treated in Ch. 5. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 16 Chapter 2. A METAMODEL FOR DATAREDUCTON MODELS 2.1 A metamodel. A basic thesis in this work is that data reduction models may be regarded as (miniature) psychological theories. While this work mainly deals with similarities data, in Section l.3 a set of properties of data reduction models is outlined, which transcends similarities data. In Section 1.4 two alternative geometric models are briefly sketched, spatial models and tree structure models. The introduction to these models specifically illustrates the possible substantive relevance of the models. For convenience in the following discussion the properties of data reduction models outlined in Section 1.3 are briefly restated here. a) b) c) d) e) Parsimony. The output is simpler than M (data input to the model). Reconstruction. From the output more or less complete recovery of M is possible Purification. The output may give a truer, more purified description and thus be said to: reveal latent structure (L), and give information on psychological processes. While e) above most directly ties in with the basic thesis on data reduction models ( that they are substantive theories), we have chosen to emphasize c) since it may be regarded as a necessary condition for d) and e). An investigation of c) is then relevant to the general problem of evaluating theories (applicability). As discussed in Ch. 1 the usual approach to evaluating data reduction models is to compute indices of goodness of fit and evaluate such indices either more or less intuitively or in the light of statistical sampling distributions. Goodness of fit will be seen to be directly related to b) and a) above. The task of this chapter is to outline a conceptual framework which will interrelate the usual goodness of fit approach and purification (and distortion). This conceptual framework is called a "metamodel". The term metamodel should be taken to imply a conceptual framework relevant to a wide variety of types of model. The metamodel will serve as a useful heuristic guide to general insights on data reduction models and also for suggesting new methods for answering problems of applicability, dimensionality and precision. The basic and simplest form of the metamodel is presented in Fig. 1. Fig.1. A metamodel representing the relations between latent structure (L), manifest (M) and reconstructed data (G). NL - Noise Level. AF - Apparent Fit, goodness of fit, stress, degree of recovery of M from G. TF - True Fit, degree of recovery of L by output G. Before discussing the interrelated concepts “apparent fit”, "noise level" and “true fit”, some preliminary comments concerning L, M and G are necessary. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 17 Concerning the data M, we have so far considered this as a square, symmetric matrix for similarities data. It will however, be useful to consider M as being in vector mode. Any square, symmetric matrix may be strung out as a vector, for instance in the sequence (2,1), (3,1), (3,2), (4,1), (4,2), etc. This n sequence excludes the diagonal, and the vector consists of ( 2 ) elements. Since we exclude asymmetric matrices the symmetric values (1,2), (1,3), etc. are safely ignored. Other types of data matrices than similarities data may be strung out as vectors in different ways. Unless otherwise specified, M will from now on always refer to a vector with ( n ) elements. An element in this vector will 2 be referred to as mij . While only one mode needs to be considered for M, two modes must be distinguished for both G and L. For spatial models the output G is usually considered in the mode of configuration, a (n,t) matrix of coordinates where n is the number of points and t the number of dimensions. If this mode is specifically intended, the symbol G(C) will be used. gil will refer to a specific element (coordinate) in G(C). The coordinates gi1 may be inserted in equation (2), Ch. 1. This will give reconstructed distances which we here consider as another mode of G. These reconstructed distances will be in strict one to one correspondence with the data elements. As for M we consider these reconstructed distances as a n vector (with ( 2 ) elements for similarities data). If this mode is specifically intended, the symbol G(V) will be used G will serve as a generic symbol. G will be used if the context makes the specific mode intended apparent, or if it is not essential to specify the mode. There are the same two modes for L as for G. The configuration mode will be denoted L(C), L(V) will denote the distance mode, and L will be used as a generia symbol in the same way as G will be used.4 Though the concept latent structure will be discussed in more detail later, we may here note that L can be specified in more or less detail. At the most general level L may refer to just any type of latent structure. The barest amount of specification is to state the type of L e.g. whether L is intended to denote a spatial model or a tree structure model. On the next level of specificity, we may indicate a specific form of L, e.g. dimensionality and type of metric space for spatial models. Finally, what may be called the content of L may also be specified, that is concrete values of the element in L(C). In simulation studies it is always necessary to have complete specification of the content of L(C). Turning now to the connecting lines in Fig. 1, a central feature is that these lines can be regarded as “sizes of discrepancy”. Consider first the relation between M and G. The extent to which one can reconstruct M from G is a question of the discrepancy between G(V) and M. The closer G(V) is to M, the more satisfactory is the "goodness of fit". Algorithms for finding G are more or less directly aimed at optimizing fit (closeness), or what amounts to the same: minimizing the discrepancy between M and G (V). Sometimes the discrepancy may even be interpreted literally as distance, this was for instance the approach taken in defining the main problem in factor analysis by Eckart and Young (1936, p. 212), who formulated the problem in terms of a least squares solution to minimizing the distance between the data matrix and a matrix of reduced rank. Kruskal’s nonmetric algorithm is directed at minimizing stress, his index of discrepancy. Since Kruskal, however, treats M strictly as an ordinal scale, stress cannot directly be regarded as the distance between G (V) and M. The term discrepancy is somewhat less precise than the term distance, but longer lines in the metamodel will always be taken to imply larger discrepancies. 4 In Section 5.1 we discuss how a tree is algbraically represented and how one from L(C) can compute the corresponding distances, L(V). There is then the same relation between a tree output, G(C), and the reconstructed distances, G(V). Concerning references to single elements double subscripts will always be used. For elements in M - and associated vectors to be considered in Section 3.1. there can be no confusion since there is only one mode to consider. For G(V) convention dictates using dij to denote elements Finally the context will make clear whether subscripted l’s refer to elements in L(C) or L(V). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 18 Since we may also from a tree output compute reconstructed values, G(V), it is possible to compute goodness of fit also for a tree structure mode. It may, however, be noted that the algorithms for finding the tree output are not explicitly directed at minimizing the discrepancy between G(V) and M. The important point in this context is that a tree structure model may be discussed within the same general framework as spatial models, which has not previously been done. According to Coombs (1966, personal communication) any data reduction model consists of "a theory of behaviour and an algorithm for applying the theory". We are thus faced with the double problem of evaluating the theory (in this context how L is specified), and since for similarities data there are a variety of competing algorithms, there is also the problem of evaluating algorithms. A central thesis in the present work is that goodness of fit cannot provide sufficient answers to these problems, cfr. the critical comments on stress in Section 1.2. Consider first a basic claim for nonmetric methods, that a configuration with metric properties can be recovered just from ordinal data. The only way of demonstrating this is by first "knowing the answer" and then to show that from ordinal information in M it is possible to get back essentially what we started from. In terms of the present terminology this implies starting with a complete specification of L(C), then to compute L(V) and then set M equal to L(V). In the analysis all but the ordinal information in M is further ignored. The question of “recovery” is then answered by comparing G(C) with L(C) and to the extent that these are similar, the answer is satisfactory. Indeed, this basic scheme was followed in the first example given both by Shepard (1962b) and Kruskal (1964a) as will be discussed in more detail later. At present we note that the discrepancy between G and M is not relevant in answering the present question of recovery. Even if this discrepancy is zero, it might still be the case that the underlying metric configuration is very incompletely matched by G. The main point of the metamodel is to draw attention to the fact that it is more basic to know how far G is from L than to know how far G is from M. To highlight this difference we have borrowed a pair of terms from Sherman (1970). He used the term "apparent fit" (AF) for stress (goodness of fit) and “true fit” (TF) for the discrepancy between L and G. True fit has variously been called “accuracy of solution”, "metric determinacy", “'recoverability of metric structure”, ”metricity”. We think "true fit” is to be preferred since it highlights the major importance of assessing this discrepancy, furthermore it contrasts conveniently with “apparent fit”. The case we have discussed so far, setting M equal to L(V), is not realistic as a model for empirical data, since this case makes no allowance for noise or measurement errors. In terms of the metamodel this case corresponds to zero discrepancy between L(V) and M, or noise level (NL) = 0, and we can then conceive of L and M as coinciding. Conversely we can conceive of various amounts of noise as equivalent to various sizes of discrepancies (length of NL in Fig. 1). Methods of simulating error processes which results in various noise levels will be discussed in Section 3.2, at present we simply note that a set of data which contains much measurement error can be pictured as far removed from L, the line NL will be long. We can now give a literal interpretation of the highly abstract concept “purification”. Any error process removes M from L, purification implies that the result G moves back closer to L. In order to give precise meaning to this concept it is necessary to use the same index for NL and TF. Since we have stressed that all three basic terms in the metamodel may be presented in the same mode (vector mode), any index used to express NL may in principle be applied also to TF. Provided the same index is used for NL and TF, we then have the following simple definitions.5 5 Unless otherwise specified a smaller value of any index for NL, TF and AF will always imply a "better" value or closer congruence between the corresponding pair of L, M and G. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Purification: Distortion: 19 TF < NL TF > NL Detailed results on purification will be presented later. We shall then see that for a given noise level purification increases with n, a result highly consonant with the general expectation that it pays to have a lot of data. From the metamodel we should, however, then expect that stress should increase with n since more noise is stripped away the further the result G must be removed from M (of course in the direction of L). This result strongly points out the advantage of using the metamodel as a conceptual framework. If' the emphasis is solely on stress it would be very disconcerting to have stress increasing with n when the results really get better. We have just implied that without the guidance provided by the metamodel one may fall prey to a pseudoconflict between the goal of having high n and the goal of a good apparent fit. A similar type of pseudoconflict has already been discussed in Section 1.2. It will be recalled that low dimensionality best satisfies parsimony (gives a high degree of data reduction) but generally gives high stress, whereas the reverse is the case for high dimensionality. To see this as a “conflict” between the goal of parsimony and the goal of remaining close to the data is nothing but a recognition of the limitation of the latter goal. We will not place special emphasis on any of these two goals but argue that the superordinate goal of searching for the best true fit will provide a new approach to the vexing problem of dimensionality. The answer to the central question on purification, does this really occur, has already been anticipated. The concept has some currency in the literature, but has not been found to be specified concretely before. This will be further done here after discussing various indices for TF and NL in Ch. 3. Concerning the main finding that purification generally will be found to occur, it may be objected that this really is a trivial finding, that it is just a consequence of a quite general “averaging” process. Purification may be seen as analogous to the general fact that for a sample of measurements generated by a certain stochastic process, the mean may be a better representation than the separate observations. To take a different example, a regression line may give a better representation of a relation than the whole scatter diagram. For nonmetric models the case with no noise will of course always involve some distortion. When L and M coincide, G should also coincide if there was no distortion. The fact that G will not exactly coincide with L in this case implies some distortion (the amount of distortion in this case will be discussed later). When the limiting case of no noise involves distortion, it is not trivial that the usual case with more or less noise will imply purification. There is, however, a more important way to answer the objection of "triviality". The fact is that no objection really has been stated. A data reduction model implies a connection of the type of constraints in the data. If this conception is valid we may generally expect purification. If however, this conception is not valid distortion may be expected. Concerning the simple statistical examples just mentioned it may first be pointed out that it is not trivial that a regression line will give a "better representation". If there actually is a non-monotone relation, a regression line may be said to distort the relation. Quite often summary statistics as the mean and variance are regarded as not having theoretical implications, just being convenient "descriptive statistics". This point of view is strongly contested by Mandelbrot (1965). Discussing “a class of long tailed probability distributions” he points out that the Pareto distribution is useful in describing a variety of phenomena and that for this distribution the second moment is not finite and thus the variance in any sample will not be of any use as a descriptive statistic. He even points out that forms of the Pareto distribution, where the first moment does not exist, may also be useful, - in this case not even a mean of a sample would be of any use but would just obscure underlying processes generating the distribution. The position taken here is that purification is a useful concept because the contrasting phenomenon, distortion, may also occur. This may generally be expected to occur when an inappropriate model is applied to a specific set of data. This will be illustrated by showing that when a tree structure model is applied and L is of a spatial type, G will be further removed from L than M is, as will be briefly commented upon in Concluding remarks. A different type of application of the present conceptual Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 20 framework is the previously mentioned problem of evaluating algorithm. Which of the competing algorithms for similarities data is best? If for a wide variety of different true configurations, L, method A consistently gives better values of TF than method B, method A is clearly to be preferred. Notice that for this type of problem a comparison of the different goodness of fit indices is not relevant. Partly they may not be comparable (concerning for instance the Shepard-Kruskal approach versus Guttmann-Lingoes, it will be shown in Section 3.1 that the different indices of goodness of fit are not strictly monotonely related). But even if they were strictly comparable the important issue in comparing the methods is not which one gives output closest to M, but rather which method gives output closest to the true configuration L. In other words, degree of purification is proposed as a criterion for comparing methods, the method with the highest degree of purification should be considered best. We have argued that TF is more basic than the usual goodness of fit, AF. Since in empirical applications L is usually not known, TF must be estimated. If it were possible it would be desirable to work out mathematically joint sampling distributions of TF and AF indices, but the mathematical problems seem to be insurmountable. Consequently a simulation approach will be used. Estimates of TF from AF and n and t are then proposed as the answer to the problem of precision, see Section 4.6. In simulation studies L must be completely specified. From L (and usually some error process) synthetic or "artificial" data are then generated and the solution G is evaluated from the point of view of L, which may then be called an external true criterion. While L is not known in empirical applications, cases may exist where the scientist is able to completely specify his image of L before analysis of his data. This is what we in Section 1.2 referred to as "strict hypothesis testing". Torgerson's study of Munsell colours was used as an example. In this case the Munsell classification may be regarded as an external empirical criterion. Let us label such a criterion L' and insert this in the metamodel, cfr. Fig. 1, one may then compute the discrepancies TF’ and NL’. If it turns out that G is closer to L’ than M is to L’, we may say that the hypothesis L' implies a purification of the data and our confidence in L' will increase. On the other hand it might be the case that G moved away from L' or that distortion occurred. In this case general confidence in L’ (or the theories generating L’) would not seem warranted. Theories should account for data, not lead to distorted representations. This appears to be a novel approach to the problem of evaluating completely specified hypotheses in multivariate models. In Section 1.2 we did, however, argue that research which seeks a yes or no answer to the validity of a completely specified hypothesis is rather atypical. By also considering the discrepancy between M and G, more precise questions can profitably be asked. From the observed AF, TF can be estimated, this estimate can then be compared with TF'. Even though purification might have occurred, it might still be the case that TF' is substantially worse than TF. One might then conclude that though L’ points in the right direction, some revision is called for. This may stimulate a revised conception of L', say L’’, which may then be used as a basis for new empirical data, and the same process may continue with L’’ substituted for L'. 2. 2. The extended form of the metamodel. Empirical and theoretical purification. The simple form of the metamodel in Fig. l has been found useful for clarifying weaknesses of relying too heavily on goodness of fit, estimation of true fit is proposed as an alternative. While the discussion referred to both the problem of dimensionality (form of L) and applicability (type of L), no methods were suggested to answer these problems. We will now discuss extensions of the metamodel which have implications for both of these problems. In order to estimate true dimensionality solutions in several dimensions are required. These solutions might be generated from one set of data (giving a stress curve). In order to check applicability it is, however, necessary to have several sets of data generated by a given L. We first consider this case, a simple interpretation is repeated measurements on individual data. In Fig. 2 we visualize the structure when we have two sets of data, Ml and M2, furthermore two corresponding outputs, Gl and G2. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Fig. 2 21 Extended form of metamodel (for repeated measurements). The figure illustrates empirical purification, that is: Rel (G1, G2) < Rel (M1, M2) In Fig. 2 we have diagrammed the case where G1 and G2 are closer than MI and M2. The discrepancy between various G’s (or various M's) will generally be denoted by Rel (for relation), it being understood that the closer the pair of terms is being related, the smaller is the value of Rel. Concrete indices for assessing these relations will be discussed in Ch. 3. In Fig. 2 we then have: Rel (G1, G2) < Rel (M1, M2). A major problem is now whether it is generally justified to conclude from the inequality above to purification and thus that a valid model has been applied. We will argue that not only is it possible to conclude from the inequality in Fig. 2 to purification, but that also the reverse implication holds good, that is: purification generally implies that the inequality in Fig. 2 will be satisfied. Notice first that purification as defined on p. 18 can never be directly observed in empirical work, since with real data L is always unknown. This is not the case for the inequality in Fig. 2. This inequality can always be empirically investigated whenever we have repeated measurements. Consequently this inequality will be referred to as empirical purification. When both empirical purification and purification as defined on p. 18 is under discussion, the latter will be referred to as theoretical purification. An equivalence thesis may now be stated: Empirical purification and theoretical purification are logically equivalent, that is the occurrence of one implies the occurrence of the other. We will state a preliminary case for this thesis by arguing against the conditions which would invalidate it. The two invalidating conditions are diagrammed in Fig. 3. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 22 Fig 3: Illustrations of possible lack of equivalence between empirical purification and theoretical purification a) Empirical distortion and theoretical purification (cfr. Fig. 3a) Consider first Ml and M2 as composed of L plus "large" error components. Similarly Gl and G2 may be considered as composed of the same L plus "smaller" error components. This is a direct consequence of theoretical purification. If the error terms are not too much correlated, simple psychometric theory tells us that Rel (Gl, G2) < Rel (M1, M2). In this case theoretical purification will imply empirical purification, and the possibility diagrammed in Fig. 3a) will not occur. On the other hand M1 and M2 could "guide" G1 and G2 in separate directions, yet both G1 and G2 could be closer to L than M1 and M2. This could come about if G1 and G2 each capitalized on specific noise components. For this case to occur, however, substantial correlations between error components in M and in G would probably be necessary. b) Empirical purification and theoretical distortion (cfr. Fig. 3b) This does not seem at all likely to occur. If L is inappropriate as a model for M1 and M2, there is no basis for "guiding" G1 and G2 closer to each other. To the extent that it is safe to rule out this possibility, it is legitimate to conclude that empirical purification will imply theoretical purification. Results substantiating the equivalence between empirical and theoretical purification will be presented later. We shall then see that the 5 vectors L (V), M1, M2, G1 (V), G2 (V), has a structure of a Spearman type, where L corresponds to the general factor. To the extent that this holds true, the 5 vectors might be diagrammed as points on a straight line. L (V) can be considered as a unit vector, then G1 (V) and G2 (V) closest to L (V), as these are more saturated with the general factor than M1 and M2. A more practically useful approach to the problem of applicability is sketched on p. 23. Consider now analyses performed in several different dimensionalities. For each dimensionality we get a pair of' results (G1t, G2t) where the superscript t is an index for dimensionality. A simple way of estimating dimensionality is to pick the value of' t where G1t and G2t are closest. Different dimensionalities correspond to different forms of L. More generally we then offer the following tentative selection rule: The true form of L is found by selecting the form of L which corresponds to the highest degree of empirical purification. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 23 If, regardless of' the form of L, there is no empirical purification this may indicate that the wrong type of L has been selected, that the type of' model chosen is not applicable to the specific set of data. We can now see why the simple form of' the metamodel in Fig.1 is insufficient for throwing light on the problem of applicability. The reason is that no independent estimate of' NL is possible. We can then not know whether for instance a high stress indicates that the model is inappropriate, that is, we have used a wrong theory or whether simply the bad results are due to highly unreliable data. By increasing NL, stress will generally increase. If one is tempted to conclude that a very high stress indicates that the model is inappropriate, high NL will then always be an alternative explanation. The situation is quite different when we know Rel (M1, M2), because from this NL can be estimated as will be shown in Section 4.7. Furthermore TF can be estimated from Rel (G1, G2) by the same procedure. All the concepts we have introduced when discussing the extended form of the metamodel may be pictured as a redundant network of constructs as in Fig. 4. Fig. 4 Construct network for extended form of metamodel When the method of estimation of NL from Rel (M1, M2) is the same as the method of estimation of TF from Rel (G1, G2), it follows that there must be the same ordering of TF and NL as of Rel (M1, M2) and Rel (G1,G2). This is merely another way of stating a conclusion previously indicated by a different argument, that empirical purification and theoretical purification are logically equivalent. The redundancy in Fig. 4 points at general ways of testing the appropriateness of the model. Suppose that in the apparently appropriate dimensionality (t0) there is a fairly high stress (AF|t0). From this stress one may estimate NL. If on the other hand Rel (M1, M2) implies much less noise in the data than AF|t0 this may indicate that the model does not apply to the data at hand. The point is that the conversion from Rel (M1, M2) to NL will not be theoretically neutral, but implies a specific conception as to the nature of L. If then Rel (M1, M2) turns out to be much too low (little error), this indicates that there is more structure in the data than is captured by the specific type of L being hypothesized. In such cases we should also expect empirical distortion. A general way of evaluating applicability becomes possible through a comparison of the two values for degree of purification in a given case, the empirically given purification and the estimated theoretical one. A necessary condition for such a comparison is that empirical and theoretical purification are expressed in the same units. In Section 4.7 this will be accomplished by converting the relations in empirical purification to the same units as those used for theoretical purification. When only one set of data is available one may as previously discussed plot a stress curve and look for an elbow in this curve. An alternative strategy, however, is for each separate stress value to estimate true fit in the corresponding dimensionality. This implies trying out several hypotheses as to the form of L and to estimate true fit for each form. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 24 A simple solution to the problem of dimensionality is to select the dimensionality which corresponds to the lowest (best) estimated value of true fit. Formulated as a general rule: When several estimates of TF are available - each estimate assuming a specific form of L - the true form of L is that which corresponds to the lowest value of true fit. This will be seen to be a generalization of the selection rule on p. 22 since Rel (G1, G2) and thus empirical purification is assumed to be monotone with true fit. The general rule given above and the more specific selection rule correspond to supplementary strategies for finding the proper dimensionality. Simulation studies to be discussed in Section 4.7 will form the basis for evaluating the success of these strategies, considered both separately and jointly. Summary In this chapter a metamodel has been presented as a conceptual framework for data reduction models generally and specifically for simulation studies. True fit - discrepancy between results from the analysis and true, latent structure - replaces apparent fit (conventional goodness of fit indices) as the central concept. Noise is an important concept in the metamodel: Noisy data are assumed to be purified when the analysis is guided by an appropriate theoretical conception of he type of redundancies in the data An optimal goal for simulation studies is to provide decision rules which can be applied in practical work. Deciding on the applicability of the model and the problem of dimensionality has been discussed. A further problem is to evaluate competing algorithms with the same general purpose. This includes both comparison between various nonmetric algorithms and also comparing metric and nonmetric algorithm. In the next chapter the three relations: noise level, apparent fit and true fit - which here have been discussed in a general way - will be treated in detail. The complexity of simulation studies will be apparent in Ch. 4. In this chapter steps are taken to realize the goals outlined for simulation studies. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 25 Chapter 3 NONMETRIG MULTIDIMENSIONAL MODELS AND THE METAMODEL 3.1 Nonmetric algorithms and criteria for apparent fit (AF) The most popular nonmetric multidimensional scaling method today is probably the one described by Kruskal (1964a, 1964b) who set out to improve Shepard's original program. Since Shepard has abandoned his original program in favour of Kruskal’s MDSCAL program (cfr. Shepard 1966, Shepard and Chipman, 1970) we will not give any attention to the details of Shepard's first program. Later Guttmann and Lingoes with their "smallest space analysis" (SSA -1) and further Torgerson and Young (TORSCA) have offered programs with the same purpose as Kruskal, see Lingoes (1965,1966), Guttmann (1968). Young and Torgerson (1967), Young (1968a). Whether these methods have any advances compared with Kruskal's method will be discussed in Section 4.32. Common to all the nonmetric methods is that no specific functional relationship between data and distances is assumed. The only assumption is that there is a monotone relation between distances and data. For the nonmetric methods the distances of the obtained configuration are computed so that the order of the data is optimally reproduced (stress is an index telling how well this is accomplished).6 Reproducing order is in contrast to the older metric methods where the aim was to reproduce not only order but the actual values of the data. Since a configuration obtained from nonmetric methods has essentially the same properties as a configuration from a metric method, we can say that the nonmetric enterprise replaces the strong interval or ratio assumptions of metric methods by the much weaker ordinal assumptions. The strongest claim which can be made for the nonmetric methods is that the weaker assumptions of this approach does not in general imply any loss of information. Simulation studies relevant to this claim will be reported in Section 4.33. The present survey of the main features of the currently most popular nonmetric algorithms borrows heavily from the recent work by Lingoes and Roskam (1971). This is by far the most detailed mathematical exposition of the main methods available and also gives fascinating glimpses of the sometimes acrimonious debates among the persons chiefly involved in developing the methods. As implied by the introductory remarks in this section the essence of the nonmetric methods is captured by the concept of monotonicity. Corresponding to a vector of dissimilarities M, with elements mij is a monotonically related vector ∆ with elements δij. ∆ "replaces" M in all algebraic formulations in the algorithms. The general principle of monotonicity may be stated in two slightly different forms: 6 The term "similarities" which describes the kind of data the present work focuses on is used in two different senses. Partly it is used in a generic sense, as a general term describing symmetric data matrices. In this sense it subsumes "similarities" and "dissimilarities" as properties of the data in specific examples. In the specific sense high data values of dissimilarities correspond to large distances while the reverse is the case for similarities. For metric methods similarities are usually treated as scalar products, while dissimiarities are treated as distances which are then converted to scalar products. The presentation of nonmetric algorithms is simplified by considering dissimilarities as the basic case. Data where high values correspond to low distances are then assumed first to be converted to dissimilarities by sorting them in descending order. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Strong monotonicity: Weak monotonicity: 26 whenever mij < m k1 then δ ij < δ kl whenever mij < m kl then δij δ≤ δ kl When there are ties in the data a further distinction is whether tied data shall imply tied δ values or not. The weaker case where it is allowed to have tied data values and untied δ values, is following Kruskal's preference for this approach, called the “primary approach to ties”. The stricter requirement that tied data values shall imply tied δ values, is then called the “secondary approach to ties.” Primary approach to ties: If mij = mkl then δij =δkl or δij ≠ δkl Secondary approach to ties: if m ij = m kl then δ ij = δ kl ∆ (and M thus only indirectly by the principle of monotonicity) is one of the terms in the second basic concept in nonmetric methods, the loss function. This function is a general expression for the discrepancy between the distances - a vector D with elements dij computed from a (trial) configuration - and the values in ∆. p dij = ( t∑ gik − g jk )1/ p k =1 Where G7 is a (trial) configuration in t dimensional space. The loss function is then defined: Loss = ∑(dij − δij )2 / ∑ d 2ij Loss may be regarded as synonymous with the earlier discussed “goodness of fit” concept. (Badness of fit would logically be a preferable term since the smaller the value the better the fit.) The aim of nonmetric algorithms may now be formulated as finding the D which minimizes loss. Note that in Loss there are two sets of unknowns, both D and ∆. An iterative process to be discussed later is necessary to find the optimal D. The basic distinction between the Kruskal and the Guttmann-Lingoes approach to nonmetric scaling lies in the way ∆ is defined. Kruskal uses the symbol D̂ for his version of ∆ and defines the d̂ ij as the numbers which (for a given D) minimize Loss while maintaining weak monotonicity with the data. In his widely adopted terminology the quantity to be minimized is “stress”, and his celebrated stress formula is: ) S = ∑(dij − dij )2 / ∑ dij2 The Lingoes-Roskam Loss function is immediately seen to be a generalization of Kruskal’s stress. 7 G (and gik) are used both to denote a trial configuration and, as in Ch. 2, the final output configuration frcm the analysis Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 27 In Kruskal’s approach finding d̂ ij requires a separate minimization process. First the dissimilarities in M are arranged in increasing order. The set of distances, D, is partitioned into “the smallest possible blocks such that each block consists of a subset of distances whose subscripts are associated with consecutive values in M d̂ ij is set equal to the average of the distances in the block to which it belongs.” (Lingoes and Roskam, 1971, p. 26). Starting with the distances corresponding to the lower ordered dissimilarities, distances are merged into blocks until each block average is larger than the preceding block average and smaller than the succeeding block average. When the process starts, each distance is regarded as a separate block. Whenever a block average is smaller than the preceding block average (not “down-satisfied” in, Kruskal's terminology) or larger than the succeeding block average (not "up-satisfied”) - a merging of the corresponding blocks takes place. This process will minimize stress for a given set of distances, and the set D̂ is weakly monotonic with the data. Briefly, this may be described as a “block partition” definition of ∆. In the Guttmann-Lingoes approach (embodied in the program SSA - 1) ∆ is denoted by the symbol D* and is known as Guttmann’s rank images. D* is defined as a permutation of distances such that D* shall maintain the rank order of the dissimilarities. When the dissimilarities are sorted from low to high, “the rank images are simply obtained by sorting the distances from low to high and placing them in the cells corresponding to the ranked cells of M”. (op.cit. p 47). The way D* is constructed automatically implies that the rank images must satisfy strong monotonicity, unlike Kruskal's d̂ values. The computation of rank images and block partition values is illustrated in Table 1. The example is taken from an analysis of an order 4 matrix, the analysis of such matrices is discussed in detail in Section 4.32. Computation of rank images, D*, and block partition values, D̂ Table 1. M - data D - distances 1. blocking* 2..blocking 3. blocking 4. blocking 1,0 2,0 3.0 . 523 .689 .400 (.523) (.689) (.400) (.523) (.544 .544) (.537 .5537 .537) (.537 .537 .537) D̂ - final blocking D* - rank images (.537 .400 .537 .523 4.0 5.0 1.089 1.612 (1.089) (1.612) (1.089) (1.6l2) (1.089) (1.6l2) (l.089) (l.412 .537) (l.089) .689 1.089 6.0 1.212 (1.212) (1.212) (1.212) 1.4.12) (1.412 1.412) 1.212 1.612 *Parentheses indicate blocks. It is evident that the distribution of D* is identical to the distribution of D (since D* is a permutation of D) and thus all moments of the distribution are equal. The way D̂ is constructed implies that the d̂ values have the same first moment as the d values, that is ∑ d̂ ij = ∑dij, whereas the higher moments will generally be different. Only in the perfect case when there is perfect monotonicity and thus ∑ d̂ ij = ∑dij, (and S = O) will the distributions be identical in Kruskal’s algorithm. In this case each distinct dij will have a corresponding distinct d̂ ij which wil1 be a separate block. For large stress values there will be large blocks and thus a high degree of ties will exist in the d̂ values. It is exactly this property which is described by the concept “weak monotonicity”, which contrasts with the strong monotonicity for rank images. In the Guttmann-Lingoes program two related formulas are used to evaluate goodness of fit: ϕ = ∑(dij − dij* )/2 ∑ dij2 K* = [ ] 1 2 2 1 − (1 − ϕ ) Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 28 From the apparent similarity between the formulas for S and ф it has been an unfortunate practice to think that ф and S bear a simple relation to each other. Young and Appelbaum (1968, p.9) write: "It can be seen immediately that 2 ф = s2 “, and Lingoes (1966, p. 13) in discussing an application of SSA-1: "since Kruskal's normalized stress S - (2ф) 2". The first textbook treatment of nonmetric scaling to appear states: " Φ is closely related to Kruskal's S, the relation being Φ = 1/2 • S2 except that d̂ ij is estimated somewhat differently" (Coombs et al 1970, p. 71) . The implied caveat in the last quotation should not be neglected. Indeed, the different definitions of d̂and d* preclude any simple relation between S and Φ. In order to illustrate this one could study the difference between S* = 2Φ and Kruskal’s S. Instead we choose to compare K* with S since “in practice S* is almost identical to K* in numerical value" (Roskam, 1969, p. 15). If' for instance S* = .2191 then K* = .2178. From the definition of K* and S* it is readily seen that K*/S* = l − Φ / 2 . Lingoes (1967) has' proposed K* (and Φ) as a general index of pattern similarity, applicable also when methods other than SSA have been used. Conversely Kruskal’s S could also be applied to evaluate the outcome of SSA-1. K* and S may be seen as alternative ways of evaluating the outcome of a specific algorithm independent of the algorithm employed. In the example in Table 1, K* = .2667, whereas for the same solution (same distances) S = .1406. This reflects a general tendency, for the same solution K* will be substantially larger than S (except of course when there is perfect fit). The smaller value of S is due to the fact that a separate minimization process is involved when the d̂ values (which enter the stress formula) are computed, in contrast no minimization is involved when computing the rank images which enter the formula for K*. Having now discussed how ∆ may be computed and the corresponding evaluation of Loss, we turn to the iterative process for finding D. A special problem is how this process is to be started. This is known as the "initial configuration" problem. The basic distinction to be made concerning this problem is whether the initial configuration is arbitrary or non-arbitrary. In the former case an initial configuration is defined which has no specific relation to the dissimilarities whatever. One widely used such configuration is described by Kruskal (1964b, p. 33). An arbitrary initial configuration can also be generated by some random device. Non-arbitrary initial configurations in some way utilize the information in the dissimilarities. Ways of constructing non-arbitrary initial configurations will be discussed in Section 4.32. The problem of the initial configuration is usually discussed in terms of "local minima". The latter concept is clarified by an overall abstract view of the iterative approach. Consider first ∆ as fixed, Loss is then a function of nt variables (the coordinates in G). This space of nt dimensions is called "configuration space" by Kruskal (1964b, p. 30) in contrast to the more usual "model space" of' t dimensions. In configuration space each point represents an entire configuration. This space will generally have several minima, the over-all minimum being the global minimum. The iterative procedure searches for a minimum, but has no way of knowing whether a given minimum is a local minimum or the desired global minimum. Consequently the procedure may be trapped at an undesirable local minimum. There is a widespread feeling that with arbitrary initial configurations the process will generally start "far away" from the desired global minimum, and that this will lead to an uncomfortably high probability of being trapped at suboptimal local minima. Whether this really is the case and whether it is to be recommended to start with non-arbitrary initial configurations, will be discussed in Section 4.31. Once the iterative process - in one way or another - is started, there finally remains the problem as to how the configuration - and consequently the distances - is to be changed from one iteration to the next. Young and Appelbaum (1968) distinguish between: Starting with a definition of an iterative formula, then defining "best fit" without specifying any particular relation between the two. This is said to characterize the Guttmann-Lingoes approach (SSA-1) and also the Young-Torgerson approach ((TORSCA): “On the other hand, Kruskal defines his notion of best fit, and then, on the basis of the definition derived the best possible iterative formulation” (op.cit. p. 9). We might add that Kruskal used Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 29 the standard “negative gradient method" also called the "method of steepest descent". This involves finding the derivative of the stress function and moving "downwards" in the direction of the gradient. Young and Appelbaum distinguish between what may be called an extrinsic relation of fit and iteration (SSA-1, TORSCA) vs. an intrinsic one (MDSCAL). This distinction should not, however, be taken to imply too much since Young and Appelbaum show that the iterative formula for SSA-1, MDSCAL and TORSCA, are of the same class and can all be stated: v +1 v gik = gik + α n∑ dij − ∂ ij v v (g ja − gia ) dij j =1 where v is the iteration number (cfr. op.cit. equations 5, 18 and 19). Notice that when dij is larger than ∂ij the points i and j are too far apart and tend to be pulled together, and vice versa, so as to make dij and ∂ij more similar. The point i is of course subject to similar influences from all the other points in the configuration, such that all these separate influences are summed. Each pair of points is thus subjected to some amount of "stress" in fact a physical analogue of how all the separate stress components jointly act to change a given configuration is found in Kruskal and Hart (1966). The iterative formula above applies to all three methods only in the case of Euclidean space and consequently the distinction between extrinsic and intrinsic relation between fit and iteration is of no practical importance in this case. For non-Euclidean space, however, the above iterative formula does not apply to Kruskal's MDSCAL. MDSCAL may then well turn out to be superior to the other methods for non-Euclidean spaces if an intrinsic relation is the best approach. The coefficient ∝ is called a step-size coefficient. All the programs employ different strategies for computing this. This is extensively discussed by Lingoes and Roskam (1971), and here it is sufficient to note that comparison between algorithms in the non-Euclidean case is made more difficult t by the fact that the step-size coefficient may be a confounding factor. The major features discussed here are summarized in Table 2. In this table the most salient similarities and differences between the "three major current nonmetric programs are stated. Table 2. A survey of major features of nonmetric algorithms Monotonicity Approach to tied Dissimilarities Definition of ∆ (values monotone with dissimilarities) Initial configuration Relation iterative formula and fit function 2) SSA-1 Strong Primary MDSCAL Weak Optional (primary) 1) TORSCA Weak Secondary Rank images Block partition Optional (arbitrary)1) Block partition Nonarbitrary Intrinsic Extrinsic Non-arbitrary Extrinsic l) The most frequently used options are stated in parenthesis. 2) If important, then only for non-Euclidean distances. This analysis of various programs in terms of distinctive features encourages a more modular approach. Several combinations of features not represented by the current programs may then suggest themselves. This is extensively discussed by Lingoes and Roskam, indeed, the whole aim of Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 30 their study may be said to break current programs up in separate components and then to recombine the components in an optimal way. They discuss a variety of features not even hinted at in the present simplified presentation. A variety of mixed strategies is also discussed, that is shifting between various combinations of features. Their strategy when analyzing various matrices may be said to treat features as "independent variables". These are then evaluated in terms of the "dependent variables" S (or K*). Our primary concern is different. We wish to replace stress by true fit. Stress is then treated as a predictor variable, true fit being the variable to be predicted. From the point of view of' the metamodel stress is only one of several possible indices for the concept apparent fit. An obvious alternative index for AF is K* (even if SSA - 1 is not used). Yet another alternative would be to use a rank correlation between M and G(V) as an index of AF. This index was provided as collateral information on goodness of fit in earlier versions of SSA-1, cfr. Guttmann (1968, p. 478). Since, however, rank correlation neglects the metric properties of G(V) a better alternative might be to compute linear correlation between G(V) and D*. Guttmann (1968, p. 481) points out that the iterations tend to make the relation between G(V) and D* linear and to minimize the alienation from the regression line through origo in the diagram plotting G (V) against D*. Preliminary explorations have revealed that all these indices are highly interrelated and thus serve equally well to predict true fit. This does not rule out the possibility that special cases might exist where the specific index chosen for AF might make a difference, but at present it seems a very good first approximation to regard AF as a unitary concept and use stress as the basic index for this concept. 3.2 Methods of introducing error and indices of no noise level (NL) Noise or stochastic variations from various sources are ubiquitous in psychological research. How, then, are noise processes specified in simulation studies? Do these specifications correspond to psychological processes? In the present context a more important question is to what extent it really makes a difference for the applicability of results from simulation studies, whether there is such a correspondence or not. Not only the type of noise must be specified, it is also necessary to specify different amounts of noise, and finally there is the question as to what kind of index is to be used to describe a given amount of noise. In terms of the metamodel this section discusses the relation between L and M. In what way is M specified as a distortion of L and how shall the discrepancy between L and M be indexed? We shall see that there are many ways of introducing error and for each of the different ways there are several possibilities of numerically expressing the resulting noise level. First different types of noise are discussed, then amounts and indices. A discussion of some different ways of introducing errors is found in Young (1970). First we describe the method preferred by Young, this method will be extensively applied in Ch. 4. The procedure is illustrated in Fig. 1. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 31 Fig. 1 .Illustration of error process used by Young (1970). The pair i, j with coordinates (li1, li2); (lj1, lj2) represents two points in the true configuration, L (C). The line labelled 1ij represents the true distance between these points. Normal random deviates are added to each coordinate for each point. This gives the error- perturbed positions i' and j'. The distance between these positions, mij, may then be regarded as an element in the data vector M. It is important to note that different random errors are added to a point i for each distance where i is involved, cfr. the subscripts to ε in Fig. 1. In Young's approach the variance of the random normal deviates, εijk, depends upon the whole configuration, L (C), and is independent of points and dimensions. The more general case, where the variance of ε maybe different for different points and/or dimensions, will be briefly discussed in the Concluding remarks. Following Ramsay (1969), Young points out that the error process used by him leads to a non-central chi-square distribution of the dissimilarities, where the parameter of non-centrality is related to the true distance between two points, cfr. Ramsay (1969, equation (1), p.171). Using Ramsay’s terminology, this distribution will simply be referred to as the "distance distribution". The error process corresponds to a multidimensional extension of Thurstone case V and will further be referred to as the RamsayYoung process. This process implies that all distances will more likely be over-evaluated than under-evaluated. "In the limiting case of zero distance it is certain that the estimate will not be an under-estimate. On the other hand, as the true distance becomes larger the probability of an over-estimate approaches the probability of an under-estimate". (Young, 1970, p. 461). If we translate this into classical test theory we see that a basic assumption of this theory is violated. Dissimilarity corresponds to observed score and distance to true score, the difference to error score. That small distances will generally give (relatively) large positive errors (over-evaluation) and larger distances smaller positive errors imply that true and error scores are negatively correlated contrary to all psychometric test theory. Results to illustrate this will be presented in Table 3 and further discussed in Comment 1), p. 40. On the basis of Ramsay's results Young points out that the error process is equivalent to first completing true distances then "add error to the distances, where the error has a non-central chisquare distribution”. (op.cit. p. 461). He then discusses why one should not use the normal distribution to add error to the true distances (as Kruskal (1964a) did in an example to be discussed in the next section). Young states (op. cit. p. 461): It seems hardly true that a "real subject" would be as likely to underestimate a small distance as a large distance", and he further argues: “The non-central chi-square distribution seems to reflect the psychological aspects of the judgment processes more accurately than the normal distribution (op.cit. p. 462.) Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 32 The reason given for this is that the distance distribution has a parameter corresponding to the number of dimensions of the stimuli, while there is no corresponding parameter for the normal distribution. Ramsay (1969, p. 170) goes one step further than Young and quotes data supporting this error process: "The distance distribution predicts the sort of nonlinearity which has been observed in varying degrees in some scaling studies. This appears in a positively accelerated relationship between dissimilarities and corresponding interpoint distance... This phenomenon was especially evident in the study by Indow and Kanazawa (1960)”. We may comment that Indow and Kanazawa used a metric model, but essentially the same relation appeared when Kruskal (1964a, p. 19) reanalyzed these data with MDSCAL. Ramsay (1969, p. 170) concludes that "this is one of the most important discrepancies between the distance and normal distributions. The normal distribution predicts a linear relationship between estimator and interpoint distance”. While this non-linearity may be of theoretical interest we shall later in this section see that for some purposes the consequences of the non-linearity are negligible. In a more recent investigation Wagenaar and Padmos (1971) do use the normal distribution, albeit in a multiplicative rather than additive way. Each distance was multiplied by a separate random element with mean 1 and standard deviation x. For this procedure "one should note that the standard deviation of the actual error is proportional to the distance” (op.cit. p. 102). On the other hand, in the study by Indow and Kanazawa (1960) the relation between “scaled sense distance” (dissimilarity) and "SD of scaled distance" is strongly non-monotone, an inverted U curve (op.cit. p. 331). Since furthermore the relation between dissimilarity and distance is practically linear in the region where SD decreases as a function of “scaled sense distance”, these data seem to run strongly counter to the Wagenaar-Padmos model. Unfortunately the results from Indow and Kanazawa do not quite rule out the validity of the WagenaarPadmos procedure. The judgment process might have been of a “two stage” nature. First there might have been an error process analogous to the Wagenaar-Padmos type, then a monotone transformation could have produced the non-linearities found by Indow and Kanazawa. It will be useful to attempt to give an exhaustive list of the specifications to be made for a complete description of an error process. This will serve to clarify the differences between the Ramsay-Young process and the Wagenaar-Padmos procedure, and also to point out other possibilities to explore. Nine types of decisions may be distinguished. It will, however, turn out that some of the decisions are only relevant for specific choices on other decisions. 1. Where error is introduced. a) Error introduced to coordinates (Ramsay-Young). b) Error introduced to distances (Wagenaar-Padmos). This may perhaps be thought to be a spurious distinction since, as we discussed, adding normal deviates to coordinates in the Ramsay-Young process is mathematically equivalent to adding a value from non-central chi-square distribution to each distance. This does not, however, imply general equivalence between 1a) and 1b) since there are cases where an error process is specified in complete detail for e.g. la), and there is no defined equivalence in 1b) (and vice versa). Generally 1a) vs. 1b) may correspond to different psychological. Processes. “The variation in perceived difference or similarity either must arise from variation in perception of the stimuli in psychological space [corresponding to 1a)] or it must arise directly from difference perception [corresponding to 1b)]". (Ramsay, 1969, p. 181). 2. Constancy of coordinate errors (only relevant for 1a)). a) Error coordinates for point i differ from point j to point k. b) Error coordinates for point i the same for all other points. 2b) will simply give a new configuration and is dismissed by Young (1970, p. 460): "This was not done because there exists a perfect solution for the set of error-perturbed distances, the error-perturbed configuration”. In the concluding remarks we will, however, find a place for this option when considering further extensions of the metamodel. 2a) is illustrated in Fig. 1. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 33 3. Addition of "extra dimensions” (only relevant for 1a)). a) No additional dimensions added. b) Add one or more additional "error" dimensions. 3b) is also dismissed by Young since "it would have produced distances which could be precisely recovered in a space whose dimensionality was equal to the number of true plus error dimensions”. (op.cit. p. 460). Lingoes and Roskam (1971), however, use precisely this method in the study of “metricity" (true fit in the present terminology) included in their work. As will be discussed in the Concluding remarks their enthusiasm for such studies is highly limited and it is then perhaps not surprising that they have no discussion to justify their choice of method. Though the studies to be discussed in Ch. 4 similar to Young's are based on 3a), we will find a place for 3b) in the more general forms of the metamodel in the Concluding remarks. 4. Arithmetic operation to introduce noise. a) Additive. b) Multiplicative. Young does not consider the multiplicative case, this is, however, used by Wagenaar and Padmos. As already pointed out, their method will give increased error variance for larger distances. This property makes it doubtful whether the multiplicative case can be generally meaningful if introduced to coordinates. A necessary condition would then be that the absolute size of the coordinates was meaningful, and thus that there was a natural zero point for the configuration. In multidimensional scaling it is practically always assumed that the zero point is arbitrary, usually it is conventionally placed at the centroid of points. 5. Type of distribution of error. a) Normal distribution. b) Other types. No alternative to using the normal distribution as basic seems to have occurred in the literature on multidimensional scaling. This, however, is not the case for onedimensional scaling, where for instance the Bradley-Terry-Luce model (discussed by Coombs, 1964) is a prominent alternative to the Thurstone model. In the future it may perhaps be profitable to explore alternatives to the normal distribution. As is the case for one-dimensional models theorizing may be given a different direction by exploring alternatives to the often poorly specified assumptions underlying use of the normal distribution. The next three decisions all concern homogeneity (or heterogeneity) of error variance. 6. Error variance for dimensions. a) Error variance assumed the same for all dimensions. b) Error variance assumed to differ according to dimensions. The only argument for 6a) is of course its greater simplicity. As Ramsay (1969, p. 181) points out, it “is rather like assuming that absolute sensitivity is the same for all relevant properties of the stimuli ...this assumption is likely to be false in some situations". 7. Error variance for points. a) Error variance assumed the same for all points. b) Error variance assumed to be different for different points. As for 6. there are likely to be situations where the simplest case is likely to be false. Especially when the objects are very complex, as for instance persons or words, some objects are likely to be intrinsically more difficult to judge and this will probably be reflected in larger error variance for these objects. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 8. 34 Error variance for distances (only relevant for 1b)) a) Error variance assumed independent of size of distance. b) Error variance assumed to depend upon size of distance. In most cases the outcome of 8. cannot be decided independently of the other decisions but will simply be a consequence of other decisions made, for instance 1b), 4b) and 5a) as used by Wagenaar and Padmos imply 8b). If one, however, should choose to disregard Young and Ramsay’s misgivings concerning adding normal deviates directly to the distances then 4a) and 8b) might be an alternative to the Wagenaar-Padmos procedure. 9. Monotone transformation added. a) No specific monotone transformation for producing dissimilarities. b) Monotone transformation added before arriving at the dissimilarities to be analyzed. Young (1970) chose 9b), that is, he used a two-stage process in producing the dissimilarities to be analyzed. After the Ramsay-Young process he chose an arbitrary monotone transformation (squaring and adding a constant). It may at first seem surprising that this step was included. Previously we have stressed that only the ordinal information in the dissimilarities is relevant, which should imply that the output should be independent of any final monotone transformation. Consequently 9b) should be a completely redundant step in investigating true fit. There is, however, a peculiarity in the TORSCA algorithm so that: "it is possible that various monotonic transformations may result in different final configurations with differing values of stress" (op.cit. p. 464). To circumvent the problem of local minima the TORSCA algorithm starts by constructing an initial configuration from the “raw” dissimilarities and this is the reason why 9b) might make a difference in the output configuration. In Section 4.32 we shall see that the problem of local minima in most cases can be satisfactorily solved without using more than ordinal properties of the dissimilarities. This will make it unnecessary to worry about possible effects of "various monotonic transformations”. We now turn to a brief discussion of the problems involved in testing whether a specific set of noise specifications corresponds to psychological processes. Two kinds of implications of noise specifications have already been mentioned, there is first the relation between distances and dissimilarities (a plot of this relation is usually called a Shepard diagram). A second kind of implication is how the variance of dissimilarities depends upon the size of the underlying distances. If now implications of a given set of noise specifications are confirmed, our confidence in this set may increase (as Ramsay made use of the Indow-Kanazawa study). If, however, the implications are not confirmed, the situation is quite problematic. To see this it is necessary to have in mind the distinction between two types of consequences of noise processes for the relation between L(V) and M. First the dissimilarities in M may be shuffled relative to the true distances in L(V). This will for instance be reflected in decreased rank correlation between L (V) and M. Such consequences we here call ordinal rearrangement. Second, metric relations may be different in M than in L(V), metric rearrangement. A given ordinal rearrangement will correspond to a large set of different metric rearrangements. Any given ordinal rearrangement is not changed by adding a monotone transformation, as in 9b), whereas the metric rearrangement is then changed. So far only metric consequences of noise processes seem to have been derived, and these consequences have then been tested by metric assumptions of the dissimilarities. If these metric consequences are not satisfied, this does not, however, rule out the possibility that the ordinal consequences of the noise processes may be satisfied. It may be the case that the psychological error process corresponds to the specifications made, but that there are additional monotone transformations in the human information processing. If it is merely the metric consequences of a given noise process which are violated, this is no argument against using simulation studies based on the given noise process, since only ordinal properties are used in nonmetric analysis. There does not seem to have been any attempts to specify exclusively ordinal properties of specific noise processes. It is then possible that different noise processes may give similar ordinal consequences. On the other hand it seems likely that for instance the Wagenaar-Padmos procedure and the YoungRamsay process may have different ordinal properties. The former procedure will be likely to produce Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 35 high degrees of shuffling of larger distances and relatively low degrees of shuffling of small distances, whereas the Young-Ramsay process is more likely to produce equal amounts of shuffling for small and large distances. A complicating consequence in deriving ordinal consequences of noise processes is that such consequences generally will depend upon the configuration and the distribution of distances which the configuration implies. In regions where distances are well spaced there will generally be less shuffling than in regions where distances are more tightly clustered. Deriving consequences of various noise processes - especially ordinal consequences - and devising appropriate experimental tests is a seriously neglected field. The question of whether a given set of noise specifications corresponds to psychological processes has two different aspects. First, relevant empirical studies obviously have consequences for psychological theory. There has for instance been some interest in specifically isolating a monotone transformation by for instance studying to what extent such a transformation can be recovered. Several examples are given by Shepard, (1962b, Figs.4, 5 and 6). In direct judgments of dissimilarity where the subjects are asked for numerical estimates such a monotone transformation would correspond to a kind of "psychophysical function". Nonmetric multidimensional scaling may provide a new approach to the moot question of the form of such functions. It should, however, be emphasized that merely asking about the form of monotone transformations in the information processing is equivalent to focussing on 9b) and neglecting all other components of noise specifications. Indeed, in the present framework, we will not regard this as inducing noise at all since there will be a perfect monotone relation between distances and dissimilarities. In this case we regard L and M as coinciding. In the typical case of discrepancy between L and M it may be quite difficult to disentangle consequences of noise processes as specified by 1-8 above and any consequence of a specific monotone (non-linear) transformation in the information processing. Recall for instance that a nonlinear relation may arise, not because of any specific monotone transformation, but simply as a byproduct of the Ramsay-Young process. However; interesting empirical testing of noise processes may be in itself, such studies do not necessarily have any relevance for the main concern in the present work. What is of importance in the present context is whether the relation between stress and true fit depends upon how error processes are specified. Hopefully there will be no pronounced interactions. If this is the case, the user of the results to be presented in Ch. 4 does not need to worry about the kinds of error processes producing his data. If, however, there are serious interactions, empirical testing of noise processes will be highly relevant. In this case procedures for identifying the type of error process must be worked out and for each type of error process a separate procedure for estimating true fit from apparent fit must then be worked out. Alternatively, the more pronounced the interactions may turn out to be, the weaker must be the form of any general conclusions. Obviously anything even resembling a comprehensive exploration of various types of error processes in simulation studies will be prohibitive. Furthermore at present purely theoretical work seems more important than simulation studies concerning effects of various error processes. At present we will mainly use the Ramsay-Young process, in Section 4.7 a comparison with the Wagenaar-Padmos procedure is included. In the final chapter implications of more general error processes are briefly discussed. We now turn to the problem of how various amounts of noise, and thus sizes of NL, are produced. For both the Ramsay-Young process and the Wagenaar-Padmos procedure amounts of noise are defined as a proportion. We first discuss the Ramsay-Young process where the discrepancy between L and M clearly depends upon the variance, σ ce, of the error terms ( σ in Fig. 1) which is added to the coordinate, relative to the spread of the points in the true configurations. Young (1970, p. 465) defines noise level, E, as "the proportion of error introduced". (1) E = σ ce / σ v where σ V "denotes the standard deviation of the true distances and serves as a standardization" (op.cit. p. 462). This is by no means a natural choice of standardization term. An obvious alternative Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 36 would have been to use the standard deviation of the configuration σ c,8 in the denominator of the definition of E. The latter alternative is discussed by Young, but no convincing rationale is given for the choice he made. He does for instance point out that σ V simultaneously will tend to increase with dimensionality and decrease with mean true distance and thus "unless the mean distance is changing in a way to compensate, we might expect that the effective proportion of error variance is confounded with dimensionality" (op. cit. p. 463)" Table 3 and comment 2) on p. 40 will illustrate that such compensation does take place.9 It is, however, important to note that "noise level" for a given proportion, EO, is not the same whether σ c or σ v are used as standardization. It turns out that for a given configuration σ v is substantially smaller than σ c. This implies that for a given configuration the error variance for E0 will be smaller when σ v is used than when σ c is used. Results to substantiate this are presented in Table 3 (cfr. comment 3) on p. 41). Unless one is very careful in specifying precisely the way E is defined, one may risk not getting comparable values if results are reported in terms of E. Turning to the WagenaarPadmos procedure there is no problem of specifying a standardization term since a multiplicative model is used. While their procedure as pointed out is similar to the Ramsay-Young process in defining noise level in terms of a proportion (or in their terminology a "fractional random error", Wagenaar and Padmos, 1971, p. 102), it should be stressed that their fractional error is not comparable to E as defined by Young, cfr. Tables 3 and 4 and comment 5) on p. 41. We can now see that there are several disadvantages with the approach of defining amounts of noise exclusively in terms of some proportion. The preceding discussion implies that different ways of specifying the proportion do not give comparable results, furthermore there is the problem that for any given specification amounts of noise may depend upon irrelevant parameters, as for instance dimensionality and Minkowski constant. Evidently some index different from a proportion must be applied to substantiate these statements. The main such index used in the present work is simply the correlation between L(V) and M, r (L, M). If now for some specification of error proportion r (L, M) shows that noise depends upon some irrelevant parameter, this is not an unproblematic statement, since it assumes that in some way r (L,M) is more “basic”. Without making such an assumption one could alternatively say that it is r (L, M) which depends upon the irrelevant parameter, that amount of noise per definition is the same when the error proportion is the same. We regard r(L, M) as the basic index of noise level, NL, since it will turn out to be fairly simple to relate this to other concepts - specifically the key concept purification - which does not seem to be the case for any definition of an error proportion. Furthermore there does not seem to be any satisfactory way of estimating error proportion from retest data. Wagenaar and Padmos (1971, p. 105) refer to their proportion as ”measurement error” and imply that it can be estimated: “if the measurement error is known beforehand for instance on the basis of repeated measurements", but no procedure for “knowing measurement error” is described. Young use σ c where we use σ V , and σ t where we use σ V . The present use of subscripts is consistent with the terminology introduced in Section 2.1 where C was introdueed to denote the configuration mode (coordinates) and V was introduced to denote the vector (distance) mode. 8 9 While this makes a given level of E comparable across dimensiona.lities, E is not comparable across different Minkowski values, p. This is because σ V decreases as p increases. Increase in p tends to produce more homogenous distances since then more and more only one of the coordinates "counts" in producing any given distance, the variance among coordinates in different dimensions looses in importancee Conversely, the variance of distances from a given configuration will be maximal for the city-block model, p = l, since the coordinates in all dimensions then contribute maximally to the distancese Since the simulation studies in Ch. 4 will be limited to the Euclidean case, we will, however, not give further details on this problem here. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 37 On the other hand we shall see that it is quite simple to estimate r (L, M) from retest ("parallel form”) reliability. Conceptually this corresponds to the extended form of the metamodel, cfr. Section 2.2, where we considered several M vectors generated from a single L. In the discussion of the Ramsey-Young process we showed that it leads to correlations between true and error scores and thus error scores for parallel forms will also tend to be correlated. At present, however, we choose to neglect this and use simple correlational theory. Assuming then that what M1 and M2 have in common is completely described by L, the partial correlation between Ml and M2 holding L constant will be 0, which is expressed below: r(M1M2 ) = r(L,M1) • r(L,M2 ) If we now assume that r(L, M1) is approximately equal to r(L, M2) or alternatively that we are interested in the geometric mean of these two noise levels, r(L, M), we then have: r(L,M) = r(L,M1) • r(L,M2 = r(M1M2 ) (2) A very important advantage of this line of reasoning is that exactly the same argument can be made concerning the relations between true fit and the two output configurations G1 and G2. The use of correlation r(L, G) as an index for true fit will be discussed in detail in the next section, here we just state the corresponding equation: (3) r(L, G) = r(L, G1) • r(L, G2 ) = r(G1G2 ) Notice that the terms of equations (2), and (3) give precise meaning to the two left pointing arrows in Fig. 4, Ch. 2. If we should compare the present concepts to conventional psychometric theory this might be done as in the outline below: Validity Reliability Data r (L, M) r (M1, M2) Output r (L, G) r (G1, G2) Recalling a quotation from Shepard in Ch. 1 where he pointed out that the output can be both more reliable and valid than the data from which it was derived (p. 10), we now see that the thesis of equivalence between theoretical and empirical purification stated in Section 2.2 can give precise meaning to Shepard's statement. r (G1,G2) > r (M1,M2) and r (L,G) > r (L,M) imply precisely more reliable and valid output than data. Results to substantiate the general validity of equations (2) and (3) and further implications of these equations will be presented in Section 4.7. A very important objection to the use of correlation as the basic index for NL must now be discussed. A usual correlation coefficient assumes interval scale properties of both variables. There can be no objection to treating L(V) as an interval scale, concerning M, however, we have repeatedly stressed that an essential feature of nonmetric models is that the output is independent of whatever monotone transformation of the dissimilarities we may consider. It is well known that such freedom will completely play havoc with the product moment correlation. Have we then forfeited our claim to staying within the nonmetric framework by introducing r(L, M)? First we note that provided the relation between L and M is essentially linear it is defensible to use correlation. As a matter of fact the studies in Section 4.6 use r(L, M). This can be defended because Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 38 the non-linearity for the Ramsay-Young process turns out to have little, if any, consequence in the present context. For one thing, the non-linearity just turns up in the ends of the Shepard diagrams, in most of the region linearity is excellent, cfr. the previously referred to Fig. 16 in Kruskal (1964a, p. 19). In order to study this non-linearity more precisely M vectors from 8 configurations with 20 points (representing 4 different noise levels) in each of 3 dimensionalities (l, 2 and 3) (altogether 24 configurations) were generated and plots were made of the relations between L and M. (The RamsayYoung process was used). Inspecting these 24 plots it was hardly possible to detect any non-linearity, though the effects in Kruskal’s figure were present, although somewhat less pronounced. This may justify using r(L, M) for the Ramsay-Young process, but it does not justify general applicability of the results based on using r(L, M). (The reader might ask if he had to assume linearity, why use nonmetric models at all, this will, however, be answered in Section 4.33). In practice the safest assumption will be that there is generally not a linear relation between L and M, this will also make the use of retest correlation to estimate r(L, M) as in equation (2) highly dubious. Even if the Ramsay-Young process may have some validity, there is always the possibility that there will be additional monotone transformations in the information processing. A solution would be to find some transformation of M which would offset the distortions induced by using correlation if there is non-linearity. A simple solution which was attempted was to use rank correlation. This solution has the attractive feature that in some data collection methods the data are given in the form of ranks, an example is the ranking procedure used by Shepard and Chipman (1970). In some preliminary investigations it did, however, turn out that using rank correlations gave very crude fit for equations (2) and (3). A much better solution is to use the principle of monotonicity to find a transformation of M so that it will always be defensible to use linear correlation. It will be recalled that there are two different approaches to monotonicity, rank images, corresponding to strong monotonicity, and block partition values, corresponding to weak monotonicity. Of these two approaches rank images were found preferable, since this will not tie values in the transformed data when the raw data are untied. In the simulation approach there are two possibilities for rank image transformation of M, since both L or G may be used as a basis for transforming M. Instead of assigning l, 2, 3 etc. to the ranked values of M as in rank correlation, we substitute for the ranked values of M the values of L(V) sorted in ascending order. Thus the ordinal information in M is preserved but a new distribution is obtained which is identical to the distribution of L and this insures maximal linearity. G(V) may be used in the same way, the resulting transformations will be labelled M*L and M*G respectively. For the 24 configurations previously mentioned, M*L was computed and correlated both with L and with M. The lowest correlation between M and M*L was .997, differences between r(L, M) and r(L, M*L) showed up first in the fourth decimal place. This further testifies to the linearity observed in the plots. Since in practice, however, L is unknown there is no counterpart in usual empirical studies to using M*L. It does, however, turn out that M*G and M*L give essentially identical results, even for relatively high levels of noise. In Section 4.7 we report results comparing M*G and M to check the validity of equation (2), cfr. also Section 4.33. It may be noticed that M*G is just another symbol for D* if SSA-1 is used. By here using M*G as a symbol, it is emphasized that the rank image transformation can be used independently of the Guttmann-Lingoes algorithm. Another advantage of using M*G may be mentioned. If one is interested in the type of relation in a Shepard diagram, this relation tends to be obscured by the presence of errors. Instead of plotting data against distances one may plot data against M*G. When using the correct form of the model M*G will be linearly related to the true distances. The relation between M and M*G will reveal the type of monotone transformation producing M unencumbered by errors, since there is a perfect rank correlation between M and M*G. When we further report results based on r(L, M) and r((M1, M2) it should be understood that this is valid because the noise process being used does not produce significant non-linearities. However, when there is reason to doubt linearity, M*G should be used, this will then give the necessary linearization. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 39 Returning now to the relation between proportion of error and the resulting correlation, r (L, M), it may be useful to regard proportion of error as the intended noise level, and r(L, M) may be regarded as the actually obtained noise level. This distinction is similar to the familiar statistical distinction between a universe parameter and a sample value. Using the same error proportion repeatedly on the same configuration will give slightly different values of r(L, M) since the random error process will start with a different value each time. This conceptualization may point to away of deciding what the best way to standardize error proportion is in the Ramsay-Young process. Should σ G or σ V be used as the denominator in the definition of E, cfr. equation (1)? One criterion of a specific method being the best is that the variance of the obtained r (L, M) values for a given value of E is smaller than for competing methods (since for a given value of E the intended variance is 0). A related alternative is that the best method should have minimal variance within a given level of E, while the variance between levels should be maximal. This will lead to higher correlation between E and r L, M) for the best method. Some preliminary results using this approach failed however, to show any clear-cut difference between using σ c or σ v for standardization, and consequently Young’s procedure (using δV) is used throughout in the Ramsay-Young process The problem of describing the discrepancy between L and M may be regarded as the problem of finding an index for the relation between an interval scale (L) and an ordinal scale (M) variable. We have suggested using correlation either because in some cases M is linearly related to L, or else we have suggested transforming M to produce linearity. The stress coefficient can, however, be regarded as an alternative approach to describing the relation between an interval and an ordinal scale variable. Consequently, stress should be well suited as an alternative index for NL. There is one advantage of using stress as an index for NL, which is related to the problem of local minima. Recall that in terms of Kruskal’s formulation of his algorithm, the true configuration, L, will be one point in the configuration space, and thus one of the possible solutions to M. The algorithm searches for the global minimum which (if found) by definition will be G. When noise is introduced L cannot be expected to be the global minimum. The stress of G will then be the lowest possible stress. Consequently the stress of L, NLstress10, cannot be less than the stress of G, AF-stress, but will generally be higher. If conversely AFstress is larger than NL-stress we know that the global minimum has not been found (at least one point with lower stress than G exists, namely L). If then AF-stress is less than NL-stress, this is an indication that the program has not been trapped in a local minimum (though of course it is by no means conclusive evidence). Another possible advantage of using stress is that stress can also be used as an indication of true fit. Just as stress can be used as an indication of discrepancy (L(V), M) it can be used in exactly the same way to indicate discrepancy (L, (V),G(V)). This makes it possible to study purification not only by basing indices for NL and TF on correlation but also on stress. It may be of advantage to show that the general conclusions on purification are independent of just one type of index. Since, however, equations (2) and (3) will turn out to be powerful tools, correlation will be used as the basic index for both NL and TF. In the next section we review the various indices which have been used for TF, where it will be evident that correlation is by no means the only alternative worth exploring. 10 NL-stress is a convenient way to express a specific index used for a general concept. The same convention applies to TF and AF and other specific indices referred to. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Table 3. Error proportion, E t= 1 l) Ramsay-Young σv standard 2) Ramsay-Young σ c standard 3) Wagenaar-Padmos t=2 4) Ramsay-Young σ v standard.. 5) Ramsay-Young σ c standard 6) Wagenaar-Padmos 7) r(L,M-L), t=l 8) r(L,M-L), t=2 40 Illustrations of relations between different procedures for inducing noise, n = 12. In rows 1) -6) each cell is a root mean square correlation based on results, r(L, M), from 10 different random configurations. For a given dimensionality, t, the numbers in each column of the table are based on the same configurations. For each error proportion, different configurations were used. Rows 7) and 8) report mean correlations between error and true scores for the M vectors in row 1) and row 4), respectively. .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .998 .990 .979 .962 .945 .919 .899 .873 .844 . 824 .997 .986 .969 .947 .921 .887 .856 .829 .970 .767 .997 .985 .967 .950 .921 .887 .863 .829 .770 .739 .998 .990 .976 .964 .944 .921 .897 .870 .827 .801 .997 .992 -.08 -.05 .987 .971 -.01 -.05 .967 .943 -.06 -.l0 .951 .923 .893 .851 -.15 -.l0 .03 -.07 .895 .799 -.18 -.13 .869 .789 -.17 -.10 .835 .794 .694 .685 -.15 -.18 -.13 -.14 .755 .656 -.12 -.15 Table 4. Overall means for results presented in Table 3.. Means for rank correlations are included. Linear correlation Rank correlation t=1 t=2 t=1 t=2 Ramsay-Young .925 .921 .909 .917 σ v standard Ramsay-Young .898 .901 .879 .897 σ c standard ..CI . Wagenaar-Padmos .895 .835 .916 .835 Comments to the results presented in Tables 3 and 4. 1) Correlation between true and error scores are presented in rows 7) and 8), based on σ v standardization of the error term in the Ramsay-Young process. The corresponding results for σ c were practically identical and are not reported. Increasing the error proportion further than 0.50 tends to slightly increase the numerical value of the negative correlation. In the region E= 0.501.50 there may be correlations in the range -.20 to -.30. Further increases in error proportion, however, will reduce the numerical value of true error score correlation. As E increases without bounds r (L, M) will approach 0, the configuration is drowned in noise and correlation between true and error score will also approach 0. The implications of this will be further discussed in the next chapter. For the Wagenaar-Padmos procedure there were, as expected, no significant departures from 0 in correlations between true and error scores. 2) Comparing row 1) with 4), we see that the compensation referred to on p. 36 does take place, the values are closely similar, this is reflected in the very small difference in mean values in Table 4, .925 vs. .921. Other studies, also using 3 dimensional configurations, show the same similarity as here reported. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 41 3) Comparing row 1) with 2), likewise row 4) with 5), we see that as stated on p. 36 standardization with σ c gives consistently lower correlation, though the over all mean differences in Table 4 do not seem to impressive. As for σ v standardization there is no interaction with dimensionality. 4) As will be further discussed in Section 4.6 Table 3 does not cover the whole range of E of interest. For the Ramsay-Young process with δV standardization values of E in the range of .501.00 may be of interest, even values of E in the range of 1000-1.50 may give rise to acceptable solutions if n is not too small. E = 1.00 roughly corresponds to r (L,M) = .53 “ “ = .31 E = 1.50 Values of E larger than 1.50 almost invariable produce M vectors where L is beyond recovery. 5) For the Wagenaar-Padmos procedure there is a pronounced interaction with dimensionality, the values in row 6) are consistently lower than the values in row 3) .The reason for this is not clear, neither is it clear why the rank correlation is substantially larger than the linear correlation for t = 1 (.916 vs. .895). This is highly significant (in 9 of 10 conditions the rank correlation was higher). It is possible that the Wagenaar-Padmos procedure may be very sensitive to the detailed form of the distribution of distance. 3.3 On indices of true fit (TF) Although it seems important to agree on an appropriate measure of the discrepancy between G and L, this topic has never been treated systematically in the literature. In no case has there been any attempts to justify the type of index used. Broadly we may distinguish two different types of indices: 1. Indices based on the distance mode, that is discrepancy between LV) and G(V). 2. Indices based on the configuration mode. In this case it will be assumed that G has been orthogonally rotated to maximal similarity with L. Indices can then be computed to express the discrepancy between L(C) and G(C). Indices belonging to the first class are far more frequent than indices of the second class, and will be treated first. In his, by now classical 1962 papers, Shepard used - as the first example illustrating the use of his method - as true configuration 15 points in 2 dimensions, the coordinates taken from Coombs and Kao (1960). The distances were subjected to a monotone transformation (but no noise added). In multidimensional scaling the output configuration is only determined up to a similarity transformation. Origo can be freely chosen (translation), likewise the unit (dilation) and finally one may freely rotate orthogonally since all these transformations leave the order of the distances unchanged. Shepard placed the origo at the centroid of the points (as is usually the case and will be assumed throughout this work). The dilation factor was constrained by setting the mean of the interpoint distances to unity, the configuration was then rotated to maximal similarity with the true configuration, L. As a measure of true fit Shepard (1962b, p. 221) computed the "normalized mean square discrepancy between the reconstructed an the true distance”: (4) TF1 = ∑(lij − dij )2 / ∑ lij2 Without further knowledge of this particular index the reported value of .00013 would be hard to evaluate. It seems like a comfortably small value, but a graphical presentation is far more informative: "That this discrepancy is indeed extremely small is shown graphically in fig. 3" (op.cit. p. 223). With perfect true fit it should be the case that the reconstructed configuration should coincide exactly with the true configuration. In Shepard's Fig. 3, where L(C) is represented by crosses, G(C) by circles, it is Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 42 evident that this is essentially the case, circles and crosses coincide. Indeed, it is hard to imagine any situation where the minute discrepancies would make any difference. Since Shepard used an errorless example this is as expected and substantiates the basic claim for nonmetric scaling, that metric information on coordinates is implied by rank order information on dissimilarities. Presenting his improved version of Shepard’s program, Kruskal (1964a) used the same true configuration as Shepard, but in this case the dissimilarities were first distorted by addition of a normal deviate to each (transformed) distance. After transforming the reconstructed configuration, Kruskal chose to compute a “percentage difference". In the present terminology Kruskal's formula can be written: (5) TF2 = ∑(lij − dij )2 / ∑((lij + dij )/2)2 While the numerator is the same as in equation (4), the normalizing factor is different so that the sizes of the two indices are not comparable. Comparing Fig. 11 in Kruskal's paper with the corresponding figure in Shepard's paper, it is evident that the fit is somewhat worse in Kruskal's case, though for all practical purposes the amount of fit illustrated by Kruskal would be acceptable. In the next section we present figures to illustrate various levels of true fit, Figs. 2 and 3. Since Kruskal had added noise to his dissimilarities it is to be expected that the fit would decrease. If Kruskal and Shepard had used the same index for true fit, we would also have a numerical estimate for the amount of loss of true fit suffered by the addition of noise. None of the indices originally used by Shepard and Kruskal seem to have been used in later studies. An index related to equation (4) would be the stress of G with respect to L. The general form of the formula is identical but stress would give smaller values since then block partition values would be substituted for the dij values, and as previously mentioned this implies a separate minimization process. This can be written: (6) TF1 = ∑(lij − dij )2 / ∑ lij2 Lingoes and Roskam (1971, p. 168) use yet another index: (7) TF4 = 1 − ∑(lijdij )2 ∑ lij2 dij2 The only index which seems to have appeared in more than a single study is linear correlation which also pertains to the distance mode or some transformation of correlation. The first example is Shepard (1966) where, however, only the case with no noise was studied. Root mean square was used to compute mean indices. In a much more extensive design - to be discussed in detail in Section 4.4, Young (1970) defined “metric determinacy” (true fit in the present terminology) as the "squared correlation between the true and the reconstructed distances" (op. cit. p. 458). When computing mean indices Fisher’s z-transformation was used, this transformation was also used in various analyses of variances “to improve the normality of the distribution” (op.cit. p. 470). In his various attempts to predict true fit from stress, dimensionality and number of points, Fisher’s z was used as the index to be predicted. In elaborating Young’s design his pupil Sherman (1970, p. 22) used yet another function of a true fit correlation, TF-r. He wanted a function of r maximally linearly related to stress and through trial and error he found that the coefficient of alienation: 2 cal = 1 − r Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 43 was more closely linearly related to stress than were r, r2 or Fisher's z transformation of r. As will be further discussed in Section 4.6 linearity is desirable for purposes of estimating true fit. It turns out, however, that Sherman’s transformation, cal, severely violates linearity when the whole range of NL values is included. When the same type of transformation is applied to 1-cal, linearity was restored practically over the whole range. The formula for this transformation, K, is: (8) K = 1 - (1- l - r 2 ) A further transformation of K was found desirable to adjust linearity for values of r very close to 1.0 and for values of r less than .70. These transformations will be further discussed in the next section and data showing that these transformations do give linearity are presented in Section 4.6. Any transformation of r does of course presuppose that r basically is a valid measure of true fit. In his 1969 review of scaling Zinnes strikes a highly critical note concerning the use of correlation as an index for true fit: "Shepard’s use of the correlation coefficient as a measure of accuracy here seems unfortunate. While the correlation coefficient is a useful index when two variables are crudely related, it is practically useless when the variables agree on the ordering of the stimuli. This property of the correlation coefficient is amply demonstrated by Abelson and Tukey” (op.cit. p. 465). This critique may be stated too strongly. For one thing the general relevance of many of the examples referred to from Abelson and Tukey (1963) may be questioned since they use variety of rather bizarre linear orderings. Secondly, it is not clear from the critical comments made by Zinnes when correlation becomes useless. In the cases of interest in the present work, the variables will in most cases not agree on the ordering of the stimuli (rank correlation will be less than 1). On the other hand, the variables will not be "crudely related". A possible implication of Zinnes’ criticism may be that r is too insensitive in the region close to 1, small departures from 1 may be of large practical interest. The transformations used in the present work to a much higher degree than Sherman’s cal transformation, “dramatize" very small differences in r when r is close to 1. Fig. 6 in Section 4.6 illustrates this for r > .99, cfr. also the general survey of the transformations in Table 7 in the next section. We may, however, conclude that the use of correlation is not unproblematic. One feature shared by all indices of type l is an indirect quality. There are (2n) elements involved in the index. In contrast, for an index of type 2 there will be nt elements which generally will be a much smaller number than ( n ). A type 2 index is "closer to " the visual impression of discrepancies in figures where 2 true and reconstructed configurations are juxtaposed. We first discuss one such index, then turn to a more general discussion of criteria for evaluation of indices. The familiar coefficient of congruence, CO can be given some simple interpretations: 2 • ∑ ∑ g2 C0 = ∑ ∑ lik gik / ∑ ∑ lik ik It is convenient to fix the unit of G so that ∑∑12ik ∑∑g2ik and the formula for C0 then simplifies to: 2 C 0 = ∑ ∑ lik gik / ∑ ∑ lik Another useful version of this formula is: C1 = 1 − C0 Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 44 Consider now the discrepancies between each pair of corresponding points in for example Fig. 4. For each of the n pairs, say pair i: t (lik ∑ k =1 − gik ) 2 (lik − g ik )2 indicates the total error for all coordinates (in Euclidean space simply the squared "error distance"). Summing across all pairs of points and norming by the total sum of squares of coordinates, finally taking the square root, we get a normed root mean square of coordinate errors, for short NCE: 2 NCE = ∑ ∑(lik − gik )2 / ∑ ∑ lik Since: 2 ∑ ∑(lik − gik )2 = 2( ∑ ∑ lik − ∑ ∑ lik gik ) We get: NCE = C1 • 2 Consider next the root mean square of distance errors normed by the root mean square interpoint distance, for short "normed distance error", NDE. Mean interpoint distance can be shown to be 2.∑∑l2ik/(n – 1) and then: NDE = ∑ ∑(lik − gik )2 / n 2 2 ∑ ∑ lik /(n − 1) = C1 • (n − 1)/n Finally we may give a more abstract interpretation by considering L and G as two points in nt dimensional space, what Kruskal (1964b, p.30) calls "configuration space". The distance between L and G is simply: ∑ ∑(lik − gik )2 Norming this by the length of L, √∑∑12 ik , we get normed error length, NEL. The formula for NEL is the same as for NCE: NEL = C1 • 2 In practice the coordinates of G in the preceding equations are determined by orthogonal rotation of the preliminary output to maximal similarity with L. One algorithm for accomplishing this is described by Cliff (1966), who maximizes ∑∑likgik , which is equivalent to maximizing C0. This algorithm is used in the present work. Though Cliff does not explicitly describe an overall goodness of fit index, C0 is a natural choice, and has been applied by Shepard and Chipman (1970, p. 11) under the name “coefficient of agreement”. Confronted with a variety of criteria as in the present section one needs a set of methodological rules by which to judge the appropriateness of any proposed index, either considered singly or relative to other indices. The following list of desiderata is presented very tentatively: Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 45 a) Predictability We may prefer the index which lends itself best to prediction from other variables (for instance stress, dimensionality, n). In a very general context, Cattell (1962) argues that the variables which are most clearly related to other variables, are more "basic" than others, this may be called Cattell's criterion. b) Intuitive appeal Admittedly this sounds like a very subjective and ambiguous criterion. It is, however, listed since it may be a useful criterion when the domain is uncharted as in the present case. By presenting figures as done by Shepard and Kruskal (other examples will be given in the next section) weaknesses of current indices might be apparent and other alternatives suggested. The next section demonstrates how this criterion may be given experimental definition and discusses further possibilities implied by this approach. c) Simple boundaries Indices where 0 and 1 are boundaries may be preferable. The stress measure has for instance been criticized because there is no well defined upper boundary. While the three criteria above may have general applicability, the next two are designed to illuminate specific problems. d) Comparability with index for noise level In order to give operational specification to the concept of "purification" it is necessary to use the same kind of index both for noise level and true fit. Since data are only given in vector form C0 is then by definition inappropriate. Correlation and all other indices of type 1 may, however, be used since all three terms in the metamodel can be stated in vector form. In the present work correlation will be used as a basis for studying purification. Stress, however, has been found to give similar results. e) Component breakdown. For some purposes it may be useful to break down the total discrepancy into separate components for each point (or for each dimension). This is fairly simple for C0 (or rather C1). It is then convenient to define each component so that the total discrepancy is a root mean square of the components. The discrepancy for point i, h.i, is: (9) hi = t (lik ∑ k =1) − gik )2 • B0 where B0 is defined so that C1 = n 2 h ∑ i i =1 /n that is 2 B 0 = n / ∑ ∑ lik Similarly one may study discrepancies for each dimension: (10) hk = n (l ∑ ik i =1 − gik ) 2 • B1 where B1 is defined so that Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. C1 = t hk k2 ∑ k =1 46 /t that is 2 B1 = t / ΣΣ1ik For equations (4), (5) and (6) it is also simple to give a similar breakdown for points. Equation (4) will be used as an example. For point i the discrepancy is: (11) hi = n (l ∑ ij j=1 − dij )2 • B where B is defined so that: TF1 = ∑ hi2 / n that is: B = n / ∑ lij2 It does not, however, appear possible to give a similar breakdown for contributions of separate dimensions for equations (4) , ( 5) and( 6) . A disadvantage of correlation end also of equation (7) is that is does not easily admit of component breakdown. A complete investigation of all the indices would be prohibitive. As an example of the general approach we conclude this section by comparing correlation and C1 with respect to Cattell’s criterion. Correlation is selected not only because it is usually used but because it may be a powerful tool by virtue of equations (2) and (3). C1 is selected because it represents a quite different approach. It may have a somewhat more direct quality than r, but perhaps a more important reason for including this index is that it easily admits of component breakdown. The ideal situation would be if both r and C1 gave equally convincing results both from the point of view of predictability and intuitive appeal. The indices would then function in a complimentary way. Each could be put to work for special purposes where the other one did not apply (r for the study of purification, C1 for studying component breakdown), yet true fit could be retained as a unitary concept by virtue of the two general desiderata. The relation between r and C1 was studied in six different conditions generated by two values of n (10 and 15) and three values of t (1,2 and 3). For each of the six different conditions 25 different random configurations were analyzed (5 different values of E (noise level) and 5 replications for each noise level). This study actually was a replication of the study done by Young (1970) and detailed results are 2 given in Section 4.5. It was found that 1 - r and C1 were practically linearly related and Table 5 reports the relation between these two indices. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 47 Table 5. The relation between correlation and congruence as alternative indices for true fit. The correlational index is 1- r 2 , the congruence index C1 = 1 − Co . Each cell reports a correlation based on 25 different configurations. 10 t- dimensionality 1 2 .986 .970 3 .978 15 .996 .972 n Root mean square of the 6 correlations above .982 .981 2 Correlation between 1 - r and C1 across 150 config. (merged correlation) .972 The correlations seem to be satisfactorily high. It is of special importance to note that the merged correlation is not appreciably lower than the averaged correlation (.972 vs. .981). If the merged correlation had been substantially lower this would have indicated that the relation between r and C1 as indices of true fit would have depended upon “irrelevant” parameters (as n and t). In this case true fit based on r would have been a different concept than true fit based on C1. The present results indicate that we may (tentatively) regard true fit as a unitary concept. Since r and C1 are not perfectly related it is possible that one of the indices is more predictable than the other. There are two main predictor variables, NL (where we also use correlation) and AF (where we use stress). The results to be reported are based on rank correlations between these predictors and the criteria r and C011. For NL the root mean square correlation with TF - r is .944 whereas the corresponding correlations with TF – C0 is .929. In 5 of the 6 conditions TF -r is the most predictable criterion. For AF the root mean square correlation with TF - r is .933, the corresponding correlation with TF - C0 is .927, for this criterion, however, TF - r was more predictable in only 2 of the 6 conditions. The present results indicate that from the point of view of predictability, if there are any differences at all, these differences will probably favour correlation as the basic index for true fit. We may further note that C1 does not have a clearly defined upper boundary. Even if there is no common structure for G and L, C1 will be less than 1 since the rotational procedure will capitalize on noise. The higher the dimensionality and the lower the value of n is, the more will the expected upper boundary for C1 be less than 1.0. It does not seem likely that any of the other indices mentioned in this section would give appreciably different results. There might, however, be special circumstances where the specific index used may make some difference. One could conceive of elaborate designs using a variety of indices for each of the three basic relations NL, AF and TF. Influenced by Cattell's criterion, perhaps also using the logic propounded by Campbell and Fiske (1959) in their multitrait - multimethod approach one might work out more precise criteria for deciding, first whether it was reasonable to regard each of the three basic relations as a unitary concept, second if so what the best index for each relation would be. At present, however, it is difficult to see that much could be gained by such refinements. We conclude that it is legitimate to use correlation as the basic index for true fit. If n is not too low C1 may be used for special purposes, for instance if it is desired to study component breakdown. Since C1 may be somewhat easier to give an intuitive interpretation the next section will give descriptions both in terms of r and C1. 11 Using linear correlation gave numerically highly similar results, though more clearly favouring TF - r. This might, however, be due to the fact that C1 may not be maximally linearly related with the predictor variables used, consequently results based on rank correlations are reported here. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 3.4 48 Direct judgments of true fit. An empirical approach In the experiment to be discussed in this section subjects judged varying degrees of true fit in visually presented figures. This study has three major purposes. a) Check intuitive appeal of numerical indices. Intuitive appeal was mentioned as one of the criteria for evaluating indices on p. 45. Hopefully there will be no serious discrepancies between the ordering given by judgments and the ordering on the basis of numerical indices. This will give increased confidence in these indices. If on the other hand there are pronounced discrepancies between the numerical and judgmental orderings the situation will be quite problematic. Further theoretical and empirical analyses will then be necessary to bring out possible flaws of present indices which may then suggest improved ones. b) Define a cutoff point between acceptable and unacceptable degrees of true fit At some point the discrepancy between L(C) and G(C) will be so pronounced that the results will be useless. Is it possible to find a "natural" boundary between acceptable and unacceptable degrees of true fit? c) Provide true fit categories This is a broader concern than just defining a cutoff point. The present study is the main basis for providing categories to replace the use of stress and the associated verbal labels provided by Kruskal, cfr. p. 5-6..(How the correct category in practice may be estimated is discussed in Section 4.6). A specific advantage of the present approach is that visual presentations may provide a useful perceptual anchoring to these categories. This may reduce some arbitrariness associated with interpretation of categories merely described by general verbal labels. Stimulus material and judgment tasks. Two sets of pairs of configurations were used as stimulus material. For 2 dimensional configurations 23 pairs of (L,G) configurations were constructed, presented in the same way as the previously referred to examples by Kruskal (1964a) and Shepard (1962b). Fig. 3 gives four examples of the 2 dimensional pairs used. Each configuration consisted of 20 points. The pairs were constructed to reflect several sources of variance, first, they covered the whole range of TF-values (the actual distribution will be given later in Table 7), second, both systematic (circular and rectangular arrangements) and random configurations were used. Similar principles were followed in constructing 22 pairs of 1 dimensional configurations, four examples from this set are presented in Fig 2, see p. 52. In Fig. 2 the pairs have been reduced 10% in size from the original sized used in the experiment, in Fig. 3 the reduction in size is 50%. Each pair was supplied with an arbitrary label for convenience in recording the judgements, of course no other information about the pairs was given. Five colleagues served as judges. All of them had some training in multivariate techniques. For each series the subjects were asked to judge how well one configuration was as a description of the other how well they matched each other. The pairs were to be rated on a scale from 100 (perfect) to 0 (worse) so that scale distances reflected differences in how well the pairs matched. In essence this instruction asked for both rank order and interval scale information, the latter type of information will, however, be ignored since the subjects complained that this aspect of the task was quite difficult. Finally they were asked to partition the scale which they had used in 5 categories: “Excellent”, "good", "fair", "poor" (barely acceptable), "off" (unacceptable). Comments made during the judgments and subsequent discussion with the judges will play an important part in the following presentation and discussion of the results. For each of the three main problems there will also be supplementary elaborative and technical comments. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 49 Results and discussion Intuitive appeal. Concerning the intuitive appeal of r as an index of true fit, the results were very encouraging. For the 2 dimensional series the rank correlation between the subjects ranking and TF -r ranged from .925 to .960 with median value .935. For the 1 dimensional series the results were even better, rank correlations ranging from .955 to .990 with median .980 and the deviations from perfect correspondence showed little consistency between subjects. During the discussions the pairs were sorted on the basis of the numerical indices and the judges were encouraged to criticize this ordering on the basis of their own deviations from this ordering. This produced little in the way of consistent answers. These qualitative observations from discussions with the judges further support the validity of using r as an index of true fit. As would be expected from the results reported in the previous section, the present design was inadequate to give differentiating information on r vs. C1 as indices for true fit as the rank correlation between these indices was above .99. It may, however, turn out to be possible to construct stimulus material where r and C1 will give quite different orderings of the pairs. A possible basis for such discrepancies is suggested by comments made by the judges. Occasionally they experienced conflicts - how should they weigh against each other few large errors versus many small ones? Detailed studies of judgments by experienced subjects where the pairs differ markedly in the distribution of errors, might reveal flaws with r and/or C1 as indices of true fit. The construction of the present stimulus material precluded a more detailed study of such problems since this conflict was not pronounced. In the present cases the judgment of correspondence may be a fairly simple, predominantly perceptual process. It might be mentioned that five 2 dimensional pairs all within the acceptable range -were perfectly rank ordered by a 6 year old boy. The tasks were performed with little hesitation. The main reason that the present judgments give such clear results may be that the error process used when generating the stimulus material assumed the same error variance for all points. When this assumption is highly questionable, the present results must be used with extreme caution, cfr. also the comments on specification 7 on p. 33. Cutoff point. The judgments showed a remarkable agreement not only concerning rank order but also in providing a boundary differentiating acceptable from unacceptable degrees of fit. On the basis of the judgments, r= .707 was found to be the best boundary (the precise value was decided on the basis that for this value of r half the variance in L and G is common variance). Cross tabulating this boundary against judgments of acceptability gave the results presented in Table 6. Table 6. Concordance between judgments of acceptability by 5 subjects and a numerical criterion (r > .707). Number of pairs judged acceptable Number of pairs judged unaccetable Sums % concordance R < .707 4 61 65 (61 + 157)/225 = 96.9 r > .707 157 Sums 161 3 160 64 225 We see that the judgments by the subjects and the mathematical criterion concorded in 96.9% of the cases. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 50 For the 1 dimensional series, 7 of the 22 pairs were in the unacceptable region (r < .707) and for the 2 dimensional series correspondingly 6 of 23 pairs. The high intersubjective agreement is not due to a “lumping together” of unacceptable pairs in contrast to acceptable ones. For both series all subjects had close to perfect rank ordering also within the unacceptable range. It was thus not that they simply differentiated between “degrees of structure”, and "no structure", rather that at a certain point the degree of structure seemed so "diluted" that it no longer seemed worthwhile to try to interpret the configuration. Examples of unacceptable pairs are shown in Figs. 2d) and 3d). A further problem is whether it is possible to find any support for the presently proposed cut-off point in the literature. As we shall see in Section 4.4, both Shepard (1966) and Young (1970) would perhaps prefer TF -r to be higher than .90. It is, however, not possible to give much weight to their views since they are stated obliquely, lacking in precision. It is, although by a fairly indirect procedure, possible to find support for the cutoff point here proposed in a paper by Cliff (1966, p. 41). In the context of defining the number of factors - by determining whether a factor is identifiable across parallel data sets (after orthogonal rotation to congruence) - he writes: Factors are matched one-by-one and correlations between loadings on the corresponding factors are determined…. Preliminary experience indicates that a correlation of .75 is minimal if the factors are to have recognizably the same interpretations. We now take the value .75 also to apply to whole configurations (a value for a configuration can be regarded as averaged across dimensions). It is then possible to find a value of r roughly equivalent to C0 = .75. First it must be noted that congruence between parallel data sets, (C0 (G1 G2), as in Cliff’s procedure will be lower than congruence between a data set and a true configuration C0(G,L). Unfortunately one does not have the same simple relation between C0(G1 G2) and C0(L, G) as between r(M1 M2) and r(L, M), cfr. equation (2). By generating two different M vectors for each of a number of L configurations the relation between C0 (G1 G2) and C0(L, G) can be studied, and it was found that C0(G1 G2 ) = .75 corresponded to C0(L, G) = .85. : From extensive studies of the relation between r and C0 this was finally found to correspond to approximately r = .72, a value remarkably close to our proposed cutoff point of r = .707. True fit categories. The final step is to divide the acceptable region into categories. There was, however, no uniform agreement on the range to be encompassed by the descriptive labels supplied. Two subjects did for instance use "poor" in a way which in interview turned out to be equivalent to: "I would never try to interpret my data if the fit was that bad. Neither would I pay attention to any conclusions - however tentatively stated – others might care to make from such data". On this basis - in agreement with the subjects - their “poor” category was included under “off”, and further “poor” was found to be too ambiguous to retain as a descriptive label. The discussions with the subjects suggested three categories in the acceptable range: Excellent, good and fair. Granted that there will be different requirements in different fields of psychology, the following descriptions of the categories are tentatively offered. Excellent: This is the required fit if the fine grain of the data is very important. This will be the case in advanced fields of psychology. If for instance one should want to recommend changes in the Munsell colour system or the basis of multidimensional scaling, “excellent” true fit would be required. Fig. 3a illustrates a case very close to this category, cfr. also Fig. 2a. Good: This will be the fit typically found in good work in social psychology and personality. Not only major clusters, but also rough ordering in various directions can be interpreted fairly confidently. Fig.3b illustrates an example at the lower boundary of this category. Fair: For this level of fit major clusters can be identified though there will be some error in allocation of members. The provisional nature of interpretations of clusters should be stressed. Figs. 2c and 3c provide illustrations of the worst level of fit where interpretation may be possible. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 51 The discussions with the subjects just suggested ranges for boundaries between categories, the precise boundaries were dictated by numerical considerations. When the three acceptable categories were compared with the K transformation, equation (8), it was found that the “good” category covered a much larger domain than the “fair” category. On this basis the "good" category was further subdivided into two categories: “very good” and “good”. Table 7 summarizes the statistical properties of the final set of categories and also gives the distribution of the figures in the stimulus material in terms of these categories. Table 7: True fit categories and distribution of figures used in the experiment. 1) r- correlation . 2) K - 1 - (1 - 1 - r 2 ) 3) TF -categories 4) C0 congruence* 5) C1 - normed disstance error* No. of 1 dimensional pairs ~hNo. of 2 dimensional pairs *) Approximate values. 7 Categories Excellent Very good Good Fair Unacceptable .994 .976 .922 .707 0 .457 -.5-1,0 .998 .623 2.0 .99 .790 3.0 .96 .956 4.0 .84 1.0 4.5 - .05 .10 .20 .40 - 5 2 3 5 7 Sum 22 2 0 6 9 6 23 Unfortunately the "very good" category was not represented in the 2 dimensional series (but see Fig. 2b for a 1 dimensional example). The subdivision of the preliminary "good" category made the range of each of the three middle categories equal in terms of the K transformation. There are two different ways of using the description of true fit. The simplest is just to ask to which category a given solution belongs, perhaps this will be sufficient in most cases. The K transformation does, however, invite more precise numerical descriptions. This will for instance be useful in Section 4.6 when we discuss with what accuracy one can estimate true fit from stress, n and t. It may be convenient to have simpler numerical representations of category boundaries, this is provided in row 3) in Table 7. In the range K = .457 to K = .956 this is a linear transformation where 1.0 corresponds to .457 and 4.0 corresponds to .956. From the point of view of the end categories the K transformation was not completely satisfactory. This transformation would make the “excellent” category more than twice the size of the middle categories, the ”unacceptable” region would be negligible. As will be seen from the results in Section 4.6, making the "excellent" category 50% larger than the middle ones, and the ”unacceptable” category 50% smaller, give maximally linear relations with stress. This implies using a different linear transformation for each of the end categories. The term TF-categories will be used to refer both to the numerical description and to the proposed labels for the different categories. Statements where TF is assigned a numerical value always refer to the above discussed linear transformations of the K values. Concluding comment. At the present state of development of multivariate techniques some arbitrariness and lack of precision in describing the true fit index is unavoidable. There is bound to be individual differences among researchers as to the requirements they will make in different cases. Further progress may be expected as experience is gained with the proposed index. It will then be important to gain systematic knowledge of what actually occurs when researchers interpret configurations in specific cases. Exactly what topological (and other) properties do they stress in what contexts? How important may for instance serious misplacements of few points versus small misplacements of many points be in concrete cases? How important may for instance serious Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 52 misplacements of many points be in concrete cases? Knowledge on questions like this may help the construction of more refined indices and will be in line with the plea in Ch. 1 for obtaining descriptions of how research actually occurs. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 53 Fig 3: Illustration of different levels of true fit for 2 dimensional configurations, n = 20 ° represents true position ● reconstructed position. Corresponding points are connected by straight lines. * a) should be classified as “very good”, but is labelled “excellent” since it is very close and it was found desirable to represent the best category. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 54 Chapter 4 BEYOND STRESS –HOW TO ASSESS RESULTS FROM MULTIDIMENSIONAL SCALING BY MEANS OF SIMULATION STUDIES 4.1 Two approaches in simulation studies. This chapter contains the promised guide lines for the problems of applicability, dimensionality and precision. Since these are based an simulation studies the first step is to point out two different strategies which have been employed in previous simulation studies. The first approach is influenced by the perhaps unfortunately strong emphasis on statistical significance testing in much of psychological research. Systematic attempts to provide information on how to assess stress were made independent of each other by Stenson and Knoll (1969) and Klahr (1969) who provided tables and graphs giving expected stress values as functions of n and t. The sampling distribution of stress for given values of r and t are simple to generate by analysis of a sufficient number of random sets, each set for instance being a permutation of the first ( n ) integers. 2 Klahr reported complete sampling distributions, Stenson and Knoll just reported average values, and also provided some rough rules of thumb for how far stress should deviate from the average to be regarded as statistically significant. The conventional advice to be expected from such studies is to provide stress values corresponding to the customary 5% limit and recommend this as a cutoff point separating acceptable from unacceptable solutions. This is for instance done in the most recent of this type of study (Wagenaar and Padmos, 1971). Concerning this approach we first note that it represents a very different strategy to separate acceptable from unacceptable solutions than the approach used in Section 3.4. Results comparing the consequences of these two approaches will be discussed in Section 4.6. At present we note that in many cases the scientist will not primarily be interested in whether he can safely reject the null hypothesis of no structure or not. Just as one will not be content to know that the reliability of a test significantly departs from zero, but wants an estimate of the amount of internal structure, we believe that the researcher in multidimensional scaling typically will be concerned with the amount of structure in his material. This corresponds to the second approach in simulation studies where we offer TF - categories (as discussed in Section 3.4) as an index for amount of structure. For a given number of points, n, and dimensionality, t, we shall see that stress (AF) and TF is a function of the amount of noise added to the true configuration. Analyzing random sets corresponds to a special case of this more general approach. If a given true configuration is subjected to increasingly higher level of noise the configuration will eventually be "drowned" in noise. As noise level increases without bounds we end up with a random set of dissimilarities. Analysis of random sets may thus be regarded as a special case of the more general approach of studying consequences of noise levels. This will be apparent in the results in Section 4. 6. 4.2 Classification of variables in simulation studies. Very briefly the steps in simulation studies are: construct L, introduce error to get M, analyse M to get G, then finally predict TF from AF (and n and t). Though this may sound simple we shall see that there are a large number of variables to explore. The parameter space is of such a staggering complexity that anything even remotely resembling a factorial design is completely out of the question. The list of variables presented below is, however, intended to be exhaustive. We indicate which of the variables will be explored in the present work. The variables which are listed and not explored may Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 55 then point to further studies. The classification offered below also serves as a brief summary of Chapter 3. The basic distinction in the classification of variables is: A. Specifications necessary to make when running a simulation study, what may here more loosely be referred to as independent variables. B. Consequences of the specifications made in A, loosely here referred to as dependent variables. It will be convenient to further subdivide the independent variables into three different classes: A1. Simple quantitative variables. A2. Complex quantitative variables. A3. Qualitative variables. where the three classes differ in complexity. The main focus of the present work is to outline methods to deal with main effects and interactions for a subset of the variables in A1. Main effects and especially interactions where variables from the more complex classes are involved are a nuisance since this will limit the relevance of the present simulation studies for many experimental situations. We now consider variables which belong in each of the three classes: A1. Simple quantitative variables. n - number of points in the configuration E - degree of error, error proportion T - dimensionality of true configuration m - dimensionality of analyzed configuration p - Minkowski constant of true configuration (type of metric space) pl - Minkowski constant of analyzed configuration. In Section 4.6 we will set m = t, corresponding to the simplest form of the metamodel, cfr. Section 2.1. In Section 4.7 we then explore several values of m for a given value of t and analyse the resulting stress curves as implied by the extended form of the metamodel in Section 2.2. Exploring several values of pl for a given value of p, may be relevant to finding the true type of metric space, cfr. for instance the analysis of colour space by Kruskal (1964a, p. 23-24). This strategy also produces “stress curves”, such stress curves will, however, not be studied in the present work which is limited to the Euclidean case, that is p = pl = 2. A2. Complex quantitative variables. A2.1 - relative variance in different dimensions of the true configuration. A2.2 - relative error variance for points and/or dimensions. A2.3 - degree of ties in the dissimilarities. Concerning A2.1 it is clear that generally configurations will vary in the extent to which they have equal variance in all dimensions. So far systematic studies have been limited to the special case of equal variance in all dimensions. Studying the more general case will probably give results "between" those which to now have been obtained for various values of t. Results for a "flattened" twodimensional configuration may for instance be expected to fall between the results for t = 1 and t = 2. A2.2 has already been discussed, cfr. specifications 6 to 8 in the discussion of noise level in Section 3.2. Concerning A2.3, the present investigation will be limited to the study of sets of dissimilarities without ties. Notice, however, that the metamodel implies a promising approach to the problem of how to treat ties. As implied by the brief treatment in Section 3.1 the primary approach to ties - which allows untied Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 56 distances for tied data without downgrading stress - will automatically give lower stress than the secondary approach. It is by no means, however, evident which approach will produce best true fit. This will be the subject for a separate investigation. In the present study we consistently make the simplest possible choices for all the variables in A2. The simulation studies will be limited to configurations which have equal variance in all dimensions, error variance will be assumed homogenous, and there will be no ties. In a variety of experimental situations (perhaps most?) these assumptions will be unrealistic and it seems likely that generalizations from the simplest cases to the more complex will be tenuous indeed. These simplifications are probably the single most important limitation of the results to be discussed in Section 4.6 and 4.7. Future studies will deal with the more complex cases. Probably the best strategy will be to do simulation studies tailormade to the most reasonable specifications in given experimental contexts. In the present introductory work the best strategy seems to be to concentrate on the simplest cases to bring forth the general logic of the present approach as clearly as possible. A3. Qualitative variables. A3.1 - Definition of configuration. A3.2 - Type of error process. A3.3 - Type of algorithm. For each of the variables in A2 one can conceive of a set of quantitative parameters to describe them, this does not, however, seem possible for the variables in A3. Consider first A3.1 which perhaps may be considered as a domain of variables. One obvious distinction is between configurations defined by a random process versus by some systematic procedure. One procedure has been to use subsets of a fixed list of uniform random numbers from Coombs and Kao (1960) as coordinates (Coombs -Kao coordinates have been used both by Shepard (1966) and Young (1970)). A highly related approach is to simply use some routine generating uniform random numbers. In Section 4.4 we shall see that these two procedures have been reported to give quite different results. If this really is the case it is doubtful whether results from random configurations can be applied to systematic configurations, for instance a circular configuration. For more examples see the configurations studied by Spaeth and Guthery (1969). To the extent that results depend upon type of configuration their general relevance will be highly limited. A3.2 has already been extensively discussed, cfr. specifications 1 to 5 in Section 3.2. For both A3.1 and A3.2 one would hope that the results will be largely independent of a concrete set of specifications, in other words that there are neither main effects nor interactions associated with these qualitative variables. Obviously it is practically impossible to investigate this with any degree of completeness. The strategy to be pursued here is to compare a few examples differing in many aspects of A3.1, A3.2 and A3.3 and on this basis make some tentative generalizations.12 A3.3 has been discussed in detail in Section 3.1. The simplest case would be if one algorithm turned out to be better than all the others, or if they turned out to be equally good. The choice between them could then be made on the basis of other considerations, the most relevant such consideration in most cases would probably be computer time. If it turned out that some algorithms gave best results for some conditions, other algorithms for other conditions, we would have a fairly messy situation, making life more difficult for users of multidimensional scaling. In Section 4.3 we compare the currently most popular algorithms from the point of view of the metamodel. B. Dependent variables. As a result of making specifications for Al, A2 and A3, there will be values of the three basic relations NL, AF and TF. Since these relations have been discussed in detail in Ch. 3 a summary comment is sufficient here. It will be recalled that for each of these three relations there are a variety of indices to 12 See the study of systematic configurations in Section 4.7 for an example of this strategy for A3.1 and A3.2 Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 57 give them numerical expression. Preliminary results strongly indicate that results are independent of the specific indices chosen, this justifies treating each of the three basic relations as unitary concepts. Theoretical reasons indicate linear correlation as the basic index for NL and TF, stress is a convenient choice for AF since it is so well known. Occasionally, however, other indices will be used to illustrate more specific points. Notice that though in the present classification of variables it is logical to classify AF as a dependent variable, it will in Section 4.6 be most convenient to consider AF, n and t as independent variables and just TF as the dependent variable. Having now summarized A and B there is a final distinction to be made: to differentiate between unrepeated and repeated designs. As the terms are used here in the former case a given true configuration is only analyzed once. This corresponds to the simple case of the metamodel and is the only type previously reported in the literature. To simplify, t will be assumed known in this case and we will set m = t. For repeated designs a given true configuration will be analyzed several times, each time the actually introduced error components will be different. This corresponds to the extended form of the metamodel. Unrepeated designs are discussed in Section 4.6, repeated ones in Section 4.7. These two types of designs are regarded as supplementary, that is they should give the same results on the aspects where they are comparable. A discussion of previous contradictory results in simulation studies in Section 4.4 will bring this out. From the point of view of our three main problems: precision, dimensionality and applicability the next section –comparing algorithms - may seem as a digression. Yet before proceeding with our major concern it is of importance to explore whether there is one tool (algorithm) better suited for our main purpose than alternative ones. 4.3 Comparison of algorithms. Before discussing the broad question of comparing algorithms, it is first necessary to discuss the widespread practice of using Kruskal's algorithm with an arbitrary initial configuration. 4.31 Choice of initial configuration and local minima. Concerning local minima Kruskal (1964b, p. 35) writes: "Experience shows that this is not a serious difficulty because a solution which appears satisfactory is unlikely to be merely a local minimum”. If it is feared that a given solution does not represent an overall minimum it is recommended to start again, for instance by "using a variety of different random configurations to repeatedly scale the same data …if a variety of different starting configurations yield essentially the same local minimum, with perhaps an occasional exception …..then there is little to worry about” (Kruskal, 1967, p. 23). From the point of view of the user this state of affairs is, however, clearly a nuisance. No unequivocal criteria of “satisfactory solution" is presented by Kruskal, furthermore having to run the same data repeatedly makes the program much more inconvenient to apply, both from the point of view of time used to preparing runs and in terms of computer time. There is now a fair amount of evidence that the risk of being trapped in a local minimum may be very high if Kruskal's initial configuration is used. For 120 twodimensional error free configurations Shepard (1966, p. 297 note 6) reported that in about 15% of the cases it proved necessary to begin again from a differing staring position. On the other hand Spaeth and Guthery (1969, p. 509) in their analysis of 19 one and twodimensional simple geometric shapes reported that: "in well over half the runs MDSCAL landed at a minimum other than the true overall minimum." Wagenaar and Padmos (1971, p.102) more generally report that "without precautions a considerable number of calculations end in local minima", and the analysis in one dimension "generally tended to end in local minima when the initial configuration was chosen randomly." (op.cit.p.103.) Lingoes and Roskam (1971) defined as "major local minimum" a solution deviating more than .025 K* units (cfr. p. 52 ##) from the best obtainable. For 40 random cases (10 points generated in five dimensions and analyzed in two dimensions) there were 45% major local minima (op.cit. Table 4, p. 91). Finally they report that for 40 Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 58 cases generated and analyzed in two dimensions there were 35% major local minima in two dimensions and 20% for the onedimensional solutions (op.cit. Table 8, p. 97). As part of the present simulation studies a set of 25 random configurations of 15 points in 2 dimensions, covering 5 noise levels, were analyzed independently in 1, 2 and 3 dimensions. The following results were obtained: for 3 dimensional solutions l2%, for two- dimensional 40% and for 1 dimensional 84% major local minima. As might be expected local minima were reflected not only in higher stress for non-arbitrary initial configurations, but also in substantially worse true fit. Further aspects of this study which compared the main nonmetric algorithms will be given in Section 4.32. Concerning the results of local minima when arbitrary initial configurations are used, it should be borne in mind that the number of such minima may be reduced by starting in a “too high” dimensionality and going down. For 10 cases of 10 points in two dimensions Lingoes and Roskam (1971, p.93) found no local minimum in two dimensions when they started in 5 dimensions, whereas there were 7 local minima when working directly in two dimensions. A more satisfactory way of avoiding the problem would be to start with an initial configuration directly related to the dissimilarities, that is a non-arbitrary initial configuration. The most elaborate such initial configuration is the procedure in TORSCA (Young, 1968a). Initially dissimilarities13 are treated as distances and converted to scalar products by way of the well known formula presented by Torgerson (1958, p.258) and eigenroots and vectors are computed. This is usually referred to as the Young Householder -Torgerson decomposition. From this configuration Euclidean distances are computed and the best monotonic transformation of the dissimilarities is computed. The transformed dissimilarities are again converted to scalar products and the process is repeated until no improvement is possible, this then is the starting point for the nonmetric algorithm. This procedure has the undesirable feature of not being completely independent of metric properties, and “it is possible that a presumption based on the metric of the data may get one into local minimum traps”. (Lingoes and Roskam, 1971, p.128). Furthermore the procedure is very timeconsuming. A much simpler procedure which has some of the same features is first to convert the dissimilarities to rank numbers,14 and then simply treat these rank numbers as distances which are converted to scalar products and factor analyzed. This is equivalent to using only the first step in the iterative procedure for the initial configuration in TORSCA, but in such away that the results are independent of any monotone transformation. The procedure may be called a rank factor analysis approach. Lingoes and Roskam (1971, p.42 - 43) recommend the same procedure except for first squaring the rank numbers. They do not, however, report any results from this procedure. The SSA-1 initial configuration has a related but somewhat more involved rationale. For details on this procedure see Lingoes and Roskam (1971, p.39 - 42). In Section 4.32 there will be no further discussion of Kruskal's algorithm with an arbitrary initial configuration, but the studies to be reported there are relevant to the problem of comparing the merits of the various non- arbitrary initial configurations. The present survey confirms the conclusions by Lingoes and Roskam (1971, p.132) who strongly points out that arbitrary starts in Kruskal’s MDSCAL are to be avoided. 4.32 Comparing, MDSCAL. TORSCA and SSA-1 As argued in Section 2.1 the central issue when comparing various nonmetric algorithms is which one of them gives best true fit. Though in practice an underlying true configuration is not known, it would be reasonable to extrapolate from simulation studies if over a variety of different circumstances one algorithm persistently did better than another. 13 Similarities are inverted, values of 0 are then unacceptable. 14 Similarities are then sorted in descending order, dissimilarities in ascending order. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 59 The different features of algorithms, as outlined in Section 3.1, have no intrinsic psychological content, at least no one has argued for the specific theoretical relevance of e.g. choosing weak vs. strong monotonicity as basic in a given algorithm. The present section may then be said to deal with strictly technical problems, the battle between the algorithms is to be fought on purely pragmatic grounds. There is a widespread feeling that the main nonmetric algorithms largely give if not identical, then for practical purposes close to indistinguishable results. Yet there have been surprisingly few studies systematically comparing algorithms from the point of view of true fit. Young (1970) includes a comparison of' TORSCA with results from MDSCAL presented by Shepard (1966). This comparison only deals with the error-free case and will be commented in Section 4.6. As previously pointed out Lingoes and Roskam (1971) mainly restrict themselves to comparing indices of apparent fit, though they do include one comparison between some algorithms from the point of view of' true fit. This comparison, however, is restricted to variants within their hybrid MINI-SSA15 program. Analysis of order 4 matrices. A priori it seems reasonable that any marked differences between the algorithms would be most likely to show up for small values of n. Partly from this point of view we have been persuaded by the argument made by Lingoes and Roskam (1971, p. 99) that analysis of all distinct order 4 matrices may highlight problems only hinted at in analysis of larger matrices. Analysis of order 4 matrices is facilitated by the fact that there are just 30 distinct order 4 matrices, (cfr. the proof op.cit. p. 105-107). Notice, however, that analysis of order 4 matrices is of only marginal relevance from the point of view of our main criterion since for these matrices no external L can be said to exist. Our main concern in this study was to try out the rank factor analysis approach to initial configurations in Kruskal’s algorithm and also to compare TORSCA (excluded by Lingoes and Roskam) with the other variants. 16 The results are presented in Table 1. 15 For the uninitiated the acronym needs spelling out: Michigan (Lingoes), Israel (Guttmann) Netherlands (Roskam) Integrated Smallest Space Analysis. As recognized by Lingoes and Roskam the acronym ought to include a reference to Kruskal since his block partition approach to monotonicity is one of the options in MINISSA. 16 A minor concern was to check that the program versions presently used were identical to the ones used by Lingoes and Roskam by comparisons where possible. We shall later see that unfortunately there is reason to doubt whether errorfree programs always have been used in the published literature. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 60 Table 1. Mean results from analysis of 30 distinct order 4 matrices. Algorithm MDSCAL Initial configuration Kruskal arbitrary initial configuration rank factor analysis* SSA-1 output MDSCAL MDSCAL TORSCA SSA-1 Lingoes Roskam over all minima**) .053 a) S .106a) .077b) .060c) .064b) .065b) .095a) K* .108b) .111C) a) Result taken from Lingoes-Roskam (1971, Table 11 p.108) b) Own result c) Result from Lingoes Roskam confirmed in own study. *) A possible weakness in MDSCAL may be revealed by the fact that occasionally the stress of the initial configuration did not decrease, but dramatically increased. In these cases the stress of the initial configuration was used in the mean. The same phenomenon was occasionally observed in analysis of random sets of 6 points the same strategy was used then. This phenomenon has, however, never been observed with the thousands of other cases where rank factor analysis has been used in the course of the present investigation. **)These values represent “the best solutions obtained over all possible algorithms and initial configurations tried in minimizing K* or S". (op.cit. p. 100). The following conclusions are tentatively offered on the basis of the results presented in Table 1: 1) There is something to be gained by the factor analysis initial configuration compared with Kruskal's arbitrary starts, (stress = .077 vs. stress = .106). The other alternatives do, however, appear to be clearly superior since they all give lower stress than .077. 2) If MDSCAL is provided with optimal start configurations (perhaps output from SSA-1?) not very much can be gained by trying a variety of alternatives. In practice the latter strategy would probably be prohibitive in terms of the time required both to prepare runs, also computer time. Algorithms compared with respect to true fit Our next task is now to present results with larger values of n, using TF - categories as the major criterion. We shall then see to what extent the tentative conclusions from the previous analysis must be modified. For each of the conditions (a specific combination of n and t) several random configurations were generated. In each condition different levels of noise was used so that the whole range of TF-categories values was observed in each condition. Before discussing the results with noise some comments on results for error-free data are made. 5 sets of error-free dissimilarities were included in each condition. The comments serve to clarify why results from these sets are excluded from the main results. Remarks on results from error-free data. Inclusion of these cases would have given the TORSCA analysis of M an unfair advantage since then straight Euclidean distances are inputted, and the Young – Householder -Torgerson resolution gives not only a solution with stress = 0, but a solution with perfect true fit. Unfortunately this has not been clear in the literature. Commenting on results from Spaeth and Guthery (1969) Lingoes and Roskam (1971, p.130) point out that “either the decomposition was not carried out correctly or the TORSCA monotone distance algorithm is capable of messing up a perfect fit”. The present results conclusively show that the TORSCA monotone algorithm is working perfectly in all cases of analysis of Euclidean distances. As intended TORSCA then works as a metric method in these cases and the expected perfect true fit has been found. Consequently there must have been some error in the Spaeth - Guthery implementation of TORSCA. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 61 A minor technical point is that generally MDSCAL17 is superior to the other algorithms for error-free but monotonically distorted distances. In such cases TORSCA has in the present work not generally been found to give a solution with stress = 018, while this generally is the case for MDSCAL. Likewise SSA-1 rarely ends up with perfect apparent fit, this may, however, be due to one of the criteria for termination in this algorithm, unlike the TORSCA and MDSCAL algorithm, there are no options for adjusting criteria for termination in SSA-1. Stress = 0 for MDSCAL corresponds to slightly improved true fit compared with the other algorithms, this will be further commented in relation to Figure 6. Results from data with noise. Table 2. Algorithm MDSCAL MDSCAL TORSCA* TORSCA TORSCA SSA-1 Mean results of TF-categories for different algorithms and different combinations of n and t. n = 6, t = 1 is based on 35 different random configurations, for the other conditions 20 different random configurations were used. Noise was introduced for all the configurations. The Ramsay-Young error process was used. Initial configuration rank factor analysis SSA-1output mij rank numbers m ij = m2 ij+10** n t 6 1 n t 7 1 3.125 3.370 2.784 2.943 3.243 3.269 n t 9 2 n t 9 3 n t 10 1 n t 5 2 2.984 3.059 2.992 2.967 3.570 3.531 3.510 3.553 2.014 2.221 2.022 3.000 3.477 2.204 2.140 * For TORSCA using various monotone transformations of the original raw data is equivalent to different initial configurations for other algorithms, cfr. p. 58. ** This specific transformation is included because it was used by Young (1970). From the results presented in Table 2 we see that the tentative conclusions from Table 1 do not have general validity. Though slight, the difference disfavouring MDSCAL for n = 6 and n = 7 is, however, probably significant. In view of this and the poorer performance of MDSCAL for order 4 matrices we will not recommend MDSCAL when n is small. The main simulation studies in Section 4.6 use TORSCA for n = 6 and n = 7, and for other conditions MDSCAL and TORSCA are both used. For the other conditions, however, the use of MDSCAL (with rank factor analysis) does not give noticeably different results from the other algorithms. Notice specifically that the elaborate procedure used in order 4 matrices, to use the output of SSA-1 as the initial configuration for MDSCAL, does not improve the much simpler procedure of rank factor analysis to give the initial configuration. The main conclusion to be gained from Table 2 is that with an initial configuration from rank factor analysis, Kruskal's MDSCAL has not been significantly improved on by any of the more recent algorithms. Specifically concerning TORSCA the results clearly show that rank numbers may just as well be used as input data.. Since otherwise TORSCA and MDSCAL are practically identical, the present results show that the timeconsuming construction of an initial configuration in TORSCA really is superfluous. The studies to be discussed in Section 4.6 and 4.7 are almost exclusively based on MDSCAL. Since there are no clear advantages of other algorithms it is reasonable to make the choice on the basis of computer time and MDSCAL is here far superior to the other algorithms. 17 Referring just to MDSCAL it will further be understood that rank factor analysis has supplied the initial configuration 18 Notice also that in Young (l970, Table 3, p. 466) stress ought to be 0 in the first column since this corresponds to the error-free case, but of the 15 means in this column only one is given as .0000 in the other cases stress varies between .000l and .0117. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 62 MDSCAL versus SSA-1. An unresolved problem. A further analysis of the results of comparing SSA -1 and MDSCAL may yet bring out problems worthy of further study. First we should note that for the 60 comparisons between MDSCAL and SSA -1 the closely similar means reflect results that were highly similar for each single configuration. In 50 of the 60 cases the difference in TF was less than .20, the two largest differences were .38 and .59 probably only the latter difference could be of any importance in applied work. Secondly a detailed analysis of the 40 cases, where both SSA -1 and MDSCAL using the output from SSA -1 as the initial configuration were used does bring out some further problems. In the following discussion it is necessary to keep in mind a distinction made by Lingoes and Roskam (1971). They point out that the goal of minimizing K*(based on rank images) is not necessarily the same goal as minimizing stress (based on block partition values). The first goal is called minimizing goodness of measurement, the second goal minimizing goodness of fit. (The intended connotations of “measurement” vs. “fit” are not made clear). We will just refer to minimizing K* vs. minimizing S, wishing to emphasize that these two approaches may sometimes represent different targets. The first point to note is that - perhaps not unexpectedly – SSA -1 does not minimize stress. For n = 9, t = 2 mean stress of SSA-1 output was .1027 while the final MDSCAL output had stress of .0946. For n = 9, t = 3 the comparable results were .0647 vs. .05540. While these differences may not seem dramatic the results are highly significant because in every one of the 40 cases MDSCAL reduced stress. Since the final output from MDSCAL in this case was mostly, (practically) identical to the output from MDSCAL starting with rank factor analysis, it seems reasonable to assume that the MDSCAL/SSA -1 approach succeeded in arriving at a global minimum in what may be called the block partition configuration space. If now minimizing stress had been an optimal goal from the point of view of true fit, we would have expected to find that the MDSCAL/SSA -1 would show better true fit than just the SSA-1 solutions. But this very clearly did not happen, if anything it was rather the case that improving stress made true fit worse. Cfr. 3.059 vs. 3.000 and 3.551 vs. 3.477 in Table 2). The major conclusion to be drawn from this analysis then is: Minimizing stress is not necessarily an optimal goal from the point of view of achieving the best possible true fit. While MDSCAL minimizes stress, the next question is whether on the other hand SSA -1 minimizes K*. If this was the case there would be a basis for asking under what conditions what target is optimal from the point of view of true fit. If now SSA -1 does minimize K* further iterations on the output from SSA -1 can do nothing but increase K*, the iterations will then move the output away from .the global minimum in the “rank image configuration space”. Consequently the next step was to compute K* for the MDSCAL/SSA-1 solutions. The results were both clear and puzzling. For n = 9, t = 3. mean K* for SSA -1 was .0865 for MDSCAL/SSA -1 .0850. For n = 9, t = 2 the results were .1397 vs. .1425 respectively. As indicated by these means it is entirely unpredictable whether MDSCAL will increase K* or not. In 21 cases MDSCAL increased K* and in the other 19 cases it decreased K*. These results clearly indicate that it is not the case that SSA -1 generally minimizes K*. The puzzling problem is then why SSA -1 did not do worse than MDSCAL. SSA -1 minimises neither K* nor S yet it works as well as any other algorithm from the point of view of true fit. The present results do not permit a more detailed analysis of this problem since the differences in true fit are small and one can not expect strict relations between targets for minimizations and true fit. One may however, speculate that improving SSA -1 (by using some variant in MINI-SSA?) will lead to minimization of K* and that there are a large number of cases where this is a better target than minimizing S. Yet another possibility is that there may be still other targets that may profitably be explored from the point of view of true fit. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 63 Conclusions. The present choice of mainly using MDSCAL is not contraindicated by the variety of analyses presented by Lingoes and Roskam (1971). There are three major investigations in that report which are of relevance: a) From a large number of analyses of random cases they conclude that when provided with a good initial configuration "MDSCAL operates as satisfactorily as any other procedure we investigated". (op. cit. p. 98). b) Analysis of order 4 matrices have already been discussed, this confirmed a suspicion in our own data that MDSCAL may not be quite satisfactory when n is very small (7 or less). c) In their study of “metricity” (true fit) MDSCAL was not used but one of the variants within MINI-SSA used (op. cit. Table G, p.128) seem to differ only in the choice of initial configuration and step size calculation. The difference between this version and the others which were tested turned out to be negligible. Still there is every reason to reiterate the plea for more research. We would suggest working with a few simple systematic configurations, experimenting with a variety of algorithmic strategies always evaluating any strategy from the point of view of true fit. Such investigations may well give rise to different conclusions than the present general "no difference" verdict. The estimates of TF from stress, n and t to be presented in Section 4.6 can be applied regardless of whether MDSCAL or TORSCA has been used. Even though we have not made any claims for superiority of these methods in relation to SSA -1, it must be emphasized that the estimates can not be applied if SSA - 1 is used. In Section 3.1 we were at some pains to point out that K* and S will give quite different results. To illustrate this we may mention that for order 4 matrices the relation between K* and S is well described by K* = 1.8 x S. For n = 9, t = 2 and 3 the relation was K* = 1.3 x S and for n above 15 the relation is generally K* = 1.2 x S (one may guess that the relation will tend to l as n increases). It would, however, probably be difficult to find a general conversion formula from K* to S. Another approach would be to compute stress of' SSA –1 output directly.) Since this as previously discussed gives higher values of stress without giving worse true fit such a strategy would lead to biased estimates as the estimates of true fit would be worse than really warranted. 4.33 Metric vs. nonmetric methods A major advantage of the nonmetric methods compared with metric methods is that the latter may give too many dimensions and thus give rise to misleading interpretations. This will be the case if a metric interpretation of the data is not appropriate. A favourite example is the study of colour vision by Ekman (1954), where a metric analysis produced five factors. This was a somewhat surprising result in view of classical colour theory. Nonmetric reanalyses of Ekman’s data have consistently given the familiar colour circle, a two-dimensional representation (Shepard 1962b, Kruskal 1964a, Guttmann 1966, Coombs 1964). Torgerson (1965) reviews other studies on colour vision during the 1950’s where metric analyses gave disturbing results and later (Torgerson, 1967) he points to simulation studies showing that if true distances are distorted by some monotone transformation and the resulting “distances” analyzed metrically, too many dimensions will result. This clearly demonstrates the advantage of nonmetric studies if there are nonlinearities in the data. On the other hand one may ask, what if the re are no non- linearities in the data, will metric analysis then give better results than nonmetric? We know that this must be the case if there is no noise, since as repeatedly stated this always implies some (even if slight) distortion when nonmetric methods are used, but perfect true fit for metric methods. But what about cases where there are linear relations between dissimilarities and distances, and also noise? Notice that this is a very good illustration of how irrelevant AF - stress is in answering many questions. A nonmetric analysis will invariably turn out Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 64 with a solution with lower stress than the metric solution19, it may, however, be the case that the nonmetric algorithm just messes up the best obtainable solution by capitalizing on noise. If the researcher feels "fairly" certain that there are no pronounced nonlinearities in his data he may face a difficult choice: Should he use a metric analysis and perhaps gain in precision? (if nonmetric methods capitalize on noise and thus distort relative to the metric solution.) But if his presumption of linearity was incorrect he may risk ending up with too many dimensions. This problem is the point of departure for the simulation studies to be reported here. For n = 20, 20 different random configurations were analyzed for t = 1, 2 and 3, altogether 60 different configurations. In each of the conditions various levels of noise were introduced so as to cover the whole range of values of TF-categories of interest. It will be recalled from Section 3.2 that there may be some (even if slight) nonlinearity when the Ramsay-Young error process is used. In the present study we did not wish to introduce any bias whatever disfavouring the metric approach so we used the rank image approach to correct for any nonlinearities. The study then also serve the subsidiary purpose of providing more information on the linearity of the relation between L and M for the Ramsay-Young process20. As shown in Table 3 three different versions of M were analysed metrically, first M then M*L and finally M*G.21 Tab1e 3. Mean results of TF-categories for MDSCAL vs. metric methods. Each mean is based on 20 different random configurations, n = 20. Noise was introduced for all the configurations. The Ramsay-Young error process was used. Algorithm What is analyzed 1 2.730 MDSCAL M Factor Analysisx M 3.014 Factor analysis M*L 2.886 Factor ana1ysis M*G 2.913 x) Young-Householder-Torgerson decomposition. t 2 2.535 3 2.743 2.999 3.171 2.913 3.087 2.885 3.056 The results were quite surprising. It had been anticipated that on "home ground" so to speak the metric approach would surpass the nonmetric approach, but the results speak very clearly otherwise. Looking at the results in more detail the impression from Table 3 is strikingly confirmed. For each of the 60 configurations the true fit of the MDSCAL solution was compared with the best of the three metric solutions. In 8 cases it was possible to find a metric solution which outdid the nonmetric ones, the largest of these 8 differences was, however, a trifling .011, a difference hardly likely to make any practical difference. It may be mentioned that for some of the conditions reported in the previous section similar comparisons were made between metric and nonmetric approaches. The present results are not restricted to a relatively high value of n. 19 The present studies partly used MDSCAL with a metric solution as start configuration and stress always decreased. As expected from results reported in the previous section this gave identical results to those obtained when rank factor analysis supplied the initial configuration. 20 Comparing r(L, M) with r(L, ML*) and r(L, MG*) failed to show any case of discrepancy likely to be serious in practice. 21 As expected and as also implied by Table 3 M*G and M*L were consistently highly interrelated, the correlation between these two vectors never strayed below .99. This shows that while M*L may theoretically be the most desirable transformation of M, M*G which always can be computed in practice, is a perfectly viable alternative. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 65 These results give a very clear answer to the problem of whether one should use a metric or nonmetric approach if one for some reason is in doubt. Provided there is noise in the data (and who can doubt that for empirical data?) there is nothing to loose by using a nonmetric algorithm, on the contrary it is highly likely that the results will be substantially better. Perhaps this unqualified recommendation of nonmetric methods should now be somewhat tempered. Part of the motivation for the study reported in this section was provided by some very important experiments reported by Torgerson (1965). He showed that there where cases where the structure of the data was not revealed, but rather distorted by using nonmetric models. He used geometric figures which reflected both qualitative and quantitative dimensions. In these cases the nonmetric programs partly distorted the underlying structure “by eliminating the contribution [of qualitative dimensions] entirely and then capitalizing on error” (op.cit. p. 389). Is there then a contradiction between the results reported in this section which come out clearly in favour of non-metric methods and Torgarson's strong tempering of unbridled enthusiasm for nonmetric methods. We do not think so. Spelling this out, however, requires a more synthetic view of types of models for similarities data. This is the topic for Ch. 6. 4.4. Previous simulation studies and two methodological problems. Analytical versus graphical methods. Unrepeated versus repeated designs. In this section the systematic simulation studies most directly leading up to the main results presented in Section 4.6 will be reviewed. In that section detailed results will be presented -extending the results of the studies to be dealt with in this section. The present section will mainly focus on two methodological problems. The first systematic simulation study was done by Shepard (1966). He worked with two-dimensional configurations. For each n (varying from 3 to 45) he constructed 10 random configurations from the Coombs-Kao coordinates. From the results he concludes that (op.cit. p. 299) “while the reconstruction of the configuration can occasionally be quite good for a small number of points, it is apt to be rather poor (for n less than eight say)”. For n = 7 minimal r = .919 and root mean square = .980. “As n increases, however, the accuracy of the reconstruction systematically improves until even the worst of. the ten solutions becomes quite satisfactory with ten points [min r = .992] and, for all practical purposes, essentially perfect with 15 or more points.” We note that Shepard does not commit himself as to how low a correlation must be in order for TF to be unacceptable, but it appears that a correlation of say .90 would be considered as unacceptable by him. From the point of view of the metamodel, Shepard's study corresponds to one special case, the case where there is no noise. As previously stated, even though TF will be acceptable in this case there will always be some distortion. L and M will coincide since there is no noise, while there will be some discrepancy between L and G. The other special case of the metamodel is the analysis of random sets as briefly discussed in Section 4.1. In practice the intermediate cases will be of greatest interest. The first systematic study of such cases was probably done by Kruskal. “Kruskal (personal communication) has also investigated the robustness of the solution by imposing random deviations on artificially generated data. Generally a moderately high level of added “'noise” can be sustained before the recovered configuration suffers serious deterioration" (Shepard, 1966, p. 308). The first complete report giving details on the amount of noise and the resulting deterioration is given by Young (1970). He studied five levels of n (6,8,10,15,30), three levels of t (1,2,3) and five levels of E (0, 0.10, 0.20, 0.35, 0.50). The configurations were generated by using partly overlapping subsets of the Coombs-Kao coordinates. This generated configurations which are not independent of each other. Possible consequences of this will be discussed later in this section. For each of the 5x3x5 = 75 combinations means of TF and AF were computed across 5 different random configurations. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 66 TF-correlations were converted to Fisher's z which were used in regression analysis, mean z values were converted back to square correlations. Young plotted the relation between (n, AF) and (n, TF) separately for each level of E and t. For TF the main effects were as expected, TF improved with increasing n, and deteriorated as E increased, and finally deteriorated as t increased. For AF the results showed the same pattern for E and t as for TF. But increasing n leads to increase (deterioration) of AF-stress. The fact that increasing n gave opposite results for TF and AF seemed to be a cause of some concern for Young: "if one relies heavily on stress the unfortunate situation exists that as he diligently gathers more and more data about an increasingly larger number of stimuli, he will become less and less confident in the nonmetricality reconstructed configuration, even though it is more accurately describing the structure underlying the data." (op.cit. p. 471). This result is of course exactly as expected from the point of view of the metamodel. For a given NL, G will move back towards L as n increases. Young's results can be taken to definitely show that it is not appropriate to try to interpret stress without regard for the number of points involved. Concerning definition of a lower limit for an acceptable true fit correlation, Young is no more explicit than Shepard. In illustrating his main conclusion, that "nonmetric multidimensional scaling is able to recover the metric information of a data structure, even when the structure contains error" (op.cit. p. 470) he implies that values of his index higher than .80 (which corresponds to correlation higher than .91) are acceptable (op, cit. p.471). Noting the regularities in his data Young tried in several ways to develop regression equations which would enable one to estimate TF from stress, n and t. "although some of the attempts were able to account for more than 90% of the variance in z, it was not possible to develop a regression equation which permitted reasonable interpolation and extrapolation". (op.cit. p. 412). By looking at a previous report (Young, 1968b) we can see why the attempts failed and this may then suggest a different approach to the estimation of TF. Since TF - z was not linearly related to n, a term for points squared was added. This, however, led to a non-monotone relation between n and TF for a given value of AF. Beyond n = 22 his table showed that TF deteriorates with increasing n for a given stress value. This is inconsistent with Young’s main results which showed improved TF as n increases. A reasonable conclusion to draw from this is that while there is sufficient regularity in the results to justify attempts to estimate TF, this regularity is not easily captured by regression analysis. An alternative is to use graphical methods. While this approach perhaps gives less accuracy than an optimal analytical method, it may be very difficult to find the optimal analytical method. The great advantage of graphical methods is a far greater flexibility than usual analytical methods, there is also a greater closeness to the data than when working with analytical methods. Since we are concerned not only with the relation between AF and TF, but also with the relation between NL and TF, maximal flexibility will be very important. In Section 4.6 a graphical approach will be used. This approach will be evaluated in terms of the conventional criteria: multiple correlation and crossvalidation. The remaining problem in this section is a discussion of a disturbing failure to replicate some of Young’s results by Sherman (1970). Apparently trivial differences in defining true configurations have been reported to give pronounced discrepancies in the results. These discrepancies, will be discussed in terms of the distinction between unrepeated and repeated designs (cfr. p.107). Sherman used the same basic design22 as Young (he has been one of Young's coworkers). The main difference is that unlike Young Sherman generated true configurations which were all completely independent of each other. While there where striking over-all similarities in the results the discrepancies are important in the present context. There were three types of discrepancies. (cfr. Sherman, 1970, Fig. 16 and Fig. 17, p. 52 - 53). a) For a given error proportion Sherman failed to find that stress uniformly increased with n. While not pronounced this result is quite disturbing from the point of view of the metamodel which leads us to expect G to move closer back to L with increasing n and thus AF- stress consistently to increase with n (as Young found). 22 Actually Sherman's design was far more extensive than Young's. Using the terminology on p. ## Sherman for p = 2 analyzed each M vector with pl = 1,2 and 3 and for each value of t (l, 2 and 3) M was analyzed with m = l, and 3. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 67 b) For given values of (n, t, E) Sherman found both TF and AF clearly worse than Young did. This implies that Sherman's results would give a more pessimistic view than Young’s as to the possibilities of recovering true configurations from noisy data. c) For all plots of (n, AF) and (n, TF) Sherman found more irregular curves than Young. Sherman attributes all these discrepancies to the differences in the ways the true configurations were generated. "all of our configurations were random and independent. There were overriding dependencies in his (Young's) similarities data which reduced variance and lessened the effect of: random error” (op.cit. p. 51, 54.) It is, however, by no means clear how this could produce all the above mentioned discrepancies. In terms of the distinction between unrepeated and repeated designs Young’s design may be labelled a partially repeated design. Since his configurations were not completely independent his design is not an unrepeated one (as Sherman’s clearly is), on the other hand the configurations were not completely dependent (identical) so neither is Young’s design a repeated one. We now argue that neither discrepancy a) nor b) above are likely to occur as a consequence of unrepeated vs. repeated designs. If this is the case there is even less reason to ascribe main effects to the distinction between repeated and partially repeated designs as done by Sherman. A basic premise in the argument is that the major systematic source of variance in AF and TF is the error proportion E producing between cell variance. Generally the variance within cells (specific combinations of n, t and E) may be regarded as generated by two minor sources of variances. One such source is that each replication within a cell will give a separate pattern of random error. This source of within cell variance is common for both unrepeated and repeated designs. The second source of within cell variance is the influence of different true configurations. This source is of course restricted to unrepeated designs. Sherman’s argument may now be restates as a claim that different true configurations not only contribute to within cell variance but also produce systematic between cells effects, cfr. a) and b) above. A much more likely possibility is that different true configurations mainly contribute to within cell variance and also contribute somewhat too unsystematic between cell variance. This will produce more irregular curves, so Sherman's argument may have some validity for c) above. If it really was the case that different true configurations produced pronounced systematic between cell variance then two different repeated designs, each based on replications from just a single configuration would produce markedly discrepant results. Incidentally, if this were the case any attempts at general attempts to predict TF would be doomed to failure.23 The basic design in Section 4.6 is an unrepeated design which gave results closely resembling Young's and thus gives support to the critical comments on Sherman’s analysis. Unfortunately, however, it does not explain why he got so deviant results. Hopefully this discussion serves to underscore not only the distinction between repeated and unrepeated designs; but also the wider concern of general replicability of results. Since unexpected error may creep in when complex chains of programs are used (as is necessary when investigating true fit), replicability is a major concern. In the next section the metamodel is put to use and implications drawn which give a simple framework for assimilating the relations between the main variables and directly leads to the type of analysis done 23 By using the technical apparatus of analysis of variance one could get more precise information on the effect of varying true configurations. More complex designs (for instance repeated design within cells, unrepeated between cells) might also be illuminating. We do, however, feel that analysis of variance may not be the optimal tool in an explorative work like the present. The results to be reported in Section 4.6 show overriding consistencies in the results. At present the best strategy seems to be to focus on such consistencies. Later research may then bring out the full complexities. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 68 in Section 4.6. To illustrate the general relations we report some of Young’s results in detail and compare them with our own to show that what we consider essential features are replicable. 4.5 Implications from the metamodel. In the following treatment it is convenient to consider (n, NL) as independent variables and (TF, AF) as dependent variables. Dimensionality will be considered fixed. A schematic representation of implications for two levels of n and two levels of NL are given in Fig. 1. Fig. 1. Schematic illustration of the relation between (n, NL) and (TF, AF). Note that AF is of the same size in a) and d) and that TF is the same in c) and c) Fig.1 illustrates the two sets of main effects to be expected from the metamodel: Increase n. TF decreases (improves) and AF increases ("looks worse"). In other words G moves away from M towards L, that is purification increases with n. Decrease NL (reduce noise). Both TF and AF decrease (improve). For a given combination of (n, t) NL, AF and TF are all highly intercorrelated. One might say that for any such combination there is just one degree of freedom. This is a basic fact which makes estimation of TF from AF (or from knowledge of NL) possible. Before proceeding with the joint implications of a) and b) it may be advantageous to present empirical support for these effects. Tables 4 and 5 also serve to confirm the critical analysis of Sherman’s failure to replicate Young's findings discussed in the previous section. The only difference between our own procedure and that of Young is the present use of an unrepeated design in contrast to Young' s partially repeated design. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 69 Table 4. Stress as a function of error proportion, E, number of points, n, and dimensionality, t. A replication of results reported by Young (1970, Table 3, p. 467). n = 10 E 0 .10 .20 .35 .50 means own .000 0348 0925 1746 2230 .1050 1 Young .0005 .0407 .0895 .1565 .2215 .1017 own .0004 .0126 .0520 .0733 .1098 .0504 t 2 Young . 0023 .0162 .0455 .0926 .1378 .0591 3 own Young .0006 .0055 .0087 .0121 .0338 .0336 .0591 .0654 .0752 .0738 .0355 .0381 n = 15 0 .10 .20 .35 .50 means .0004 .0473 1070 .1822 .2696 1213 .0013 .0530 .1099 .1777 .2617 .1208 .0007 .0300 .0680 .1394 .1750 .0826 .0025 .0310 .0741 .1257 .1729 .0813 .0015 .0117 .0224 .0271 .0532 .0591 .0927 .1120 .1233 .1366 .0586 .0673 Table 5 True fit as a function of error proportion, E, number of points, n, and dimensionality, t. A replication of results reported by Young (1970, Table 3, p. 467). The measure of fit is here mean z values (from correlations) converted back to squared correlations. N = 10 E 0 .10 .20 .35 .50 means own1 Young .9958 .9998 .9876 .9954 .9844 .9826 .9011 .9419 .9089 .9281 .9719 .9903 t own2 Young .9968 .9965 .9896 .9887 .9539 .9567 .9117 .8986 .7489 .7157 .9671 .9643 own3 Young .9910 .9863 .9823 .9808 .9428 .9342 .8037 .8082 .6864 .6806 .9417 .9340 .9996 .9920 .9780 .8961 .8222 .9826 .9988 .9885 .9552 .8772 .7373 .9700 n = 15 0 .10 .20 .35 .50 means 9986 .9968 .9865 .9646 .9057 .9884 .9993 .9951 .9870 .9621 .9062 .9918 .9900 .9950 .9744 .9486 .8135 .9833 .9946 .9841 .9625 .8790 .7327 .9588 That the difference in design does not produce marked differences in the results are born out by inspection of' Tables 4 and 5. In each of the tables there are 6 means, and in each of the tables 3 have the higher value in Young's results, 3 in our own results. We now turn to the main effects: Concerning a) we see that as n increases from 10 to 15 TF improves both for 1, 2 and 3 dimensional solutions. This is most readily seen by comparing the means. Not all the comparisons for separate noise levels are in the expected direction but perfect results can not be expected for a quite limited range of n. For stress the higher mean values for n =15 than for n =10 are completely supported by the results for separate noise levels both for Young and in the present results. Concerning b) inspection of each separate column verifies this effect in both investigations. Consider now the joint implications of a) and b) above. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 70 From the point of view of TF there must be a compensatory relation between n and NL since an increase in n can be balanced by an increase in NL. This is illustrated in Fig. 1 b) and c). For TF a convenient way to summarize these relations might then be to draw a set of indifference curves, for instance one such curve for each of the proposed category boundaries in Section 4.4. A set of such curves will be referred to as TF contours from NL. TF contours from NL are presented in Figs. 10, 11 and 12 in Section 4.6. These figures may for instance be useful in estimating TF from a known (or presumed) reliability or conversely they may be used to estimate the reliability necessary to achieve a desired level of TF. From the point of view of AF there must also be a compensatory relation between n and NL but opposite of that for TF. An increase in n can be balanced by a decrease NL as illustrated in Fig. 1 a) and d). This is perhaps the most dramatic way of underscoring the fact that stress can not be interpreted independently of n (and NL). For a high n and low NL a given stress value may indicate close to perfect fit, while the same stress value for low n and high NL may indicate a practically worthless solution. In practice one will not as here consider AF as a “dependent variable”, but as a predictor of TF. Just as we can construct TF contours from NL we can construct TF contours from AF-stress. Since NL and AF are highly correlated (for given (n, t) combinations) we would expect these two sets of contours to be similar in appearance. One might indeed say that the two sets of contours are nothing but alternative ways of presenting the constraints among TF, AF and NL. The positive slope in the TF contours constructed from AF-stress indicates that in order to maintain a given TF value, increase in AF must be compensated by increase in n. Since TF contours from AF-stress provide a convenient way of estimating TF from AF and are our answer to the problem of precision, a detailed report of how these contours, cfr. Figs. 3, 4 and 5, actually are constructed will be given in Section 4.6. In principle the TF contours from NL contain information on amount of purification, but in a rather inconvenient form since in these contours quite different units are used for NL and TF. More convenient - and practically useful - ways of representing amounts of purification will be presented in Section 4.7. This section will be concluded by some supplementary comments on the metamodel. Generally we have two ordinal relations on the three lines in the metamodel: TF < NL (purification) AF < NL (In section 3. 2 we argued that AF- stress < NL-stress (p. 39), and since generally we have found different indices for the same relation to be highly correlated this inequality should be fairly generally valid). The metamodel would be much more powerful and simpler to use of these two ordinal relations could be supplemented by linearity. That would imply that G could be presented between L and M on the line LM. Unfortunately there are considerations which rule out such linearity. b) implies that as NL increases both TF and AF increase. As NL moves towards its maximal value (corresponding to r(L, M) = 0) TF will move towards its maximal value. This value must be represented as equal in length to the maximal value for NL since also the maximal value of TF will be represented by a correlation of 0, r( L, G) =0. Increasing NL forces G away from L, but since AF also increases, G is simultaneously forced away from L and M and consequently away from the LM line. When maximally distant from M, G will be equally distant from both L and M. Concerning a) one might perhaps think that theoretically it could be possible for TF to decrease with increasing n without simultaneously increasing AF. TF could for instance decrease just by G moving closer to the LM line (if this movement took place along a circular curve AF would stay constant). But if such were the case it would not be possible for G to move arbitrarily close to L as n increases without bounds. This would seem to be an intuitively natural boundary condition. Unlike Sherman (1970), cfr. a) p.128 we have found no reason to doubt the general validity of a) in our rather extensive simulation studies. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 71 It is to be hoped that further research on the metamodel may turn out with more powerful general statements than appears possible at present. 4.6 Evaluation of precision. Construction of TF contours from AF-stress This and the following section will present the material likely to be of most use for empirical analyses making use of multidimensional scaling. Here dimensionality will be taken for granted, and analyzed dimensionality will here be the same as true dimensionality. Since TF contours from NL are tightly interlocked with TF contours from stress, this section also presents TF contours from NL. This serves to bring out the joint (TF, AF, NL) structure and provides an additional approach to validate the procedure here used. As reported in Section 4.5 a replication of some of the results presented by Young (1970) for n = 10 and 15, t =1, 2 and 3 was done. Since the results were found satisfactory, cfr. Tables 4 and 5, the TF contours from stress could in principle have been constructed from Young's results and in fact part of his results were utilized. There were, however, two reasons making additional runs necessary. First, since only mean results were reported by Young, it would not have been possible to check how close estimated and known true fit would be for individual configurations. Second, Young's design did not include sufficiently high noise levels. Additional runs were made for error proportions E =.75, 1.00, 1.25, 1.50 and 2.00. Since Young had no values of n between 15 and 30, several runs were also made for n = 20. Finally, additional runs were made for n = 6 and n = 8. Altogether 450 configurations were run in order to construct the contours. These 450 configurations were distributed as follows in the 5 TF categories: 60, 62, 81, 119 and 128 from “Excellent” to “Unacceptable”, that is with a pronounced emphasis on the ”Fair” and “Unacceptable” categories. The preliminary replication of Young's n = 6 and n = 7 used TORSCA. All other runs were made with MDSCAL using rank factor analysis to give the initial configuration. An unrepeated design was used. The coordinates for each configuration were selected from a uniform random distribution. The Ramsay-Young error process was used. For each condition (separate combination of n and t) the first step was to compute the mean of AF and TF over the 5 replicates for each value of E. For each condition a TF - AF curve was plotted. A sample of such curves is presented in Fig. 2. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 72 Fig 2. Sample of results from simulation studies showing the relation between TF – categories and AF – stress for selected values of n and t. The points represent means. These curves serve to illustrate that the TF -categories transformation does give satisfactory linear relations with stress (linearity is perhaps even more clearly revealed in Figs. 7 to 9. These figures will be commented on later). The curves also illustrate why conventional linear regression is inadequate, as they are based on an additive model which requires parallel lines. One might ask whether there is any specific reason why TF correlation should be transformed to give linear relation with stress, and the answer is the pragmatic one that linear relations are easier to work with. Values of AF for each TF category boundary were read from the (TF, AF) plots. In this process irregularities in the curves were smoothed out, and this smoothing was simplified by the fact that the curves were basically linear. For each dimensionality a preliminary table from the smoothed (TF, AF) curves was made. Each column of such a table represented a specific TF - category boundary and contained the corresponding stress values for the different levels of n included in the study. To a minor extent these tables were filled out from Young’s results; where both the present simulation studies and Young's result gave values for the same cells, the results showed remarkably close correspondence. A plot of a column in such a table is then a TF contour, cfr. Figs. 3 - 5. Actually the plots from the preliminary tables did not produce quite so smooth curves as in these figures. Before arriving at the final TF contours some additional smoothing was found necessary, this was again checked back with Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 73 the original (TF, AF) plots. It was hoped that the smoothing processes would reduce noise in the curves. As may be seen from Fig. 2, however, the amount of smoothing required was not extensive. Fig 3. TF contours from AF – stress for 1 dimensional configurations. Each curve shows a TF category boundary (contour) as a function of AF and n. Also included is a curve showing the 5% significance level. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 74 Fig. 4. TF contours from AF – stress for 2 dimensional configurations. Each curve shows a TF category boundary (contour) as a function of AF and n. Also included is a curve showing the 5% significance level. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 75 Fig 5. TF conttours from AF – stressfor 3 dimensions. Each curve shows a TF category boundary (contour) as a function of SF and n. Also included is a curve showing the 5% significant level. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 76 Fig 6. Relation between TF (expressed as categories and correlations) and for configurations in 1, 2 and 3 dimensions when stress = 0. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 77 Fig 7. Relations between AF – stress and TF – categories for 7, 12 and 25 points and crossvalidation results for 1 dimensional configurations. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 78 Fig 8. Relations between AF-stress and TF- categories for 9, 12 and 25 points and crossvalidation results results for 9,12 and 25 points and crossvalidation results for 2 dimensional configurations. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Fig 9. 79 Relations between AF – stress and TF – categories for 9, 12 and 25 points and crossvalidation results for 3 dimensional configurations. Points from the curves in Fig. 6 are read off to provide the values at the TF- categories for stress = 0 in the (TF, AF) curves in Figs. 7 to 9. The curves in Fig. 6 were constructed in the same way as the TF contours from AF. First preliminary values for the curves were read off from the original (TF, AF) curves as presented in Fig. 2. A similar smoothing process was used as that described for constructing TF contours from AF- stress. A minor technical comment may be made concerning the results presented in Fig. 6. Young (1970, p. 466) stated that MDSCAL and TORSCA are equally adept at recovering the underlying true configuration in the errorless case. A closer look at the curve for t = 2 in Fig. 6 may lead to some qualification of the statement, at least for higher values of n. For the n = 30, t = 2 condition Young found for instance root mean square correlation 24 .999914 vs. Shepard .999998. In terms of the TFcategories scale this difference is not as minor as it appears at first sight, it reflects roughly .4 units in the TF-categories scale. The difference for n = 15, Young .99951 vs. Shepard .99991 corresponds to roughly .2 units in the TF-categories scale. These differences do represent a slight, but nevertheless clearly discernible superiority of MDSCAL in the errorless case, a point also mentioned in the comparison of algorithms in Section 4.32. 24 In the errorless case and for high n different random configurations give remarkably similar TF-correlations, so that the way these correlations are averaged is unimportant. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 80 The curves in Fig. 6 are somewhat influenced by, as previously mentioned, also using results from Young. Generally, however, they give a picture of recovery falling between the results of Young and Shepard. That the present results verify the superiority of MDSCAL for error-free data may be discerned in Fig. 8 where the crossvalidation results both for n = 12 and n = 25 reveal that the points for stress = 0 read from Fig. 6 fall short of the precision in the crossvalidation results. While more extensive investigations probably would lead to revision of the curves in Fig. 6, it is, however, hard to see that these revisions would have much, if any, practical significance. Turning now to the other special case, the analysis of random sets, there are as previously mentioned several studies reported in the literature. Unfortunately, none of the studies were sufficiently complete to be directly included in Figs. 3 to.5. The study by Klahr (1969) did not include values of n beyond 16, Wagenaar and Padmos (1971) stopped at n = 12. On the other hand Stenson and Knoll (1969) chose to compute averages just from three cases, this could only suggest rough rules of thumb for testing statistical significance. The first step was then to run 50 random sets with 30 points in 1, 2 and 3 dimensions. Especially for 1 dimension there was a clear difference, as the present results gave a mean of .515, while the value from Stenson and Knoll (1969, Fig. 1, p. 123) was .53. In view of the small standard deviation this was judged to be a significant difference. A possible explanation could be that the present approach, using rank factor analysis to give the initial configuration, avoided some local minima. Consequently, 50 sets of random data for n = 6, 8, 10, 12, 16, 20, 30 were analyzed in 1, 2 and 3 dimensions. Especially for 1 dimension the present results consistently gave stricter values of stress than previous studies. While for instance Wagenaar and Padmos for n = 12 report the 5% value as .395 the present results gave .37. Using previously published results might give too many false rejections of the null hypothesis, the present results are somewhat more exacting. Plotting the results from my own studies did not give quite as smooth curves as those presented in Figs. 3 to 5, and some smoothing was done. We may now see quite simply how the analysis of random sets is a limiting case of various noise levels. The points with maximal stress on the curves in Fig. 2 correspond to points on the random set contours in Figs. 3 to 5. Turning forward to Fig. 13 may perhaps serve to clarify this even further. The terminal points on the (NL, AF) curves are directly read off from the random set contours. In Fig. 5 we may for instance see that for n = 10 the expected stress value for random sets is .10. In Fig. 13 we then see that when NL is maximal this value is used as the last point on the corresponding (NL, AF) curve. From Figs. 3 to 5 it is now easy to see what practical difference it makes to use the conventional 5% limit vs. the cutoff point of TF = 4 arrived at in Section 3.4. For both 1, 2 and 3 dimensional configurations the 5% criterion is stricter for n less than 12, whereas the reverse is the case for n greater than 1.2. We shall later see that the confidence one can have in estimates of TF for n less than 12 is unfortunately not completely satisfactory, so we do not have any definite advice as to what criterion to use in such cases. Leaving out marginal cases for small values of n, there are strong arguments to be made for the present approach as a better alternative than conventional hypothesis testing. A basic critique against the customarily reported “p-value” is that it says nothing about what should be of major concern, namely the strength of the relationship or in the present analysis the amount of structure. The criticism directed at conventional hypothesis testing, by Edwards et.al. (1963) and more generally discussed by Bakan (1966) is highly relevant in the present context. The reader who clings to the magical number 5 should rest assured when he convinces himself from Figs. 3 to 5 that the presently proposed cutoff point is clearly stricter when n is greater than 12. Our advice would be to look at Figs. 2 and 3 in Section 3.4 and then the user may decide for himself whether he will accept the presently proposed criterion, or whether (as might not be unlikely in many cases) he may prefer an even stricter criterion before he looks further at his configuration. It is to be hoped that the present approach will be seen in the much more fruitful context of estimating strength of relation (here amount of structure) than the sterile emphasis on just achieving statistical significance. One purpose of the approach in the present work is to help steer work in multidimensional scaling away from the trap of overemphasizing statistical significance testing which probably has done much to make psychological research as sterile as it often is. As Bakan (1966) points out, one reason for the popularity of statistical tests of significance is that it relieves the Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 81 researcher of making explicit the basis for his decisions. We join Bakan in trying to restore the responsibility of the researcher for his decisions and hope that the present approach may contribute in that direction. Turning now to the evaluation of the present procedure for estimating true fit, it will be most convenient first to present graphically results from a crossvalidation study, then to bring out the interlocking structure (TF, NL, AF) by also bringing in the TF from NL contours and finally to present numerical evaluation of the present procedure. Since some amount of smoothing was involved in construction of the TF contours from AF-stress, there is an undefined number of degrees of freedom in the procedure and also the possibility that the smoothing mainly uses features unique to the results from the 450 configurations and that extrapolation to other values of n would give substantially different results. Consequently crossvalidation is very important. Will estimates of true fit for values of n not included in the original study be as good as estimates in the original study? The original study used values of n = 6,10,15,20 and 30. For the crossvalidation study n = 7 for 1 dimensional configurations, n = 9 for 2 and dimensional configurations and n = 12 and n = 25 for 1,2 and 3 dimensional configurations were deemed adequate values 25. For each condition 5 levels of E were used, and 5 replications were run for each (n, t, E) combination, making a total of 225 configurations in the crossvalidation study. For l dimensional configurations E = 0, .20, .50, 1.00, 1.50 and for 2 and 3 dimensional configurations E = 0, .20, .35, .50, 1.00. From the TF contours from AF- stress the relevant functions relating TF- categories and AF-stress were constructed. This step has already been specified as necessary to arrive at more precise estimates of TF. cfr. Figs. 7 to 9. If the procedure really does permit interpolation, then the results from the crossvalidation study should fall on the (TF, AF) curve constructed on the basis of Figs. 3 to 5. In each of Figs. 7 to 9 a circle represents the joint (TF, AF) mean from 5 crossvalidation configurations. Generally the correspondence of these means with the independently constructed (TF, AF) curves will be seen to be excellent. When constructing TF contours from NL the first step was to find a transformation of r (L, M) which was linearly related to TF-categories. It was found that: NL1 = 1 − r(L, M)2 was a good transformation for this purpose. The steps in constructing the TF contours from NL exactly paralleled those for TF contours from AF-stress and will not be further discussed. The resulting contours are shown in Figs. 10 to 12. 25 For the results for 6 points in 2 and 3 dimensions it was possible to estimate TF from AF. We recommend n = 8 as the lowest value to use for dimensionality greater than 1. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 82 Fig 10. TF contours from NL 1 dimensional configurations. Each curve shows a TF category boundary (contour) and a function of NL and n. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 83 Fig. 11 TF contour from NL for 2 dimensional configurations. Each curve shows a TF category boundary (contour) as a function of NL and n. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 84 Fig. 12 TF contour from NL for 3 dimensional configurations. Each curve shows a TF category boundary (contour) as a function of NL and n. Since by equation (2) in Ch. 3 r(L, M)2 = r(M1M2 ) we then get: r(M1M2 ) = 1 − NL12 For instance NL1 = .20 corresponds to r(M1 M2) = .96. For convenience both NL1 and r(M1 M2) are given in Figs. 10 to 12. This permits a simple way of estimating TF from knowing e.g. test retest reliability. Notice for instance that for high values of n there should result adequate levels of TF for surprisingly low values of r(M1M2). If for instance n = 30 and t = 1 one should get a very good true fit even with reliability say .68. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 85 The relation between true fit and reliability will be further pursued in the next section. In the present context we wish to emphasize that one may combine the results from TF contours from AF with results from TF contours from NL to get results on (NL, AF) curves. These curves should be linear as there is linearity both in the (TF, AF) curves and in the (TF, NL) curves. To give a concrete example, consider the condition n = 15, t=1 and take the contour TF=2. We then get: AF = .205 from Fig. 3 NL = .47 from Fig. 10 and then the point (.47, .205) is located on the corresponding (NL,AF) curve from the original data in Fig. 13. Fig 13 Some comparisons between the relation of AF and NL based on Figs. 3 – 5 and Figs. 10 – 12 and the relation of AF and NL in the original results Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 86 Notice both the linearity and the generally excellent fit this process gives on the representative curves in Fig. 13. The concordance in Fig. 13 provides a very good over all check that the smoothing involved in constructing both sets of TF contours has not removed us appreciably from the original data. Fig. 13 also serves to bring out a difference between the present estimation process and usual regression techniques. In usual regression techniques there is an asymmetry in that there are “two regression lines”. There is, however, no asymmetry in the present procedure, the results presented in Fig. 13 imply that regardless of what is regarded as predictor and what is regarded as the variable to be predicted the present procedure will give the same results. One drawback of using a graphical procedure is that the numerical evaluation of the estimation procedure is a bit tedious. The following steps are necessary: From TF contours from AF-stress construct (TF, AF) curves for all relevant conditions. Entering such a curve with an observed value of AF one then reads off from the (TF, AF) curve the estimated value of TF. This will subsequently be referred to as TF|AF. An exactly parallel procedure is used to estimate TF from NL, these estimates will be referred to as TF|NL. A necessary condition for the present procedure to be valid is that there is a fairly high correspondence between these estimates of TF and the known true fit. Since in many cases one will just be interested in the extent to which the present procedure gives the correct TF category, we first present the % concordance for each condition. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Table 6. 87 Percent of cases in each condition where the estimation procedure gives correct category placement, concordance. TF : short for TF category TF|AF : TF categories estimated graphically from AF- stress TF|NL : TF categories estimated graphically from NL N : number of configurations in a given condition . a) Original study (450 configurations) n 6 8 10 10 10 t .1 2 1 2 3 N 40 20 50 25 40 15 15 15 20 20 20 30 30 30 1 2 3 1 2 3 1 2 3 50 40 40 30 35 20 20 20 20 450 .TF, TF|AF 55 75 70 72 68 86 85 83 100 80 85 80 90 90 TF,TF|NL 70 95 76 80 90 TF|AF, TF|NL 70 70 82 85 70 92 93 85 100 86 90 75 95 85 90 93 79 100 90 85 95 100 95 b) Crossvalidation study (225 configurations) 7 9 9 1 2 3 25 25 25 60 76 44 72 80 72 68 88 56 12 12 12 25 25 25 1 2 3 1 2 3 25 25 25 25 25 25 225 80 84 76 92 84 84 88 72 88 96 84 92 92 72 80 92 88 76 The first thing to notice from Table 6 is that there are marked differences between the results for n less than say 12 and for n greater or equal to 12. This is especially pronounced in the column of most importance in the present context, the concordance between TF and TF|AF. For n less than12 the percent never goes above 80 whereas for n greater than 12 there is just one stray condition where the percent drops to 76 (n = 12, t = 3 in Table 6b). Consequently summary statistics will be presented separately for these two regions of n. These summary statistics include linear correlation and since TF, TF|AF and TF|NL are all expressed in the same units (the TF-categories scale) the root mean square discrepancy between each of the three pairs is also relevant statistic. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 88 Table7. Summary of results from a) original study b) crossvalidation study TF : TF-categories TF|AF : TF-categories TF |NL : TF-categories estimated graphically from AF-stress estimated graphically from NL Type of comparison TF, a) Correlation n < 12 n > 12 .984 Total .927 .984 .965 TF|AF b) TF, a) TF |NL b) TF|AF, TF|NL a) b) .920 .983 .967 .961 .991 .989 .975 .991 .986 .956 .993 .980 .949 .991 .980 Root mean square discrepancy n < 12 .51 n > 12 .28 Total .39 .52 .30 .39 .36 .22 .29 .30 .25 .27 .39 .17 .28 .41 .20 .29 Percent concordance n < 12 n > 12 Total .56 .83 .74 .81 .75 .75 .71 .90 .87 .91 .83 .86 .93 .85 .79 Number of configurations a) b) n < 12 175 75 n ≥ 12 275 150 Total 450 225 .67 .86 .79 These results seem highly encouraging. Notice that the correlations are analogous to multiple correlations. The graphical procedure used to estimate TF from AF implicitly takes account of n and t. Young (1970, p. 472) reported that numerous attempts at multiple regression generally accounted for 75 to 85% of the variance, occasionally more than 90%. In a previous report (Young, 1968, p.22) he reported a maximal multiple correlation .965, by a curious coincidence this is exactly the value found in the original study done here. In view of our previous discussion of unrepeated designs (as the present) vs. partially repeated (as Young) this is especially encouraging since that analysis would lead us to suspect more irregularity in the present type of design. That this does not seem to be the case would then indicate that the effect of completely different configurations is not very marked. This should be one basis for ascribing some generality to the present results. One might perhaps object that comparison with Young's maximal multiple correlation is somewhat unfair since in the present investigation a larger range of values were observed since E covered a much larger range in the present investigation). Restricted range (as in Young's study) might reduce correlation. This however, is probably balanced by a tendency observed here for standard error of estimates to increase somewhat for the higher values of TF - categories. Furthermore the root mean square discrepancy and % concordance which are not similarly influenced by range as correlation is, bear out the present favourable results. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 89 Comparing the original and crossvalidational results confirm the graphical correspondence displayed by Figs. 7 to 9. Both for correlation and root mean square discrepancy there are no differences in the results. 26 The simplest way to summarize the confidence one can have in the estimates of TF would be to say that provided n is larger than 12 the odds in favour of the present procedure leading to the correct TF category is better than 4 to 1. Yet of course there is room for improvement and especially refinement. The most important task would be to try to replace the present graphical procedure by analytical methods. This would imply finding analytical expressions for the coefficients of linear equations in (TF, AF) curves, cfr. Figs. 7 to 9. These coefficients would have to be expressed as functions of n and t. Provided such analytical expressions could be found it would be a very simple task to plug in an estimate of TF in MDSCAL. The present procedure could be improved further by finding a more elegant transformation of r than TF-categories. After first transforming r to give K, cfr. equation (8) in Ch. 3, it will be recalled from Section 3.4 that K was further transformed by a broken curve composed of three linear segments. A smooth transformation would be aesthetically more satisfying and perhaps also lead to some improvement. A somewhat puzzling aspect of the present study is that there is a perhaps slight but highly consistent tendency for estimates of TF from stress to be poorer than estimates of TF from NL. The use of for instance K* (based on rank images) would hardly improve the estimates, as in numerous conditions K* and S were correlated and the correlation stayed well above .99. It may be the case - as implied by the discussion of the fine grain of the results in Section 4.32- that minimizing S is not generally an optimal target from the point of view of true fit and that this somehow is related to the poorer performance of stress as a predictor variable than NL. The major conclusion of the present section, however, should be to focus on the fact that largely inspired by Young it has been possible to reach a goal which he first formulated. A major further goal should probably at first not be refinement, but rather extension. The reader may wish to review Section 4.2 to bring into focus that we have explored only a very simple case of the general possibilities. A major task would be to explore the more complex cases to provide a better basis for applying the present results to the variety of empirical studies using multidimensional scaling. 4.7 Evaluation of dimensionality and applicability Application of the extended form of the method In order to provide information on dimensionality and applicability the extended form of the metamodel, cfr. Section 2.2, must be applied. Since the design used in this section has a fairly complex structure, it will be convenient first to have a general outline of the design. After the presentation of this outline implications of the extended form of the metamodel will be schematically represented. The representation will then serve as a framework for discussion of results from two simulation studies. 26 One might notice a small difference in the present concordances. This is probably a consequence of different distributions of TF-categories in the original and crossvalidational study. A closer investigation of such detail would at present add little or nothing to the over all picture. Further information on crossvalidation is included in Section 4.7. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 90 Fig. 14 Schematic representation of the design used for the extended form of the metamodel. t – true dimensionality, m, m1 and m2 analyzed dimensionality. Mij and Gij : Empirical correlations (r(Mi, Mj) and r(Gi, Gj)). NL and TF : Theoretical correlations (r(L,M) and r(L,G)). Superscripts refer to analyzed dimensionality. With only two parallel forms there will be values only in the cells marked with *. There are two basic features of the design used in the present section. The first is that for each true configuration, L, several vectors, M1, M2…..Ms, are generated. Each M vector is generated by a different stream parameter for the random process. In the present studies these M vectors are generated from different noise levels, E1, E2…..Ene, furthermore there are a number of replications, rep. for each noise level so that the total number, s, of the M vectors for a given L is s = ne x rep. This corresponds to a far more extensive design than will usually be the case in practice, when there will usually be only two parallel forms, M1 and M2. In contrast to a single retest reliability, r (M1, M2), the present design generates a correlation matrix designated Mij in Fig. 14. The design used here permits a detailed study of a single configuration, and corresponds to what in Section 4. 2 was labelled a repeated design. Notice that the present design may have a parallel in empirical research if one is willing to assume that s individuals for e.g. a set of physical stimuli have the same underlying structure. Since the correlation between any two M vectors can be observed in empirical research, the Mij correlations are here called empirical correlations. This is in contrast to the correlations between L and M which usually can not be observed in empirical research (barring the special case of completely specified hypothesis, cfr. p. 20) and consequently these correlations are here called theoretical correlations. The second basic feature of the design is that each configuration is analyzed in several dimensionalities, m = 1, 2 and 3. Each value of m generates a set of G vectors and for each such value an inter-correlation matrix can be computed. cfr. the Gij matrices in Fig. 14. The correlation between two G vectors can also be observed in empirical research and are also called empirical Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 91 correlations here. These correlations contrast with the correlations between L and G(TF), the latter being labelled theoretical correlations. The bidiagonals (Mi.Gim) correspond to AF, here indexed by stress. One major task of this section is to show how comparison of Gij values for the same pair (ij) but for different values of m can solve the problem of dimensionality, together with information from corresponding AF values. The final task is then to show how comparison of the appropriate Gijt with the corresponding Mij value can throw light on applicability. A schematic representation of implications from the metamodel, both for theoretical. and empirical correlations is presented in Fig. 15. Fig. 15. Application of the extended form of the metamodel when the analysis is done in varying dimensionalities for a given true dimensionality, t. schematic illustration of expected relative size of Theoretical Purification, TP, based on theoretical correlations and Empirical Purification, EP, based on empirical correlations. NL|Mij = r(Mi, Mj) converted to the TF – categories scale. TF|Gij = r(Gi,Gj) converted to the TF – categories scale. Superscripts refer to analyzed dimensionality. Hatched lines signify that there are no unequivocal relations to be expected between the position of G relative to M and L. Fig. 15 illustrates most of the implications from the extended form of the metamodel which will be tested in this section. These implications have all been more or less directly discussed in Ch. 2. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 92 Implications for theoretical correlations a) TF t < TFm ≠ t A basic premise in the metamodel is that the highest purification (lowest TF value) will result when the analysis is guided by the correct assumption as to the form of L. This implies that the best TF value will be found when analyzed dimensionality, m, is equal to true dimensionality, t. b) TF t < NL This is nothing but a restatement of the thesis that theoretical purification (TP = NL -TFt) does occur, cfr. Section 2.l, and as stated above TP is highest in the correct dimensionality. c) TFm < t > NL If the analysis is done in too low dimensionality there is an insufficient number of degrees of freedom to represent the information in L, consequently part of the structure will be lost and we will expect negative values of TP, that is theoretical distortion. In Fig. 15 this is represented by for instance G1 for t = 2 being further removed from L than M is. Implications for empirical correlations a1) TF | Gijt < TF | Gijm ≠ t a1) is proposed as a basic rule for finding the correct dimensionality and states that (since the lower the TF value the closer the correspondence) the correct dimensionality is found by simply picking the value of t where the correlation between the corresponding G vectors is highest. a1) is just a restatement of the selection rule stated on p. 22. b1) TF | Gitj < NL | Mij This is just a restatement of the thesis that empirical purification (EP = NL|Mij – TF|Gijt does occur Since NL|Mij is the same regardless of m, the inequality a1) implies that EP will be highest in the correct dimensionality. Negative values of EP denote empirical distortion. Note the close parallelism between a) and a1), likewise for b) and b1). This parallelism is related to the equivalence between EP and TP stated below. It is, however, difficult to state whether we should expect an inequality. c1) <t TF | Gim > NL | Mij j parallel to c), that is whether analysis in too low dimensionality will result in empirical distortion or not. If for instance two M vectors generated from a two-dimensional configuration are analyzed in 1 dimension it could be the case that the analysis happened to come out with more or less the same one-dimensional configuration in both cases and it would then not necessarily be the case that r(G1i,G1j) would be less than r(MiMj) (if this were the case we would per definition have empirical distortion). This uncertainty is represented by hatched lines in Fig. 15. Relation between empirical and theoretical correlations, equivalence between EP and TP Perhaps the most central thesis in this work is the thesis of equivalence between TP and EP (assuming correct dimensionality). This thesis was first stated on p. 21 and then given more precise Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 93 definition on p. 37 in the discussion of indices for NL in Section 3.2 and is for convenience repeated here: r(L,M) = r(L,Mi ) • r(L,M j ) = r(MiM j ) | TP | | TP | r(L, G) = | EP | t r(L, Gi ) • r(L, Gtj ) = NL equation r(Git , Gtj ) TF equation it being understood that TP and EP are both expressed in TF-categories. The equivalence stated above will be relevant to the problem of applicability. Two simulation studies have been done to check the implications of the extended form of the metamodel. One study deals with systematic configurations, the other with random configurations. Before presenting the results both studies are described. Description of simulation studies. Unless otherwise specified results are reported separately for each configuration for systematic configuration. Each configuration is then a separate condition. For random configurations a condition includes results for 5 different true configurations. There are 8 conditions for systematic configurations, 4 different true configurations, all with n = 12, are analyzed separately with two different error procedures, the Ramsay-Young process (R-Y) and the Wagenaar-Padmos procedure (W-P), cfr. Section 3. 2 for a description of these noise procedures. The 4 configurations are: t=1 t=2 i) a line with equal distance between neighbouring points ii) a line where the distance between neighbouring points successively increase. i) a circular configuration ii) a lattice configuration. This study represents an example of the strategy described in the discussion of "complex qualitative variables" in Section 4.2, to study a few examples differing in many respects. There are 6 conditions for random configurations, only the R-Y error process is used. For both n = 15 and n = 20, 1, 2 and 3 dimensional configurations were analyzed. A specific feature of this study is that there is no trace of interval assumptions of the elements in the M vectors since all results are based on rank image transformations of M, cfr. p. 38 27. This study is analyzed in more detail than systematic configurations. 27 Actually rank image trasformations, based on G, were computed for all values of m. There was a slight but fairly consistent tendency for the Mij intercorrelations to be highest when the correct dimensionality was used. This may provide a further approach to dimensionality, but this possibility has not been explored in any detail. Generally the results were fairly similar for all values of m, and for m = t the results were as expected very close to the Mij correlations. In practice dimensionality may first be determined by inequality a1), it is then sufficient to compute the rank image transformation for t, only these results will be reported. In Figs. 14 and 15 one may replace the expressions with M with corresponding expressions for M*, cfr. also the survey of' terminology in Table 8. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 94 Further features of the two studies are outlined below: ne - nr. of noise levels(E) values of E R-Y W-P rep - nr. of replications for each value of E Systematic config. 4 Random config. 3 .0 .20 .35 .50 .10 .20 .35 .40 .10 .35 .50 3 2 12 6 1 5 N(T) = con x s nr. of observations in means for theoretical correlations 12 30 N(E) = con x ( n ) 2 nr, of observations in means for empirical correlations 66 75 s = ne x rep nr. of M vectors for each L con - nr. of configurations in each condition Both these studies provide further crossvalidation results for the contours discussed in the previous section, this will be separately discussed at the end of the section. The results will be reported in four steps. First results based on theoretical correlations are reported. Second, results based on empirical correlations and TF estimated from stress are reported. These two latter types of results provide the opportunity to see how well the present approach contributes to solving the problem of dimensionality. Third, results are presented on the relation between empirical and theoretical purification, the latter based on results presented in Section 4.6. This illustrates our approach to applicability. Finally crossvalidation results are summarized and compared to the results presented in Section 4.6. All results are presented in the TF-categories scale. For convenience Table 8 summarizes the main symbols used. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 95 Table 8. Survey of terminology. Expressed in What is converted to TF-categories TF-categories Theoretical correlations r(L,M) NL t r(L,M* G ) t NL | M* G r(L, Gm ) TFm t NL − TF t or NL | M* G − TF t TP Empirical correlations r(Mi,M j ) NL | Mij t NL | Mij* G TF | Gm ij t t r(Mi* G ,M*j G) m r(Gm i ,G j ) t EP NL | Mij − TF/Gijt or NL | Mij* G − TF | Gijt TF|AFm Graphical estimate of TF from AF -stress, cfr. Figs. 3- 5. Note that superscripts always refer to analysed dimensionality,m. For each value of true dimensionality, t, n (m?) always takes on the values 1, 2 and 3. M* refers to rank image transformations of M, cfr. p. 38. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 96 Results based on theoretical correlations. The main results are presented in Table 9. Table 9. Results based on theoretical correlations. Results where true dimensionality, t, equals analyzed dimensionality are underlined. a) Systematic configurations. Ramsay-Young error process t=1 t=2 Configuration i) ii) NL TF3 TF2 TF1 2.64 2.90 2.52 1.48 2.58 2.65 2.30 1.46 TP 1.16 1.12 i) Wagenaar-Padmos error process t=1 t=2 ii) i) 2.67 2.60 2.26 2.36 1.80 1.90 4.33 4.25 .87 .70 ii) i) ii) 3.06 3.33 3.22 2.00 2.84 3.11 3.02 1.65 3.51 3.14 2.56 4.33 3.46 3.24 2.79 4.28 1.06 1.19 .95 .67 b) Random configurations. n 15 t n 1 15 NL|M* Gt 2.72 TF3 TF2 TF1 2.85 2.59 1.75 .97 TP t 2 n 15 2.78 t 3 n 20 t 1 n 20 t 2 n t 20 3 2.79 2.72 2.71 2.70 2.62 2.21 3.75 2.42 3.38 4.13 2.83 2.60 1.62 2.34 1.89 4.04 2.05 3.44 4.11 .57 .37 1.10 .82 .66 It is readily seen in Table 9 that the implications stated on p. 92 hold good: a) t m≠t TF is best when m = t, TF < TF For both random and systematic configurations the underlined TF value is smaller than the other TF values for each condition. b) t *t Theoretical purification does occur, TF < NL (or NL | M G ) As might be expected TP is more pronounced for higher values of n and for lower values of t. As we shall see in more detail later this implies that for otherwise similar conditions it is easier to check applicability of the model the higher the value of n and the lower the value of t (This should not be unexpected since these conditions imply more constraints on the data, and generally the more constraints, the more vulnerable any model will be). c) Theoretical distortion will always occur when analyzed dimensionality is too low, t TFm< t > NL ( or NL | M* G This is very marked in the present results. Generally the amount of distortion is for instance more than one unit in the TF scale if a two-dimensional configuration is analysed in one dimension. For example for the R-Y error process, t = 2, configuration i) and m = 1 the distortion is 4.33 - 2.67 = 1.66. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 97 On the other hand if analyzed dimensionality is too high there is no consistent tendency neither in the direction of purification nor distortion. The general validity of the three inequalities a), b) and c) is strikingly confirmed by looking at the individual results. For systematic configurations the relations between the means are based on 4 x 12 = 48 comparisons for each value of t, totally 96 comparisons for this study. For random configurations there are a total of 6 x 30 = 180 comparisons, making a grand total of 276 comparisons. a) is violated in 7 of the 276 cases or 2.5%, b) is violated in 4 cases or 1.5% For c) 1 dimensional configurations are not relevant, and there are thus 168 relevant comparisons and in 12 of these, or 7.1%, c) was violated. It should be noted that the violations were by no means randomly distributed, for t = 1 and for n = 20, t = 2 there were no violations whatever, furthermore there were no violations for the lowest level of E (highest reliability), that is the violations occurred only with low reliability. This pattern is perhaps not too surprising and we shall see the same pattern repeating itself in the more practically important results dealing with how to assess dimensionality. Results based on empirical correlations. The main results are presented in Table 10. Table10. Results based on empirical correlations. Results where true dimensionality, t, equals analyzed dimensionality are underlined. a) Systematic configurations Ramsay-Young error process Wagenaar-Padmos error procedure t=1 t=2 t=1 t=2 Configuration i) ii) i) ii) i) ii) ( i) ii) NL|Mij. 2.84 2.18 2.88 2.84 3.33 3.03 3.69 3.62 TF|Gij3 2.90 2.78 2.45 2.63 3.10 2.65 3.38 3.33 TF|Gij2 2.54 2.45 l.97 2.09 2.78 2.36 2,80 2.98 TF|Gij1 1.58 1.54. 4.04 3.90 2.26 1.74 4.05 4.07 EP 1.27 1.24 .91 .74 1.07 1.29 .89 .64 b) Random configurations n 15 t 1 n 15 t 2 n 15 t 3 n 20 t 1 n 20 t 2 n 20 t 3 NL|Mij* tG TF|Gij3 TF|Gij2 TF|Gij1 3.05 3.01 2.75 1.92 3.10 2.90 2.50 2.75 3.12 2.74 2.91 3.55 3.04 2.95 2.70 1.76 3.03 2.65 2.10 3.4 0 3.02 2.28 2.71 3.07 EP 1.12 .61 .38 1.28 .93 .75 Corresponding to a) and b), a1) and b1) as stated on p. 92 are readily seen to be verified. In every condition: a1) ≠t TF | Gijt ∠TF | Gm ij and t b1) TF | Gijt ∠NL | Mij (or NI | Mij* g ) c1) may have some validity for systematic configurations but not generally for random configurations. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 98 The important question is now the extent to which a1) is valid for each single comparison. The confidence with which we can answer this affirmatively will indicate how far the present approach provides a simple solution to the problem of dimensionality. This approach will further be referred to as dimensionlity rule a1). Individual results turn out to depend upon reliability (noise levels), n, t and also type of configuration (systematic versus random). For systematic configurations there are no pronounced differences between the two configurations for each t. For each combination of t and error procedure, there are thus 2x66 = 132 comparisons to be made. A survey of the per cent violations of rule a1) for systematic configuration is given below: t=1 R-Y 0 W-P 5.4 t=2 R-Y 3.7 W-P 16.5 The results are generally highly encouraging for such a relatively small value of n. There appears to be a preponderance of violations for the W-P procedure. This, however, is mainly due to the fact that generally the reliability is lower for the W-P procedure (cfr. the NL values in Table 10 and the NL|Mij values in Table II). When for instance all cases with r(Mi, Mj) < .65 were excluded from t = 2 the W-P % violations dropped to 5.5% while the R-Y % raised to 3.9).28 For random configurations the results were entirely clear for 1 dimensional configurations where there were no violations whatsoever. For 2 dimensional configurations the results were quite different for different noise levels. From Table 3, Ch. 3 E1 = .10 corresponds to r(L, M) = .99, similarly E2 = .35 to r(L,M) = .92 and E3 = .50 to r(L,M) = .81. These values were confirmed in the present study. By the equation on p. 93 the various combinations of noise levels then roughly corresponds to reliabilities as outlined below: E1 E2 E3 E1 .98 .89 .80 E2 E3 .81 .73 .66 It will be convenient to present the results separately for results generated only by E1 and E2 and the results generated by E3. As will be seen from the outline above this roughly corresponds to a distinction between cases with reliability above .80 vs. cases with .80 and below. Since there were 2 replications per noise level, this gives for each configuration 6 cases produced by E1 and E2 - 2 diagonal and 4 offdiagonal cases - that is a total of 30 such cases for each condition. There will be 9 cases produced by E3 for each condition (totally 45 such cases for each condition). Before we can give details on individual results for 2 dimensional configurations it is necessary to distinguish between three types of violations of rule a1): Type (1) TF | G1ij < TF | Gij2 Type (2) TF | Gij3 < TF | Gij2 Type 3) both Type 1) and Type 2) above. For Type 1) using rule a1) will lead to too low a dimensionality, correspondingly Type 2) will lead to too high a dimensionality. 28 A more systematic comparison between the two error procedures would have been possible if different noise levels had been chosen in such away that the reliabilities would have been the same for the two procedures. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Table 11. 99 Types of violations of dimensionality rule a1) for 2 dimensional configurations. N: number of comparisons. n t Noise levels Type 1) Type 2) total nr Type 3) of violat. N per cent 15 2 20 2 E1 and E2 E3 Total E1 and E2 E3 Total 4 18 22 1 6 7 0 1 1 0 0 0 0 5 5 0 0 0 4 24 28 1 6 7 30 45 75 30 45 75 13.3 53.3 37.3 3.3 13.3 9.3 We see that rule a1) works quite well for n = 15 when reliability is above .80 29 (noise levels E1 and E2) and for all investigated noise levels when n = 20. There is also a very consistent pattern in the errors such that when the selection rule fails, it is in the direction of giving an underestimate of true dimensionality. No such pattern was observed for systematic configurations. We should expect less violations for n = 15 than for n = 12, nevertheless there appears to be more violation for n = 15, t = 2 than for the systematic configurations with n = 12. These differences between systematic and random configuration will be further discussed later. For t = 3 the selection rule completely fails for n = 15, there are 30/75 = 45% violations. For n = 20 the selection rule may have some value for the noise levels E1 and E2, where there were 7/30 = 23.3% violations. Altogether there are 27/75 = 36.0% violations for n = 20 and t = 3. TF estimates from stress. We now turn to another proposed criterion for determining dimensionality not previously discussed in this section but stated in Section 2.2, p. 24. The stress for each analyzed dimensionality, m, is converted to a TF estimate from the figure among Figs. 3-5 with dimensionality m, t is then identified as the value where TFm is at a minimum. This is suggested as an alternative to looking for an elbow in the stress curve, cfr. p. 6. It is a much simpler criterion, since a minimum is easier to identify than an elbow. Another advantage of this criterion is that it does not require retest procedures. The main results are presented in Table 12. 29 It is possible that more refined results might give differences if a given reliability was generated by the same levels of E for both Mi and Mj than if the levels were quite different. From the outline on p. ## it is for instance apparent that both E1E3 and E2E2 generate roughly the same reliability. We suspect that the selection rule will not work so well when a given reliability is generated by widely different noise levels. With just a single reliability (retest) coefficient, one can in practice of course not know with any confidence whether the underlying noise levels are similar or not. Nevertheless a detailed comparison of the success of the selection rule for a given reliability generated by closely similar versus widely different noise levels might be of interest in further studies. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 100 Table 12. TF estimated from AF - stress. Results where true dimensionality, t, equals analyzed dimensionality are underlined. a) Systematic configurations Ramsay-Young error process t= l t= 2 Configuration i) ii) i) TF|AF3 2.28 TF|AF2 2.25 TF|AF1 1.69 b) 2.43 2.20 1.63 Wagenaar-Padmos error procedure t=l t=2 ii) i) ii) i) ii) 2.20 1.93 4.15 2.08 1.98 3.90 2.68 2.50 2.10 2.18 2.02 1.79 3.13 2.92 4.28 3.15 2.90 3.88 Random configurations n 15 TF|AF3 TF|AF2 TF|AF1 t 1 n 15 2.38 2.19 1.73 t 2 2.21 2.14 3.04 n 15 t 3 n 20 2.26 2.98 3.67 t 1 .28 2.01 1.44 n 20 t 2 n 20 2.07 1.82 3.19 t 3 2.04 2.85 3.32 Table 12 immediately shows that for every condition: t m≠t a2) TF|AF < TF|AF The next question is then to what extent this inequality, dimensionality rule a2), holds good for each separate condition. For systematic configurations there are 12 comparisons for each separate configuration, 24 comparisons for each combination of t and configurations. For random configurations there are 30 comparisons for each condition. (There is one comparison for the AFm values, generated by each separate M vector). Again it is simple to summarize the results for 1 dimensional configurations. For the R-Y error process, n = 15 and n = 20 there are no exceptions to rule a2), while for the W-P error procedure there are 3 violations or 12.5%. For 2 dimensional configurations we can distinguish the same types of violations of dimensionality rule a2) as the previously discussed violations of dimensionality rule al): Type 1) TF|AF1 < TF|AF2 Type 2) TF|AF3 < TF|AF2 Type 3) both Type1) and Type 2) above. Table 13. Types of violations of dimensionality rule a2) for 2 dimensional configurations. N number of comparisons. Y-R W-P n = 15 n = 20 Type 1) 0 1 1 0 Type 2) 3 6 6 1 Type 3) 0 1 3 1 total nr.of violations 3 8 10 2 N 24 24 per cent 12.5 33.5 30 30 33.5 6.7 As reflected in Table 11 there was a tendency for concentration of errors at higher noise levels, this aspect is not included in Table 13. Judging from over-all percentages of violations it is not possible to see any clear-cut difference between the two-dimensionality rules (for n = 15, 37.3 vs. 33.5), for n = 20, 9.3 vs. 6.7). There is, however, a very interesting difference in the pattern of types of errors which Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 101 suggests that the two rules may serve in a complementary way. When rule a1) failed it was in the direction of giving too low dimensionality, that is: Rule a1) serves best in avoiding too high dimensionality. With too many degrees of freedom, solutions from different M vectors may diverge in different directions, thus avoiding Type 2) errors. On the other hand with too low dimensionality, stereotyped, or “too similar” solutions may be found, thus giving Type 1) errors. It should, however, be pointed out that the tendency towards Type 1) errors reflected in Table 11 did not occur for systematic configurations. These configurations may all be characterized by a lack of clustering in the configurations. Perhaps this feature tended to preclude a tendency towards stereotyped solutions, contrary to random configurations where some clustering must be expected. Turning now to dimensionality rule a2) there is a very clear tendency for this rule to avoid Type 1) error, that is: Rule a2) serves best in avoiding too low dimensionality. This is not difficult to explain. With too low dimensionality, there are far too few degrees of freedom to accommodate the structure of the material, this will force stress markedly upwards. As will be recalled from Section 4.6, to a given TF there corresponds higher stress in one than in two dimensions but this is far outweighed by forcing a solution into a too low dimensionality. On the other hand rule a2) is not equally good in avoiding too high dimensionality, stress will then be low, the solutions may capitalize on noise and we can not expect the rule to be highly differentiating. The latter reasoning may explain the perhaps surprisingly low error rates found for 3 dimensional configurations, 10% for n = 15 and 6.7% for n = 20. If these configurations had been analyzed in 4 dimensions as well, we would have expected a much higher error rate. The main conclusion on dimensionality is that, provided certain condition holds good, both our two major rules work very well. These conditions are: sufficiently high value of n, and not too high dimensionality. To some extent low n and high t can be compensated by having highly reliable data. The more detailed discussion suggested that even finer diagnosis of the proper dimensionality may be achieved by using the criteria in a supplementary way. Rule a1), picking the highest value of r(Gim, Gjm), rules out too high values for t, while rule a2), picking the lowest value of TF|AFm, rules out too low values for t. Working out the fine details of such a combined use of criteria will, however, require extensive investigation of different types of configurations as there may be interactions between this variable and rule a1). Theoretical and empirical purification. We now assume that dimensionality has been estimated and turn to discuss the relation between TP and EP, that is the validity of the equations on p. 93. This will be seen to illustrate an approach to the question of the applicability of the model. There are two approaches to checking the validity of these equations, namely separate check and combined check. In the first approach the equation relating empirical and theoretical correlations for NL, the NL equation is checked separate from and independent of the corresponding TF equation. In contrast to the next approach the separate check is only indirectly testing the equality between TP and EP. In the second approach the relation between the two equations is directly checked, that is the equality between TP and EP. The separate and the combined check correspond to different strategies for testing applicability. This is briefly commented on p. 110. Separate check. For both the NL and the TF equations there are again two different approaches to test the validity of the equations. Perhaps the most straightforward approach is to study the relative size of the discrepancies between the middle and right part of the equations. In terms of Fig. 14 the NL equation Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 102 states that the Mij 30 correlations can be reproduced from the NL column, correspondingly the TF equation states that the Gijt correlations can be reproduced from the TF column, in other words that the Mij and the Gijt matrices both have a perfect onedimensional Spearman structure. The statistics 31 which will be used to check the structure of these matrices are: ∑ Res(Mij ) =Σi< j r(Mi,M j ) − r(L,Mi ) • r(L,M j ) /( c∑ i< j r(Mi,M j )/c) t t t t Σ t t Res(Gijt ) =Σ i< j r(Gi , G j ) − r(L, Gi ) • r(L, G j ) / i< jr(Gi , G j ) simply indices for the relative size of the absolute deviations between the observed and the expected. To rely exclusively on Res(Mij) and Res(Gijt) may obscure that small differences for very high correlations may be of large practical importance. The second approach checks the validity via the transformation to the TF-categories scale. First theoretical correlations are multiplied e.g. r(L, Mi) ·.r(L, Mj), then each of the ( s2 ) products are converted to TF-categories. NL|TCij (TC for Theoretical Correlation) denotes one such converted product, these values are then compared with NL|Mij. Correspondingly r(L, Gjt) · r(L, Gjt) are converted, the resulting values denoted TF|TCij. A separate comment on this procedure (and the new symbols NL|TCij and TF|TCij) is in order since it will serve to clarify an otherwise puzzling aspect of the results reported in Tables 9 and 10. Instead of first multiplying theoretical correlations, then converting, one might have first converted then averaged. The latter, averaging procedure would have introduced a systematic bias as illustrated in the following example: Averaging procedure r (L,M1) r (L,M2) TF = = .90 .80 3.20 3.76 mean(NL) = 3.48 TCij procedure: (r (L,M1) · r (L,M2))2 = .722 = .8485 NL|TC12 = 3.55 Since the TCij procedure first multiplies theoretical correlations, more weight will be given to the lower correlation in this approach than when averaging. This will give worse (higher) values in the TFcategories scale (3.55 > 3.48 in the example). As a matter of fact if the averaging procedure had been used, mean (NL) for checking the ( s2 ) NL|Mij values would have been identical to the NL values reported in Table 9. By comparing Tables 9 and 10 it will be seen that NL is systematically lower than NL|Nij. When NL|TCij is used there is, however, no such bias. The discrepancy between the two procedures will be larger the greater the discrepancies in the correlations. Correspondingly there are systematic differences between TFt in Table 9 and TF|Gijt Table 10, though less pronounced since the Gijt correlations are more homogenous. The larger bias for NL and the smaller bias for TF combine to produce a systematic bias for TP vs. EP such that TP is generally smaller. The above argument implies that these two set of values are not comparable. When 30 In order not to have the terminology too complicated the sub-and superscripts for rank images will be dropped in the rest of the chapter. It should, however, be understood that all computations for random configurations are still based on rank images. 20 Since these corelations are of the same size for all the conditions the average across c conditions are used as denominator (but of course separately for the R-Y and the W-P procedures). This is the contrast to Res(Gijt) where the average correlation is different between conditions. It is always implied that the sums extend over the different configurations within a condition. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 103 we later compute TPTC as the difference between NL|TCij and TF|TCij there is no longer any bias. TPTC and EP are closely similar in general size, cfr. Table 16. We are now ready to present the results for the separate check, corresponding to Res(Mij) and Res(Gijt) we have for the transformation approach: Res(NL | Mij ) = ∑ NL | Mij − NL | TC ij /( ∑ ∑ NL | Mij /c) Res(TF | Gijt ) = ∑ TF/Gijt − TF | TC ij / ∑ TF | Gijt Table 14. Separate check. Relative residuals in per cent for the validity of the relation between empirical and theoretical correlations. a) Systematic correlations . Ramsay-Young t=1 error process t=2 Wagenaar-Padmos t=1 error t=2 Res (Mij) Res (Gijt) 1.6 .6. 1.7 1.0 3.3 1.2 5.6 2.6 Res (NL|Mij) Res(TF|Gijt) 2.1 9.1 3.2 5.6 3.1 6.7 2.5 4.3 procedure b) Random configurations n 15 t 1 Res (Mij) Res (Gijt) 1.4 1.0 Res (NL|Mij) Res (TF|Gijt) 1.6 6.8 n 15 t 2 1.6 1.4 1.6 3.6 n 15 t 3 n 20 t 1 n 20 t 2 n 20 t 3 1.9 1.7 1.3 . 7 1.5 .5 1.1 .7 1.9 2.9 1.7 6.8 1.7 2.9 1.5 2.9 The results are very convincing. From both the Spearman point of view (Res (Mij) and Res (Gijt)) and the transformation point of view (Res (NL|Mij) and Res (TF|Gijt)) the relative errors are acceptably small. For the Mij matrices the relative error does not seem to be much influenced by whether correlation residuals are computed or whether residuals in the TF scale are computed. For the Gijt matrices, however, the relative error is more pronounced in the transformation approach. This is probably due to the fact that in the Gijt matrices there will be differences between very high correlations and such differences are magnified by the TF-categories scale. While there is a tendency for relative error to decrease with n there is no very clear dependence on dimensionality. At present penetration to the fine details of these results does not add much to the over all picture. There appears to be some relation between noise level and relative error, higher relative error for higher noise level, but this tendency is itself of a fairly erratic nature. While different ways of presenting deviations from the NL and TF equations might give somewhat different results, we think it is fair to state that the general validity of these equations is excellent. Before turning to the combined (direct) check on the equality between EP and TP some further comments on Spearman structure will be made. Turning to Fig. 14 one might ask whether the square MiGjt matrix should not also have a Spearman structure, that is: Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. t 104 t r (L,Mi) • r (L,G j ) = r (Mi,G j ) This is indeed the case, the relative residuals are roughly of the same size as Res(Gijt). There is, t however, one important exception, the bidiagonal r(Mi,Gi ). This is a correlational index for AF, an alternative to AF-stress. Since a number of studies (not reported in detail in the present work) have t t shown that AF-stress and r(Mi,G i ) are very highly interrelated, comments on r(Mi,G i ) will serve to elaborate some comments made previously on AF. There is first the principle of global minimum stated on p. 39, that M should be closer to G than to L. This implies: t r (Mi,G i ) > r (L,Mi) This inequality has been checked for the total of 6 x 30 = 180 comparisons for random configurations in every single case the inequality was found to be valid. The global minimum inequality implies the weaker inequality: t t r (Mi,G i ) > r (L,Mi) • r (L,G i ) In terms of partial correlation this latter inequality states that the relation between M. and Git can not be t completely accounted for in terms of L, or that Mi and Gj have more in common than can be accounted for by L. This "more" can be called “capitalizing on noise”: In addition to L there will be noise components common to Mi and Git. This is another way of stating a conclusion arrived at earlier, cfr. p. 70, that G can not be represented on a straight line between L and M. As might be expected the amount of capitalizing on noise behaves very regularly. The discrepancy t t r (Mi,G i - r (L,Mi) • r (L,G i ) increases: when n decreases, when E (noise level) increases and finally when t increases. When n= 20, t = 1 E = .1 the discrepancy is just .0028 while the maximal value observed in the present study was for n = 15, t = 3, E = .35, when the value was .2059. Combined check. In the combined check on the TF and NL equations there are again two approaches. The most immediate is first to compute TPTC TPTC = NL|TCij - TF|TCij and then the discrepancies EP –TPTC = (NL|Mij – TF|Gijt) - (NL|TCij - TF|TCij) = (NL|Mij – NL|TCij) - (TF|Gijt - TF|TCij) The latter expression shows that checking the relative discrepancies of (EP - TPTC) is equivalent to checking differences entering the expression Res(NL|Mij) and Res(TF|Gijt). Since generally differences are far less reliable than the components we should expect substantially worse discrepancies for the direct check. Nevertheless it will be of interest to study the relative discrepancies: Res (EP) = ∑ | EP -TPTC| / ∑EP The other approach in the combined check will turn out to be a validation of our basic proposal for checking applicability of the model. On p. 23 we proposed that applicability could be evaluated by Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 105 comparing two values of purification, empirical purification and estimated theoretical purification; Est (TP). A seemingly simpler approach would be to compare TF estimated from NL|Mij - this will be labelled TF|(NL|Mij) - with TF estimated from t (Git, Gjt), that is TG|Gijt. Two independent estimates of TF are then compared, one based on retest reliability, the other on results after multidimensional analysis. TF|(NL|Mij) can be read off from Figs. 10 to 12 just as TF|AF is read off from Figs. 3 to 5. If now TF|Gijt is not appreciably higher than TF|(NL|Mij) this substantiates the validity of the model. To take an example suppose that n = 15, t = 1 and r (Mi, Mj) = .935. Two different values of r(G1i, G1j) for this reliability will be used to illustrate the procedure. As example 1 we set r(G1l, G1j) = .990, this gives TF|Gij1 = .93. From Fig. 10 we see that for n1 = 5, t = 1and r (Mi ,Mj) = .935, then TF|(NL|Mij) = 1.0 so in this case we would have a very good confirmation of applicability. As example 2 we set r (Gi1, Gj1) = .960, this gives TF|Gij1 = 1.85. The latter value is appreciably higher than TF|(NL|Mij), but is it so much higher that we have reason to doubt the validity of the model? Instead of working out procedures to decide when the discrepancy between TF|Gijt and TF|(NL|Mij) is as large as to throw doubt on the validity of the model, we choose to use the slightly more indirect procedure of comparing two points in a usual coordinate system. the observed point: the theoretical point: (NL| Mij, EP) and (NL|Mij, Est (TP)) where the first (common) coordinate is the abscissa, the second coordinate the ordinate and Est (TP) = NL|Mij – TF|(NL|Mij) The discrepancy between the ordinates of the observed and theoretical points is then: EP –Est (TP) = (NL | Mij − TF | Gijt ) − (NL | TCij − TF | TCij ) = TF | (NL | Mij ) − TF | Gitj So the second approach is to study: ResEst (EP) = ∑ | EP – Est (TP) | / ∑ EP that is the discrepancies between the empirical and theoretical estimates of purification. The reason we choose to compare the observed and the theoretical point instead of justTF|Gijt and TF|(NL|Mij) is that generally the latter discrepancy will depend upon the size of NL|Mij. Comparing the observed and the theoretical point takes NL|Mij into account, in essence it is equivalent to comparing TF|(NL|Mij) with TF|Gijt separately for each level of NL|Mij cfr. the expression for EP-Est(TP). Before suggesting rules for when the discrepancy between the observed and the theoretical points is suspiciously large, we need a convenient way to find the estimated theoretical purification, Est(TP). This information is contained in Figs. 10 to 12, albeit in a fairly indirect way. In our example r(Mi, Mj) =.935 corresponds to NL|Mij = 2.25 and this implies: Est(TP) = 2.25 – 1.0 = 1.25. Systematic information on Est (TP) is represented in Table 15. To fill in Table 15 the first step was to construct (TF, NL) curves - parallel to the (TF, AF) curves in Figs. 7- 9. In these curves NL was converted to the TF-categories scale and it was then simple to read off values of NL-TF for different levels of NL. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 106 To take an example: for n = 20 and t = 1, then r(Mi ,Mj) = .85 corresponds to NL|Mij = 3,.0. In Fig. 10 we see that (n = 20, r (Mi, Mj) = .85) is about halfway between the contours for TF=1 and TF=2 but slightly closer to TF = 1. On a (TF, NL) curve we can read off the value TF = 1.48, that is TF|(NL|Mij). This then gives Est (TP) = 3.0 - 1.48 = 1.52, the latter value is an entry in Table 15. Table 15. Amount of estimated purification, Est(TP), as a function of n, t and NL|Mij. NL|Mij is estimated from r(MiMj) and is as Est(TP) expressed in the TF-categories scale. Negative values denote distortion. T=1 n NL|Mij -.5 30 20 15 10 6 0 .5 -.20 .20 .55 -.50 -.10 .28 -. 70 -.25 .20 -1.0 -.65 -.22 -2.0 -1.57 -1.15 1.0 1.5 2.0 2.5 3.0 3.5 .85 1,15 1.40 1.62 1.70 1.72 .65 .98 1.25 1.42 1.52 1.50 .52 .82 1.12 1.32 1.38 1.30 .20 .55 .80 .90 .98 .85 -.75 -.20 .10 .40 .50 .42 4.0 4.25 1.42 1.0 1.15 .75 .90 .60 .55 .32 .18 .05 t=2 30 20 15 10 8 -.30 -.60 -.80 -1.20 -1.50 .02 -.22 -.40 -.80 -1.03 .35 .70 .17 .45 .00 .30 -.40 .00 -.55 -.16 30 20 15 10 8 -.60 -.70 -.90 -1.40 - 2.10 -.30 -.40 -.55 -.92 -1.70 .05 -.02 -.20 -.50 -1.32 .98 .78 .55 .32 .05 1.25 1.42 1.45 1.38 .98 1.15 1.20 1.10 .73 .88 .90 .75 .55 .73 .70 .55 .16 .25 .18 .17 1.10 .75 .80 .55 .55 .35 35 .20 .13 .10 .75 .82 .84 .60 .68 -.65 .42 .48 .45 .12 .15 .15 -.35 -.18 -.10 .60 .40 .28 .02 .00 t=3 .38 .25 .05 -.17 -.90 .63 .48 .28 .00 -.55 .78 .62 .40 .12 -.05 .38 .25 .18 .00 .00 Notice that the information in the column for NL = -.5 is the same as the information presented in Fig. 6, the error free case. Some representative curves illustrating the information in Table 15 is presented in Fig. 16. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 107 Notice that from the bottom scale in Fig. 16 one may easily find the appropriate value of NL|Mij from r(Mi, Mj. 32 The same scale can be used to find TF|Gijt from r(Git, Gjt). By interpolation curves for other values of n than the ones listed in Table 15 can be constructed, or as often will be the case in practice, single values of Est(TP) can be computed by a double linear interpolation. If now the observed point is closely under (or above) the theoretical point everything is OK. If, however, the observed point is far below the theoretical point the wrong type of model has most likely been used. Ideally one would like to have a confidence belt surrounding each purification curve. This 32 For convenience the transformations leading from r(Mi ,M j) to NL/Mij are also summarized in FORTRAN notation in the appendix. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 108 is not possible to construct from the present material, but we can suggest some rules of thumb from the maximum deviations observed in the present study. For n = 15, t =1 EP should always be greater than .50 x Est (TP). For n = 15, t = 2 and 3 there is a much simpler rule, EP should just be above 0. For n = 20, t =1 we should have EP greater than .40 x Est (TP) and for n = 20, t = 2 and 3 EP greater than .50 x Est (TP). The two previously mentioned examples will serve to summarize the procedure for checking applicability. Finding the theoretical point Finding the observed point Ex.1 Ex.2 r(Mi,Mj) .935 r(G 1i ,G 1j ) .990 .960 NL|Mij 2.25 TF |Gij1 .93 1.85 Est (TP) 1.25 EP = NL / Mij − TF / Gi1j 1.32 .40 From the bottom scale in Fig. 16 we see that r (Mi, Mj) = .935 corresponds to NL|Mij = 2.25. Entering the curve for n = 5, t =1 with 2.25 as abscissa gives the value 1.25 for Est (TP). (Since Est(TP) – NL|Mij = TF|(NL|Mij) then TF|(NL|Mij) is here 2.25 - 1.25 = 1.00, the same value as previously read off from Fig. 10. This again illustrates that the information in Table 15 and Fig. 16 is the same as in Figs. 10 to 12.) Using the bottom scale in Fig. 16 also shows that r(Gi1, Gj1) = .99 gives the previously mentioned value TF|Gij = .93. In example 1 the model looks slightly "too good", EP is higher than Est(TP). Such a finding should not be surprising since the curve for Est(TP) is an expected curve where there will be random deviations in both directions. We should, however, have EP greater than .50 x Est(TP). This gives a lower boundary of .63 for EP in the present case, and this boundary is clearly not reached in example 2. Even though there is some empirical purification in example 2 (as may be directly seen by comparing r(Gi1 ,Gj1) and r (Mi, Mj)), the observed amount of EP is not sufficient to warrant faith in the applicability of the model for this example. Provided the model is correct a reliability of .935 implies more structure in the data than what is implied by r(Gi1, Gj1) = .960. It is now convenient to give systematic information on Res (EP) and ResEst (EP). Table 16 present information on the two approaches to the combined, or direct check on the equality between EP and TP. The first row of Table 16 is taken from Table 10. Comparing this to the second row we see that there are no significant biases in the TPTG procedure, the mean results are closely similar. There is, however, a consistent difference favouring Est(TP) over EP. The mean difference Est(TP) - EP across the six conditions is .12 . To a large extent, however, this difference is due to the combination of error levels E1E3 - if this combination is excluded, the overall difference drops to .07. Still this is probably a significant difference and it reflects that TF|Gijt >TF|(NL|Mij). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 109 Table16. Combined check. The relation between EP and TP by two different approaches. Relative residuals in per cent. Random configurations. n 15 EP TPTC Est (TP) Res (EP)33 ResEst(EP) 1.12 1.09 1.27 .12 .23 t 1 n 15 .61 .59 .79 .25 .34 t 2 n 15 .38 .37 .41 .20 .40 t 3 n 20 t 1 1.28 1.26 1.45 .11 .20 n 20 t 2 .93 .94 1.10 .07 .18 n 20 t 3 .75 .74 .63 .10 .22 Since the latter value is based on the TF contours this inequality implies that the contours are (slightly) “too good” for the present study of random configurations. Notice that there is a similar tendency when comparing TF|AFt in Table 12 with TFt in Table 9, again the results from the TF contours are slightly too good. (On the other hand there is the reverse tendency for systematic configurations and for both studies the crossvalidation, as later reported in Table 17, appears satisfactory, so there may not be much reason to dwell on these discrepancies). As expected the relative discrepancies are much higher than those for the separate check in Table 14. The most important relative discrepancy, ResEst(EP) may appear uncomfortably high. Those relative discrepancies do not, however, preclude useful rules of thumb for checking applicability as exemplified on p 108. From p. 98 it will be recalled that the present study covered reliabilities in the range .66 to .98 (some reliabilities were beyond .99). From the purification curves presented in Fig. 16 it might appear that it is advantageous not to have too high reliability. For n = 20, t =1 there does for instance seem to be a maximum for r (Mi, Mj) = .80.This apparent advantage of not too high reliability has no real basis, however, as present results seem to indicate that the relative error increases with decreasing reliability. If one has reliabilities far exceeding those observed in the present study, the use of the purification curves is not very meaningful. This is most clearly seen in the extreme case with reliability = 1. The curves then indicate distortion, but this is meaningless from the point of view of r(Git, Gjt) which of course also will be 1 in this case. When r(Mi, Mj) is greater than .99 we propose to use stress instead of r(Git, Gjt). If for instance n = 15, t = 2 and r (Mi, Mj) = .998, this corresponds to NL = .5 and NL – TF = 0 and thus TF = .5. From e.g. Fig. 2 it is seen that for TF = .5 we should have stress in the neighbourhood of .01. If then the stress of both Gi2 and Gj2 is far above .01 this indicates as stated on p. 23 that there is more structure in the material than is captured by the method and thus that the wrong model has been applied. Perhaps finer diagnosis of applicability could be developed by systematically combining stress and r(Git, Gjt) in the assessment of applicability. For the regions of reliability we have investigated, however, stress is slightly less related to TF|(NL|Mij) than TF|Gijt is and at present it is hard to see how stress might give better procedures when we have the usual values of reliability. There might appear to be a weakness in the present approach to applicability in that we first assume that dimensionality is estimated, then proceed to ask whether the type of model is correctly chosen. But how can dimensionality meaningfully be estimated if the model is basically inappropriate? First we should notice that probably any approach to applicability would be very difficult (if not impossible) to develop if dimensionality is very high. As true dimensionality increases, then expected purification will 33 In these results the combination of noise levels E1E3 is excluded since this condition severely disfavoured ResEst (EP). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 110 tend towards 0. r(Git, Gjt) will tend to be more and more equal to r(Mi, Mj). It is, however, not likely that a spatial model really is appropriate if dimensionality is too high. In the limiting case where dimensionality is n-2 (or n-1 for metric models) we have what will later be called an unfilled space. So we assume that for the cases where one wants to check applicability of the model the alternatives are: either a reasonably low dimensionality or that a spatial model is inappropriate. If a spatial model is inappropriate one might expect dimensionality rule a1) and rule a2) to give contradictory or equivocal results. This in itself might bean indication that the model is inappropriate. It is then further possible to use our proposed comparison of the observed point and the corresponding theoretical point on the purification curve for several alternative hypotheses of the value of t. If for all values of t the observed point is far below the theoretical point this will be a strong indication that the model is inappropriate. It would have been valuable to try this procedure for data generated by other types of model than a spatial model, for instance a tree structure model. This has not yet been done, but we would expect that serious distortion would occur if one analyzed the data with the hypothesis of for instance a 2 or 3 dimensional structure. Perhaps many cases where the model is inappropriate will show a strong empirical distortion for the potentially relevant values of t. Provided n is not too low (cfr. Table 15) and reliability is fair (say above .80) a clear empirical distortion will be a very strong indication that the wrong type of model has been used. For the rare cases of very high values of reliability, it is possible from Figs. 3 to 5 and Figs. 10 to 12 to estimate what the corresponding value of stress should be if the model is appropriate. The details of such an approach have, however, not yet been worked out. While the design used in the present section includes a set of “parallel” forms, the result for the combined check have been written out from the point of view of the user who has a single test, retest design. As pointed out on p. 90, however, there may be cases where the design used in empirical research has a. similar structure to that outlined in Fig. 14. This will be the case either if there are several replications for a single individual or if we are willing to assume that results from different individuals can be considered as replications generated from the same L. In this case one may go "the other way" from what was done in computing the values in Table 14. Instead of estimating Mij and Gijt from the theoretical correlations, the theoretical correlations can be estimated by standard procedures for dealing with a Spearman structure, cfr. Harmann (1967), Thurstone (1947). The low residuals reported for the separate check in Table 14, compared with the relatively high residuals in Table 16 suggest that this will be a very sensitive procedure to test applicability of the mode. This design also has the advantage that one will get “tailormade” estimates of TF, it is then not necessary to go via the contours in Section 4.6. If we finally take a closer look at the meaning of TF|Gijt this brings forth an incompleteness of the present approach which calls for further research. Each of the configurations Git and Gjt have their corresponding TF, and TF|Gijt is a kind of average of these separate true fits. But provided the problems of dimensionality and applicability have been satisfactorily answered, one will not be primarily interested in such an average. One will want in some way to get at the best configuration and then to know the true fit of this one configuration. Just as generally a mean is more reliable than the separate components one might hope it would be possible to derive one configuration which would have a better true fit than either of the configurations Git and Gjt. A promising approach to this problem might be to use the option in MDSCAL which allows a single solution to be computed from several data vectors, the repeat option described by Kruskal (1967). A special problem is then what to use as start configuration to avoid local minima problems. In some preliminary runs Git was used as start configuration when Mi and Mj were inputted in one run. Altogether 8 different random configurations from various combinations of n and t have been analyzed. For each configuration 3 solutions were computed, the first from M1 and M2 when E1 = .10 (reliability ca. .98), the second from M3 and M4 when E2 = .35 (rel. ca. .81) and the third from M5 and M6 when E3 = .50. (rel. ca. .66). In all but 2 of the 18 runs the solution had better true fit than either of the separate true fits for Git or Gjt. It might be noticed that in the two exceptions the G with the worse true fit was used as start configuration and the program failed to change this configuration - that is we happened to start with what might well have been a local minimum and this might have been avoided with a different start configuration. Leaving out these two cases the improvement in TF compared to TF|Gijt turned out to be Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 111 roughly .20 for E1 and varying between .20 and. 50 for E2 and E3. At this point we should not be surprised that though true fit improved by using the repeat option, stress was higher (looked worse). Using the repeat option is equivalent to increasing n, and we remember that this generally increases stress and improves TF. It is, however, too early to attempt to give parametric expression to this relation, likewise it is not known if the repeat option will give an improved solution if the true fits of the separate G’s are widely different. A further possibility is to study "to what extent further improvement can be made by using the repeat option with more than two M vectors. Is it then possible to get equally good true fit as otherwise can only be achieved with high n and highly reliable data? If this turns out to be the case it might be possible to get excellent levels of true fit even with small values of n and unreliable data merely by having enough replications. Further crossvalidation results. Further crossvalidation results from the random and the systematic configurations are given in Table 17. Comparing these results with those presented in Table 8 (p. 95) we would expect the results for systematic configurations to be between those for n ≥ 12 and n (< 12, that is the correlation between .92 and .98 and the rmsq discrepancy to be between .30 and .50. Table 17. Further crossvalidation results r correlation between TFt and TF|AFt rmsq root mean square discrepancy between TFt and TF|AFt. a) Systematic configurations Ramsay-Young error procedure t =1 t=2 configuration i) ii) i) ii) Wagenaar-Padmos error process. t=1 t=2 i) ii) i) ii) r rmsq .951 .364 .858 .481 .941 .284 .895 .443 .919 .402 .831 .484 .878 .946 .599 .363 Summary results R-Y W-P Total b) Random configurations n t n 15 1 15 r rmsq .957 .265 .944 .293 t n 2 15 .972 .301 t 3 n 20 .957 .293 r rmsq .903 .918 .920 .409 .466 .437 t 1 n 20 .980 .201 t 2 n 20 t 3 .969 .266 Summary results (over all conditions) r rmsq .962 .285 We might thus have wished the total correlation to be slightly better but the rmsq is just as expected. We tend to put more faith in the latter index since it is sensitive not only to general linear relation but Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 112 also to differences in average and standard deviation. There is exactly the same pattern for random configurations, the total rmsq here is .285 which fits just beautifully with the values .28 and .30 from the studies in Table 7. The fact that the over all correlation is a bit lower in the present studies is probably due to the fact that more limited ranges of noise levels were used at present than in Section 4.6. These further crossvalidation results should finally settle the point that there is no systematic difference to be expected between repeated and unrepeated designs, cfr. the discussion in Section 4.4, p. 66-67. Furthermore it is very encouraging that the results for systematic configurations do not depart appreciably from those for random configurations and the similar results for the two different error procedures suggest that it may not be of basic importance precisely how the error process is specified. 34 34 It might here be further mentioned that for n = 12 in the crossvalidation results reported in Section 4.6 the WP error procedure was also used. The relation between AF and TF turned out to be the same for both the R-Y and the W-P error procedures. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 113 Part II. ANALYSIS OF A TREE STRUCTURE MODEL AND SOME STEPS TOWARDS A GENERAL MODEL Chapter 5 Johnson's "hierarchical clustering schemes, HCS. 5.1 A presentation of “Hierarchical clustering schemes" HCS. While Section 1.42 gave some intuitive background and a concrete example of a tree structure, this section treats tree structures from a formal point of view. The formal definition of a HCS is followed by the definition of distance for a HCS. The latter definition makes it possible to map a HCS into a distance matrix and we then show that it is possible to go "the other way" - that is to reconstruct a HCS from such a distance matrix. The definition of a HCS will be illustrated by reference to the example in Fig. 1. Fig. 1. An example of at HCS and the corresponding tree representation. The definition of a HCS has two parts: a) An ordered sequence of m+1 clustering C = (C0, C1 …...Cj-1, Cj ...Cm) b) A set of corresponding values of clusterings: α = α0, α1…..α j-1, αj…αm) where α0 = 0 and αj-1 ≤ αj The subscript j (j = 0, 1, ...m) denotes the level of the HCS. A clustering is a partitioning of n objects into a set of non-overlapping clusters. In Fig. 1 n = 7 and the integers 1, 2 ...7 are used as labels for the 7 objects. Each cluster in a clustering is delineated by a parenthesis. At the lowest level, where j = 0 (see the row for j = 0 in Fig. 1) each cluster consists of just a single object. The corresponding clustering – C0 - then consists of n clusters. This is called the weak clustering, it is really a dummy (trivial) clustering and has no empirical significance. At level 1 there is one non-trivial cluster, (12), the remaining clusters in C1 again consist of single objects, so there are 6 clusters in the clustering C1 (see the row for j = 1). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 114 The most important property of a HCS is that the clusterings are ordered, which is expressed: Cj – 1 < Cj The relation “<” can be interpreted as an inclusion relation. To say that Cj-1 is included in Cj is a shorthand way of expressing that every cluster in Cj-1 is included in a cluster in Cj. Stated otherwise every cluster in Cj is either a cluster in Cj-1, a merging (union) of clusters in Cj-1. As the level j increases the clusters in clustering Cj will also “increase” in the sense that they will contain more objects. The clusterings might then also be said to “increase” or to get “less weak”. Finally we get w hat is called the strong clustering, which consists of just one cluster. This cluster then contains all the objects (C6 in the example). Just as the weak clustering it is a dummy clustering without empirical significance. The clusterings C1 to Cm-1 are non-trivial. Different groupings of the objects (cfr. the use of parentheses) in each clustering will define different trees for a given n. In the example the clustering C5 consists of two clusters (123) and (4567). The three clusters at the preceding level C4 - (123), (45) and (67) - are included in C5. (123) is of course “included” in (123). Both (45) and (67) are included in (4567), the latter being a merging of the two former clusters. This property of inclusion makes it possible to represent a HCS as a tree. At a given level j some clusters from previous levels are merged. These clusters are represented as nodes. The merging is represented as branches from the “previous” cluster nodes to the node at level j which then represents the “new” cluster being formed. In the example the node at level j = 5 represents the cluster (4567). The two branches from this node to the level 3 and level 4 node signifies that (45) is formed at level 3, (67) at level 4 and that these are merged at level 5 as stated above. If the new clusters were not formed by merging of previous ones, we could not represent the structure as a tree. If we had for instance (45) at one level and (56) at the next level, then of course the structure would not be a HCS since (45) is not included in (56). So far α the values of clusterings - have not been discussed. Some properties of HCS, as in Section 5.2, may be stated independent of these values. The α values are used to define distances between objects in a HCS structure. The distance between objects (x, y), d(x, y), is defined: (1) d(x, y) = αj where j is the least integer such that x and y are in the same cluster in clustering Cj. In the example 4 and 5 are necessarily in the strong clustering C6. The least integer j where they are in the same cluster is 3, and in Fig. 1 we see that α3, =d (4,5) = 14. Let us now see how equation (1) satisfies the distance axioms (cfr. p. 11-12). The definition of distance immediately implies that d(x, x) = 0 ( since α0 by definition is 0). The first distance axiom also states that if d(x, y) = 0 then x = y. This requires that α1> α0 (if α1 = 0 then distinct objects x and y could have distance 0). Unless otherwise stated it will be assumed that α>0. Concerning the second distance axiom it is immediately evident that d(x, y) = d(y, x). The most important of the distance axioms is the triangular inequality. Johnson shows that the distance definition for a HCS satisfies a much stronger version than the usual statement of this inequality, what he calls the ultrametric inequality. This inequality is simple to illustrate. Let: (2) d(x,y) = αj , d(y,z) = αk Where we assume αj ≤αk We then know that x and y are in the same cluster in Cj and y and z in the same cluster in Ck. Since Cj is included in Ck x must be in the same cluster as y and z in Ck. (The cluster containing x and y must Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 115 increase from Cj to Ck, x can then not be "dropped" from the cluster). Both x and y must join z in a cluster at the same level. This implies according to the definition of distance: d(x, z) = αk The usual statement of the triangular inequality is: d(x, z) ≤ d (x ,y) + d(y, z) but we have shown : (3) d(x, z) = d(y, z) and then clearly d x, z) < d(x, y) + d(y, z). Johnson (1967, p. 245) states the ultrametric inequality: (4) d(x, z) ≤ max (d(x, y), d(y, z)) Equations (2) and (3) are an alternative formulation. As Johnson points out the ultrametric inequality is clearly stronger than the triangular inequality in the sense that the ultrametric inequality establishes a much smaller upper bound for d(x, z) than is generally required by the triangular inequality. (In this sense the "weakest" requirement of the triangular inequality would be: d(x, z) = d(x, y) + d(y, z) that is: y between x and z on a straight line, cfr. p. 197). From the HCS definition of distance it is straightforward to map a given HCS into a distance matrix as shown in Table 1 for our example. Table1. The distance matrix for the HCS in Fig. 1. 1 2 3 4 5 6 7 1 0 2 6 26 26 26 26 2 2 0 6 26 26 26 26 3 6 6 0 26 26 26 26 4 26 26 26 0 14 22 22 5 26 26 26 14 0 22 22 6 26 26 26 22 22 0 18 7 26 26 26 22 22 18 0 Note the large number of ties in this matrix. This is a characteristic feature of a distance matrix satisfying a HCS, a direct consequence of the definition of distance. When two clusters are merged all the distances between the objects in one cluster and the objects in the other cluster must be equal. In the example all distances between for instance (123) and (4567) must be equal to 26. This feature makes it simple to reconstruct a HCS from a distance matrix such as the one above. The matrix is successively condensed. At each step the smallest distance in the matrix αj is picked and the clusters (which may be single objects) with distance αj are merged. This creates Cj from Cj-1. If d(x, y) = αj is picked at Cj-1 the ultrametric inequality implies that the distance between (x, y) and another object z is uniquely defined.35 ) Since d(y, z) = d(x, z) it is natural to define d (x y), z) = d(y, z) = d(x, z). This process of condensation is illustrated at two stages in Table 2. 35 x, y and z may be clusters containing more than a single object. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 116 Table 2. Illustration of condensation in finding the HCS from the distance matrix. C1 (12) 3 4 5 6 7 (12) 0 6 26 26 26 26 3 4 6 26 0 26 26 0 26 14 26 22 26 22 5 26 26 14 0 22 22 6 26 26 22 22 0 8 7 26 26 22 22 8 0 C4 (123) ( 45) ( 67) (123) 0 26 26 (45) 26 0 22 (67) 26 22 0 At stage 1 the only non-trivial cluster is (12) giving the matrix labelled C1. In this matrix 6 is the smallest distance, and this gives the cluster (123) in C2. In the matrix corresponding to C2 14 is the smallest distance and this gives the clustering (123), (45), (6), (7) in C3. Picking 18 in C3 gives the clustering (123), (45), (67) - the corresponding condensed matrix is given to the right in Table 2. For empirical matrices the ultrametric inequality will probably never be strictly satisfied. In the process of condensation it will not generally be the case that d(x, z) =d(y, z) when x and y are merged to one cluster. There will not be the large number (and pattern) of ties which the ultrametric inequality demands. Methods for constructing HCS which "approximates" the structure in the data matrix will not be discussed in any detail in the present work. Suffice it to mention that Johnson recommends two extreme strategies, a) always pick the smallest of the dissimilarities to be merged, the minimum method, b) always pick the largest of the dissimilarities to be merged, the maximum method. If the two strategies give "closely similar" results this gives some reassurance that HCS is an adequate model. Goodness of fit and a complete discussion of HCS from the point of view of the metamodel will be discussed at another occasion. Johnson does not explicitly treat the question of the possible number or clusterings (m+1) in relation to n. The number of clusters in Cj-1 must be at least one less than the number of clusters in Cj-1. Since the process stops when all the objects have been merged to one cluster there can be at most m+1 = n clusterings (the process starts with n clusters in C0). If for a given numerical value αj more than two clusters are merged to a single cluster this implies that m is correspondingly reduced. If for instance in the example d(1,2) = d(1,3) = 2 - then d(2,3) must also be 2 - and the three clusters (1), (2) and (3) are merged to (123) at C1 and there will be at most m+1 = n-1 or 6 clusterings. In such cases there will be more than two branches from a single node in the tree representation. We always have that the number of clusterings will equal the number of nodes in the tree plus one (since there is no node corresponding to the weak clustering C0). Two cases can then be distinguished: a) Binary tree. The number of clusterings: m+1 = n. Always two branches from each node in the tree (as in the example here used). b) General tree. The number of clusterings: m+1< n. More than two branches from some (or all) of the nodes in the tree. Fig. 3 in Ch. 1 gives an example. Unless otherwise specified binary trees will be assumed in the following. Summary In this section a HCS is defined and illustrated. The central notion in a HCS is the concept of an ordered set of clusterings. Another concept is a set of ordered values - α. These values are the basis for the definition of distance between objects. Since the clustering are ordered (by an inclusion relation) this distance function satisfies a stronger form of the triangular inequality - the ultrametric inequality. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 117 In the next section we discuss a specific implication of the inclusion relation between clusterings. The set α is then not considered. 5.2. HCS and the Guttmann scale. In this section it will be shown that a HCS - considered as a sequence of ordered clusterings -is isomorphic to a Guttmann scale. In order to prove this some concepts developed in a recent paper by Johnson on "Metric clustering" Johnson (1968) are very helpful as are also some similar and more general concepts in an important article by Restle (1959). The main concepts are illustrated by the same tree structure as used in the previous section. It is simplest not to include clusters consisting of just single objects in the clusterings. Table 3. Illustration of "height" and “distance between clusterings j C - clusterings 0 1 (1 2) 2 (1 23) 3 (1 2 3) (4 5) 4 (1 2 3) (4 5) (6 7) 5 (1 2 3) (4 5 6 7) 6 (1 23 4 5 6 7) h(Cj) = h(Cj∩Cj+1) 0 1 3 4 5 9 21 d(Cj, Cj+1) 1 2 1 1 4 12 - h(Cj) is called the height of a clustering by Johnson. This term is probably inspired by “weak” and “strong” clustering as defined in the previous section. A weak clustering has height 0 and height is maximum for a strong clustering. More precisely height is defined in terms of incedence matrices. An incedence matrix is a symmetric n by n matrix with entries 1 for all pair of objects which are in the same cluster, 0 otherwise. The diagonal consists of 0 entries. Height is then defined: h(Cj) = sum of incedence matrix for Cj. This is the same as the total number of relations "within clusters". Since the incedence matrix is symmetrical it is sufficient to consider the offdiagonal halfmatrix. Examples of incedence matrices (for C4 and C5) are given in Table 4. Table 4. Examples of incedence matrices. C4 1 2 3 4 5 6 7 C5 1 2 3 45 6 7 1 1 0 0 0 0 1 0 0 0 0 1 2 3 4 5 6 7 0 0 1 0 0 0 0 0 0 0 1 . 1 2 3 4 5 6 7 1 11 0 0 0 0 0 0 0 0 0 0 1 0 11 0 11 1 If there are s clusters in Cj, with ni objects in cluster si the following formula can be used to compute h (Cj): (5) n n n h(C j ) = ( 21 ) + ( 22 ) + ....( 2s ) 3 4 In the example we have for instance h(C5) = ( 2 ) + ( 2 ) = 3 + 6 = 9. See Table 3 for other values of h(Cj). For the strong clustering h reaches its maximal value: n h(Cm) = ( 2 ) Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 118 since all the objects then are in one cluster. This can be used to norm the h values (dividing all of them n by ( 2 ) as Johnson does) but this is immaterial in the present context. Johnson does not explicitly treat a clustering as a set. This is, however, very much implied by his definitions. The elements of a (clustering) set are the ones in the corresponding incedence matrix. His concept of height is then a measure of the set (see Restle 1959, p. 208) the measure simply being the number of elements. The weak clustering corresponds to the empty set since it has no elements. Johnson defines the intersection of two clusterings as the largest clustering contained in both. In terms of incedence matrices the intersection of two clusterings is the number of ones which are common to both matrices. This corresponds to the standard definition of intersection as the set containing just the common elements. Johnson's definition of distance between clusterings: (6) d(Ci,Cj) = h(Ci) + h(Cj) - 2h(Ci ∩ Cj) is exactly the same as Restle's definition of distance between sets, and the specific proof Johnson has that this measure satisfies the axioms for a distance function is implied by a more general proof by Restle (1959, p. 209-210). Restle has a general definition of betweenness which is useful: Sj is between Si and Sk if and only if: (7) (8) Si ∩ s j ∩ Sk = ф that is: Si and Sk have no common members which are not also in Sj s i ∩ Sj ∩ s k = ф that is: Sj has no unique members which are in neither Si nor Sk. Restle (1959, p. 212) then specifically considers "the special case of nested sets - the Guttmann scale" where for S1 S2 ...Sn S ⊂ Si+1 for i = 1,2... n. In this case it is simple to see that for i < j < k (as will be assumed in the following) then Sj is between Si and Sk. Fig 2. Illustration of a nested sequence of sets. Equation (7) can be written Si ∩Sk ⊂ Sj which is clearly satisfied by a nested sequence since Si ∩ Sk=Si and by definition Si ⊂Sj. All common members of Si and Sk (simply Si) must also be in Sj. Equation (8) can similarly be written: Sj ⊂ Si USk. It is evident that Sj has no unique members since Sj⊂Sk by definition. In the case where all triples of sets satisfy the betweenness relation Restle shows that: (9) d (Si,Sj) + d (Sj,Sk) = d (Si,Sk) Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 119 that is: distances are additive and in an abstract sense the sets can then be mapped as points on a straight line, or the set theoretical definition of betweenness corresponds to betweenness on a straight line. Since now every cluster in Ci is included in some cluster in Cj, all elements in the set Ci must also be elements of the set Cj, (see for instance the incedence matrices for C4 and C5 in Table 4). The sets Cj increase in the simple sense that new elements are added to the “old ones” as j increases. We have then shown: When each clustering is regarded as a set, the sequence of clusterings forms a linear array and can be mapped as points on a straight line. This can also be seen more directly from the definition of distance between clusterings. Since clustering sets are included in each other: (10) h(Ci) = h(Ci ∩.Cj) Inserting this in the distance definition, equation (6), gives: (11) d(Ci, Cj ) = h(Ci) + h(Cj) - 2h(Ci) = h(Cj) – h (Ci) Similarly and d(Cj,Ck) = h(Ck) – h(Cj) . d(Ci,Ck) = h(Ck) – h(Ci) which gives: (12) d(Ci,Cj) + d(Cj,Ck) = d(Ci,Ck) Q.E.D. Notice that: d φ,Cj) = h(Cj) – h(φ) = h(Cj) and d(φC0) = h(C0) = 0. The measures of the clustering sets - the heights - map the sets on a straight line. If we accept that all elements in a clustering set are weighted equally (simply added) the h values can be regarded as an interval scale representation of the clusterings. The endpoints, h (C0) = 0, and h (Cm) = ( n ), which are 2 empty of empirical meaning, correspond to the two "degrees of freedom" in an interval scale. Structural characteristics of a given HCS might be studied by considering the differences in distance between succeeding clusterings cfr. the last column in Table 3. In our example we note that d (C5, C6) is very much larger than any of the other intervals. This is because clusters containing roughly the same number of objects are joined in the last clustering. We may note that: (13) d(Cj,Cj+1) = h(Cj+1) – h(Cj) = nj1 · nj2 where nj1 and nj2 are the number of objects in the two clusters merged in Cj+1. The new cluster in h (Cj+1) contains (nj1+nj2)●(nj1+nj2-1)/2 objects and the new cluster thus adds (nj1+nj2)●(nj1+nj2-1)/2 – nj1●(nj1-1)/2 – nj2●(nj2-1)/2 = nj1●nj2 elements. In Table 3 we have for instance: d(C5, C6) = 3 · 4 = 12 Equation (13) might perhaps be useful in a description of structural characteristics of trees. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 120 While clusterings can be said to form a linear array it is clearly not the case that the objects conform to any linear order. Indeed, it would represent a misunderstanding of the concept of a tree to impose any linear order on the "horisontal" sequence of objects in presenting a tree, since the specific sequence is largely arbitrary. Consider for instance the different presentations of the same tree structure in Fig. 3. Fig. 3 Different presentations of the same tree structure. Just from Fig. 3a) it might have been tempting to regard 1, 2, 3, 4, 5 as a linear order, but Fig. 3b) and c) clearly show that this is unwarranted and there are still further ways of representing the same tree structure. In depicting a tree we do not want branches to cross each other. This of course implies some constraints on the order in which we may list the objects. (In the example above it is for instance not possible to list any of the objects 3, 4 or 5 "between" 1 and 2, since the branch from such an object would then cross the branches joining 1 and 2). Johnson's computer program for HCS analysis illustrates the degrees of freedom in an important way. In his program the object labelled n (the last row in a similarity matrix) is always printed at the extreme right of the paper. Since it is completely arbitrary which object the experimenter labels n, this implies that any of the objects may be placed as the last. (Or as the first since it is evident that a tree for a given HCS might be “reversed”). In the example above the same HCS is represented with 5, 4 and 3 respectively as the last object. Since a given object can always be placed last for a given HCS, no object can then be between two other objects. Stated otherwise no three objects in a HCS can be represented on a straight line. This can also be seen directly from the ultrametric inequality. Suppose the contrary: that y is between x and z and that d(x, y) = αj < d(y, z) = αk. d(x, y) + d(y, z) = d(x, z) then implies that d(x, z) = αj + αk which according to our statement of the ultrametric inequality is impossible since this inequality requires that d (x, z) = αj + αk cfr. equations (2) and (3). Since it is convenient to list clusterings vertically, cfr. Fig. 1, we might say more informally that a tree can be considered as a linear order from a "vertical point of view" when clusterings are considered as units. It has been shown that this is an implication of the fact that "the clusterings increase hierarchically" (Johnson, 1967, p. 243). A similar notion is implied by the familiar Guttmann scaleobjects are rankordered and objects with higher ranks have all properties of objects with lower rank plus some more. From a "horizontal point of view", however, when now objects are considered as units, the sequence of objects (or leaves in a tree) is to a large extent arbitrary. Since no three objects can be mapped on a straight line one might wonder whether a dimensional representation of the objects is at all appropriate. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 121 Is the ultrametric inequality so strong (cfr. p.115) that it precludes meaningful spatial representation of the objects? In the next section we argue that n -1 dimensions are required to represent n objects (conforming to a HCS) as n points in a metric space. This will set the stage for a general discussion of HCS and spatial models in Ch. 6. 5.3 A dimensional representation of the objects in a HCS (for binary trees) - a tree grid matrix It will be recalled that a HCS (now including the clustering values α) can be mapped into a distance matrix. It is possible to give a dimensional representation of this distance matrix which is simple in the sense that the coordinates are closely related to the α values and the pattern of values in the coordinate matrix clearly reflects the tree structure. It is not “simple”, however, from the point of view of the number of dimensions required, since this equals n-1. We first explain the nature of the coordinate matrix (which alternatively will be referred to as a tree grid matrix). The underlying space is not the usual Euclidean one but the l∞ metric (dominance model, cfr. Section 1.41. After showing that the coordinate matrix maps into the proper distance matrix we prove that -at least for the l ∞ metric -the dimensionality can not be reduced. The coordinate matrix for the HCS used in Section 5.1 is presented in Table 5. Table 5 Coordinate matrix (tree grid matrix) for the HCS in Fig. 1, l ∞ metric. dimension node 6 5 4 3 2 1 1 2 3 4 5 6 objects 1 2 3 4 5 6 7 +1 -1 0 0 0 0 0 +3 +3 -3 0 0 0 0 0 0 0 +7 -7 0 0 0 0 0 0 0 + 9 -9 0 0 0 +11 +11 -11 -11 +13 +13 +13 -13 -13 -13 -13 The matrix is oriented to bring forth as clearly as possible the similarity to the presentation in Fig. 1. We notice that there is a strict one-to-one correspondence between dimensions and nodes in the tree representation. The dimension corresponding to the root node will be seen to be most important, or to have largest scope, it is the most general dimension. This will be referred to as the first dimension, cfr. dimension 1 in the bottom row in Table 5. The dimensions are labelled inversely with respect to levels so that dimension j corresponds to level (node) n-j. Generally "lower" dimensions will have larger scope than "higher" dimensions, though some dimensions are not ordered with respect to scope (In Table 5, e.g. dimensions 2 and 4 are thus ordered, while this is not the case for dimensions 4 and 5). On the first dimension all the points (which represent objects) subsumed under the left branch from the root node have values + αn-1/2 and the values are – αn-1/2 for points subsumed under the right branch. The first dimension partitions the n points in two subsets with n11 and n12 points respectively where n11 + n12 = n. The next dimension has non-zero values only for one of the subsets formed by the first dimension. All objects subsumed under the left branch corresponding to node n-2 are represented with values + αn-2/2, and the values are - αn-2/2 for objects under the right branch. There will be n21 points with values +αn-2/2 and n22 points with values – α n-2/2 where n21 + n22 = n11 or n12. The rest of the objects are represented with value 0 on the second dimension. All the higher dimensions will have non-zero values only for one of the subsets previously formed. For dimension j the objects subsumed under the left branch of node n-j are represented by the values Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 122 + αn-j/2 and - αn-j/2 for objects under the right branch. There will be nj1 “ + values” and nj2 “- values”. Each nja (j >1, a = 1,2) will be the sum of a partition for a higher dimension. (The coordinates in each dimension might be multiplied by -1, the assignments “left and +”, “right and – “ are arbitrary). We now show that the type of coordinate matrix discussed above maps into the proper distance matrix. First all the distances between the subsets formed by the first dimension is computed. The differences between coordinates is the same for all pairs of objects belonging to different subsets: αn-1/2 – (- αn-1/2) = α n-1 Since αn-1 is larger than any of the coordinate differences for higher dimensions, all the distances between the first subsets equal αn-1 as they should. (Remember that only the largest difference is relevant in computing a distance according to the definition of l∞ metric). In the example all the distances between (123) and (4567) are computed from the first dimension, they are seen to be: 13 - (-13) = 26 cfr. Table 1. Consider next the higher dimensions. For any dimension j the distances between the two subsets formed by this dimension (containing nj1 and nj2 objects) are computed. First note that the non-zero values on lower dimensions will be tied. The union of subsets for dimension j forms one subset (with the same value for all elements) of a lower dimension. Consequently the lower dimensions can not contribute to the nj1 x nj2 distances which are computed from dimension αj. Since second the non-zero values for dimension j are larger than the non-zero values for higher dimensions, the distances between the nj1 and nj2 objects will be simply αj as they should. In our example all the distances between the subsets formed by e.g. dimension 2 will be 22. When a dimension forms subsets of just one object each, only one distance will be computed from this dimension. This is the case with dimensions 3,4, and 6 in. the example. We now have shown that a grid matrix with n -1 dimensions gives a dimensional representation of the distance matrix from a given HCS in a l∞ metric. The representation is “simple” in the sense that the non-zero values for dimension j correspond to the branches from the node j. The essence of the argument given above is that quite literally lower dimensions dominate the higher ones. When computing distances the higher dimensions do not contribute anything if the objects have different values on a lower dimension. A basic point is that large number of zeros for a given dimension signify a limited scope, range, for this dimension. There are few distances which are influenced by such dimensions. The zero values should be taken to signify that the dimension is irrelevant for the corresponding domain of objects. The psychological significance of "irrelevance" and “scope" (range) will be prominent in Section 6.2. While the tree grid matrix is appealing since it is so closely related to the tree representation, it is not simple from the point of view of the number of dimensions involved. Indeed, it is well known that any set of n points (satisfying the triangular inequality) can be represented in n-1 dimensions. (See for instance Torgerson, 1958). When applying spatial models one usually hopes that the number of dimensions is far less than n. Is it possible to give a dimensional representation in less than n -1 dimensions? This will be the case if we can show that a tree grid matrix has rank less than n -1 or alternatively that the n -1 row vectors in a grid matrix are linearly dependent. If this is the case one (or more) of them can be represented as linear combinations of the others and the dimensionality can be correspondingly reduced. Conversely, if the n -1 row vectors are linearly independent dimensionality can not be reduced. We will give a proof that the n-1 vectors are linearly independent, that is: Theorem: In a l∞ metric it is not possible to re resent n elements in a binary tree in less than n -1 dimensions 36 36 If this was not the case some of the nodes would be redundant. The reader may skip the proof. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 123 First we give a proof for our example of a HCS and then we suggest a general proof. The proof is a bit tedious but a straightforward application of linear algebra. For dimension j the vector of coordinates will be written Xj. The non-zero values in a row will be written xj and -xj. In the example we then have: (14) X1 = X2 = X3 = X4 = X5 = X6 = 1 2 (x1 x1 (0 0 (0 0 (0 0 (x5 x5 (x6 -x6 object 3 4 x1 -x1 0 x2 0 0 0 x4 -x5 0 0 0 5 -x1 x2 0 -x4 0 0 6 -x1 -x2 x3 0 0 0 7 -x1) -x2) -x3) 0) 0) 0) A necessary and sufficient condition for a set of n-1 vectors to be linearly independent is that the only way of satisfying the vector equation: (15) k1 X1 + k2 X2 + + kn-1 Xn-1 = 0 is that all the scalars kj (j = 1,2, n - 1) must be 0. This is w hat will be proved. If on the other hand equation (15) could have been satisfied with at least one kj different from 0 the vectors would have been linearly dependent and then at least one of them could have been expressed as a linear combination of the others and the dimensionality could then be reduced. Equation (15) implies that the separate equations for each object must be satisfied. We insert equation (14) in equation (15) for the objects subsumed under the left branch from the root node and get: (16) k1x1 + k5x5 + k6x6 k1x1 + k5x5 -k6x6 k1x1 + k5x5 k1x1 - k5x5 k1x1 = 0 (1) = 0 (2) = 0 (12) = (1) + (2) = 0 (3) = 0 (123) = (12) + (3) (1) in equation (16) is the equation for object 1, similarly (2) for object 2. Since (1) and (2) are only differentiated by X6, we see that the last right hand term drops out when we add (1) and (2) in equation (16). This new equation corresponds to the cluster (12) and is labelled accordingly. Similarly (12) and (3) are only differentiated by X5 so by adding (12) and (3) it is evident that k1 = 0. Having first gone up we then go down. By inserting k1 = 0 in (3) in equation (16) we get k5 = 0 and by further inserting k1 = k5 = 0 in (1) equation (16) we finally get k6 = 0. In order to better trace the details of this reasoning and thus get a general view of the line of thought it may be instructive to study Fig.4. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 124 The branches are labelled in such away that it is easy to see which terms go into the equation for any of the objects e.g. k1, k5 and k6 for object 1. Generally we start from a cluster composed of single objects. Adding equations (going up) the term for the differentiating dimension drops out. When the cluster thus "grows" a new equation is added and again a differentiating dimension drops out. This process is repeated till we get to a cluster separated by a single branch from the root node. It is then evident that k1 = 0 and by then going down it is seen that the k's nested under k1 must also disappear. Below the same process is shown for the right part of the tree, (not in complete detail): Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. - k1x1 - k1x1 - k1 x1 - klx1 - k1x1 - k1x1 - k1x1 + k2x2 + k4x4 + k2x2 - k4x4 + k2x2 - k2x2 + k3x3 - k2x2 - k3x3 - k2x2 125 = 0 (4) = 0 (5) = 0 (45) = 0 (6) = 0 (7) = 0 (67) = 0 (4567) (forming (4567) was not strictly necessary since we already knew that k1 = 0, this step was just added for additional clarity). In general we start by considering two single objects c0 and c01 which are differentiated just by xim. Adding the equations for these elements gives the equation for cluster c1 where xim drops out. (17) ± k1x1 ± k i1x i1 ⋅ ⋅ ⋅ ⋅ ⋅ ± k im −1x im −1 ± k 1x1 ± k i1x i1 ± k im − 2 x im − 2 =0 =0 --------± k 1x1 ± k i1x i1 =0 ± k1x1 =0 c0 + c01 c1 + c11 = c1 = c2 = cm1 cm-1 + cm-1,1 = cm where im > im-1 > im-2.…............. i1 > 1. When going up the cluster c1 grows by having added a cluster c11 which then gives a new cluster c2. This process is repeated till we finally reach cm. For each new cluster cj a differentiating term xim-j drops out. cm then implies k1 = 0. We then work down and then cm-1 implies ki1 = 0 etc. till finally c1 implies xim-1 = 0 and at last c0 (or c01) implies kim = 0. If at any stage in the process described by equation (17) cj1 is composed of more than one element we must arrive at one equation for cj1 by a process similar to the first equations in (17). Some dimensions different from ij (j = 1,2, m) will be involved here. When we get back to cj1 the k's for these dimensions must be separately traced and will similarly be seen to be 0. The complete process is recursive, the main process which leads to cm must be used to arrive at cj1 and perhaps again for subclusters of cj1. cm will correspond either to the left or right branch from the root node, cfr. the ± notation in equation (17). Exactly the same argument can then be used for the other branch from the root node (starting with an arbitrary cluster of two elements nested under the other branch from the root node). We have then shown that starting from equation (15) and a n - 1 dimensional representation of a tree we must have: k1 = k2 = …. kj =…..kn-1 = 0 According to the definition of linear independence all the dimensions or vectors, Xj are then linearly independent and the theorem stated on p. 122 is thus established. This proof assumes a l∞ metric. Can it be the case that a smaller dimensionality will suffice for another lp metric? I have not seen any way of proving that this can not be the case. It does not, however, appear likely. An indirect approach is to study dimensional representations of tree structures in the most popular Euclidean (l2) metric. A large number of analyses of distance matrices generated from HCS structures have shown that in every case n-1 dimensions are necessary. These runs are part of a more specific comparisons of tree structure models and spatial models, the details of which will be reported elsewhere. Suffice it here to note that the Euclidean representation does not reveal the Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 126 structure as tree grid matrices do. Depending upon the type of tree it will be more or less difficult to “decode” the structure as a tree" This problem will be further commented in Section 6.41. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 127 Chapter 6 FILLED, UNFILLED AND PARTIALLY FILLED SPACES 6.1. A discussion of HCS and spatial models. Any given similarity matrix may be analyzed either by a spatial model or a tree structure model. Are these models exclusive in the sense that if one of them fits the data the other will not? More generally do they represent different theories of underlying cognitive structure? Miller (1967) mentions as one of the problems in need of clarification the relation between factor analysis (a spatial model) and tree structures. In this chapter we will throw some light on the similarities and differences between the two approaches. This will lead to an outline of a general model. When I started to investigate this problem, I first noticed that no element y in a HCS can be placed between two other elements x and z on a straight line (cfr. p. 120) .The next step was then to consider a simple tree with just 4 objects and the corresponding distance matrix, cfr. Fig. 1. Fig.1. A tree with 4 objects and the corresponding 3 dimensional spatial representation. It is fairly simple to see that this tree can not be represented in a two-dimensional Euclidean space. Three dimensions are necessary as illustrated in part c) of Fig. 1. When 3 objects require 2 dimensions, 4 objects 3 dimensions, it was tempting to guess that generally n-1 dimensions would be necessary to represent n objects as points. Intuitively it appeared that the large number of ties in the distance matrix required by the ultrametric inequality would “force” the dimensionality upwards when considering trees with new objects added. When I suggested this argument to Johnson he pointed out (personal communication): " - however, n - 1 scalars (one dimension) suffice if you expand slightly the notion of dimension. For example to place n points in one dimension we may pick an order and n-1 interpoint distances. In clustering we pick a tree and n node distances: the information seems to be about the same. An interesting question: is there a twodimensional clustering representation in the above sense.” This comment inspired the work reported in Section 5.2, where it was shown that a HCS could be regarded as a one-dimensional scale. Each clustering was mapped as a point on a line. This is really a Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 128 fairly simple consequence of the cumulative nature of the clusterings (regarded as sets). New elements are added to each new clustering in the ordered sequence and all the previous elements are retained. The distance between succeeding clusterings is simply the number of new elements added. The term "hierarchical" in HCS seems to refer mainly to this cumulative growth in the sequence of clusterings. Notice that the node distances (α values) mentioned by Johnson in the quotation above were not considered when regarding HCS as a onedimensional scale. There may perhaps be other ways of regarding a HCS as a one-dimensional scale. (Perhaps new elements added to a clustering set could be weighted by the corresponding α value.) If, however, not clusterings but objects are to be mapped as points not the Euclidean but the dominance metric was found to be particularly appropriate. Each node in a tree then corresponds to a dimension. Objects subsumed under the left branch from node j were given the value αj /2 and objects under the right branch the value - αj /2 on dimension n-j. (The first dimension which is most important corresponds to the root node, value α n-1). The l∞ metric insures that only one dimension is relevant in computing the distance between any two objects. How are these different representations related: a onedimensional representation of clusterings and a n-1 dimensional l∞ representation of objects? The one-dimensional scale can be conceived of as a series of intervals (cfr. the last column in Table 3, Ch. 5). The last interval corresponds to dimension 1, the next last to dimension 2 and finally the first interval (C0 - C1) to the last dimension. (If the ∞ values were taken into account in defining the length of the intervals it could probably be shown that the length of the intervals were related to the variances of the corresponding dimensions). In Section 5.3 we proved that it is not possible to have less than n - 1 dimensions in representing n objects in l ∞ metric. This immediately implies that if we in a given tree insert a new object (and thus a new node) the dimensionality will increase. A tree with 3 objects can be represented as the corners of a triangle with two equal sides. An added fourth point x can not be "between" the three points (a, b and c) in the sense of being inside the triangle. The space inside the triangle can not be filled, it must be empty. (Indeed the whole plane formed by extending the triangle must be empty except for a, b and c). Generally objects can be regarded as the corners of a regular convex polyhedron in n-1 dimensions and no new point can be located inside this polyhedron. Stated otherwise: It we represent n objects forming a tree structure in a space the space can not be filled. Inherently there does not seem to be anything "wrong" with the spatial representation in the l∞ metric. Yet it should be strongly pointed out that the notion of a space which can not be filled does run counter to the general concept of space. This is stated as follows by Torgerson (1965, p. 385): spatial models tend to imply continuity. We tend to interpret a dimension or direction in the space as a continuous variable. Since space itself is nothing but a hole it seems to me that this assumption implies that the hole can be filled, the hole should not have unfillable holes in it. In our case the space is - apart from "skeleton points" - nothing but an "unfillable hole"! In a basic theoretical contribution on "the foundations of multidmensional scaling" Beals et. al., (1968) note in the beginning of the article that "the content and justification of multidimensional scaling have not been explored" and furthermore: "such representation carry strong implications that should not be overlooked" and a. warning is sounded: "if the necessary consequences of such models are rejected on empirical or theoretical grounds the blind application of multidimensional scaling techniques is quite objectionable." (op.cit. p. 28). More specifically they consider "metrics with additive segments (underlined here) where "segmental additivity" implies that for any distinct points x and z there exists a set Y of points such that for y in Y d(x, y) + d(y, z) = d(x, z). Clearly this is an aspect of continuity which is not satisfied by HCS. A main part of their article is devoted to stating ordinal assumptions from which the usual axioms for a distance function and segmental additivity can be derived. Some of these ordinal assumptions clearly imply continuity, one of them is for instance commented: "this condition implies that there are no "holes" in the set of stimulus objects". (op.cit. p. 131). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 129 The main point in the present section is to bring to attention that spatial models imply continuity, while from a spatial point of view the objects in a HCS can only be mapped in an "unfillable" space. The usual notion of "dimension" - usually implied by spatial models -clearly implies continuity both from a common sense point of view and also in a large bulk of psychological work. Considering white and black shades of grey quickly comes to mind as an example. Also, we usually conceive of people not just as intelligent or stupid, but as varying in ability. In learning theory some notion of "habit strength" is usually prominent - again a dimensional - continuous concept. At this point we could reiterate the point made in Ch. 1, that tree structure and spatial models represent different types of model. Perhaps the metamodel, e.g. the selection rule stated on p. 22, might be developed to provide a simple test as to which (if any) of the two types of models was most appropriate for a given set of data. But a more fascinating inquiry is to ask: if HCS represent unfilled spaces, spatial models filled spaces, might there not be other, more general structures which subsume HCS and spatial models as special cases? Tentatively such structures may be labelled partially filled spaces. Section 6.2 will underscore the importance of partially filled spaces by giving examples of hierarchical models in cognitive psychology and then argue for the inadequacy of such models. After the outline of a general model in Section 6.3, some of the issues raised in Section 6.2 will be further developed in Section 6.42. 6.2. The inadequacy of tree structure models. Comments on tree rid matrices G.A. Miller's semantic matrices and G.A. Kelly’s Rep Grid. There is a quite obvious isomorphy between tree grid matrices, and a type of semantic matrices described by Miller (1967). This is best brought out by rearranging a demonstration example he gives. Table 1: An example of a semantic matrix as a tree grid matrix. Adapted from Miller (1967, Figure 6, p. 47). m o t h e r t r e e c h a i r r o c k f e a r v i r t u e Semantic markers (features) + - c o w t i g e r object -nonobject living -nonliving mental- characterologicall plant- animal artefact -natural human -subhuman feral -domesticated + + + + + + + + + - + - 0 0 0 0 - 0 0 + 0 0 + 0 0 + 0 0 0 0 0 + 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 It is immediately evident that the general structure is the same in Table 1 and in Fig. 4, Ch. 5 (The only difference is that values of clusterings are excluded from Table 1, but that is unimportant in the present context.) So objects may be words and dimensions may be semantic features. Later Miller (1969), asking how the subjective lexicon is organized, has given some evidence that a collection of English nouns fairly well conforms to a tree structure model. In the present context we will concentrate on Miller's general presentation of hierarchical semantic systems and show that exactly parallel concepts are part of the impressive personal construct theory presented by Kelly (1955). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 130 First we note that corresponding to features Kelly uses the term construct. Kelly's insistence on the dichotomous nature of constructs is conveniently captured in Table 1. What Miller refers to as values of features correspond to poles of constructs in Kelly's terminology.37 The main methodological tool Kelly has provided is what he calls the Rep Grid. In most applications the objects are what in Lewinian terminology would be called "relevant persons in the subject's life-space" (self, spouse, father etc. etc.). Kelly usually refers to them as figures. Generally Kelly finds it convenient to say that constructs deal with events, figures are thus an example of events. So far the semantic matrix is formally identical to a Rep Grid, the former being flanked by features and items, the latter by constructs and figures. Whether the internal structure of a Rep Grid is similar to that of the type of semantic matrices Miller is especially interested in will be discussed later. As a central aspect of hierarchical organization Miller (1969, p. 176) underscores that features from one path of a tree are not defined for items subsumed under different paths. As an example he points out that at one node in a taxonomic tree we have the animal-plant feature, F1. Further removed from the root there is the vertebrate-invertebrate feature, F2, which is subsumed under the path from animal. The vertebrate feature is not defined for plants. Generally “a feature F1 is said to dominate a feature F2 just in case F2 is defined only for items having a particular value of F1." (op.cit. p. 176). In Table 1 we should thus not regard features as three-valued functions, the zeros should be taken to imply that for these items the corresponding features are simply undefined. Exactly the same idea has the status of a corollary in Kelly's theory: Range Corollary: A construct is convenient for the anticipation of a finite range of events only. (Kelly, 1955, p. 68): one may construe tall houses versus short houses, tall people versus short people, tall trees versus short trees. But one does not find it convenient to construct tall weather versus short weather, tall light versus short light, or tall fear versus short fear. Weather, light and fear are, for most of us at least, clearly outside the range of convenience of tall versus short." (op. cit. p. 69). Kelly is further quite explicit in differentiating between a contrast pole and what is outside the range of convenience. When discussing the personal construct approach to understanding what for instance respect may mean to a person Kelly points out that "we cannot understand what he means by “respect” unless we know what he sees as relevantly opposed". "We do not lump together what he excludes as irrelevant with what he excludes as contrasting. (op.cit. p. 71). Miller (1969) explicitly links the limited relevance of features to hierarchical organization. While Kelly is not equally explicit on such linkage he does have an organization corollary which is highly relevant to hierarchical organization: Organization Corollary: Each person evolves, for his convenience in anticipating events, a construction system embracing ordinal relationships between events" (Kelly, 1955, p. 56). Generally "there may be many levels of ordinal relationships with some constructs subsuming others and those in turn, subsuming still others. When one construct subsumes another, its ordinal relationship may be termed superordinal and the ordinal relationship of the other becomes subordinal”. (op.cit. p. 56-57) . We see that what Miller calls a dominating feature Kelly chooses to call a superordinal construct. "Higher concepts" will be used as a common term for “dominating features” and "superordinal constructs", “lower concept” similarly for "dominated features" and “subordinal constructs”. 37 A difference which will not be elaborated here is that features are provided by the linguist. Kelly, however, insists on eliciting the subject's own constructs, and he never claims that the illustrating examples he uses should apply to persons generally. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 131 Looking closer at Kelly’s examples illustrating his orgnaization corollary, there are, however, some differences compared to Miller's treatment of hierarchical organization. Kelly distinguishes two types of ordinal relations. One construct may subsume another by: a) extending the cleavage intended by the other or b) abstract across the other’s cleavage line. As an example of a) good may subsume intelligent (and other things falling outside the range of convenience of intelligent) while similarly bad may subsume stupid (plus things which are neither intelligent nor stupid). It is not entirely clear how the type of structure a) implies is related to the kinds of trees we have considered so far, this will be further commented in Section 6.42. The examples Kelly gives to illustrate b) raise some interesting issues. As one example he states that the evaluative pole (of the evaluative-descriptive construct) may subsume the construct intelligentstupid (and others) while the descriptive pole may subsume for instance light-dark. While this seems formally similar to the type of hierarchical organization Miller discusses there is an interesting difference. If for instance a dominated feature applies to a specific item the higher or dominating feature will also apply. If vertebrate applies to a specific creature, then animal will apply as well. But if intelligent applies to a specific creature, evaluative will not apply to this creature, likewise if dark applies to a specific physical condition, descriptive will not apply to this condition. A construct "abstracting across" in the sense implied by this example is clearly a construct applying to its subordinal constructs but not to the events subsumed by the subordinal constructs. One way to capture the difference between Miller's and Kelly's examples is to say that the former show transitivity (F1 applies to F2, F2 applies to an item and F1 applies to the item) while the latter do not show this transitivity (evaluative applies to intelligent, intelligent applies to person x, but evaluative does not apply to person x). This distinction between transitive and not transitive relations in hierarchical organization may clarify an otherwise puzzling problem. It may be tempting to identify concepts at higher levels as somehow indicating more "abstract" thinking than concepts at lower levels. From a taxonomic point of view, however, fish is at a higher level than herring, yet few would say that pointing at a creature and saying “there is a fish” is more abstract than saying "there is a herring". For not transitive relations, however, the intuitive notion of more abstract thinking seems to apply to the superordinal construct, since it is a metaconstruct, a construct about a construct. According to Kelly such metaconstructs are of profound importance for understanding how persons may change (cfr. Fragmentation and Modulation Corollary). In the present context we just outline two kinds of change, particularly emphasized in the most important of the recent theoretical contributions to personal construct theory, Hinkle (1965). There is first what Hinkle calls slot change, e.g. one may shift from regarding self (or others) as subsumed under one pole to being subsumed under the contrasting pole. The hierarchical nature implied by Kelly's corollaries does, however, imply possibilities of what may well be regarded as "deeper" changes. As an example Kelly (1955, p. 82) asks us to consider a person who once construed people around him under the construct fear – domination, there are those to fear and there are those to dominate. But there may be a metaconstruct childish-mature, where childish subsumes fear-domination and mature subsumes respect-contempt and the rnetaconstruct may permit a change from applying the fear-domination construct to the respect-contempt construct. This is referred to as a shift change by Hinkle. We may note that a shift change for subordinal constructs corresponds to a slot change for a superordinal construct. The person may say: "whereas I formerly had a childish outlook, I have now shifted to a mature outlook concerning other people". These two different kinds of changes will be further commented in Section 6.42. Turning now to the Rep Grid where the person provides constructs to describe figures, Kelly might have chosen to explore the hierarchical implications of his theory. We have pointed to both the formal similarity of a semantic matrix and a Rep Grid, and also important similarities between Miller's description of hierarchical organization and some of Kelly’s corollaries. One necessary condition for exploring hierarchical organization would then be that the range corollary explicitly was included in the instruction to the subject. This would imply that he would be permitted to mark certain constructs as Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 132 irrelevant for certain figures. As a matter of fact, however, Kelly chose to elaborate the spatial (dimensional) implications of his system and explicitly assumed that all figures would fall within the range of convenience of all the constructs the subject provided. For a given construct a check mark is taken to imply that one pole applies to the figure, a void implies that the contrasting pole applies. This gives a "filled" grid, there are no intersects corresponding to the zeros in Table 1. Of course he was aware that "this may not be a good assumption in all cases: it may be that the client has left a void at a certain intersect simply because the construct does not seem to apply, one way or the other, to this particular figure" (op. cit. p. 271). This assumption precludes finding a hierarchical structure in Rep Grid. As we have seen, such structure necessitates not only a sizeable number of irrelevances, but also a specific pattern of them (cfr .the zeros in Table 1, p. 129 and Fig. 4., Ch." 5.). Did Kelly violate his own theory in not including implications of the range corollary in the Rep Grid? There has been some discussion of this, cfr. the recent summary of Grid methodology by Bannister and Mair (1968). They do for instance point out that when a subject conscientiously carries out the instructions for the Rep Grid "he may quietly produce what, in terms of his personal system, is nonsense." (op.cit. p. 204). The main point in the present context is not, however, to add to this discussion but rather to point out that there are overriding considerations in Kelly's theory which may justify the choice he made. These considerations bring us to the main point in the present section, the inadequacy of tree structure models. There is one construct used to describe construction systems which gives us a clue to this inadequacy, the construct propositionaliy. This construct in a way runs counter to the notion of range of convenience "a propositional construct is one which does not disturb the other realm memberships of its elements ---Although this is a ball, there is no reason therefore to believe that it could not be lopsided, valuable or have a French accent." (op.cit. p. 157). It is as if Kelly recognizes that for most of us anything with a French accent must fall outside the range of convenience of the construct ball, but he refuses to necessarily accept such "constricted" thinking. Struggling for years with this type of problem Bannister and Mair (1968, p. 129-130) have tried to exclude false teeth from the range of convenience of religious-atheist, only to end up realizing "that false teeth are clearly atheist". Balls with French accents, atheist false teeth - what do they have in common? These constructions share the attempt to break away from strict, pedestrian semantic rules, in other words they point to what makes language come alive, metaphors. Metaphors may recall a charming game, guessing who a person is thinking about by way of metaphorical questions. "If he were a flower, what would he be? Or what would be his emblem as an animal, his symbol among colours, his style among painters. What would he be if he were a dish?." (Gombrich, 1965, p. 36) A fascinating aspect of this game is that it really may work, "the task of the guesser is by no means hopeless" (op.cit. p. 36). Perhaps one should not be surprised that this game is one of the bag of tricks used in encounter groups. So it may not simply be the by now proverbial malleability of subjects in psychological experiments which makes them comply by filling out a complete grid. The metaphorical quality of language, which may make just about any construct apply to just about any figure, seems to be a better explanation for the fact that it is usually not difficult to make a person fill out a complete grid. And how to deal with metaphors? We should not forget that metaphors and related phenomena were the point of departure for perhaps the most widely discussed example of a spatial model in psychology, Osgood’s semantic space. Since this is not always recognized among the numerous critics and commentators of semantic space, we quote from the introductory chapter in Osgood et.al, (1057, p. 20-21): "The notion of using polar adjectives to define the termini of semantic dimensions grew out of research on synesthesia with Karwoski …..” Pointing to the general relevance of synesthesia for thinking and language there were the observations: whereas fast exciting music might be pictured by the synesthete as sharply etched, bright red forms, his less imaginative brethren would merely agree that words like “red-hot”, "bright" and "fiery" as verbal metaphors adequately described the music. The relation of this phenomenon to ordinary metaphor is evident. A happy man is said to feel "high", a sad man "low", the pianist travels "up" and "down" the scale from treble to bass: soul Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 133 travels "up" to the good place and “down" to the bad place: hope is "white" and despair is "black". Reminiscing on the growth of "semantic space” Osgood (1969, p. vii to ix) recalls his childhood infatuation with words and “a vivid and colourful image of words as clusters of starlike points in an immense space." He then expresses his deep gratitude to the inspiration provided by Karwoski’s work on synesthesia and later recalls how "I was swept up into the monumental edifice of learning theory that Clark Hull was building.” Osgood ends up identifying semantic space with a wayward Pinocchio. After stating what he is not, there is the positive assertion: "he is…. primarily reflecting affective meaning by virtue of the metaphorical usage of his scales." One possibility might now be to say that tree structures have a limited range of convenience, as they provide a model for part of the psychological lexical organization and that there is a different range of convenience for spatial models, the latter being relevant for metaphorical and affective aspects of language. But this does not seem satisfactory, there is Kelly clearly straddling both horses. Kelly might be said to deal with phenomena on a macro level - the large issues in personality theory. It is very interesting to note that the dissatisfaction with an either/or approach to these types of models which we have read out of Kelly also finds support in a recent theoretical framework for psycholinguistics. When discussing phenomena at a micro level, Rommetveit (1968, 1972) describes referential, associative and emotional aspects of the experience of words but one of his basic points is that this in some way is an artificial and arbitrary division of one process, since the different components mutually influence each other. Representational processes release associative and emotional aspects and are also continuously influenced by associative and emotive impulses. (Rommetveit, 1972, p. 75). So it appears that neither a tree structure model nor a spatial one can be adequate for complex human functioning. This motivates the outline of a more general model in the next section. 6.3. Outline of a general model. The first step towards a general model is to embed classes in a multidimensional space. Struggling with the general problem of geometric representation of complex structures Attneave (1962, p. 638) stated: "if a multidimensional psychophysical space is taken as the fundamental framework then classes (e.g., of objects) may be conceived as regions or hypervolumes in that space." A very significant further step was taken by Torgerson (1965) who suggested a variety of structures all of which shared the characteristic that they violated the assumption of a filled space. In the present terminology he suggested a variety of types of partially filled spaces. It is especially important to note that he reported some experiments where the stimuli were constructed to reflect both qualitative (e.g. sign of asymmetry) and quantitative (aspects of size) dimensions and the results supported interpretation in terms of a partially filled space. The most important of the structures Torgerson suggested is a mixture of class and dimensional structures. This gave rise to his highly interesting contribution to a symposium on Classification in Psychiatry where he refused to fall prey to the dichotomy between “dimensionalists” and "typologists" (Torgerson, 1968). He (op.cit. p. 219) suggested that similarity between patients may be determined in part by class membership "and also in part by degree of difference on one or more quantitative dimensions that cut across class boundaries. “This would occur, for example, if some of the variables were sensitive to overall degree of disturbance, regardless of the type of disturbance involved”. The resulting structure is not too difficult to visualize. First we note that e.g. 3 classes will be represented as three points (or tight clusters) in a two dimensional space. In the simplest case the 3 classes will be represented as corners of an equilateral triangle. If we now to class membership add a dimension (e.g. overall disturbance) "the points would … be located in a three-space, but only on the three lines corresponding to the edges of a triangular prism" as in Fig. 2. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Fig. 2. 134 Representation of a mixed class and dimensional structure, 3 classes and one continuous dimension. Adapted from Torgerson (1968, p. 219) With 1 quantitative dimension we get clusters organized as lines, with 2 quantitative dimensions we get clusters organized as planes etc. A purely dimensional interpretation of such structures would be highly misleading, since the basic feature, the mixed structure would then be lost. This should not, however, deter us from applying multidimensional methods, since as Torgerson (1965) emphasized a multidimensional space in principle can embed any kind of structure, spatial or not. Adherence to a conventional dimensional framework is then of course objectionable but "we can think about - and look at - the shape of the configuration itself" (op.cit. p. 390). But we may recall from Ch. 1 that our capacity to "look" is severely restricted, this is why Shepard (cfr. p. 18) stressed the need for “artificial machinery” - or data reduction models. Consequently it is a very important step when one of Torgerson's pupils described an algorithm for revealing “mixtures of class and quantitative variation”, Degermann (1970). Before giving a rough outline of the steps in his algorithm it may be helpful to describe the types of experiments that he reported which conformed to his model. There were first “15 stimuli composed of three classes (triangle, circle, square) "varying in five levels of grey (brightness)”, second an experiment with “20 stimuli composed of four classes (triangle, circle, square, cross) varying in 5 levels of brightness”. The third experiment contained three shapes and two quantitative dimensions, both brightness and shape (op.cit. p. 484). For all experiments the subjects made judgments of dissimilarity for each pair of stimuli. An important feature which these experiments share with the example from Torgerson is that the quantitative dimensions apply to all the objects in the set, we have what will be called global dimensions. The first step in Degerman's algorithm is to perform a multidimensional scaling of similarities data which gives an (n, k) coordinate matrix. n is as usual the number of points and k the total number of dimensions. The basic purpose of the procedure is to partition the k space into two orthogonal complements, q dimensions for q quantitative dimensions and k-q dimensions for k – q + 1 classes. In experiment 1 above we would for instance expect a total of 3 dimensions, q = 1 quantitative dimension and 2 dimension (3-1) for 3 (3-1+1) classes. In the second experiment we would expect k = 4, q = 1 and k - q = 3 for 4 classes. Also in the third experiment we would expect k = 4, q = 2 and k - q = 2 for 3 classes. First values of k and q must be preset and the next step is then to identify the k – q + 1 clusters. The special case of q = 0 is the basis for previous cluster programs, this corresponds to a simple classification (a nominal scale). The problem faced by Degermann is of a far greater complexity. For q = 0, the clusters are organized around points, but for q = 1 around lines, etc. as already mentioned. The task is performed by first calculating a set of what he calls hyperplanar coefficients. For the case of q = 1 such a coefficient is computed for all 3-tuples (generally q + 2 tuples). This coefficient is simply an index of the extent to which the 3 points fall on a straight line, or more concretely the length of the perpendicular from the largest side to the opposite vertex. Generally a hyperplanar coefficient is the "minimum distance from a vertex to the opposite face of a simplex" (op.cit. p. 480). The hyperplanar coefficients will be small for all points belonging to the same cluster and large for points Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 135 belonging to different clusters. The next step is then to use an iterative procedure to identify the members of each of the k – q+1 clusters from the set of hyperplanar coefficients. 38 Finally a fairly complex rotational procedure is used to separate the class space from the quantitative dimensions. In terms of Fig. 2 the prism would be tilted to stand squarely on the plane. The first two dimensions would then reveal the class structure, here we would get 3 sets of superimposed points. Plotting for instance the first against the third (the quantitative) dimension we would get the points on three parallel vertical lines. This mixed model clearly subsumes the usual dimensional models as one special case. This special case corresponds to setting q = k in Degerman’s model, we then have the space organized in terms of a single cluster. Furthermore a class structure is also clearly subsumed, this corresponds to q = 0. The model is, however, not more general than a tree structure model. These two models are not comparable in generality since neither one of them can subsume the other. We do, however, think that it may be possible to extend Degerman’s model to a level of generality where it will subsume a tree structure. As the next section shows, however, we do not underestimate the complexity of this task. One way to generalize Degermann's model is to start by checking whether it is reasonable to set q = 0. Consider now a case where in some way this is found tenable. We note (op.cit. p. 477) that “when small amounts of random error are present, the class members disperse somewhat, and the prototypal clusters for nominal classes resemble (k-q) dimensional spheroids centered at the vertices of a simplex”. It is now possible to conceive of each of these k + 1 dimensional spheroids (when q = 0) as separate galaxies in the total universe. Technically the key concept is now recursivity. For each of the separate galaxies the generalized procedure can be applied again. For galaxy i there will be the choice as to whether gi should be set equal to 0 or not. If qi = 0 we get a set of subgalaxies. Again each subgalaxy may be treated in exactly the same way as first the total universe then a galaxy. If now for each galaxy, subgalaxies within galaxies etc. it is reasonable to set the corresponding q = 0 the universe is described as a tree structure. The analysis then produces classes, further classes within classes etc. and this is just what a tree is. This general model will, however, be more interesting when q is not set to 0 for each galaxy and each subgalaxy. Suppose that first it is decided to set q = 0 but that in galaxy i, qi is greater than 0. Degermann’s algorithm is then applied just to this galaxy and produce qi quantitative dimensions. Note that these dimensions only apply to galaxy i and not to the whole universe. In contrast to the global dimensions in Degermann's model, we get the possibilities of local dimensions in the generalized model. When g is first set to 0 this rules out global dimensions. The presently proposed generalization of Degermann's model can, however, accommodate both global and local dimensions. The previously referred to (k - q) dimensional spheroids may also be analyzed again by Degermann's model and a single class might then turn out to have a complex substructure, perhaps comprising both further classes and local dimensions. In principle the concept of local dimensions solves a problem which worried Attneave (1962) as to the applicability of multidimensional scaling. As he expressed the difficulty: The concept of a psychological space of many dimensions, in which virtually any object may be represented, runs into several difficulties of a still more fundamental nature [than the previously referred to which deals with the Minkowski constant]. Only one of these need to be discussed here: the problem of relevance. Consider the kind of dimensions that might be important for the representation of a human face: e.g., height of forehead, distance between eyes, length of nose etc. Now where, on such dimensions, is an object like a chair located? We cannot say that the distance between the chair's eyes is “medium”, nor that it is "zero": since the chair has no eyes, any question about the distance between them is completely irrelevant. This is to say, in geometrical terms, that 38 It is interesting to note that one of the procedures used by Johnson (l967) for noisy data is what he calls the “connectedness”(or minimum method, cfr. p. 191) which will identify “elongated”, chainlike, clusters. This is somewhat similar to the more refined procedure used by Degermann. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 136 a face and a chair belong to different representative spaces (or to partially overlapping spaces), rather than to different regions of the same space”. (op.cit. p. 632) . In our proposed general model "distance between eyes" would be a local dimension and a human face would belong to a different galaxy than a chair. Galaxies are not just “different regions of the same space”, there are not necessarily any “bridges” or “dimensions” spanning the galaxies, the space in our general model is to an even greater extent than in Degermann's model an unfillable - or only partially filled space. The remainder of this chapter is devoted to difficulties in implementing such a general model as we have sketched. The difficulties are of two sorts. First there are technical problems, some of which are hinted at in Section 6.41. Perhaps a more serious shortcoming is that as yet we have no concrete example where the full generality of the model will be useful. This we think partly reflects a lack of theoretical sophistication in cognitive and social psychology. In Section 6.42 we offer some speculations which perhaps in the future may lead to fruitful research. 6.4. Comments on the general model. The points taken up in this section does not add up to any coherent overall picture. Specifically the issues raised in Section 6.41 will be seen to be of a guite different kind from those taken up in Section 6.42. 6. 41. Some technical problems The metamodel and the general model. Three points will be raised in this subsection. First we make more explicit the recursivity required and the problems to be solved in order to implement the general model. Second we comment on the relation between the general model and the metamodel and third we comment on nonmetric methods in relation to the general model. As a general term encompassing the previous terms "universe" and "galaxies", we use simply "spheroid". A spheroid here denotes a set of points the detailed structure of which is to be decided. “Degermann separation” will refer to the basic feature of Degermann's algorithm, separating a k dimensional space into a q dimensional quantitative subspace and a k - q dimensional class space (k – q + 1 classes). The following steps describe an outline of the general model: l) SPHEROID ANALYSIS 2) test if structure in the spheroid (versus just noise) a) if YES then go to 3) b) if NO then STOP 3) estimate q (and k). 4) DEGERMANN SEPARATION 4.1 print out the quantitative subspace (if g greater than 0) 4.2 test if q less than k a) if YES then RECURSlVE call of 1) for each of k – q + 1 spheroids b) if NO then STOP For each analysis the recursive aspect will generate a process tree, furthermore the resulting output can be described by a family tree, cfr. Eckblad (1971a, 971b) for a further discussion of these concepts. Some examples will illustrate the steps a general program will go through for various special cases previously discussed: Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. Type of example Main steps gone through pure tree [2a), 3) q = 0, Pure class structure 2a), 3) q = 0, pure dimensional str. 2a), 3) q = k, Degermann mixed str. (only global dimensions) 2a), 3) k > q > 0, not global, but local dim. 2a), 3) q = 0, both global and local. dim. 2a), 3) q > 0, 4.2a)] 4.2a), 2b). 4.2b). 137 Repeat [ ] until 2b). 4.2a),2b). 4.20.),2a), 3) q > 0, etc. 4.2a),2a), 3) q > 0, etc., A small difficulty is that such an outline is difficult to implement in FORTRAN since recursivity is generally not possible in FORTRAN (There are other programming languages, e.g. ALGOL, where recursivity is no problem). A greater difficulty will be to devise appropriate tests of structure (step 2) and of dimensionality (step 3). Concerning step 2) one simple condition would be that if the number of points is less than some specified number, search for structure will be impossible and this will provide one simple STOP condition (incidentally this may preclude the general model from describing a complete tree structure as it may well be impossible by this approach to detect further structure in say 4 points). Otherwise step 2) will probably prove difficult to implement since how can the program "know" if there is structure without going through a complete analysis? If a complete analysis should be necessary in step 2) we would seem to be involved in a too messy recursivity since it is difficult to envisage stop conditions. Consequently it appears likely that it will be necessary with heuristic devices - inspired by computer work in artificial intelligence - to implement something resembling a general test of structure (perhaps some general measure of amount of information may also be useful). Heuristic devices may also be necessary in order to find appropriate estimates of q. If q must be preset (as in Degermann's algorithm) it will not be possible to implement the recursive loops implied by the general outline, since q necessarily must be reestimated as we move away from the root in the process tree generated by the general program. The same holds good for k. One simple rule here will be that if for instance spheroid i after step 4) contains ni points, then the corresponding k when the program next enters step 3) must be less than ni. It may also be possible to estimate k as part of step 4). The basic function of k is to regulate the number of classes (spheroids) into which the current number of points inputted to step 4) is to be partitioned. It may be of interest to note that for the special case of q = 0 the general formula for a hyperplanar coefficient (Degermann, 1970, p. 480, equation (7)) can be shown to reduce to the distance between two points. For this case there exists a variety of clustering procedures which also estimates the number of classes, cfr. for instance Tryon and Bailey (1970). Perhaps some features of "standard" clustering procedures can be incorporated into Degermann's more general clustering procedure. We may also note that there are a variety of partial generalizations of Degermann's algorithm which should be fairly simple to implement but which fall short of the full recursivity described above. A simple example would be to set q = 0 and then again to input each of the ni spheroids (with ki computed as some specified function of ni) to Degermann's algorithm. In this second set of runs one might for instance set qi = 1 for each of the spheroids if one guessed that there would be one local dimension in each "subgalaxy". Suppose now that in one way or the other it proves possible to implement the suggested generalization of Degermann's problem, how should we then conceptualize this general model in relation to the metamodel? In Ch. 1 we treated a tree structure as one type of model and a dimensional (spatial) model as another type of model, and partly on this background made applicability of model one of the major problems to be investigated. Type has till now been a more general concept than form (e.g. different dimensionalities), but in the general model both a tree structure model and a purely dimensional model must both be regarded as different forms of the same type. There is no longer seen to be such a “fundamental” difference between a spatial and tree structure model as to warrant a distinction in “types”. We can now regard these two models as characterized by different specifications of parameters within the same type of Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 138 model. In principle then what we in Ch. 1 referred to as deciding which type of model is appropriate now reduces to a question of justifying a specific set of parameters within one type of model (It may of course be possible to conceptualize different “types” from our presently proposed "general model", again further levels of abstraction will be possible and this will then reduce such types to alternative forms within a still more general type, etc.). From this point of view the strategy we illustrated in Section 4.7 for determining dimensionality may be relevant to step 3) in the outline of the general model, though as indicated step 3) is probably of a far greater complexity. Likewise one of the major concerns in Section 4.6 was to illustrate a methodology to test whether there was structure in a material or not, this may be relevant to step 2). Perhaps the present methodology may be incorporated into - or at least inspire - the heuristic strategies which are necessary to provide a workable general model. There is finally one more comment to make on the relation between metric and nonmetric methods which adds to the investigation reported in Section 4.33. It will be recalled that we came out strongly in favour of nonmetric models, though we paid cognisance to a finding reported by Torgerson (1965) that nonmetric models may distort data. The structure in the stimulus material reported by Torgerson appears to be somewhat similar to the structure of the stimuli employed by Degermann (1970). Consequently it is somewhat surprising that Degermann uses nonmetric multidimensional scaling to provide the initial (n, k) configuration. Be that as it may, if the distance matrix from a tree structure is used as input to a nonmetric program, most of the structure will be lost. This substantiates the claim made by Torgerson (1965, p. 389) that nonmetric models may throw away information in the data. To take a simple example, suppose we have n = 8 and at each node forms subsets of equal size, then any nonmetric program will give perfect fit in one dimension. This dimension will consist of two clusters, each with four superimposed points. The perfect onedimensional fit for the tree implies that there will be the same values of ∆ (cfr. Section 3.1) for different values of the dissimilarities. This finding is not surprising for MDSCAL and TORSCA since it is consistent with weak monotonicity. The finding may, however, be surprising for SSA-1 since it clearly violates strong monotonicity. Evidently the “strong monotonicity” in SSA-1 is not "sufficiently" strong to avoid degenerate solutions. It should not be thought that this result is unique to a particular type of a tree structure. To take a quite different example, if for each node subsuming k points subsets of points are formed which consist of k-1 and 1 points respectively (k = n, n – 1, .…2), there will again be a perfect onedimensional fit with one point in one cluster and n-l superimposed points in the other cluster. The former example is constructed by an "even split" principle, the latter by a "maximally uneven split" principle. This finding 39 clearly implies that if a nonmetric scaling is used as a starting point and there really is a tree structure, this will not be revealed if the analysis starts with nonmetric scaling. On the other hand it will be recalled from Section 5.3 that a l∞ metric was required to give a coordinate representation isomorphic to tree structure (a tree grid matrix). Numerous Euclidean metric analyses of similarities data generated by tree structures (the details of these analyses will be reported elsewhere) have, however, shown that it is not difficult to recognize the tree grid matrix if the underlying tree is based on the even split principle. The more the tree departs from this (approaching maximally uneven split) the more difficult it is to recognize the tree in the Euclidean representation (but perhaps not impossible). The preliminary conclusion we draw from this is that in the general model it may not be possible to use the standard nonmetric models. (A possibility which has not been investigated is that the reported degeneraries may be less pronounced with data slightly infested noise, though the possibility does not a priori appear reasonable). 39 That tree structures are not captured by nonmetric methods may be said to elaborate what Shepard (1962b, p. 249) stated, that any nonmetric model would fail to reveal structure within classes characterized by "proximity measures for all pairs of points within the same subset larger than the proximity measures for all pairs divided between two subsets". Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 6.42 139 The general model as a conceptual model new directions for psychological research. Degermann (1970 p. 486, 487) distinguishes between the value of his model as a tool for determining structure in empirical data and as a "general conceptual model", an “aid in forming a framework for experimentation, thereby allowing relevant questions to be asked”. This section raises some problems which - when clarified - may suggest that our proposed general model may be a powerful tool as a conceptual model. First we note that similarities data may be inadequate if we wish to explore the potentialities of the general model. We also consider other types of data. The focus will, however, not be types of data per se, but rather some theoretical problems. Since dimensional models are fairly well understood, this section concentrates on the less explored nature of hierarchical relations. What are the types of relations in hierarchical systems and how should one go about to assess these types of relations? These problems will have to be clarified before one can attempt to combine tree structures and dimensional structures in the general model. In Section 6.2 we distinguished between two types of relations in hierarchical systems, transitive and not transitive relations. Not transitive relations were related to metaconstructs, which again point to different levels of abstraction. In the signal contribution by Bateson (1955) the notion of different levels of abstraction is basic. Bateson has had a tremendous impact on clinical psychology and his students have provided the most viable (and probably a superior) alternative to psychoanalysis and behaviour modification techniques, cfr. e.g. Bateson et.al., (1956), Haley (1963, 1969), Watzlawick et.al. (1967), Weakland et. al, (1972). One of the main contributions of Bateson (1955) was to describe a third type of relation in hierarchical systems, he described intransitive relations in communication which simultaneously takes place at different levels of abstraction. His main example is an analysis of the message "This is play" on the basis of observations of monkeys at the Fleishbacker zoo in January 1952. Somewhat simplified, his analysis (cfr. especially op.cit.p.41) goes as follows: a) b) c) the playful nip denotes a bite a bite denotes intention to hurt, 40 but the playful nip does not denote, but rather denies intention to hurt. a) and b) logically should imply (by transitivity) intention to hurt, but the essence of "this is play" is a denial of this logical pattern. An interesting example of communication which is at least partly characterized by the same structure as "this is play" is the enchanting behaviour of flirtation. Briefly "this is flirtation" appears to have the following structure: a) b) c) the special smile (tone of voice, glance etc.) denotes a sexual approach a sexual approach denotes intercourse, but the special smile does not denote but rather denies intercourse, and again a) and b) "logically" implies the reverse of c) and we all know that "violating" the transitivity has its special charms. The label "intransitivity" for the patterns described above is only incidentally used by Bateson, basically he discusses "this is play" in relation to paradoxes and to Russell's theory of Logical types. 40 Bateson chooses not to specify what a "bite" denotes, but using a specific label for this makes it much simpler to illustrate his basic idea. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 140 We share Bateson's scorn for the logician's attempts to rule out messages such as those above as inadmissible since: "without these paradoxes the evolution of communication would be at an end. Life would then be an endless interchange of stylized messages, a game with rigid rules, unrelieved by change or humour." (op.cit. p. 51). On the other hand it is not difficult to see that the intransitivity in the examples above may be precarious indeed, play may turn into dead serious fight, flirt may loose its special flavour and be replaced by "transitive" behaviour. When we ask about methods which may be directly useful in assessing types of relations in cognitive structures we shall see that this raises new questions about perhaps yet other types of relations in hierarchical systems. A particularly significant contribution is Hinkle's innovations in Kelly methodology. As noted in Section 6.2 a usual Rep Grid does not incorporate the hierarchical aspects of Kelly’s theory. If one, as Bannister and Mair (1968) advocates, uses a rating form instead of dichotomous marking, a Grid becomes formally identical to a semantic differential, the only difference being that in the former case the subject provides his own constructs, whereas in the semantic differential the scales (corresponding to constructs) are provided. To further emphasize the similarity, we note that in some cases it may be profitable to combine personal constructs and provided constructs, cfr. Fransella and Adams (1966).41 So Rep Grids may just as semantic differentials epitomize the dimensional approach. The basic methodological innovation by Hinkle (1965) is the laddering technique. Having provided constructs to characterize triads where the self is always one of the figures, the subject is further asked to state the preferred pole of each construct. Call this A1, the other pole A2. He is then asked to provide superordinate constructs by answering "what is the advantage of A1, versus the disadvantage of A2". This provides a new construct B1 and B2. The same procedure is repeated on B1 and B2 till the subject has no further constructs to provide. Neither Hinkle (1965) nor Bannister and Mair (1968) provides concrete examples of results of such ladders for several constructs for a single subject, example may be useful to appreciate the special quality of the method.42 Only parts of some of the ladders from the preferred ends are outlined: make exciting food - gives a richer life - self actualization enjoy a drink while discussing - alcohol liberates -avoids standard norms - liberates my potentials not having ceramics as a hobby -concentrate on other interests experience oriented - gives more genuine relations to others - self actualization wish to be interested in politics - find one's own position - find what is true for me - be independent - liberate potentials - self actualization. The example illustrates how one person from seemingly quite different points of departures arrives at the same "root". Hinkle's technique may be extended to provide a valuable clinical tool. Basically one may regard a symptom (or complaint as Kelly would have said) as a kind of behaviour which has undesirable consequences for the person: the symptom is a construct with a desired contrast (e.g. "anxiety" vs. “freedom from anxiety"). On the other hand there will also be advantages of the symptom, as it provides ways of controlling one' s environment (cfr. Haley, 1963) and correspondingly there are disadvantages of being free from the symptom. Hinkle (1965, p. 18, p. 57) mentions - without exploring further-implicative dilemmas, that is situations where both poles of a construct have both desirable and undesirable features. “Implicative dilemma” conveniently captures the essence of the above outline of 41 Continued use of the dichotomous form of the Rep Grid is not recommended, since dichotomies are not well suited to reveal dimensional structures. It should be noted that in 1966 Kelly would - if having to rewrite his personal construct theory - have deleted the section on the Rep Grid, cfr. Hinkle (1970, p. 91). 42 The example and the general comment on the method are based on pilot data collected in a course in the fall 1970. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 141 symptomatic behaviour. It may be possible to provide (more or less) standardized procedures to reveal such implicative dilemmas by systematically eliciting advantages and disadvantages both of the symptom and the contrast to the symptom. This is further treated by Tschudi and Larsen (1970) and Larsen (1972, Ch. 7), a major point being that revealing implicative dilemmas immediately suggests therapeutic techniques. In working further with a variety of examples of these kinds of techniques it is possible that one might want to use e.g. a tree structure (or perhaps some different kind of model?) not as a descriptive tool, but rather as a normative model. The model will then be regarded as a baseline, and the deviations will be of interest in themselves. (This is the case with for instance the use of expected utility as a model for decision making. The model is not rejected when it does not fit the data, rather the deviations are regarded as giving important information on information processing, cfr. e.g. Edwards et. al, (1965). It should be clear that we would not necessary give opprobrious labels to deviations from logical patterns as the analysis of flirtation makes evident.) One limitation of both Hinkle's laddering and methods for revealing implicative dilemmas is that these methods mainly seem to "extend cleavages" rather than "abstract across", cfr. p. 131. From the point of view of the Bateson group the latter concept seems far more relevant for an understanding both of psychopathology and therapy. A recurring theme is that paradoxes (e.g. "double binds"), which necessarily involves different levels of abstraction, are involved both in producing and alleviating symptomatic behaviour. Consequently the most fruitful way for integration of the work of those who draw their main inspiration from Kelly and those belonging to the Bateson group, would be to study what we called “metaconstructs." One could for instance ask a person to sort the constructs he provides in a Rep Grid and if possible to further sort his metaconstructs. This in essence would be to ask the person to comment on his construct system. This may well be possible to wed to therapeutic techniques At one point in such a procedure the person may perhaps come to realize "this is the way I have regarded myself and presented myself to others" (a core construct in Kelly's terminology), "but there may be different ways….” In Hinkle’s terminology this procedure may facilitate "shift changes", (cfr. p. 131) or change in metaconstructs. At this point it seems appropriate to quote the definition of therapy given by Bateson (1955, p. 49) "therapy is an attempt to change the patient's metacommunicative habits". This definition has not been improved or challenged by his students. Before concluding a general precaution is necessary. All the methods we have mentioned in this section share one basic feature, he results will be highly sensitive to the interpersonal context in which the methods are used. The most striking impression from reading student's reports on the laddering techniques was the wide range of comments on the meaningfulness of the method. This ranged from "mere verbal exercise" to "deep (occasionally shocking) and highly revealing confrontation with one's most personal values." It does not seem unreasonable to ascribe at least part of this variation to the varying interpersonal relations. Stated otherwise there are ample opportunities for arranging situations where the person from most points of view will just produce nonsense. This touches a basic issue which can not be elaborated in the present context: We do not regard cognitive structure as something just “residing in the mind” but rather as strategies the person may or may not choose to reveal in specific situations. One might even regard various cognitive structures as (partly) generated by specific situations. This point of view' may in the future pave the way for experimental manipulations (therapy may be regarded as manipulation of strategies on an intuitive basis). Returning finally to our general model the strategy we propose is first to experiment with a variety of methods of the sort proposed by Hinkle, and also to consider further innovations. The first goal should be to clarify the nature and occurrence of various types of relations in hierarchical construct systems. Dimensional aspects of construct systems may best be revealed by the Osgood - Kelly type of ratings. Perhaps relations between dimensional and hierarchical aspects will be different according to the prevailing type of hierarchical relations. Will we for instance find global (e. g. evaluative) dimensions in Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 142 hierarchical systems with mainly transitive relations? Will there only be local dimensions if there is a preponderance of not transitive (meta) constructs? Probably it is premature at present even to suggest such kinds of questions. A more immediate goal is to call for close collaboration between the Bateson and Kelly groups of researcher's who, if at all aware of the other group has done nothing more than paying a passing tribute to each other. Such collaboration, we believe, will increase the likelihood of developing viable general models. And these models will hardly be less complex than the model we tried to sketch in Sections 6.3 and 6.41. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 143 CONCLUDING REMARKS. There is nothing to add to Part II, so these remarks only concern Part I. First we briefly summarize the main results and point to further extensions of the present approach. Finally critical views on the concept 'latent' - or 'true' -structure are discussed. The present work may be regarded as an extension of the basic contributions of Shepard, furthermore the work of Kruskal and Young has also been of invaluable help. Perhaps Young should be given credit for first explicitly formulating the goal of replacing goodness of fit (stress) with true fit. We believe that making more explicit the theme of latent structure (where indirectly the work of Lazarsfeld has been a valuable source of inspiration) and the related concept purification has made it possible to show that the goal formulated by Young can be reached. Since stress - apparent fit - is heavily influenced by irrelevant parameters such as n and t (dimensionality), any general description of how to evaluate this index (notably Kruskal's widely quoted description) is inadequate. We think that the presently proposed true fit index will prove to be a superior alternative. The value of the conceptual framework we have formulated - the metamodel - is perhaps more clearly revealed in the present approach to the problem of dimensionality. We go beyond the plea for showing reliability of the output configuration, G, -cfr. Cliff (1966) and Armstrong and Soelberg (1968) - by showing that computing this reliability, r (Gim, Gjm), for several different dimensionalities quite simply will reveal the correct dimensionality, since for this dimensionality the reliability will be largest. Generally we have further shown that it is the case that the output is more reliable than the data which has not been done previously. This we have labelled empirical purification. The basic idea in the present approach is that true fit will be optimal when the analysis is guided by correct specifications (e.g. dimensionality). Not only will this lead to maximal empirical purification, the basic idea has also made it possible to replace inspection of stress curves (and e.g. the use of the “elbow criterion”) by the simpler criterion of converting each stress value to a true fit value and then finding the correct dimensionality by simply picking the lowest (optimal) of the converted true fit values. The concept purification is made more valuable by the fact that distortion is also possible. In the present work we have not shown conclusively that empirical distortion may occur for a broad class of situations, but this is due to the fact that Part 1 is restricted to exploring dimensional models. It might here be mentioned that in studies to be reported in detail elsewhere we have found that if a tree structure model is used to analyse data generated by a dimensional model, then marked empirical distortion will occur.43 The results reported in Section 4.7 do, however, indicate that if n is not too low and t not too high, then for most realistic levels of retest reliability there should be a pronounced empirical purification. Consequently there may be conditions where, even if there is some empirical purification, this may not be sufficient for placing confidence in the underlying model. The converse of the basic idea is that if the wrong type of model is used we can not expect optimal true fit, but must expect either empirical distortion or pronounced deviations from the estimated theoretical purification. Stated otherwise we believe that one of the main contributions of the present approach is that it provides a new approach to the problem of applicability --the evaluation of the underlying model. Models are not regarded as theoretically neutral - on the contrary they are regarded as carrying implications which may or may not be warranted for a given set of data. So the present approach may also be seen in the broader context of evaluating theories. A point which perhaps should have been more emphasized is that these results are of obvious value not only for the interpretation of results but of equal importance for the planning of experiments. The user may first study Section 3.4 and find out the desired level of true fit. Figs. 10 to 12 will then indicate the set of combinations of n and reliability which will be necessary. If stringent tests of dimensionality and applicability are also desired, the results in Section 4.7, particularly in Table 15 and Fig. 16, may lead to modification of e.g. n. 43 HCS (Johnson, 1967) was used and in the Oslo version we included an averaging procedure to the max and min methods originally proposed by Johnson. The empirical distortion was marked for all methods, but slightly less for the averaging procedure. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 144 Once more, however, we must sound a caution concerning all applications of these results since there are obvious limitations. Recall the complexity of the variables to be taken into account in simulation studies, cfr. Section 4.2, and the very limited region of the parameter space that has been explored thus far. It should be stressed, however, that the programming system outlined in the appendix easily can accomodate more complex cases than at present have been explored.44 It is fairly straightforward to provide results for for instance a specific value of n and a specified amount of variance for each of t dimensions. It may be a better strategy to perform separate runs of this programming system for specific cases than to grind out large (and probably fairly unmanageable) tables and graphs. A very interesting task will be to extend the present methodology to other kinds of models than multidimensional scaling. A good starting point would be the recent program for Polynomial Conjoint Analysis by Young (1972). This program embodies a large set of measurement models (or combination rules) as for instance multidimensional scaling, factor analysis, unfolding and additive conjoint measurement. For all these models there is a large set of choices of algorithmic strategies. The user can for instance choose not only between rank image and block partition transformations (cfr. Section 3.1) - there are also two other types of transformations. Furthermore there are for each type of transformation several possible minimization strategies and finally there are several stress formulas to choose from. Even after the user has settled on a specific measurement model, he still has the double problem of first choosing algorithmic strategy and then to estimate appropriate parameters and evaluate the resulting solution. Concerning the first problem the strategy in Section 4.3 should be used. Hopefully this will produce some viable generalizations so that the user will be relieved of the burden of choice in a situation where the consequences at present can be but dimly perceived. Furthermore it is to be hoped that further explorations of the present methodology will suggest improvements in the methodology itself. For one thing the metamodel might be extended to cover more complex cases than those in Section 2.2 and Section 4.6, e.g. the choice between two related models such as factor analysis (a scalar product model) and multidimensional scaling (a distance model). It would further be convenient if the present graphical approach could be supplanted with an analytical approach. Provided correlation is reasonable as a basis for a more general index of true fit, it might be an advantage with a more elegant transformation than the present TF- categories transformation. An even more important improvement would be to formalize the metamodel so that more precise deductions could be drawn. At present the metamodel is just a heuristic device. To some the present endeavours may perhaps appear to be nothing but idle exercises since the basic concept latent (or true) structure is thought to be irrelevant and misleading in social science. Such a critical view appears to be expressed by Lingoes and Roskam (1971, p. 124) who state: At least one of us puts no store in what he considers the pseudoproblem of "recovering" known configurations, since for most social science data a “true” set of distances does not exist to be recovered. All that we typically have is a set of similarity/dissimilarity coefficients and our task is to understand the observed patterns. A geometric representation given certain specifications on the elements and properties of that representation is largely a convenience for aiding such comprehension - nothing more (it is no more “true” than the original data)! (key phrases for subsequent discussion are underlined here). It might be noticed that Lingoes and Roskam partly echoe remarks made earlier by Guttmann (1967, p. 75) who laments "the unfortunate use of terminology (by Shepard and others) such as “recovering” configurations." At this point we could note a differentiation into different schools, a "Shepard school" (to which the present work would belong) and a "Guttmann school". This, however, would be a deplorable state of affairs. We think there is ground for a conciliatory view and will argue for this. (Note the lack of unanaminity in the Lingoes-Roskam quotation and the inclusion of a study of "metricity" (true fit) in their report). 44 On request further information on the programming system will be given or in special cases runs tailormade to a specific problem can be run at the Computer Centre of Oslo. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 145 The quotation raises four problems, of which the first is the central and the one who will be most extensively discussed. a) Does the present approach imply belief in the “existence” of a true set of distances? b) How can one provide help for the researcher who of course wants to "understand the observed patterns"? c) How are "certain specifications" to be justified? d) Can a geometric representation be more true than the original data? The problem of “existence” has ramifications which obviously can not be explored here. We choose to settle first on one structural interpretation which - we think - all parties will agree in refuting. This is what Piaget (1971) calls the preformational (Platonic) view of structures where "they may be viewed as given as such, in the manner of eternal essences." (op.cit. p. 60). Using "latent structure" as a conceptual tool for a given empirical set of data should not be taken to imply anything even faintly resembling an "eternal essence". To bring home this point it is convenient to restate some of the specifications of noise processes in Section 3.2. If the researcher in a given experiment chooses to conceive of the content of L as specified in one way at time t0 this does not commit him to conceive of precisely the same content of L at time t1. One may explore specification 2b), cfr. p. 32, and conceive of some random process which produces a different configuration at time t1 (other noise specifications may of course also be operative). Provided the perturbations are not too pronounced one might still under some conditions expect empirical purification. This might at least be explored. Specification 3b), cfr. p. 33, raises even more interesting possibilities. Generally one may conceive of a fairly large dimensionality for a given domain. What may be "relevant" (or salient) dimensions may, however, largely be dependent upon the context. What in one context may be regarded as "error dimensions" (noise) may in other contexts be the salient dimensions, and vice versa. Furthermore the relative salience of relevant dimensions may vary with context (an experimental demonstration of this was given by Torgerson, 1965). While one would not expect any purification across such diversity of conditions it might still be of interest to explore consequences of such variation in simulation studies. The results could be valuable for evaluating empirical results following from specific experimental manipulations. The general point is that a given content of L may be regarded as just a convenient conceptual tool, it need not carry any connotation of "existence". From this point of view it is interesting to note that attempts to do without concepts of "latent" or "true" structure in other fields of psychometrics have been none too successful. There is for instance Tryon (1957) lashing out against the "doctrine of true and error scores" and "underlying factors". Yet the following quotation (op.cit. p. 237) where he defines his alternative, the domain score, is highly revealing: "The domain score, usually called a “true score”, is defined as the sum (or average) of scores on a large number of composites." The difference between a "domain score" and a “true score” seems to be mainly semantic in nature. A related example is stochastic models for e.g. intelligence tests where the basic formulations are in terms of probabilities of a given number of correct answers (e.g. Rasch, 1960). Yet this can be reformulated so that an observed number of correct answers can be expressed as an expected value plus a deviation from this value. More generally we do not see any fundamental difference between the conventional statistical machinery of ‘universe parameters’ and 'stochastic variables' and the present 'latent structure' and 'manifest data'. The basic point is that some structure (schema, image, hypothesis, conceptual tool) is necessary for the scientist in order to assimilate whatever there may be in the data. The scientist's structure is far from "static", what is required is what Piaget calls a constructional view of structuralism: "there is no structure apart from construction" (Piaget, 1971, p. 140). This is the kind of view we tried to sketch in Ch.1, cfr. p. 7-8, and in Ch. 2, p.20, we more explicitly illustrated an example of this kind of view. Turning now to the other questions raised by the Lingoes-Roskam quotation we first note that b) takes for granted that there is an "observed pattern". This we believe can not be taken for granted. Just as Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 146 one may easily simulate a random pattern, subjects may resort to guessing or some quasirandom process. Some data just are no good! So here we are back to our proposed true fit index - loosely referred to as amount of structure. As for c) no answer is provided by Lingoes and Roskam. Perhaps the present contribution would be less "offensive" if formulated in a language of providing methods for estimating parameters since this is what our procedure for evaluating dimensionality does. Concerning d) we can of course not accept their flat denial that a configuration can be more true than the observed data. The answer to this question will depend upon how it is further specified. We have interpreted "more true" as "purification" which we have been at some pains to show will occur when an appropriate model has been used. In his letter Guttmann (1967) repeatedly stresses the importance of looking directly for patterns in the observed data, for instance: "when I first saw Ekman's first colour vision data matrix, it was obvious without any computing - that it was a circumplex." (op.cit. p. 76). We do not think this in any way is in opposition to Shepard's plea for "artificial machinery" (cfr. p. 9) to supplement the generally limited capacity most of us have for directly observing patterns. Scientific activity is one of the most poorly understood forms of human activity and any attempts to guide or improve this activity, whether cultivation of “directly looking” or extensive computer simulation, will surely find their place in the joint scientific endeavour of figuring out the human complexity. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 147 Appendix. Main features of the programming system. The present programming system is written for a CDC 3300 but it should be fairly straightforward to adapt the basic ideas to other computer systems with good disk (tape) facilities. There are three basic features of the programming system: a) One job calls several programs - it will therefore be convenient to refer to a program as an element. The programming system utilizes a general master system where a single card is sufficient to call an element. b) All elements communicate via disk (or tape) files which may be temporary (scratch files) or permanent. Input and output for all elements are by means of a standardized system where both configurations and symmetric matrices are strung out as vectors. Goodness of fit indices are also outputted on files. There is no card output nor any necessity for punching printed output from one element as input to another element. c) In each of the elements one parameter card is sufficient to process all the runs in one condition (a specific combination of n and t). Usually one (set of) parameter card(s) is necessary to process each single data vector. In the present system, however, a special loop has been built into MDSCAL, TORSCA and SSA –1 so that one parameter on the parameter card specifies the number of configurations to be processed. For the simple form of the metamodel (unrepeated designs, cfr. Section 4.2) this parameter, c, will simply be the number of noise levels, ne, times the number of replications for each noise level, rep. c will of course also be a parameter in all the other elements in a specific job. We will briefly outline the typical sequence of elements in the simplest simulation studies and also indicate some of the more complex possibilities within the present system. One job might make six calls for elements as follows: (most elements require one or two parameter cards) l) DISTANCE. 2) MAIN PROGRAM. 3) DISTANCE 4) FILE - 1 5) RELATE 6) FILE - 2 This element will typically generate c different L vectors and c corresponding M vectors. This will be either TORSCA, SSA -1 or MDSCAL. In the latter case a special preprocessing program is used first to compute the initial configuration from M, cfr. Section 4.32. The file output will be cG(C) vectors and a different file containing c goodness of fit values. This time this element will simply convert G(C) to G V), cfr. Section 2.1. Typically this element will merge the separate files for L, M and G vectors to one file (L1 M1 G1) (L2 M2 G2) ...(Lc Mc Gc) In the simplest case this program will compute correlations within each set (L, M, G). The program will select the correlations of interest and output them on a special file. This program handles the set of NL, TF and AF indices computed by the preceding elements. When necessary particular indices are transformed. The program contains a variety of transformations and may easily be extended. Means for each noise level of (transformed) indices are computed. This system has been growing over some years and will continue to grow (perhaps to unwieldiness). The element DISTANCE will at present generate either distances or scalar products. A separate vector containing information on the no noise levels desired is read in. Special parameters determine the type of noise process and the type of configuration generation. One initial value to start a random routine will be sufficient both to generate configurations and noise processes. It is, however, also Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 148 possible to read in systematic configurations (from cards or from a file generated by yet other programs) which will then be perturbed by noise processes. For repeated design one call of DISTANCE will produce s = ne x rep M vectors for each of con different configurations. In this case FILE - 1 will generate con sets of vectors where each of the sets is of the form (L, M1, M2,…Ms, G1, G2 ...Gs) and RELATE will pick out the correlations (or subcorrelation matrices) of interest to be further processed by FILE - 2. When each configuration is analyzed in different dimensionalities, m, FILE-1 will separate out the various sets of Gm vectors according to m, or various subcorrelation matrices may be picked out by RELATE. One job may also include several different MAIN PROGRAMS for each M vector (cfr. Section 4.32). Furthermore RELATE may be supplemented by MATCH (Cliff, 1966). Before FILE-1 is called one may also call OS - 1 (Lingoes, 1967) which has been adapted so that rank images of M (M*G and/or M*L cfr. Section 3.2, p. 38) are outputted and then further sorted out by FILE-1. Again RELATE may pick out any subcorrelation matrix of interest, however, complex the structure inputted to RELATE by preceding "elements will be. In complex cases FILE - 2 may receive a variety of indices of NL,TF and AF (from MATCH, several MAIN PROGRAMS, several varieties of correlations from RELATE). If desired (transformed or untransformed) indices may be outputted from FILE - 2 and then the interrelations between all these indices (or any desired subset) may be studied by a new call of RELATE which now may compute correlations, or if desired root mean square discrepancies. In Section 3.3 a simple example of this strategy is reported, else reports from such runs have mostly been tucked away in footnotes. As implied by the Concluding remarks there are many features incorporated in the present system which so far has not been explored. One example is to generate configurations with different variance in different dimensions, this is simple to do by reading in a separate vector in DISTANCE. Similarly separate vectors can be read in to specify different noise parameters for different points or dimensions. It might here be mentioned that if it is desired a vector containing stress components for separate points may be outputted on file from MDSCAL. A simple extension of the present system will be to include fairly detailed tables in FILE - 2.and incorporate a double linear interpolation process so that e.g. TF|AF (cfr. Sections 4.6 and 4.7) can be computed with sufficient precision without the more cumbersome graphical procedures now being used. Concerning the general Polynomial Conjoint Analysis program (POLYCON, Young, 1972) it will (in principle) be fairly simple to incorporate in the present system. Features b) and c), cfr. p. 145, indicate the necessary additions to POLYCON. For each of the new measurement models in POLYCON a corresponding extension of DISTANCE can be plugged in, otherwise the system does not need any major revision (Perhaps other true fit indices may be found appropriate but this will not call for anything but minor revisions). It finally remains to give the details of the TF-categories transformation (which is part of FILE - 2). Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 149 For r (Gi, Gj) or r (Mi, Mj) the square root of r is first computed. The succeeding steps are the same as for r (L, G). Call the starting point then R and TF-categories (TFCAT) is computed as follows: REAL K HELP = 1. - SQRT (1. - R * R) K = SQRT (1. -HELP * HELP) IF (K - .956) 2,1,1 1 A = 11.364 B = - 6.864 GO TO 5 2 IF (K - .457) 4,3,3 3 A = 6.012 B = -1.747 GO TO 5 4 A = 3.282 B = -.5 5 TFCAT = A* K + B Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 150 References: Abelson, R.P. and Sermat, V. (1962) Multidimensional scaling of facial expressions. Journal of Experimental Psvchology, 63, 546- 554. Abelson, R.P., and Tukey, J.W. (1963) Efficient utilization of non-numerical information in quantitative analysis. General theory and the case of simple order. Annual of Mathematical Statistics, 34, 1347 – 1369. Armstrong, J.S., and Soelberg, P. (1968) On the interpretation of factor analysis. Psychological Bulletin, 70, 361 -364. Attneave, F. (1950) Dimensions of similarity. American Journal of Psychology, 63, 516 - 556. Attneave, F. (1962) Perception and related areas. In S. Koch (Ed.), Psychology: A study of a science, Volume 4. New York: McGraw-Hill, 619- 659. Bakan, D. (1966) The test of significance in psychological research Psychological Bulletin, 66, 423 436. Bannister, D., and Mair, J.M.M (1968). The evaluation of personal constructs. New York. Academic Press. Bateson, G. (1955) A theory of play and fantasy. Psychiatric Research Reports, 2, 39 - 51. Bateson, G., Jackson, D.D., Haley, J., and Weakland, J. (1955) Toward a theory of schizophrenia. Behavioural Science, l, 251- 264. Beals, R. Krantz, D.H., and Tversky, A. (1968) Foundatlons of multidimensional scaling. Psychological Review, 75, 127- 143. Campbell, D.T., and Fiske, D.W. (1959) Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56, 81 -105. Cattell, R.B. (1962) The relational simplex theory of equal interval and absolute scaling. Acta Psvchologica, 20, 139 -153. Cattell, R.B. (Ed.) (1966) Handbook of multivariate experimental psychology. Chicago: Rand McNally. Cliff, N. (1966) Orthogonal rotation to congruence. Psychometrica, 31, 33 – 42. Coombs, C. H. (1964) A theory of data. New York: Wiley. Coombs, C.H, Dawes, R.M., and Tversky, A (1970) Mathematical psychology. An elementary introduction. New Jersey: Prentice Hall. Coombs, G., H., and Kao, R.C. (1960) On a connection between factor analysis and multidimensional unfolding, Psychometrika, 25, 219- 231. Deese, J. (1962) On the structure of associative meaning. Psychological Review, 69, 161- 176. Degerman, R. (1970) Multidimensional analysis of complex structure: Mixtures of class and quantitative variation. Psychometrika, 35, 475 - 490. Eckart, C., and Young, G. (1936) The approximation of one matrix by another of lower rank. Psychometrika, 1, 211 - 218. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 151 Eckblad, G. (1971a) On hierarchical structures in programs and plans. In G. Eckblad (Ed.), Hierarchical models in the study of cognition. University of Bergen Eckblad, G. (1971b) Comments on Rommetveit's “On concepts of hicrarchical structures.” Part II. In G. Eckblad (Ed.), Hierarchical models in the study of cognition. University of Bergen Edwards, W., Lindman, H., and Phillips, L.D. (1965) Emerging technologies for making decisions. In, New directions in psychology.#l New York: Holt, Rinehart and Winston. Edwards, W., Lindman, H., and Savage, L.J. (1963) Bayesian statistical inference for psychological research. Psychological Review, 70, 93 - 242. Ekman, G. (1954) Dimensions of color vision. Journal of Psychology, 38, 467- 474. Fransella, F., and Adams, B. (1966) An illustration of the use of the Repertory Grid technique in a clinical setting. British Journal of Social and Clinical Psychology, 5. 5 - 162. Gibson, E.J. (1970) The ontogeny of reading. American Psychologist, 25, 136 - 143. Gombrich, E .H. (1965) The use of art for the study of symbols. American Psychologist, 20, 34 -50. Green, B.F. Jr. (1966) The computer revolution in psychology. Psychometrika, .31, 437- 445. Guttmann, L. (1966) Order analyais of correlation matrices. In R.B. Cattell (Ed.) Handbook of multivariate experimental psychology. Chicago: Rand McNally. 43-9 - 4-58. Guttmann, L. (1967) The development of nonmetric space analysis: A letter to professor John Ross. Multivariate Behavioural Research, 2, 71 –82. Guttmann, L (1968) A general nonmetric technique for finding the smallest Euclidean space for a configuration of points. Psychometrika, 33., 469 - 506. Haley, J. (1963) Strategies of psychotherapy. New York: Grune and Stratton. Haley J. (Ed.) (1969) Advanced techniques of hypnosis and therapy: New York: Grune and Stratton. Harmann, H.H. (1967) Modern factor analyis. Second Edition, Revised. Chicago: The University of Chicago Press. Henrysson, S. (1957) Applicability of factor analyis in the behavioral sciences: A methodological study. Stockholm: Almqvist & Wiksell. Hinkle, D.N. (1965) The change of personal constructs from the viewpoint of a theory of implications. Unpublished Ph.D. Thesis. University of Colorado Hinkle, D.N. (1970) The game of personal constructs. In D. Bannister (Ed.), Perspectives in personal construct theory. New York: Academic Press, 91 - 110. Indow, T. and Kanazawa, K. (1960) Multidimensional mapping of colors varying in hue, chroma and value. Journal of Experimental Psychology, 59, 330- 336. Johnson, S.C. (1967) Hierarchical Clustering Schemes, Psychometrika, 32, 241- 254. Johnson, S.C (1968) Metric Clustering. Mimeographed Report. Bell Telephone Laboratories. Murray Hill, New Jersey. Kelly G. A. (l955) The Psychology of Personal Constructs. Volume I.II. New York: Norton. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 152 Klahr, D. (1969) A Monte Carlo investigation of the statistical significance of Kruskal's nonmetric scaling procedure. Psychometrika, 34, 319- 330. Krantz, D.H. (1972) Measurement structures and psychological laws. Sciemce, 137, 1427 - 1435. Krantz, D.H. and Tversky, A. (1971) Conjoint-Measurement analysis of composition rules in psychology. Psychological Review, 78, 151 - 169. Kruskal, J.B. (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1- 27. Kruskal, J.B. (1964b) Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29, 28- 42. Kruskal, J.B. (1967) How to use MDSCAL, a multidimensional scaling program. Mimeographed Report. Bell Telephone Laboratories, Murray Hill, New Jersey. Kruskal, J.B., and Hart, R.E. (1966) A geometric interpretation of diagnostic data for a digital machine: Based on a study of the Morris, Illinois Electronic Central Office. System Technical Journal, .45, 1299 - 1338. Larsen, E. (1972) Valget - Et strategisk og terapeutisk virkemiddel i psykoterapi. En tentativ syntese av ulike teoretiske tilnærmingsmåter til psykoterapi. Hovedoppgave. Psykologisk institutt: University of Oslo Lashley, K.S. (1942) An examination of the "continuity theory" as applied to discriminative learning. Journal of General Psychology, 26, 241 - 265. Lazarsfeld, P.F. (1959) Latent structure analysis. In S. Koch (Ed.), Psychology: A study of a science. Volume 3. New York: McGraw-Hill. 476 - 543. Lingoes, J.C. (1965) An IBM 7090 program for Guttmann-Lingoes Smallest Space Analysis-1. Behavioural Science, 10, 183- 184. Lingoes, J.C. (1966) Recent Computational advances in nonmetric methodology for the behavioral sciences. In, Procedings of the international symposium: Mathematical and computational methods in social sciences. of International Computation Center, Rome. 1 -38. Lingoes, J.C. (1967) An IBM 7090 program for Guttmann-Lingoes Configurational Similarity-1. Behavioural Science, 12, 502- 503. Lingoes, J.C., and Roskam, E. (1971) A mathematical and empirical study of two multidimensional scaling algorithms. Michigan Mathematical Psychology Program, 1. Mandelbrot, B. (1965) A class of long-tailed probability distributions and the empirical distribution of city sizes. In, Proceedings of the Seminars of Menthon – Saint - Bernard, France 1-27 July 1960 and of Gösing, Austria 3-27 July 1962). Paris. Mouton & Co. The Hague. Miller, G.A. (1967) Psycholinguistic approaches to the study of communication. In D. Arm (Ed.), Journeys in science: Small steps -great strides. Albequerque: The University of Mexico Press. 22- 73. Miller, G.A. (1969) A psychological method to investigate verbal concepts. Journal of mathematical psychology, 6, 169 - 191. Osgood, C. E. (1969) Introduction. In J.G. Snider and C.E. Osgood (Eds.), Semantic Differential Tecnique. Chicago: Aldine Publishing Co. vii- ix. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 153 Osgood, C. E., and Luria, Z. (1954) A blind analysis of a case of multiple personality using the Semantic Differential. Journal of Abnormal and Social Psychology, 49, 579 - 591. Osgood, C. E. Suci, G.J., and Tannenbaum, P.H. (1957) The measurement of meaning. Urbana: University of Illinois Press. Piaget, J. (1971) Structuralism. London: Routledge and Kegan Paul. Ramsay, J.O. (1969) Some statistical considerations in multidimensional scaling. Psychometrika, 34, 167 - 182. Rasch, G. (1960) Probabilistic models for some intelligence and attainment tests. Köbenhavn: Danmarks pædagogiske institut. RestIe, F. (1959) A metric and an ordering on sets. Psychometrika, 24, 207- 219. Rommetveit, R. (1968) Words meanig and messages. Theory and experiments in psycholinguistics. New York: Akademic Press, og Oslo: Universitetsforlaget. Rommetveit, R. (1972) Språk, tanke og kommunikasjon. Oslo: Universitetsforlaget. Roskam, E. (1969) A comparison of principles for algorithm construction in nonrnetric scaling. Michigan Mathematical Psychology Program, 2. Shepard, R.N. (1962a) The analysis of proximities. Multidimensional scaling with an unknown distance function. I Psychometrika, 27 125 - 140. Shepard, R.N. (1962b) The analysis of proximities. Multidimensional scaling with an unknown distance function. II Psychometrika, 27, 219- 245. Shepard, R.N. (1963a) Analysis of proximities as a technique for the study of information processing in man. Human Factors, 5, 3- 48. Shepard, R.N. (1963b) Comments on Professor Underwood's paper: Stimulus selection in verbal learning. In C.N. Cofer and B.S. Musgreave (Eds.), Verbal behaviour and learning: problems and processes. New York: McGraw –Hilll. 48 - 70. Shepard, R.N. (1964) On subjectively optimum selection among multiattribute alternatives. In M.W. Shelley and G.L. Bryon (Eds.), Human judgments and optimality. New York: Wiley. 257 281. Shepard, R.N (1966) Metric structures in ordinal data. Journal of Mathematical Psyohology, 3, 287315. Shepard, R.N., and Carrol, J.D. (1966) Parametric representation of non-linear data structures. In Krishnaiah (Ed.), Multivariate Analysis. New York: Academic Press. 561 -592. Shepard, R.N. and Chipman, S. (1970) Second-order isomorphism of internal representations: Shapes of states. Cognitive Psychology, 1, 1 - 17. Sherman, C.R. (1970) Nomnetric multidimensional Scaling: The role of.the Minkowski metric. Chapel Hill, North Carolina: L.L.Thurstone Psychometric Laboratory Report. No. 82 Smedslund, J. (1967) Noen refleksjoner om Rorschach-testen. Nordisk Psvkologi, 19, 203 - 209. Spaeth, H.J., and Guthery, S.B. (1969) The use and utility of the monotone criterion in multidimensional scaling. Multivariate Behavioral Research, 4, 501 - 515. Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 154 Stenson, H.H., and Knoll, R.L. (1969) Goodness of fit for random rankings in Kruskal’s nomnetric scaling procedure. Psvchological Bulletin, 71, 122- 126. Thurstone, L.L. (1947) Multiple factor analvsis. Illinois: The University of Chicago Press. Torgerson, W.S. (1952) Multidimensional scaling: I. Theory and method.. Psvchometrika, 17, 401 419. Torgerson, W.S. (1958) Theory and methods of scaling. New York: Wiley Torgerson, W.S. (1965) Multidimonsional scaling of similarity, Psvchometrika, 30, 379 - 393. Torgerson, W.S. (1967) Psychological scaling. In, Psvchological measurement theory. Proceedings of the NUFFIC international summer session in science. Psychological Institute of the University of Leyden: The Netherlands. 151- 180. Torgerson, W.S. (1968) Multidimensional representation of similarity structures. In M.M. Katz, J.O. Cole and W.E. Barton (Eds), The role and methodology of classification in psychiatry and psvchopathology. Washington, D.C.: U.S. Government Frinting Office. 212 - 220. Tryon, R.C. (1957) Reliability and behavior domain validity: Reformulation and historical critique. Psvchological Bulletin, 54, 229 - 249. Tryon, R.C. and Bailey, D.E. (1970) Cluster analvsis. New York: McGraw-Hill. Tschudi, F., and Larsen, E. (1970) Notes on Harold Greenwald’s technique: Pointing out the advantage of the symptom. Mimeographed paper. University of Oslo Wagenaar, W.A., and Padmos, P. (1971) Quantitative interpretation of stress in Kruskal’s multidimensional scaling technique. British Journal of Mathematical and Statistical Psychology, 24, 101 - 110. Watzlawick, P., Beavin, J.H., and Jackson, D.D. (1967) Pragmatics of human communication. New York: W.W. Norton. Weakland, J.H.I, Fisch, R., Watzlawick, P., and Bodin, A.H. (1972) Brief therapy: Focused problem resolution. Mimeographed Report. Mental Research Institute. Palo Alto, California. Xhignesse, L.V., and Osgood, C.E. (1967) Bibliographic citation characteristics of the psychological journal network in 1950 and 1960. American Psychologist, 22, 778 - 792. Young, F.W. (1968a) A FORTRAN IV program for nonmetric multidimensional scaling. Chapel Hill, North Carolina: L.L Thurstone Psychometric Laborator Report No. 56. Young, F.W. (1968b) Nonmetric multidimensional scaling: Development of an index of metric determinacy. Chapel Hill, North Carolina: L.L. Thurstone Psychometric Laboratory Report No.68. Young, F.W. (1970) Nonmetric multidimensional scaling: Recovery of metric information. Psychometrika, 35, 455 - 473. Young; F.W. (1972.) POLYCON Users Manual. A FORTRAN-IV program for Polynomial Conjoint Analysis. Chapel Hill, North Carolina: L.L.Thurstone Psychometric Laboratory Report No. 104 Finn Tschudi (1972) The latent, the manifest and the reconstructed in multivariate data reduction methods. 155 Young, F.W., and Appelbaum, M.L (1968) Nonmetric multidimentional scaling: The relationship of several methods. Chapel Hill, North Carolina: L.L. Thurstone Psychometric Laboratory Report No. 71. Young, F.W., and Torgerson, W.S. (1967) TORSCA, A FORTAN IV program for Shepard-Kruskal multidimensional scaling analysis. Behavoural Scence, 12, 498. Zinnes, J.L. (1969) Scaling. In, P.H. Mussen and M.R. Rosenzweig (Eds.), Annual Review of Psychology, 20, 143 - 478.