Psychology 678, Seminar in mathematical psychology: Structural Modeling with Multivariate Psychometrics Possible different interpretations of rxy , the Pearson’s linear bivariate correlation coefficient (reviewed by Petr Blahuš) Among all the interpretations and different cosequencies of the definition of the coefficient, there are two that are closely related to psychometric and psychological theories. These are placed here at the beginning regardless to its belonging to mutually different categories listed later below. Those are 1. “Geometric N-dimensional” interpretation - it belongs among other geometric ways of looking at rxy 2. “Common mental elements” interpretation - it belongs mostly to the category of ‘probabilistic-proportional’ interpretations A) The two versions closest to psychological interpretation 1. “Geometric N-dimensional” interpretation It is based on geometric epresentation of the N by 2 data matrix as 2 vectors (points) in N dimensions where r is cosine of the angle between the two vectors (each representing a variable) (just aside: for raw data the length of a variable-vector is the variable’s SD) The geometry of two variable-vectors in the N-dimensional space of N persons and the idea of battery of tests as a cluster of vectors that have small angles (high correlations) between them as they are close to one common dimension is an important basis for investigating diagnostic properties of a battery of psychological tests (questionnaires and other composed diagnostic methods). Especially it deals with homogeneity to a “construct” and with dimensionality of the test battery. This dimensional - geometrical interpretation already became traditional in multivariate statistics but historically it had been fundamental to the early psychometric and psychological theories. The Pearson’s 1901 paper called (approximatelly !) ‘On lines and planes closest fit to ellipsoid ... (?)’ de facto layed the geometrical foundations for principal axes factor analysis - as principal axes of the ellipsoid of multidimensional scatter under multivariate normal distribution. For some time this idea stayed aside when Spearman 1904 with his ‘Intelligence objectively measured ...’ based his idea of G-factor on generic true score and his formula for dissatenuated correlation. But then already Burt, and Horst turned to the multidimensional geometric idea, and especially Thurstone based his multifactor abilities theory on the geometry of principal axes that were rotated to reach an interpretable “simple structure”(computation of the axes was roughly approximated by his method of centroids in the space). Hotelling (1933) related the computation explicitly to the algebraic eigen-value solution, which he preferred to interpret as variance decomposition into principal ‘components’ (without rotation). About this time Burt with his “Vectors of mind” seemed to base his conception of “mind” as a kind of perhaps abstract multidimensional continuum that had to be described by multidimensional mathematics. In the same direction of theorizing we can trace a little more recent work by a Fin*), Ahmavaara, Y.: “On the unified factor theory of mind” (Helsinki, Tiedeakatemia 1975) who allegedly tried to bring the ‘theory of mind’ on similar background as a unified field theory and continuum theory in modern physics. I can imagine something like that only very vaguely as follow: A “mental performance” is a point in the multidimensional continuum whose n axes are “mental ‘abilities’ ” . Location and “trace line” (a “mental ‘world/line’ “ ?) of the point in the space (perhaps not just 4-dimensional but “n+1 dimensional ‘time-space’ “ ?) is given by the “mental coordinates” and by their changes in time. ------*) It is not known whether introvert or extrovert. ------- 2. “Common mental elements” Among the probabilistic interpretation far below we find the (Eq. 1) that specializes Pearson formula to the case of binary, zero-one variables alias “items”. See there the derivations which yield rxy = nxy / [(nxy+ nx) (nxy+ ny) ] , where nx is number of 1’s only in, ny by analogy, and nx is number of unities that occured jointly both in X and Y, i.e. the number of “common elements” . This formula simplifies to a proportion if both items have the same “difficulty”, see farther below. So correlation coefficient as an “overlap” or proportion of “common elements”. There were two historical psychological interpretations: 1. Pavlov’s jointly “excited” cells in brain cortex responsible for conditioning 2. Thomson’s theory that the theoretical concept or “construct” of ability consists of (latent) common mental elements that manifest themselves in, and are jointly necessary for, solving two psychological tasks, tests. (Thomson was psychometrician and factor analyst around 2nd WW.) So correlation as a proportion of an overlap of the mental elements working in the two different activities - the overlap = common ability or “common latent factor” as “common latent mental agens” functioning in both activities. N.B.: As well as Spearman did in 1904 Thomson assumes “common elements” and the rest for completeng the performance were “specific elements”, i.e. specific for the individual activity, test. Therefore Spearman called his intelligence theory “theory of one general factor” or also “two-factor theory” - compare with decomposition of test score into true, specific, and unreliable part and the problem of distinguishing between specific and unreliable parts, which both together create the unique part (this is in factor model sometimes called the discrepancy of the model or residual part of test score as it is generated by the model). 2 B) Review of interpretations The geometric interpretations: a) Representation of a N by 2 data matrix as N points in 2 dimensions (1) rxy automatically produces (implicitly) the “best fitting” line to the scattergram of N points in the sense of least squares - actually two lines, one for estimating Y as one knows X, the second for estimating X as one knows Y (2) Having the X, Y variables standardized into z-score variables ZX , ZY (so, each having mean = 0 and SD = 1, and their two regression lines go through the origin 0 point) it may be shown that: (3) r is cosine of the angel between the two lines (it approaches 1 as the lines approach each other to joint into one line going through the origin under 45 degrees angle) (4) r is tangent (slope of the lines ) of the angle between the line for estimating ZY and the x-axis, and reciprocally, it is tangent of the angle between the line for estimating ZX and the y-axis (it approaches 1 as the lines are rotating to join under 45 degrees angle, approaches 0 as each line approaches its axis to become orthogonal with the other) (5) for unstandardized raw data r is a geometric average of the two slopes / tangents, which then are not equal (so it is a square root of their product) b) Representation of the N by 2 data matrix as 2 vectors (points) in N dimensions (6) r is cosine of the angle between the two vectors (each representing a variable) (just aside: for raw data the length of a variable-vector is the variable’s SD) Variance proportional-percentual interpretations: (7) r 2 , determination = proportion or % of “explained” variance of Y, the r2x100% (8 ) 1 - r 2 , non-determination , alienation = proportion of “unexplained” variance of Y (9) r itself, it is r x100%, not its square, is a proportion of “explained” part of the original SD of Y, sy , i.e. proportion of reduction of standard deviation to the standard error se of estimating Y. (In case we don’t have any information from X about Y our “best” estimate of a person’s score y in Y would be the mean y with possible max. error in the range 3 sy . But if know his/her value x in X, use regression equation, the regression estimate y will have possible max. error just in the range of 3 se). By the way: why people are using the proportional interpretation only with r 2 as explained part of variance and not directly with r as the reduction of error of estimate ? 3 (10) Similarly, not only r 2 but the r itself is a proportion (or %), according to Cauchy-Schwarz inequality, namely rxy is a proportion of actual co-variation to the maximum possible co-variation, i.e. r by definition is ratio (or %) of actual Cov(X,Y) to the product of the two SD’s, the denominator being maximum possible value of the covariance in numerator. Again: why the overall habit is to use the percentual interpretation just to r 2 ? (And you can find people who speak about r =.50 as of “fifty percent correlation”,but those are marked as absolute ignorants.) (11) Also, r 2, as a ratio of two variances (“around regression line” to the whole s2y) certainly is a part of ANOVA (which is a well known fact), and is related to ANOVA’s F-test, namely by the formula F = r 2 / [ (1-r 2) / (N-2) ] (while this formula is not so much known). (12) If the regression is really linear (and homoskedastic) then r 2 becomes a special case of the correlation ratio 2, which in general case is a measure of overall relationship in the given data set, and typically the difference 2 - r 2 > 0 is being in use as an indicator of nonlinearity (and of other assumptions). (13) If the assumptions for linear correlation are met, then as N increases r 2 approaches 2, the Hays’ 2 as a population measure of relative “Size of Effect”. Probability interpretation: r as joint probability and a measure of probabilistic dependence If X and Y are binary 0;1 variables (“correct-wrong” titems) then their means and SDs can be re-written in a special form: - mean of X becomes px, the proportion of 1’s in X...probability of correct answer (better: sample estimate of that probability, the “difficulty” index of the item) - mean of Y becomes py, the proportion of 1’s in Y - proportions of 0’s are then qx= 1- px, qy= 1- py - then the standard deviations can also get a special form: sx = pxqx , sy = pyqy - the proportion of pairs where X and Y have unities in both of them is denoted pxy - the estimate of joint probability to answer both items correctly (14) then covariance of X,Y gets the special form sxy = pxy - px py , which by definition is a measure of probabilistic dependence as a “departure” from independence (how the actual joint probability pxy differs from the joint probability as product px py of marginal probabilities under the assumption of independence.) 4 (15) then correlation by definition becomes rxy = sxy / (sx sy ) = (pxy - px py) / (pxqx pyqy) , (Eq. 1) and is “nothing more” than the “departure” from probabilistic independence standardized so that the probability measure be scaled between 0 and 1 (to be comparable in different sudies or so.). This concept of correlation coefficient as a rescaled joint probability then could be more or less metaphorically transferred and use for its probabilistic interpretation even if the variables X, Y are measured continuously. Perhaps, on a quite inverted (perverted ?) logic to the usual case when dichotomized variable are assumed to arise by dissection of their continuous scales: say as if the original values (categories) of a variable were 1 and/or 0, but we have measured them on a unnecessarily detailed quantitative scale, for example “enough light intensity to switch a contact” and / or “not enough” but if we had measured the light intensity using a photometer with high resolution scale in lumens (candles). In the case above, which has led to the formula (Eq. 1), we now use another notation: nx ... absolute number of 1’s only ! in X (the cases where in the same pair there is also 1 in Y are not included), ny ... absolute number of 1’s only in Y (again the same limitation applies) nxy ... absolute number of pairs with 1’s both in X and Y. (16) Using (Eq. 1) we get rxy as the above mentioned proportion of “common elements” rxy = nxy / [(nxy+ nx) (nxy+ ny) ] , which formula comes to the proportion if both items have the same “difficulty”. (??) The interesting similarity with probabilistic interpretation of Kendall’s nonparametric correlation coefficient is only accidental since his perhaps cannot be derived as a consequent or special case of the Pearson’s parametric coefficient. For computation the values of variables X, Y are ordered, and - number Nagr of pairs whose order numbers agree, and - number Ndis of pairs whose order numbers disagree, then = (Nagr - Ndis) / N = Nagr/N - Ndis/N is a difference of two proportions, or say two probabilities. (17) Pearson’s r versus Spearman’s “rho”: In a very unrealistic case - when differences between neighbour rank order numbers were exact linear transformation of the quantitative values measured on interval scale - we could meaningfully say that everything here what holds for the parametric Pearson’s coefficient also holds for the so-called “nonparametric” Spearman’s coefficient “rho”. I call it “pseudo-nonparametric”. It is not too much known that Spearman’s “rho” is computationally identical with the Pearson product-moment parametric coefficient. Its computational formula is the original Pearson’s formula, just its special case, simplified 5 due to the circumstances that ordinal numbers have their minimum 1 and maximum N. See the derivation in Appendix. So, it is “pseudo-nonparametric” as it fundamentally (but with what reason ?) assumes that the differences between ranks are equal, that they create equal-interval scale !! This assumption leads to the consequence that its value is higher than Pearson coefficient computed from original quantitave values (would be equal to it just under the unrealistic assumption above). Its only nonparametric feature is that it allows for smaller samples to compute tables of exact probability for statistical test of H0 t hat the parametric correlation in population is not exactly zero 0. (Are there some studies on whether “rho” is unbiased and consistent estimate comparing Pearson’s also for population values different from 0 ?) Therefore the recommendations to use Sperman’s rho for ordinal data is quite meaningless - one can use directly the classical Pearson’s parametric correlation formula. The only reason for using Spearmans “rho” is under the joint occurence of the three circumstances: (i) you don’t want evaluate the strength of relationship, you just want to test hypothesis H0 that the correlation in population is different from exact 0, and (ii) the sample is very small (as typically recommended), say less than 30, and (iii) you don’t have access to computer and you need to use a pencil-paper computation using the nice and simple formula Information theory and rxy Close to the probability theory is the Shannon’s theory of information, which defines entropy H(X) of variable X (or entropy of the variable’s probability distribution) as expectation, resp. for a discrete X as weighted sum of binary logarithms of probabilities pi (relative frequencies) distributed over the discrete values (or categories) xi : H(X) = pi log2 pi , a non-negative quantity in bits, whose maximum for given discretization of range of X occurs when the pi‘s are equal (maximum uncertainty). Entropy H(Y) is given per analogiam as is the joint twodimensional entropy H(X,Y) of the two-dimensional probabilitiy distribution. Then amount of bits of information is I(X,Y) = H(X) + H(Y) - H(X,Y). If all the distributions are continuous normal then (18) correlation coefficient is a function of the information and vice versa: [- I(X,Y)] 1- 4 . (Cf. the Kullback’s book Information theory and statistics.} Blahus (1969, 1971) used the information measures to create a table information measures between motor tests and (1971) suggested to use the difference rI(X,Y) - rxy > 0 between actual, classically computed rxy and that derived from information, namely rI(X,Y), as a measure of departure of linear regression from its assumptions and from the overall probabilistic dependence - an analogy to correlation ratio 2 - r 2 > 0. r2xy = 6 Correlation vs. comparing difference of means (19) rxy and t-test are equivalent: The exact function which transforms rxy into t and back supports the opinion that the traditional distinction between statistical methods for comparing groups (their means) and for evaluation of relationship is not too much justified - each of them identifies relationship. Only in the former case the variable which serves for classifying the person into two groups is qualitative. Say the variable be sex: (a) either we divide persons into two groups, say one marked by “1” contains females, the other marked”0” contains males, in both we measure quantitative variable Y and compare the difference between the two means y1 - y0 by computing Student t-test, (b) or we merge both group with overall sample size N and compute rxy between binary X variable “sex” and the quantitative variable Y The equivalency of the two is given by formula t2 = r 2 / [ (1-r 2) / (N-2) ] , and is further equivalent to ANOVA F-test in this very simplified case with just two groups where morever also holds F = t2 (compare also squared correlation, or determination, as proportion of variances). Relation of r to distance and to MD Scaling, Cluster Analysis (20) 1- r 2 or 1-r interpreted geometrically as distance between points representing variables is used as a measure of dissimilarity and used in some multidimensional scaling methods (MDS), e.g. typically the Guttman’s SSA . On the other hand the r may be used in the other type of MDS scaling that works on principle of similarity /proximity measures. Both conceptions of r are also used in many methods of cluster analysis Correlation r as a fuzzy set measure, relation to fuzzy clustering (21) r 2 or r has also direct interpretation as measure of belonging to a fuzzy set in the Zadeh’s fuzzy set theory and, moreover, as a Zadeh’s “grade of membership to a fuzzy-cluster” in his fuzzy-cluster analysis, which may be a part of a system of neural network algorithms. Further, it is possible to show (Blahus 2001, still in print ?) that exploratory as well as confirmatory orthogonal factor analysis (where loadings are correlations) becomes a special case of fuzzy-cluster analysis if the loadings (correlations) are with respect to the latent factors that have been rescaled to have the same communalities. (By the way: if the aim of a fuzzy-clustering should be more than just formal identification of certain fuzzy-clusters than a possible way of interpretation of original Zadeh’s fuzzy-clusters remains quite unclear to me, reminds an unrotated factor solution.) 7 Some (known) details on rxy for standardized z-scores (22) rxy is a measure of identity, namely of approaching to identity of z-scores. While for raw scores rxy = 1 doesn’t mean identity of X and Y scores, for the standardized zscores it does: the pairs of values in ZX , ZY would be identical, then. (23) rxy is a certain kind of mean, namely the mean of cross-products of z-scores: rxy = 1/N (ZX ZY) (24) rxy is identical with standardized regression coefficients rxy = yx = xy : ZY = ZX , ZX = ZY , compare with geometrical interpretation of -coefficients as the slopes (tangents) of their corresponding regression lines Correlation and the “regression toward the mean” rxy = in the regression equation expresses effect of the “principle” of regression toward the mean, or better: regression of estimates to the mean of estimated variable. In the standardized regression equation ZY = ZX , i.e. ZY = r ZX , the weaker is the statistical relationship (the closer is r to 0) the more importance for the estimate is given to the mean, namely 0, of the estimated dependent variable ZY. If there is no information in ZX about ZY, i.e. if r = 0, then the estimate of ZY will be just 0, its average. On the other hand, even for quite high correlation, say r = 0.9, the estimates of dependent variable will be everytime closer to the average than the values of independent variable. Formally this effect is due to the fact that r < 1 in usual empirical data. - Historically, as Galton and Pearson studied regression of an attribute in sons to the same attribute in fathers, they interpreted it as a general influence of natural selection tendency of one species not to allow an individual to depart from the characteristic of its own species. APPENDIX on the next page: Proof of derivation of “nonparametric” Spearman’s rho as a special case of the classical parametric Pearson’s coefficient (under the assumption of equal intervals between ranks) Copied from the book Siegel : “Nonparametric statistics”, 1956, p. 203 - 204. 8 9