Probability interpretation

advertisement
Psychology 678, Seminar in mathematical psychology:
Structural Modeling with Multivariate Psychometrics
Possible different
interpretations of rxy ,
the Pearson’s linear bivariate correlation coefficient
(reviewed by Petr Blahuš)
Among all the interpretations and different cosequencies of the definition of the
coefficient, there are two that are closely related to psychometric and psychological
theories. These are placed here at the beginning regardless to its belonging to mutually
different categories listed later below. Those are
1. “Geometric N-dimensional” interpretation - it belongs among other geometric ways of
looking at rxy
2. “Common mental elements” interpretation - it belongs mostly to the category of
‘probabilistic-proportional’ interpretations
A) The two versions closest to psychological interpretation
1. “Geometric N-dimensional” interpretation
It is based on geometric epresentation of the N by 2 data matrix as 2 vectors (points)
in N dimensions where

r is cosine of the angle between the two vectors (each representing a variable)
(just aside: for raw data the length of a variable-vector is the variable’s SD)
The geometry of two variable-vectors in the N-dimensional space of N persons and
the idea of battery of tests as a cluster of vectors that have small angles (high
correlations) between them as they are close to one common dimension is an important
basis for investigating diagnostic properties of a battery of psychological tests
(questionnaires and other composed diagnostic methods). Especially it deals with
homogeneity to a “construct” and with dimensionality of the test battery.
This dimensional - geometrical interpretation already became traditional in
multivariate statistics but historically it had been fundamental to the early psychometric
and psychological theories. The Pearson’s 1901 paper called (approximatelly !) ‘On lines
and planes closest fit to ellipsoid ... (?)’ de facto layed the geometrical foundations for
principal axes factor analysis - as principal axes of the ellipsoid of multidimensional
scatter under multivariate normal distribution. For some time this idea stayed aside when
Spearman 1904 with his ‘Intelligence objectively measured ...’ based his idea of G-factor
on generic true score and his formula for dissatenuated correlation. But then already Burt,
and Horst turned to the multidimensional geometric idea, and especially Thurstone based
his multifactor abilities theory on the geometry of principal axes that were rotated to
reach an interpretable “simple structure”(computation of the axes was roughly
approximated by his method of centroids in the space). Hotelling (1933) related the
computation explicitly to the algebraic eigen-value solution, which he preferred to
interpret as variance decomposition into principal ‘components’ (without rotation). About
this time Burt with his “Vectors of mind” seemed to base his conception of “mind” as a
kind of perhaps abstract multidimensional continuum that had to be described by
multidimensional mathematics. In the same direction of theorizing we can trace a little
more recent work by a Fin*), Ahmavaara, Y.: “On the unified factor theory of mind”
(Helsinki, Tiedeakatemia 1975) who allegedly tried to bring the ‘theory of mind’ on
similar background as a unified field theory and continuum theory in modern physics.
I can imagine something like that only very vaguely as follow:
A “mental performance” is a point in the multidimensional continuum whose n axes are
“mental ‘abilities’ ” . Location and “trace line” (a “mental ‘world/line’ “ ?) of the point in
the space (perhaps not just 4-dimensional but “n+1 dimensional ‘time-space’ “ ?) is given by the
“mental coordinates” and by their changes in time.
------*) It is not known whether introvert or extrovert.
-------
2. “Common mental elements”
Among the probabilistic interpretation far below we find the (Eq. 1) that specializes
Pearson formula to the case of binary, zero-one variables alias “items”. See there the
derivations which yield
rxy = nxy / [(nxy+ nx) (nxy+ ny) ] ,
where nx is number of 1’s only in, ny by analogy, and nx is number of unities that
occured jointly both in X and Y, i.e. the number of “common elements” . This formula
simplifies to a proportion if both items have the same “difficulty”, see farther below.
 So correlation coefficient as an
“overlap” or proportion of “common elements”.
There were two historical psychological interpretations:
1. Pavlov’s jointly “excited” cells in brain cortex responsible for conditioning
2. Thomson’s theory that the theoretical concept or “construct” of ability consists of
(latent) common mental elements that manifest themselves in, and are jointly necessary
for, solving two psychological tasks, tests. (Thomson was psychometrician and factor
analyst around 2nd WW.) So correlation as a proportion of an overlap of the mental
elements working in the two different activities - the overlap = common ability or
“common latent factor” as “common latent mental agens” functioning in both activities.
N.B.:
As well as Spearman did in 1904 Thomson assumes “common elements” and the rest for
completeng the performance were “specific elements”, i.e. specific for the individual
activity, test. Therefore Spearman called his intelligence theory “theory of one general
factor” or also “two-factor theory” - compare with decomposition of test score into true,
specific, and unreliable part and the problem of distinguishing between specific and
unreliable parts, which both together create the unique part (this is in factor model
sometimes called the discrepancy of the model or residual part of test score as it is
generated by the model).
2
B) Review of interpretations
The geometric interpretations:
a) Representation of a N by 2 data matrix as N points in 2 dimensions
(1) rxy automatically produces (implicitly) the “best fitting” line to the scattergram of
N points in the sense of least squares - actually two lines, one for estimating Y as one
knows X, the second for estimating X as one knows Y
(2)
Having the X, Y variables standardized into z-score variables ZX , ZY (so, each
having mean = 0 and SD = 1, and their two regression lines go through the origin 0 point)
it may be shown that:
(3) r
is cosine of the angel between the two lines (it approaches 1 as the lines approach
each other to joint into one line going through the origin under 45 degrees angle)
(4)
r is tangent (slope of the lines ) of the angle between the line for estimating ZY
and the x-axis, and reciprocally, it is tangent of the angle between the line for estimating
ZX and the y-axis (it approaches 1 as the lines are rotating to join under 45 degrees angle, approaches 0
as each line approaches its axis to become orthogonal with the other)
(5) for unstandardized raw data r is a geometric average of the two slopes /
tangents, which then are not equal (so it is a square root of their product)
b) Representation of the N by 2 data matrix as 2 vectors (points) in N dimensions
(6)
r is cosine of the angle between the two vectors (each representing a variable)
(just aside: for raw data the length of a variable-vector is the variable’s SD)
Variance proportional-percentual interpretations:
(7) r 2 , determination = proportion or % of “explained” variance of Y, the r2x100%
(8 ) 1 - r 2 , non-determination , alienation = proportion of “unexplained” variance of Y
(9) r itself, it is r x100%, not its square, is a proportion of “explained” part of the
original SD of Y, sy , i.e. proportion of reduction of standard deviation to the standard
error se of estimating Y.
(In case we don’t have any information from X about Y our “best” estimate of a person’s score y in Y
would be the mean y with possible max. error in the range  3 sy . But if know his/her value x in X, use
regression equation, the regression estimate y will have possible max. error just in the range of  3 se).
By the way: why people are using the proportional interpretation only with r 2 as
explained part of variance and not directly with r as the reduction of error of estimate ?
3
(10)
Similarly, not only r 2 but the r itself is a proportion (or %), according to
Cauchy-Schwarz inequality, namely rxy is a proportion of actual co-variation to the
maximum possible co-variation, i.e. r by definition is ratio (or %) of actual Cov(X,Y) to
the product of the two SD’s, the denominator being maximum possible value of the
covariance in numerator.
Again: why the overall habit is to use the percentual interpretation just to r 2 ?
(And you can find people who speak about r =.50 as of “fifty percent correlation”,but those are marked as
absolute ignorants.)
(11)
Also, r 2, as a ratio of two variances (“around regression line” to the whole s2y)
certainly is a part of ANOVA (which is a well known fact), and is related to ANOVA’s
F-test, namely by the formula
F = r 2 / [ (1-r 2) / (N-2) ]
(while this formula is not so much known).
(12)
If the regression is really linear (and homoskedastic) then r 2 becomes a special
case of the correlation ratio  2, which in general case is a measure of overall relationship
in the given data set, and typically the difference  2 - r 2 > 0 is being in use as an
indicator of nonlinearity (and of other assumptions).
(13)
If the assumptions for linear correlation are met, then as N increases
r 2 approaches  2,
the Hays’  2 as a population measure of relative “Size of Effect”.
Probability interpretation:
r as joint probability and a measure of probabilistic dependence
If X and Y are binary 0;1 variables (“correct-wrong” titems) then their means and
SDs can be re-written in a special form:
- mean of X becomes px, the proportion of 1’s in X...probability of correct answer
(better: sample estimate of that probability, the “difficulty” index of the item)
- mean of Y becomes py, the proportion of 1’s in Y
- proportions of 0’s are then qx= 1- px, qy= 1- py
- then the standard deviations can also get a special form:
sx = pxqx , sy = pyqy
- the proportion of pairs where X and Y have unities in both of them is denoted
pxy - the estimate of joint probability to answer both items correctly
(14)
then covariance of X,Y gets the special form
sxy = pxy - px py ,
which by definition is a measure of probabilistic dependence as a “departure”
from independence (how the actual joint probability pxy differs from the joint probability
as product px py of marginal probabilities under the assumption of independence.)
4
(15)
then correlation by definition becomes
rxy = sxy / (sx sy ) = (pxy - px py) / (pxqx  pyqy) ,
(Eq. 1)
and is “nothing more” than the “departure” from probabilistic independence
standardized so that the probability measure be scaled between 0 and 1 (to be comparable
in different sudies or so.).
This concept of correlation coefficient as a rescaled joint probability then could
be more or less metaphorically transferred and use for its probabilistic interpretation
even if the variables X, Y are measured continuously. Perhaps, on a quite inverted
(perverted ?) logic to the usual case when dichotomized variable are assumed to arise
by dissection of their continuous scales:
say as if the original values (categories) of a variable were 1 and/or 0, but we
have measured them on a unnecessarily detailed quantitative scale, for example
“enough light intensity to switch a contact” and / or “not enough”
but if we had measured the light intensity using a photometer with high resolution
scale in lumens (candles).
In the case above, which has led to the formula (Eq. 1), we now use another
notation:
nx ... absolute number of 1’s only ! in X
(the cases where in the same pair there is also 1 in Y are not included),
ny ... absolute number of 1’s only in Y (again the same limitation applies)
nxy ... absolute number of pairs with 1’s both in X and Y.
(16) Using (Eq. 1) we get rxy as the above mentioned proportion of “common elements”
rxy = nxy / [(nxy+ nx) (nxy+ ny) ] ,
which formula comes to the proportion if both items have the same “difficulty”.
(??)
The interesting similarity with probabilistic interpretation of Kendall’s 
nonparametric correlation coefficient is only accidental since his  perhaps cannot be
derived as a consequent or special case of the Pearson’s parametric coefficient.
For computation the values of variables X, Y are ordered, and
- number Nagr of pairs whose order numbers agree, and
- number Ndis of pairs whose order numbers disagree, then
 = (Nagr - Ndis) / N = Nagr/N - Ndis/N
is a difference of two proportions, or say two probabilities.
(17)
Pearson’s r versus Spearman’s “rho”:
In a very unrealistic case - when differences between neighbour rank order
numbers were exact linear transformation of the quantitative values measured on interval
scale - we could meaningfully say that everything here what holds for the parametric
Pearson’s coefficient also holds for the so-called “nonparametric” Spearman’s coefficient
“rho”. I call it “pseudo-nonparametric”. It is not too much known that Spearman’s “rho”
is computationally identical with the Pearson product-moment parametric coefficient. Its
computational formula is the original Pearson’s formula, just its special case, simplified
5
due to the circumstances that ordinal numbers have their minimum 1 and maximum N.
See the derivation in Appendix. So, it is “pseudo-nonparametric” as it fundamentally
(but with what reason ?) assumes that the differences between ranks are equal, that they
create equal-interval scale !!
This assumption leads to the consequence that its value is higher than Pearson
coefficient computed from original quantitave values (would be equal to it just under the
unrealistic assumption above).
Its only nonparametric feature is that it allows for smaller samples to compute tables
of exact probability for statistical test of H0 t hat the parametric correlation in population
is not exactly zero 0.
(Are there some studies on whether “rho” is unbiased and consistent estimate comparing
Pearson’s also for population values different from 0 ?)
Therefore the recommendations to use Sperman’s rho for ordinal data is quite
meaningless - one can use directly the classical Pearson’s parametric correlation formula.
The only reason for using Spearmans “rho” is under the joint occurence of the three
circumstances:
(i) you don’t want evaluate the strength of relationship, you just want to test hypothesis
H0 that the correlation in population is different from exact 0, and
(ii) the sample is very small (as typically recommended), say less than 30, and
(iii) you don’t have access to computer and you need to use a pencil-paper computation
using the nice and simple formula
Information theory and rxy
Close to the probability theory is the Shannon’s theory of information, which defines
entropy H(X) of variable X (or entropy of the variable’s probability distribution) as
expectation, resp. for a discrete X as weighted sum of binary logarithms of probabilities pi
(relative frequencies) distributed over the discrete values (or categories) xi :
H(X) =  pi log2 pi , a non-negative quantity in bits,
whose maximum for given discretization of range of X occurs when the pi‘s are equal
(maximum uncertainty). Entropy H(Y) is given per analogiam as is the joint
twodimensional entropy H(X,Y) of the two-dimensional probabilitiy distribution. Then
amount of bits of information is I(X,Y) = H(X) + H(Y) - H(X,Y). If all the distributions
are continuous normal then
(18)
correlation coefficient is a function of the information and vice versa:
[- I(X,Y)]
1- 4
.
(Cf. the Kullback’s book Information theory and statistics.} Blahus (1969, 1971) used
the information measures to create a table information measures between motor tests and
(1971) suggested to use the difference
rI(X,Y) - rxy > 0
between actual, classically computed rxy and that derived from information, namely
rI(X,Y), as a measure of departure of linear regression from its assumptions and from the
overall probabilistic dependence - an analogy to correlation ratio  2 - r 2 > 0.
r2xy =
6
Correlation vs. comparing difference of means
(19) rxy
and t-test are equivalent:
The exact function which transforms rxy into t
and back supports the opinion that the traditional distinction between statistical methods
for comparing groups (their means) and for evaluation of relationship is not too much
justified - each of them identifies relationship. Only in the former case the variable which
serves for classifying the person into two groups is qualitative. Say the variable be sex:
(a) either we divide persons into two groups, say one marked by “1” contains females, the
other marked”0” contains males, in both we measure quantitative variable Y and compare
the difference between the two means y1 - y0 by computing Student t-test,
(b) or we merge both group with overall sample size N and compute rxy
between binary X variable “sex” and the quantitative variable Y
The equivalency of the two is given by formula
t2 = r 2 / [ (1-r 2) / (N-2) ] ,
and is further equivalent to ANOVA F-test in this very simplified case with just two
groups where morever also holds F = t2 (compare also squared correlation, or
determination, as proportion of variances).
Relation of r to distance and to MD Scaling, Cluster Analysis
(20)
1- r 2 or 1-r  interpreted geometrically as distance between points representing
variables is used as a measure of dissimilarity and used in some multidimensional
scaling methods (MDS), e.g. typically the Guttman’s SSA .
On the other hand the r  may be used in the other type of MDS scaling that works
on principle of similarity /proximity measures.
Both conceptions of r are also used in many methods of cluster analysis
Correlation r as a fuzzy set measure, relation to fuzzy clustering
(21) r 2 or
r  has also direct interpretation as measure of belonging to a fuzzy set in
the Zadeh’s fuzzy set theory and, moreover, as a Zadeh’s “grade of membership to a
fuzzy-cluster” in his fuzzy-cluster analysis, which may be a part of a system of neural
network algorithms. Further, it is possible to show (Blahus 2001, still in print ?) that
exploratory as well as confirmatory orthogonal factor analysis (where loadings are
correlations) becomes a special case of fuzzy-cluster analysis if the loadings
(correlations) are with respect to the latent factors that have been rescaled to have the
same communalities.
(By the way: if the aim of a fuzzy-clustering should be more than just formal identification of certain
fuzzy-clusters than a possible way of interpretation of original Zadeh’s fuzzy-clusters remains quite unclear
to me, reminds an unrotated factor solution.)
7
Some (known) details on rxy for standardized z-scores
(22) rxy is a measure of
identity, namely of approaching to identity of z-scores. While
for raw scores rxy = 1 doesn’t mean identity of X and Y scores, for the standardized zscores it does: the pairs of values in ZX , ZY
would be identical, then.
(23) rxy
is a certain kind of mean, namely the mean of cross-products of z-scores:
rxy = 1/N  (ZX ZY)
(24) rxy is identical with standardized regression coefficients
rxy = yx = xy   :
ZY =  ZX , ZX =  ZY ,
compare with geometrical interpretation of -coefficients as the slopes (tangents) of
their corresponding regression lines
Correlation and the “regression toward the mean”

rxy =  in the regression equation expresses effect of the “principle” of regression
toward the mean, or better: regression of estimates to the mean of estimated variable.
In the standardized regression equation ZY =  ZX , i.e.
ZY = r ZX ,
the weaker is the statistical relationship (the closer is r to 0) the more importance for the
estimate is given to the mean, namely 0, of the estimated dependent variable ZY. If there
is no information in ZX about ZY, i.e. if r = 0, then the estimate of ZY will be just 0, its
average. On the other hand, even for quite high correlation, say r = 0.9, the estimates of
dependent variable will be everytime closer to the average than the values of independent
variable. Formally this effect is due to the fact that r < 1 in usual empirical data.
- Historically, as Galton and Pearson studied regression of an attribute in sons to the
same attribute in fathers, they interpreted it as a general influence of natural selection
tendency of one species not to allow an individual to depart from the characteristic of its
own species.
APPENDIX on the next page:
Proof of derivation of “nonparametric” Spearman’s rho as a special case of the
classical parametric Pearson’s coefficient
(under the assumption of equal intervals between ranks)
Copied from the book Siegel : “Nonparametric statistics”, 1956, p. 203 - 204.
8
9
Download