Consequences of the ergodic theorems for classical test theory, factor analysis, and the analysis of developmental processes Peter C.M. Molenaar The Pennsylvania State University 1 1. Introduction The currently dominant approach to statistical analysis in psychology and biomedicine is based on analysis of inter-individual variation. Differences between subjects, drawn from a population of subjects, provide the information for making inferences about states of affairs at the population level (e.g., mean and/or covariance structure). This approach underlies all standard statistical analysis techniques such as analysis of variance, regression analysis, path analysis, factor analysis, cluster analysis, and multilevel modeling techniques. Whether the data are obtained in cross-sectional or longitudinal designs (or more elaborated designs such as sequential designs), the statistical analysis always is focused on the structure of inter-individual variation. Parameters and statistics of interest are estimated by pooling across subjects, where these subjects are assumed to be homogeneous in all relevant respects. This is the hall-mark of analysis of inter-individual variation: the sums defining the estimators in statistical analysis are taken over different subjects randomly drawn from a population of presumably homogeneous subjects. In mixed modeling the population is considered to be composed of different sub-populations, but within each subpopulation subjects again are assumed to be homogeneous. In the next section definitions will be given of inter-individual variation and homogeneity of a population of subjects, but the intuitive content of these concepts is clear. These intuitions would seem to imply that inferences about states of affairs at the population level obtained by pooling across subjects constitute general findings that apply to each subject in the homogeneous population. Yet in general this is not the case. That is, in general it is not true that inferences about states of affairs at the population level based on analysis of inter-individual variation apply to any of the individual subjects making up the population. This negative result is a direct implication of a set of mathematicalstatistical theorems; the so-called classical ergodic theorems (cf. Molenaar, 2004). A concise heuristic description of the classical ergodic theorems will be given below. The main focus of this chapter, however, will be on some of the implications of these theorems. For instance, it will be shown that classical test theory is based on assumptions that violate the classical ergodic theorems, and hence, in a precise sense to be defined later on, the results of classical test theory do not apply in individual assessments. This, of course, is a serious shortcoming of classical test theory, because many psychological tests have been constructed and standardized according to classical test theory and are applied in the assessment of individual subjects. Special emphasis will be given to the fact that developmental systems constitute prime examples of non-ergodic systems having age-dependent statistical characteristics (mean trends and sequential dependencies). Therefore the statistical analysis of developmental processes has to be based not on interindividual variation, as now is the standard approach, but on intra-individual variation (where the latter type of variation will be defined in the next section). It 2 will be indicated that the insistence that developmental processes should be studied at the individual level has a long history in theoretical developmental psychology. The classical ergodic theorems provide a definite vindication of this theoretical line of thought. At the close of this chapter a new statistical modeling technique will be presented with which it is possible to analyze developmental processes with age-dependent statistical characteristics at the required intra-individual level. This modeling technique is based on advanced engineering methods for the analysis of complex dynamic systems. It will be shown that the new modeling technique allows for the optimal guidance of ongoing developmental processes at the intraindividual level. Evidently, this opens up entirely new possibilities for applied developmental psychological science. 2. Preliminaries In this section definitions will be given of the main concepts used in this chapter. The given definition of (non-)ergodicity is heuristic; selected references will be given to the vast literature on ergodic theory for more formal elaborations. 2.1 Unit of analysis. Each actually existing human being can be conceived of as a high-dimensional integrated system whose behavior evolves as function of place and time. In psychology one usually does not consider place, leaving time as the dimension of main interest. The system includes important functional subsystems such as the perceptual, emotional, cognitive and physiological systems, as well as their dynamic interrelationships. The complete set of measurable time-dependent variables characterizing the system’s behavior can be represented as the coordinates of a high-dimensional space (cf. Nayfeh & Balachandran, 1993, Ch. 1), which will be called the behavior space. According to De Groot (1954), the behavior space contains all the scientifically relevant information about a person. The realized values of all measurable variables for a particular individual at consecutive time points constitutes a trajectory (life history) in behavior space. This trajectory in behavior space is our basic unit of analysis. Accordingly, the complete set of life histories of a population of human subjects can be represented as an ensemble of trajectories in the same behavior space. 2.2 Inter- and intra-individual variation. A standard dictionary definition of variation is: “The degree to which something differs, for example, from a former state or value, from others of the same type, or from a standard”. The degree to which something differs implies a comparison, either between different replicates of the same type of entity (inter-individual variation) or else between consecutive temporal states of the same individual entity (intra-individual variation). Based on this dictionary definition and using the construct of an ensemble of life trajectories defined in the previous section, it is possible to give appropriate definitions of 3 inter- and intra-individual variation. The following definitions are inspired by Catell’s (1952) notion of the Data Box. With respect to an ensemble of trajectories in behavior space, inter-individual variation is defined as follows: (i) select a fixed subset of variables; (ii) select one or more fixed time points as measurement occasions, (iii) determine the variation of the scores on the selected variables at the selected time points by pooling across subjects. Analysis of inter-individual variation thus defined is called Rtechnique by Cattell (1952). In contrast, intra-individual variation is defined as follows: (i) select a fixed subset of variables; (ii) select a fixed subject; (iii) determine the variation of the scores of the single subject on the selected variables by pooling across time points. Analysis of intra-individual variation thus defined is called P-technique by Cattell (1952). 2.3 Ergodicity. We now can present a heuristic definition of ergodicity in terms of the concepts defined in the previous sections. Ergodicity addresses the following foundational question: Given the same set of selected variables (of Cattell’s Data Box), under which conditions will an analysis of inter-individual variation yield the same results as an analysis of intra-individual variation? To illustrate this question: under which conditions will factor analysis of inter-individual covariation yield a factor solution that is equal to factor analysis of intra-individual covariation? The latter illustration can be rephrased in terms of Cattell’s Data Box in the following way: Under which conditions will R-technique factor analysis of inter-individual covariation yield a solution that equals the analogous P-technique factor solution of intra-individual covariaton? The general answer to this question is provided by the classical ergodic theorems (cf. Molenaar, 2004; Molenaar, 2003, chapter 3). The answer is: Only if the ensemble of time-dependent trajectories in behavior space obeys two rigorous conditions will an analysis of inter-individual variation yield the same results as an analysis of intra-individual variation. The two conditions concerned are the following. Firstly, the trajectory of each subject in the ensemble has to obey exactly the same dynamical laws (homogeneity of the ensemble). Secondly, each trajectory should have constant statistical characteristics in time (stationarity, i.e., constant mean level and serial dependencies). In case either one (or both) of these two conditions is not met, the psychological process concerned is non-ergodic, i.e., its structure of inter-individual variation will differ from its structure of intra-individual variation. For a non-ergodic process, the results obtained in standard analysis of inter-individual variation do not apply at the individual level of intra-individual variation. The meaning of the homogeneity and stationarity assumptions will be elaborated more fully in later sections, starting with the section on classical test theory below. The requirement that each subject in the ensemble should obey the same dynamical laws is expressed in the language of ergodic theory, which has its roots in the theoretical foundations of statistical mechanics. Statistical mechanics 4 arose as the attempt by Boltzmann to explain the equilibrium characteristics of a homogeneous gas kept under constant pressure and temperature in a container, where the atoms of the homogeneous gas each obey the Newton laws of motion. Nowadays ergodic theory is an independent mathematical discipline; standard introductions are Petersen (1983) and Walters (1982). An excellent recent monograph is Choe (2005). The theorem which for the ensuing discussion is the most important one in the set of classical ergodic theorems has been proven by Birkhoff (1931). 3. The non-ergodicity of classical test theory. Many of the psychological tests currently in use have been constructed according to the principles of classical test theory. The basic concept in classical test theory is the concept of true score: each observed score is conceived of as a linear combination of a true score and an error score. In their authoritative book on classical test theory, Lord & Novick (1968) define the concept of true score as follows. They consider a fixed person P, i.e., P is not randomly drawn from some population but is the given person for which the true score is to be defined. The true score of P is defined as the expected value of the propensity distribution of P’s observed scores. The propensity distribution is characterized as a “... distribution function defined over repeated statistically independent measurements on the same person” (Lord & Novick, 1968, p. 30). The concept of error score then follows straightforwardly: the error score is the difference between the observed score and the true score. Several aspects of this definition of true score are noteworthy. The definition is based on the intra-individual variation characterizing a fixed person P. Repeated administration of the same test to P yields a time series of scores of P, the mean level of which is defined to be P’s true score. Hence this definition of true score does not involve any comparison with other persons and therefore is not at all dependent on inter-individual variation. The single-subject repeated measures design used to obtain P’s time series of observed scores is akin to standard psychophysical measurement designs (e.g., Gescheider, 1997). Lord & Novick (1968) require that the repeated measurements are independent. This implies that the time series of P’s scores should lack any sequential dependencies (autocorrelation). At the close of this section we will further discuss the requirement that repeated measurements have to be independent. Lord & Novick (1968, p. 30) do not further elaborate their original definition of true score in the context of intra-individual variation because: “… it is not possible in psychology to obtain more than a few independent observations”. Instead of considering an arbitrary large number of replicated measurements of a single fixed person P, Lord & Novick (1968, p. 32) shift attention to an alternative scheme in which an arbitrary large number of persons is measured at a single fixed time: “Primarily, test theory treats individual differences or, equivalently, the distribution of measurements over people”. Apparently it is expected that using 5 an individual differences approach, valid information can be obtained about the distinct propensity distributions underlying individual true scores. We will see shortly that this expectation is unwarranted. Before focusing in the remainder of their book solely on the latter definition of true score based on inter-individual variation, Lord & Novick (1968, p.32) make the following interesting comment about their initial definition of true score based on intra-individual variation: “The true and error scores defined above [based on intra-individual variation; PM] are not those primarily considered in test theory … They are, however, those that would be of interest to a theory that deals with individuals, rather than with groups (counseling rather than selection)”. This is a remarkable, though somewhat oblique statement. What is clear is that Lord & Novick consider a test theory based on their initial concept of true score, defined as the mean of the intra-individual variation characterizing a fixed person P, to be “… of interest to a theory that deals with individuals …”. That is, they consider such a test theory based on intra-individual variation to be important in the context of individual assessment. But what is not clear is whether they also consider the alternative concept of true score based on inter-individual variation (individual differences) to be not of interest to a theory that deals with individuals. That is, do they imply that classical test theory as we know it is only appropriate for the assessment of groups and not for individuals? It will be shown that classical test theory indeed is inappropriate for individual assessment. To summarize the discussion thus far: Lord & Novick (1968) define the concept of true score as the expected value of the propensity distribution of the observed scores of a given individual person P. This definition of true score based on intraindividual variation then is used in an inter-individual context focused on individual differences, i.e., classical test theory as we know it. This raises the allimportant question whether the information provided by individual differences (inter-individual variation) is able to determine the individual propensity distributions to a degree which is sufficient to apply the concept of true score based on intra-individual variation. It is noted that this is exactly the question concerning the ergodicity of the psychological process concerned: for a given test, will an analysis of inter-individual variation of test scores yield the same results as an analysis of intra-individual variation of test scores? To answer this question it has to be established that the psychological process presumed by classical test theory to underlie the generation of test scores obeys the two criteria for ergodicity. The psychological process which according to classical test theory underlies the generation of test scores is very simple. It is implicit in the definition of true score given by Lord & Novick (1968). Each individual person P is assumed to generate a time series of independent scores in response to repeated administration of the same test. Each observed score of P’s time series constitutes an independent random sample drawn from P’s propensity distribution. Hence there exists a oneto-one relationship between the time series of P’s observed test scores and P’s 6 propensity distribution. The psychological process underlying P’s time series of observed scores therefore is characterized, according to classical test theory, by P’s propensity distribution. Statistical analysis of P’s intra-individual variation boils down to statistical analysis based on P’s propensity distribution. Classical test theory only considers the first two central moments of P’s propensity distribution (its mean and its variance). According to classical test theory the propensity distributions of different persons have different means and different variances. The true score of person P1 (i.e., the mean of the propensity distribution of P1) will in general differ from the true score of person P2. Also the variance of P1’s observed scores will in general differ from the variance of P2 observed scores. Hence, given the one-to-one correspondence between individual time series and individual propensity distributions noted above, the ensemble involving persons Pi, i=1,2,…, is populated by time series (propensity distributions) which have different mean levels (means of the propensity distributions) and different variances. Clearly such an ensemble is entirely heterogeneous: the psychological process according to which Pi’s time series of observed scores is generated is different from the psychological process according to which Pk’s time series of observed scores is generated because, for i ≠ k, the underlying propensity distribution of Pj has mean and variance different from Pk’s propensity distribution. Consequently the ensemble of time series (propensity distributions) violates at least one of the two criteria for ergodicity: the trajectories (time series) in the ensemble do not obey the homogeneity criterion for ergodity, i.e., trajectories associated with different persons do not obey exactly the same dynamical laws. Stated more specifically, the random motion characterizing time series of observed scores in the ensemble has different mean level and variance for different persons. Consequently, the psychological process which according to classical test theory underlies the generation of test scores is non-ergodic. That is, it follows from the classical ergodic theorems that results obtained in an analysis of inter-individual variation (individual differences) of test scores based on classical test theory do not apply at the individual level of intra-individual variation. In short, the results obtained with classical test theory do not apply in the context of individual assessment. 3.1 Some formal elaborations. We will now present some simple formal elaborations showing the invalidity of classical test theory for individual assessment. In particular we will focus on the concept of reliability as defined in classical test theory, show how estimation of an individual’s true score in classical test theory depends upon the reliability of the test, and indicate why this leads to invalid inferences. In what follows expressions related to classical test theory are based on Lord & Novick (1968). Consider first the situation with respect to the definition of true score based on intra-individual variation. A particular test has been selected (it will be understood 7 in the rest of this section that the same test is being considered). Also a particular person P is given. Let y(P,t), t=1,2,… denote the time series of P’s scores obtained by repeatedly administering the test. The number of repeated measurements is left undefined: it is understood that this number can be taken to be arbitrarily large. Then the true score of P, (P), is defined as the expected value (mean) of y(P,t) across all repeated measurements t. Notice that (P) is a constant. The variance of y(P,t) across all repeated measurements is denoted by 2(P). The variance 2(P) is a measure of the reliability of a single score y(P,t=T) which is obtained at the T-th repeated measurement (T arbitrary), conceived of as an indicator of P’s true score (P). If 2(P) is large, y(P,t=T) can be very different from (P), whereas if 2(P) is small its value will be close to (P). To reiterate, in classical test theory one does not consider an arbitrary large number of repeated measurements of a single person P, but instead one considers an arbitrary large number of persons measured at a single time T. This is the shift from an intra-individual variation perspective underlying the concept of true score to an inter-individual variation perspective underlying classical test theory as we know it. Accordingly we consider an ensemble of time series of test scores associated with different persons Pi, i=1,2,…, where the number of persons can be taken arbitrarily large. Associated with each distinct person P i is a distinct propensity distribution which has, as explained above, a one-to-one relationship with the psychological process according to which Pi generates his/her time series of observed test scores. The mean (true score) of the propensity distribution of Pi is (Pi) and the observed score of Pi is y(P,t=T), where T is arbitrary but fixed. To ease the presentation we will denote (Pi) as i and y(Pi, t=T) as yi. The error score associated with y(Pi, t=T) = yi is (Pi, t=T) and will be denoted as (Pi, t=T) = i,. We now are ready to express the basic relationships of classical test theory: (1a) yi = i + i, i=1,2,… (1b) var[yi] = var[i] + var[i]. According to (1a) the observed score yi of a randomly selected person Pi is a linear combination of the true score i and the error score i of Pi. According to (1b) the variance of observed scores across persons consists of a linear combination of the variance of the true scores across persons and the variance of the error scores across persons. The reliability of the test then is defined as: (1c) = var[i] / { var[i] + var[i]}. Hence the reliability is the proportion of true score variance across persons in the total variance of observed scores across persons. 8 Now suppose that the reliability of our test is given and that also is given the observed score yi of person Pi. Then the following so-called Kelly estimator of the true score i of Pi can be defined (cf. Lord & Novick, 1968, p. 65, formula 3.7.2a): (2a) est[i yi] = yi + (1 - ) where is the mean of observed scores across persons. The error variance associated with the Kelly estimator (2a) is (Lord & Novick, 1968, p. 68, formula 3.8.4a): (2b) var{est[i yi]} = var[yi](1 - ). Expressions (2a) and (2b) show that the estimate and associated standard error of a person’s true score in classical test theory are a direct function of the test reliability . The reliability itself is according to (1c) a direct function of the variance of error scores var[i] across persons. Hence the Kelly estimate (2a) of a person’s true score is a direct function of the error variance var[i] across persons. We have reached the conclusion that in classical test theory based on analysis of inter-individual variation (individual differences), the estimate of a person’s true score as well as the standard error of this estimated true score depend directly upon the reliability of the test. In contrast, it was indicated at the beginning of this section that the variance 2(P) of the propensity distribution describing P’s intra-individual variation is a measure of the reliability of a single score y(P,t=T) estimating P’s true score (P). Hence we have two different concepts of reliability: an intra-individual definition in which the reliability is given by 2(P) and an inter-individual definition in which the reliability is a direct function of var[i]. Given that the definition of true score as the mean of a person P’s propensity distribution is the starting point of both concepts of reliability, the definition of reliability in terms of the intra-individual variance 2(P) is basic. The question then arises whether the classical test theoretical definition of reliability in terms of the inter-individual error variance var[i] is a good approximation of 2(P). The answer to this question is given by the following expression (Lord & Novick, 1968, p. 35, formula 2.6.4): (3) var[i] = Ei[2(Pi)] where Ei denotes the expectation taken over persons Pi, i=1,2,… . Expression (3) states that the inter-individual error variance var[i] is the mean of the intraindividual variances of individual propensity distributions across persons Pi, i=1,2,… . So, coming to our final verdict, how good an approximation is (3) for each of the individual variances 2(Pi), i=1,2,… ? Given that the number of persons in the 9 ensemble is taken to be arbitrarily large, and given that the 2(Pi), i=1,2,… can differ arbitrarily according to classical test theory, it is immediately clear that in general (3) bears no relationship to any of the variances of the individual propensity distributions. Hence (3) is a poor approximation to the variances 2(Pi) of the individual propensity distributions. Suppose that (3) is small, which implies that the (inter-individual) reliability is high. This leaves entirely open the possibility that the variance 2(P) of a given person P’s propensity distribution is arbitrary large (the psychological process generating test scores is heterogeneous, hence non-ergodic). Estimation of P’s true score by means of the Kelly estimator (2a) then will yield a severely biased result. Also the standard error (2b) of this estimate will be severely biased, suggesting an illusory high precision of the Kelly estimate. Only the actual value of 2(P) will provide the correct precision of taking P’s observed score as an estimate of P’s true score. The true value of 2(P) only can be estimated in an analysis of P’s intra-individual variation. That is, the test should be repeatedly administered to P, yielding a time series of P’s observed scores. The mean of P’s time series of observed scores constitutes an unbiased estimate of P’s true score, and the standard deviation of P’s time series of observed scores provides an unbiased estimate of the precision of P’s estimated true score. 3.2 Fundamental reasons or contingent circumstances This section presents a critical discussion of the reasons why Lord & Novick (1968), after having defined the concept of true score in terms of intra-individual variation, do not further pursue an intra-individual foundation for test theory and turn instead to an inter-individual perspective. It will be argued that their reasons for doing so are not fundamental, but pertain to contingent circumstances that can be dealt with by means of appropriate statistical-methodological techniques. The key remark leading up to the rejection of the possibility of a test theory based on intra-individual variation is the following: Characterizing the propensity distribution associated with the time series of a given person P’s observed test scores, Lord & Novick (1968, p. 30) require that the “... distribution function [is] defined over repeated statistically independent measurements on the same person”. The important qualification is that the repeated measurements should be statistically independent. This implies the requirement that P’s time series of observed test scores should lack sequential dependencies (e.g., autocorrelation). After having postulated the requirement of obtaining statistically independent observed scores, Lord & Novick (1968, p. 30) conclude: “… it is not possible in psychology to obtain more than a few independent observations”. This is the reason why they do not consider the possibility of a test theory based on intraindividual variation to be feasible. In general test scores obtained in a singlesubject time series design will be sequentially dependent, i.e., have significant autocorrelation. Moreover, the statistical properties of the psychological process according to which test scores are generated may change in time. For instance, 10 the process concerned may be vulnerable to learning and habituation influences which induce time-dependent changes in the way test scores are being generated. Before scrutinizing the details of Lord & Novick’s (1968) requirement that repeated measurements of the same person P should be statistically independent, we first consider their reason not to pursue a test theory based on intra-individual variation. Because the basic concept underlying classical test theory, the concept of true score, is defined at the level of intra-individual variation, one would expect that the reason to leave that level and move to a different level of inter-individual variation would have to be a fundamental reason. One would expect to be given an argument involving issues of logical necessity or impossibility. Yet the actual argument given by Lord & Novick (1968) concerns more an issue of contingent character: repeated measurement of the same person P yields test scores that are in general not statistically independent. Indeed, all psychometricians will agree. But the statistical analysis techniques used to determine P’s propensity distribution can accommodate the presence of sequential dependencies, and then we still can have a test theory which is directly based on the concept of true score as defined by Lord & Novick (1968). That is, a test theory based on intra-individual variation which would be of interest for individual assessment and counseling. The reason which Lord & Novick (1968) give for not further pursuing such a test theory is not fundamental and does not prove the impossibility of such a theory. We now turn to discussion of the requirement that repeatedly measuring the same person P should yield a time series of statistically independent scores. To reiterate, no psychometrician will expect this to occur: repeated measurement of the same person generally will yield a time series of sequentially dependent scores. But is this problematic? The time series of scores provides the information to determine the propensity distribution characterizing person P. In particular, the mean and variance of P’s propensity distribution have to be determined. This is a standard problem in the statistical analysis of time series that has been completely solved in case the time series is stationary (cf. Anderson, 1971). Hence the important requirement is not that P’s time series should consist of statistically independent scores, but that the time series is stationary. Stationarity of a time series implies that the series has constant mean level and that its autocorrelation only depends upon the relative distance (lag) between measurement occasions. The alternative requirement that a time series has to be stationary can be tested for in several ways (cf. Priestley, 1988). In case such tests indicate that the series is non-stationary, it can be analyzed by means of special techniques such as evolutionary spectrum analysis (Priestley, 1988) or wavelet analysis (e.g., Hogan & Lakey, 2005; Houtveen & Molenaar, 2001). At the close of this chapter a new modeling technique for multivariate non-stationary time series will be presented. Hence from a statistical analytic point of view non-stationary time 11 series can be handled satisfactorily. Yet from the point of view of a test theory based on intra-individual variation, a person P’s time series of test scores should be stationary in order to allow estimation of the constant mean and constant variance of P’s time-invariant propensity distribution. In case P’s time series of test scores is non-stationary, the mean and/or variance of the series will in general be time-varying. Lord & Novick’s (1968) definition of true score, however, does not pertain to time-varying propensity distributions with time-varying means and/or variances. Hence either methodological or statistical techniques have to be invoked in order to guarantee that P’s time series of test scores is stationary. Only then can the (constant) mean and variance of P’s time series be used as estimates of the mean and variance of P’s propensity distribution. Methodological techniques can be used to guarantee that non-stationarity due to learning and habituation is avoided. For instance, using a common approach in reaction time research, registration of P’s time series of test scores only should begin if P has reached a steady state after an initial transient due to novelty effects. This will require the availability of a pool of many parallel test items in order to avoid learning effects. Statistical techniques can be used a posteriori to remove transient effects due to habituation and learning from P’s time series of test scores (e.g., Molenaar & Roelofs, 1987). Almost certainly new methodological and statistical techniques will have to be developed in order to accommodate the intricacies due to nonstationarity and fully exploit the possibilities of a test theory based on intraindividual variation. Until now these possibilities have not been pursued systematically, for the wrong reasons as has been argued in this section. Given that the psychological process underlying the generation of test scores is nonergodic according to classical test theory based on analysis of inter-individual variation, psychometricians will have to seriously reconsider their reasons for not pursuing a test theory based on intra-individual variation. One promising psychological paradigm which allows for straightforward determination of person-specific propensity distributions is mental chronometry. In his excellent monograph on mental chronometry, Jensen (2006, p.96) states: “The main reasons for the usefulness of chronometry are not only the advantages of its absolute scale properties, but also its sensitivity and precision for measuring small changes in cognitive functioning, the unlimited repeatability of measurements under identical procedures, the adaptability of chronometric techniques for measuring a variety of cognitive processes, and the possibility of obtaining the same measurements with consistently identical tasks and procedures over an extremely wide age range” (italics added). The possibility to obtain unlimited repeated measurements under identical procedures will allow for the determination of person-specific reaction time propensity distributions with arbitrary precision. Jensen presents impressive empirical results showing the importance of not only the intra-individual means of person-specific reaction time distributions, but also their intra-individual variances in assessing cognitive status and development (e.g., in the context of the so-called neural noise hypothesis; 12 Jensen, 2006, p.122 ff.). Consequently, I conjecture that mental chronometry provides a very interesting approach to pursue a test theory based on intraindividual variation. 3.3 Additional thoughts The impact of the fact that the ensemble of time series underlying classical test theory is non-ergodic is enormous. Psychological tests are applied for individual assessment in all kinds of settings. Using the population average expressed by formula (3) as estimate of the intra-individual variance 2(P) of a given person P can lead to entirely erroneous conclusions. To give an arbitrary example: suppose that the norm of a test is = 100, that the inter-individual reliability of the test is = 0.9, and that the between-subjects variance of test scores is var[yi] = 25. Suppose also that a true score which is larger than yC = 120 is considered reason for special treatment (clinical, educational, or otherwise). Finally, suppose that person P has observed score yP = y(P,t=T) = 126. Then the Kelly estimate (2a) of P’s true score P is: est[P yP] = 0.9*126 + (1 – 0.9)*100 = 123.4. According to (2b) the error variance of this estimated true score is: var{est[P yP]} = 25*(1 – 0.9)*0.9 = 2.25. Hence the standard error is 1.5 and a commonly used confidence interval about the estimated true score is: 123.4 ± 2*1.5, yielding 120.4 < est[P yP] < 126.4. This confidence interval is entirely located above the criterion score yC = 120, hence it is concluded that P needs special treatment. Suppose, however, that the intra-individual variance 2(P) of P’s propensity distribution is 2(P) = 36. Then the difference between P’s observed score, yP = 126, and the criterion score for special treatment, yC = 120, is only 1 standard deviation, which according to standard statistical criteria would not indicate that P needs special treatment. Numerical exercises such as the one given above can be carried out in a variety of formats, using Monte Carlo simulation techniques and alternative settings. We intend to report the results of one such a simulation study in a separate publication. But the overall message should be clear: using the (inter-individual) population value of the error variance (based on the inter-individual reliability) as approximation for the intra-individual variance of a person P’s propensity distribution is vulnerable to lead to erroneous conclusions about P’s true score, and, consequently, to erroneous decisions about the necessity to apply special treatment to P. The fundamental reason for the invalidity of (3) as approximation for 2(P) is because the ensemble of time series of observed scores is nonergodic. 4. Hidden heterogeneity In the previous section we discussed heterogeneity with respect to the means and variances of the propensity distributions underlying classical test theory. That 13 kind of heterogeneity can be considered to be a special instance of a much wider class of heterogeneous phenomena, including also qualitative heterogeneity. An important example of qualitative heterogeneity concerns individual differences in the loadings in a factor model. The standard factor model of inter-individual covariation is (using bold face lower case letters for vectors and bold face upper case letters for matrices): (4) yi = i+ i, i=1,2,… where: yi = [y1i, y2i, …, ypi]’ is the p-variate vector of observed scores of a randomly drawn subject i (the apostrophe denotes transposition); i = [1i, 2i, …, qi]’ is the q-variate vector of factor scores of subject I; i = [1i, 2i, …, pi]’ is the p-variate vector of measurement errors for subject i, and is the (p,q)-dimensional matrix of factor loadings. The factor model of inter-individual covariation not only underlies classical test theory, but is of central importance in much of psychology. The factor model can be heuristically characterized as follows. In the context of the behavior space introduced in section 2.1, choose a fixed time and a select a set of p variables y which are considered to be indicators of a q-variate latent factor . Then the factor loadings represent the regression coefficients in the linear relationships between the p indicators and the q-variate latent factor . It is an essential assumption underlying the factor model that the factor loadings are invariant across subjects. That is, does not depend upon i, where the subscript i stands for subject i in the population; i = 1,2,… .Hence the assumption is that each individual person i in the population has a person-specific q-variate factor score i and person-specific p-variate error score i, but the factor model for each person in the population has the same (p.q)-dimensional matrix of factor loadings . Suppose now that we carry out a simulation experiment in which each person not only has a person-specific q-variate factor score and p-variate error score, but also a person-specific set of values for the factor loadings i, i = 1,2,… . Hence each person has a person-specific factor model: (5) yi = ii+ i, i=1,2,… This heterogeneity of factor loadings i, i = 1,2,…, constitutes a severe violation of an important assumption underlying the standard factor model, namely the assumption that the matrix of factor loadings should be invariant (fixed) across subjects. The fact that the matrix of factor loadings in (5) is subject-specific implies that the way in which factors are expressed in the observed scores is qualitatively different for different subjects. These inter-individual differences in the values of factor scores are called qualitative because the substantial interpretation (semantic labeling) of factors is based on these loading values. 14 Despite the fact that (5) involves a severe violation of the qualitative homogeneity assumption (invariance of factor loadings across subjects) underlying the standard factor model (4), it was shown in a number of simulation studies that factor analysis of inter-individual covariation appears to be insensitive to this violation. The typical set-up of these simulation studies was to generate data according to the person-specific (qualitatively heterogeneous) factor model (5), and then fit the standard factor model (4) to the simulated data. Although one would expect the fit of model (4) to be poor due to the fact that the simulated data violate the assumption of qualitative homogeneity underlying model (4), it turns out that this is not at all the case. The general finding in these simulation studies is that (variants of) factor model (4) provide(s) satisfactory fits to data generated according to (variants of) factor model (5). Satisfactory fits, that is, according to all usual criteria of goodness-of-fit, such as the chi-squared likelihood ratio test, standardized root mean square residual, and root mean square error of approximation (cf. Brown, 2006, for definitions and discussion of these criteria). Nowhere in the obtained (Maximum Likelihood) solutions a flag is waving indicating that something is fundamentally wrong. These simulation studies were based on the cross-sectional factor model (Molenaar, 1997), the longitudinal factor model (Molenaar, 1999) and the behavior genetical factor model for multivariate phenotypes of MZ and DZ twins (Molenaar et al., 2003). A mathematical-statistical proof of the insensitivity of the factor model of interindividual covariation to the qualitative heterogeneity of the factor loadings is given in Kelderman & Molenaar (2006). Evidently, the finding that the standard factor model of inter-individual covariation is insensitive to the presence of extreme qualitative heterogeneity in the population of subjects, created by the person-specific matrices of factor loadings i, i = 1,2,…, in (5), raises serious questions. To reiterate, nothing in the results obtained with the standard factor analyses based on model (4) indicates that the true state of affairs is in severe violation of the assumptions underlying this model. The standard factor models yield satisfactory fits to the data generated according to model (5). Consequently, the presence of substantial qualitative heterogeneity in the simulated data remains entirely hidden in the standard factor analyses based on inter-individual covariation. Before discussing some of the consequences of this finding, it is noted that there exist a prior reasons to expect that wide-spread qualitative heterogeneity actually exists in human populations. The reasons have to do with the way in which cortical neural networks grow and adapt during the life span, namely by means of self-organizing epigenetic processes (cf. Molenaar et al., 1993). Self-organizing growth and adaptation give rise to emergent endogenous variation in neural network connections, even between homologous structures located at the left and right sides of the brain within the same subject (cf. Edelman, 1987). In so far as cognitive information processing is associated with cortical neural activity, one can expect that these endogenously generated differences in neural network architectures will become discernable as qualitative heterogeneity of the structure of observed behavior of 15 different subjects (see Molenaar, 2006, for further elaboration and mathematicalbiological modeling of these epigenetic processes). One direct consequence of the fact that standard factor analysis of interindividual covariation is insensitive to qualitative heterogeneity is the following. Suppose that the standard q-factor model (4) yields a satisfactory fit to the data obtained with a test composed of p subtests (e.g., items). Let est[] denote the estimated (p,q)-dimensional matrix of factor loadings thus obtained. Suppose also that in reality qualitative heterogeneity is present in the population of subjects, so that the true (p,q)-dimensional matrix of factor loadings P for a given subject P differs substantially from the nominal loading matrix est[]. For instance, several of the p subtests have negative or zero loadings in P whereas the analogous loadings in est[] are high and positive. Of course P is unknown in the context of standard factor analysis of inter-individual variation. The estimate of P’s factor score, est[P], is based on the nominal loading matrix est[] and, because est[] is a poor approximation of the true P, this estimate est[P] will be substantially biased. For quantitative details about this bias the reader is referred to the publications mentioned above (Molenaar, 1999; Molenaar et al., 2003; Kelderman & Molenaar, 2006). Another consequence of the insensitivity of standard factor analysis of intraindividual variation to qualitative heterogeneity concerns the fact that the semantic interpretation of factors thus obtained is inappropriate at the personspecific level. Suppose that standard factor analysis of personality test scores yields the expected pattern of factor loadings in est[] corresponding to the Big Five theory (cf. Borkenau & Ostendorf, 1998). Then, if qualitative heterogeneity is present, the factor loadings in P for a particular person P may not at all conform to the Big Five pattern and hence the semantic interpretation of the factors for P will be different. Stated more specifically, the nominal semantic interpretation of the five factors obtained in standard factor analysis is inappropriate for P. The reader is referred to Hamaker, Dolan, & Molenaar (2005) for an elaborate illustration based on empirical personality test scores. 5. Heterogeneity in time To reiterate, a (psychological) process should obey two criteria in order to qualify as an ergodic process. Firstly, the trajectory of each subject in the ensemble should conform to exactly the same dynamical laws (homogeneity of the ensemble). Secondly, each trajectory should have constant statistical characteristics in time (stationarity, i.e., constant mean level and serial dependencies which only depend upon relative time differences). In the previous sections attention has been confined to psychological processes which are nonergodic because they violate the first criterion, i.e., heterogeneity of different trajectories in the ensemble. Whereas the first criterion involves a comparison between different trajectories, the second stationarity criterion involves comparison of the same trajectory at different times. In this section we will 16 consider psychological processes which are non-ergodic because they violate the second criterion, i.e., they are non-stationary. In general, non-stationarity implies that parameters of a dynamic system are time-varying. Prime examples of non-stationary systems are developmental systems which typically have time-varying parameters such as waxing and/or waning factor loadings. For this reason developmental systems are non-ergodic and their analysis should be based on intra-individual variation. There exists a long tradition in theoretical developmental psychology in which it is argued that developmental processes should be analyzed at the level of intra-individual variation (time series data). The general denotation for this tradition is Developmental Systems Theory (DST). Important contributions to DST include Wohlwill’s (1973) monograph on the concept of developmental functions describing intra-individual variation, Ford and Lerner’s (1992) integrative approach based on the interplay between intra-individual variation and interindividual variation and change, and Gottlieb’s (1992, 2003) theoretical work on probabilistic epigenetic development. Intra-individual analysis of non-stationary multivariate time series requires the availability of sophisticated statistical modeling techniques. We developed such a technique based on a systems model with arbitrarily time-varying parameters (Molenaar, 1994; Molenaar & Newell, 2003). Our model can be conceived of as a suitably generalized factor model for non-stationary p-variate time series y(t), t = 1,2,...,T. Its schematic form is : (6a) y(t) = [(t)](t) + (t) (6b) (t+1) = [(t)](t) + (t+1) (6c) (t+1) = (t) + (t+1) In (6a) y(t) denotes the observed p-variate time series, (t) is the q-variate latent factor series (system state process), and (t) is the p-variate measurement error process. The factor loadings in [(t)] depend upon the r-variate time-varying parameter vector (t). (6b) describes the evolution of the latent factor series (t) by means of a q-variate stochastic difference equation (autoregression) relating (t+1) to (t), where (t+1) denotes the q-variate residual process. The (q.q)dimensional matrix of regression weights [(t)] depends upon the r-variate timevarying parameter vector (t). (6c) describes the time-dependent variation of the unknown parameters. The r-variate parameter vector process (t) obeys a special stochastic difference equation: a random walk with r-variate innovations process (t). The system of equations (6a), (6b) and (6c) allows for the modeling of a large class of multivariate non-stationary (non-ergodic) processes. Equations (6a) and (6b) have the same formal structure as the well-known inter-individual longitudinal 17 q-factor model, which helps in their interpretation. Yet the system of equations (6a), (6b) and (6c) is applied to analyze the structure of intra-individual variation underlying the observed p-variate time series y(t) obtained with a single subject. Generalization of this model to accommodate multivariate time series obtained in a replicated time series design is straightforward. Also extension of the model with arbitrary mean trend functions and covariate processes having time-varying effects is straightforward. The fit of equations (6a), (6b) and (6c) to an observed p-variate time series y(t), t=1,2,...,T, where T is the number of repeated measurements obtained with a single subject P, is based on advanced statistical analysis techniques taken from the engineering sciences (Bar-Shalom et al., 2001 ; Ristic et al., 2004). It consists of a combination of recursive estimation (filtering), smoothing, and iteration (EKFIS: Extended Kalman Filter with Iteration and Smoothing). The EKFIS yields a time series (trajectory) of estimated values for each of the r parameters in (t): k(t), t=1,2,...,T ; k=1,2,...,r. To illustrate the performance of the EKFIS, the following small simulation study has been carried out. A 4-variate (p = 4) time series y(t) has been generated by means of the state-space model with time-varying parameters (6a), (6b) and (6c). The model has a univariate (q = 1) latent state process (t). The autoregressive coefficient [(t)] = b(t) in the process model (6b) for the latent state increases linearly from 0.0 to 0.9 over the observation interval comprising T = 100 time points: b(t) = 9t/1000, t=1,2,…,100.. Hence the sequential dependence (autocorrelation) of the latent state process (latent factor series) increases from zero to 0.9 across 100 time points and therefore is highly time-varying (nonstationary, hence non-ergodic). Depicted in Figure 1 is the estimate of this autoregressive weight b(t) obtained by means of the EKFIS based on a single subject time series y(t), t=1,2,...,100. It is clear that the estimated trajectory closely tracks the true time-varying path of this parameter. 18 Figure 1: EKFIS estimate of time-varying coefficient b(t) in the autoregressive model for the latent factor scores b(t) 1.2 1 0.8 0.6 0.4 0.2 CI Upper Bound EKFIS Estimate True Value 0 -0.2 1 12 23 34 45 56 67 78 89 100 -0.4 CI Lower Bound Time 6. Discussion and conclusion In this chapter some of the implications of the classical ergodic theorems have been considered in the contexts of classical test theory, factor analysis of interindividual covariation, and the analysis of non-stationary developmental processes. In each of these contexts the classical ergodic theorems imply that instead of using standard statistical approaches based on analysis of interindividual variation, it is necessary to use single-subject time series analysis of intra-individual variation. This conclusion holds for individual assessment based on classical test theory, for testing the assumption of homogeneity (fixed factor loadings across subjects) in factor analysis of inter-individual covariation, and for the analysis of non-stationary processes such as learning and developmental processes. 19 The consequences of the classical ergodic theorems in these and many other contexts in psychology imply that time series designs and time series analysis techniques will have to be assigned a much more prominent place than is currently the case in psychological methodology. The overall aim of scientific research in psychology still should be to arrive at general (nomothetic) laws that hold for all subjects in a well-defined population. But the inductive tools to arrive at such general laws have to be fundamentally different from the currently standard approaches for those psychological processes which are non-ergodic. Only if a psychological process is ergodic, i.e., obeys the two criteria of homogeneity and stationarity, can results obtained by means of analysis of interindividual variation be generalized to the level of intra-individual variation. But the two criteria for ergodicity are very strict and many psychological processes of interest will fail to obey these criteria. Psychologists have to understand that ergodicity is the special case, whereas non-ergodicity is the rule. For non-ergodic psychological processes analysis of inter-individual variation yields results that may not apply to any of the individual subjects in the population of subjects. In conclusion, the inductive tools which are necessary to arrive at general (nomothetic) laws for non-ergodic processes involve the search for communalities between single-subject process models fitted to time series data obtained in replicated time series designs. The latter search for communalities between single-subject process models can be based on standard mixed modeling techniques (see the excellent textbook of Demidenko, 2004). Having available appropriate time series models for each individual subject opens up possibilities which are entirely new in psychology. These possibilities involve the optimal control of ongoing psychological processes. For instance, consider the following special instance of the system of equations (6a), (6b): (7a) y(t) = (t) + (t) (7b) (t+1) = (t) + u(t) + (t+1) Here the same definitions apply as for equations (6a), (6b). Notice that in (7a) and (7b) the (p,q)-dimensional matrix of factor loadings and the (q,q)-dimensional matrix of regression weights are assumed to be constant in time. This is to ease the presentation; generalization of what follows to the non-stationary model given by (6a), (6b) and (6c) is straightforward. Notice also that (7b) contains a new component: u(t). The process s-variate process u(t) represents a known process that can be manipulated; for instance dose of medication or environmental stimulation. is a (q,s)-dimensional matrix of regression weights. Suppose that (7a) and (7b) provide a faithful description of the p-variate time series y(t) for subject P. It then is possible to determine u(t) in such a way that the state process (t) is steered to its desired level #, where # is chosen by the 20 controller. The optimal input u@(t) is determined according to the following schematic feedback function: (8) u@(t) = F[y(t),t] where F[.] denotes an (s,p)-dimensional nonlinear feedback function. Application of u@(t) at time t guarantees that the state process (t+1) at the next time point t+1 will be as close as possible to the desired level #. Optimal control is an important field of research in the engineering sciences. There exists a vast literature on many different variants of optimal control (cf. Kwon, 2005, for a thorough explanation of the currently most advanced approaches). These control techniques can be applied straightforwardly in analyses of intra-individual variation in order to steer psychological processes in desired directions (cf. Molenaar, 1987, for an application to the optimal control of a psychotherapeutic process). This opens up an entirely new promising field of applied psychology: person-specific modeling and adaptive control of ongoing psychological processes. References Anderson, T.W. (1971). The statistical analysis of time series. New York: Wiley. Bar-Shalom, Y., Li, X.R., & Kirubarajan, T. (2001). Estimation with applications to tracking and navigation. New York: Wiley. 21 Birkhoff, G.D. (1931). Proof of the ergodic theorem. Proceedings of the National Academy of Sciences USA, 17, 656-660. Borkenau, P., & F. Ostendorf, (1998). The Big Five as states: How useful is the five-factor model to describe intra-individual variations over time? Journal of Personality Research, 32, 202-221. Brown, T.A. (2006). Confirmatory factor analysis for applied research. New York: Guilford Press. Cattell, R.B. (1952). The three basic factor-analytic designs – Their interrelations and derivatives. Psychological Bulletin, 49, 499-520. Choe, G.H. (2005). Computational ergodic theory. Berlin: Springer. De Groot, A.D. (1954). Scientific personality diagnosis. Acta Psychologica, 10, 220-241. Demidenko, E. (2004). Mixed models: Theory and applications. Hoboken, NJ: Wiley. Edelman, G.M. (1987). Neural Darwinism: The theory of neuronal group selection. New York: Basic Books. Ford, D.H., & Lerner, R.M. (1992). Developmental systems theory. Newbury Park: Sage. Gescheider, G.A. (1997). Psychophysics: The fundamentals. Mahwah, NJ: Erlbaum. Gottlieb, G. (1992). Individual development and evolution: The genesis of novel behavior. New York: Oxford University Press. Gottlieb, G. (2003). On making behavioral genetics truly developmental. Human Development, 46, 337-355. Hamaker, E.L., Dolan, C.V., & Molenaar, P.C.M. (2005). Statistical modeling of the individual: Rationale and application of multivariate time series analysis. Multivariate Behavioral Research, 40, 207-233. Hogan, J.A., & Lakey, J.D. (2005). Time-frequency and time-scale methods: Adaptive decompositions, uncertainty principles, and sampling. Boston: Birkhäuser 22 Houtveen, J.H., & Molenaar, P.C.M. (2001). Comparison between the Fourier and wavelet methods of spectral analysis applied to stationary and nonstationary heart period data. Psychophysiology, 38, 729-735. Kelderman, H., & Molenaar, P.C.M. (2006). The effect of individual differences in factor loadings on the standard factor model (to appear in Multivariate Behavioral Research). Jensen, A.R. (2006). Clocking the mind: Mental chronometry and individual differences. Amsterdam: Elsevier. Kwon, W.H. (2005). Receding horizon control: Model predictive control for state models. London: Springer. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Molenaar, P.C.M. (1987), Dynamic assessment and adaptive optimization of the therapeutic process. Behavioral Assessment, 9, 389-416. Molenaar, P.C.M., & Roelofs, J.W. (1987). The analysis of multiple habituation profiles of single trial evoked potentials. Biological Psychology, 24, 1-21. Molenaar, P.C.M., Boomsma, D.I., & Dolan, C.V. (1993). A third source of developmental differences. Behavior Genetics, 23, 519-524. Molenaar, P.C.M. (1994). Dynamic latent variable models in developmental psychology. In: A. von Eye & C.C. Clogg (Eds.), Analysis of latent variables in developmental research. Newbury Park: Sage, pp. 155-180. Molenaar, P.C.M. (1997). Time series analysis and its relationship with longitudinal analysis. International Journal of Sports Medicine, 19, 232-237. Molenaar, P.C.M. (1999). Longitudinal analysis. In: H.J. Ader & G.J. Mellenbergh (Eds.), Research methodology in the social, behavioral and life sciences. London: Sage, pp. 143-167. Molenaar, P.C.M., Huizenga, H.M., & Nesselroade, J.R. (2003). The relationship between the structure of interindividual and intraindividual variability: A theoretical and empirical vindication of Developmental Systems Theory. In: U.M. Staudinger & U. Lindenberger (Eds.), Understanding human development: Dialogues with life-span psychology. Dordrecht: Kluwer, pp. 339-360. Molenaar, P.C.M. (2003). State space techniques in structural equation modeling: Transformation of latent variables in and out of latent variable models. 111 pages. Website: http://www.hhdev.psu.edu/hdfs/faculty/molenaar.html 23 Molenaar, P.C.M., & Newell, K.M. (2003). Direct fit of a theoretical model of phase transition in oscillatory finger motions. British Journal of Mathematical and Statistical Psychology, 56, 199-214. Molenaar, P.C.M. (2004). A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement, 2, 201-218. Molenaar, P.C.M. (2006). On the implications of the classic ergodic theorems: Analysis of developmental processes has to focus on intra-individual variation (submitted). Nayfeh, A.H., & Balachandran, B. (1995). Applied nonlinear dynamics: Analytical, computational, and experimental methods. New York: Wiley. Petersen, K. Ergodic theory. Cambridge: Cambridge University Press. Priestley, M.B. (1988). Non-linear and non-stationary time series analysis. London: Academic Press. Ristic, B., Arulampalam, S., & Gordon, N. (2004). Beyond the Kalman filter: Particle filters for tracking applications. London: Artech House. Walters, P. (1982). An introduction to ergodic theory. 2nd edition. New York: Springer. Wohlwill, J.F. (1973). The study of behavioral development. New York: Academic Press. 24