INFORMATION-STATISTICAL APPROACH FOR TEMPORAL-SPATIAL DATA WITH APPLICATION BON K. SY Queens College/CUNY Computer Science Department Flushing NY 11367 U.S.A. bon@bunny.cs.qc.edu ARJUN K. GUPTA Bowling Green State University Department of Mathematics and Statistics Bowling Green OH 43403 U.S.A. gupta@bgnet.bgsu.edu Abstract A treatment for temporal-spatial data such as atmospheric temperature using an information-statistical approach is proposed. Conditioning on specific spatial nature of the data, the temporal aspect of the data is first modeled parametrically as Gaussian, and Schwartz information criterion is then applied to detect multiple mean change points, --- thus the Gaussian statistical models --- to account for changes of the mean population over time. To examine the spatial characteristics of the data, successive mean change points are qualified by finite categorical values. The distribution of the finite categorical values is then used to estimate a non-parametric probability model through a non-linear SVDbased optimization approach; where the optimization criterion is Shannon expected entropy. This optimal probability model accounts for the spatial characteristics of the data and is then used to derive spatial association patterns subject to chi-square statistic hypothesis test. The proposed approach is applied to examine the weather data set obtained from NOAA. Selected temperature data are studied. These data cover different geographical localities in the United States, with some spanning over 200 years. Preliminary results are reported. Keywords: Temporal-spatial data; Information theory; Schwarz information criterion, Probability model optimization; Statistical association pattern. 1. Introduction This paper presents a treatment for temporal-spatial data using an informationstatistical approach. Let Oi(t1) … Oi(tn) be sequence of n independent observations made at the ith location; where i=1..p. Temporal-spatial data analysis to be discussed in this paper can be formulated as a 3-step process: 1. Given a specific location indexed by i, and assuming the observations are Gaussian, the specific task is to detect mean change points in the Gaussian model. 2. Upon detection of mean change points, the specific task is to identify the optimal non-parametric probability model with discrete-valued random variables: X1, X2, … , Xp; where each value of Xi accounts for a possible qualitative change of successive mean change points; e.g., { Xi: x1 =increase, x2 = no-change, x3 = decrease}. 3. Upon derivation of the optimal non-parametric probability model, the specific task is to identify statistical association pattern manifested as a p-dimensional vector of {vi,j: i=1.. p, j=1..3}; where vi,j represents the jth value of random variable Xi. An example of temporal-spatial data is the monthly average temperature data. Consider three distinct city locations in the U.S.: Boston, Denver, and Houston --indexed by 1, 2, 3, and also, 4 years (or 48 months) of monthly average temperature data, we may represent these temporal-spatial temperature data by Oi(t1) … Oi(tn); where n = 48, and p = 3 (or i = 1..3). Using this temperature data example, the temporal-spatial data analysis to be discussed in this paper attempts to address two questions: 1. Suppose the monthly average temperature data of each city are Gaussian distributed, we ask the question whether there are multiple (Gaussian) mean change points of monthly average temperature data, and if so, where do they locate in the time sequence? 2. If multiple change points exist, are there any significant statistical association patterns that characterize the changes in the mean of the monthly average temperature data of the three cities? We will now present the problem formulation for each step. The problem formulation for step 1 is focused on the temporal aspect of the temporal-spatial data. The problem formulation for step 2 acts as a “bridge” process to shift the focus of the analysis from the temporal aspect to the spatial aspect. The problem formulation for step 3 is focused on the spatial aspect of the analysis of temporal-spatial data. Problem formulation 1 (for step 1): Let X1(T), X2(T), … , Xp(T) be p time-varying random variables. For some i {1..p}, let Oi(t1) … Oi(tn) be a sequence of n independent observations of Xi(T). Suppose each observation Oi(tj) is obtained from a normal distribution model with unknown mean i,j and common variance , we would like to test the hypothesis: H0: i,1 = … = i,n = i (unknown) Versus the alternative: H1: i,1 = … = i,c1 i,c1+1 = … = i,c2 … i,cq = … = i,cq+1 = i,n Where 1 c1 < c2 < .. < cq + 1 = n This statistical test for a specific i, if H0 is accepted, implies that all the observations of Xi(T) belong to a single normal distribution model with mean = i. In other words, Xi(T) can be modeled as a Gaussian model. If H0 is rejected, it implies that each observation of Xi(T) belongs to one of the q populations. Or Xi(T) has to be modeled by q Gaussian models. Problem formulation 2 (for step 2): Following the problem formulation for step 1 and assuming H1 is not rejected, the change from i,j to i,j+1 will be qualified as either increase or decrease; where j {c1, ..., cq}. For any fixed time unit k’ {1..n}, we abbreviate Xi(k’) as Xi. Note that Xi can assume one of three discrete values according to the following rules: Xi = if k’ {c1, ., cq} and i,k’ > i,k’+1 (i..e., decrease) if k’ {c1, ., cq} if k’ {c1, ., cq} and i,k’ < i,k’+1 (i..e., decrease) (1) Given the marginal and joint frequency counts of the possible discrete values of {Xi}, we would like to identify an optimal discrete-valued probability model that preserves maximally the biased probability information available while minimizes the bias introduced by unknown probability information. The optimization criterion will be the Shannon expected entropy which captures the principle of minimum biased unknown information. We will later show that this problem formulation is indeed an optimization problem with linear constraints and a non-linear objective function. Problem formulation 3 (for step 3): Upon the identification of the optimal probability model, we would like to investigate the existence of statistically significant spatial patterns characterized by the joint event of X={Xi : xi = , , } where |X| = p. Specifically, we would like to test the hypothesis: H0: {Xi : xi} in X are independent of each other for i=1..p. Versus the alternative: H1: {Xi : xi} in X are interdependent of each other for i=1..p. 2. Information-Statistical Analysis @ Problem formulation 1 (for step 1): Recall the formulation presented in section 1, for some i {1..p}, let Oi(t1) … Oi(tn) be a sequence of n independent observations of Xi(T). Suppose each observation Oi(tj) is obtained from a normal distribution model with unknown mean i,j and common variance , we would like to test the hypothesis --- for each i {1..p}: H0: i,1 = … = i,n = i (unknown) Versus the alternative: H1: i,1 = … = i,c1 i,c1+1 = … = i,c2 … i,cq = … = i,cq+1 = i,n Where 1 c1 < c2 < .. < cq + 1 = n. The statistical hypothesis test shown above is to compare the null hypothesis under the assumption that there is no change in the mean against the alternative hypothesis that there are q changes at the instant c1, c2, …, cq. To determine whether there are multiple change points for the mean, Schwarz Information Criterion (SIC) along with binary segmentation technique (Vostrikova, 1981) is employed. Our choice on Schwarz Information Criterion (SIC) for change point detection is not arbitrary. SIC can be considered as a variant of Akaike Information Criterion (AIC) with a penalty factor. When AIC is used as a measure of model evaluation, MAICE (Maximum AIC Estimate) is most appropriate. However, MAICE is not claimed as asymptotically consistent estimate of model order (Schwarz, 1978). In contrast, SIC gives consistent estimator of the true model. In addition, the test statistic of SIC (Gupta, 96) has a chi-square limiting distribution. The estimator of the change point is found to be consistent and has a convergence rate identical to all likelihood based methods. Furthermore, the method based on information criterion, as opposed to likelihood methods, has better power in detecting changes occurring at the middle of the process, and a competitive power performance in comparison to other cases. Schwarz Information Criterion has the form: –2log L (ˆ) +r log n, where L (ˆ) is the maximum likelihood function for the model, r is the number of free parameters in the model, and n is the sample size. In this setting we have one and q+1 models corresponding to the null and the alternative hypotheses, respectively. The decision to accept H0 or H1 will be made based on the principle of minimum information criterion. That is, we do not reject H0 if SIC (n) min SIC (k ) (where m=1 in this case for mk n m univariate model) and reject H0 if SIC(n) > SIC(k) for some k and estimate the position of change point k0 by k̂ such that SIC (kˆ) min SIC (k ) m k n m For detecting multiple change points, the binary segmentation technique proposed by Vostrikova was employed. Further details can be referred to a report elsewhere (Gupta, 1996; Chen, 1997). The binary segmentation is repeated for every value of i {1..p}: Under H0 and a given i, SIC (n) 2 log L(ˆn ) r log n For Gaussian model: SIC(n) = n log 2+ n + nlogi + 2log n (2) Where i = (1/n) nj=1 (Oi(tj)- i)2 and i = (1/n) nj=1 Oi(tj) Under H1, SIC (n) 2 log L(ˆn ) r log n For Gaussian model: SIC(k) = n log 2+ n + nlog’i + 3log n Where ’i = (1/n) kj=1 (Oi(tj)- ’i)2 + (1/n) nj=k+1 (Oi(tj)- ’n-k)2 ’i = (1/k) kj=1 Oi(tj) and ’n-k = (1/(n-k)) nj=k+1 Oi(tj) (3) @ Problem formulation 2 (for step 2): When change point(s) is/are detected in step 1, each change point partitions the temporal-spatial data set into two sub-populations. Each population mean can be estimated following similar procedure as described in step 1. The change in the mean between two (time-wise) adjacent sub-populations can be qualified using one of three possible categorical values: increase, same, and decrease. Since each change point has a corresponding time index, not only the marginal frequency information of the corresponding “spatial-specific” random variable can be derived, the joint frequency information related to multiple variables can also be derived by alignment through common time index. Consider the following snapshot of the monthly average temperature data of January and February from year 1988 to 1992 of five cities, Chicago (CH), Washington D.C. (DC), Houston (HO), San Francisco (SF), and Boston (BO): CH DC HO SF BO 1988 1989 1990 1991 1992 In the above table, “” or “” refers to the location of a change point, and “” refers to no change point detected. For example, two change points are detected in Chicago --- 1989 and 1991. These two change points partition the monthly average temperature data into three sub-populations during the period of 1988 to 1992: one subpopulation is prior to 1989, one between 1989 and 1991, and one after 1991. The “” in 1989 indicates that the Gaussian mean of the model for Chicago accounting the period prior to 1989 is smaller than the Gaussian mean of the model for Chicago accounting the period between 1989 and 1991. Similarly, the “” in 1991 indicates that the Gaussian mean of the model for Chicago accounting the period between 1989 and 1991 is greater than that of the period after 1991. With the conception just discussed, each city could be perceived as a discretevalued random variable and the frequency count information that reflects change points indicated by “” and “” may be used to derive the corresponding probability distribution. For example, Pr(BO:) = CH, DC, HO, SF Pr(CH, DC, HO, SF, BO:) = 1/5 Pr(BO:) = CH, DC, HO, SF Pr(CH, DC, HO, SF, BO:) = 1/5 Pr(DC:) = Pr(CH:) =Pr(CH:) = Pr(HO:) = Pr(SF:) = 1/5 Pr(HO:,SF:|BO:) = 1 CH,DC,HO,SF,BO Pr(CH, DC, HO, SF, BO) = 0 CH, DC, HO, SF, BO Pr(CH, DC, HO, SF, BO) = 1 In this example, the probability model consists of 35 = 243 joint probability terms Pr(CH,DC,HO,SF,BO). In theory, it is possible to have up to 243 linearly independent constraints for the existence of a valid probability model. But in practice we may care only the constraints that carry statistical significant information. In the above example, we show a case of nine linear probability constraints. Note that one could observe a constraint Pr(DC:|BO:) = 1 as shown in the table but is intentionally omitted in the above example. Given these probability constraints, we would like to derive an optimal probability model subject to Max[- CH, DC, HO, SF, BO Pr(CH, DC, HO, SF, BO) log Pr(CH, DC, HO, SF, BO)]. As just shown, identifying an optimal probability model based on the marginal and joint frequency information of discrete random variables is an optimization problem. Specifically, the optimization problem consists of a set of linear probability constraints, and a non-linear objective function due to Shannon entropy optimization criterion. In the operations research community, techniques for solving various optimization problems have been discussed extensively. Simplex and Karmarkar algorithms (Borgwardt, 1987; Karmarkar, 1984) are two methods that are constantly used, and are robust for solving many linear optimization problems. Wright (Wright, 1997) has written an excellent textbook on primal-dual formulation for the interior point method with different variants of search methods for solving non-linear optimization problems. It was discussed in Wright’s book that the primal-dual interior point method is robust in searching optimal solutions for problems that satisfy KKT (Karush-Kuhn-Tucker) conditions with a second order objective function. In this research, we adopt an optimization problem solving approach that follows the spirit of the primal-dual interior point method. But this approach deviates from the traditional approach in the search towards an optimal solution in the sense that it integrates two approaches for solving the algebraic system of linear probability constraints; namely, the Kuenzi, Tzschach, and Zehnder approach (Kuenzi et al, 1972), and the Singular Value decomposition (SVD) algorithm (Wright, 1997). Further theoretical details are referred to a report elsewhere (Sy, 2001). Below shows the solution of the example optimization problem just discussed: Pr(CH:, DC:, HO:, SF:, BO:) = 0.2 Pr(CH:, DC:, HO:, SF:, BO:) = 0.4 Pr(CH:, DC:, HO:, SF:, BO:) = 0.2 Pr(CH:, DC:, HO:, SF:, BO:) = 0.2 Pr(CH, DC, HO, SF, BO) = 0 for the remaining joint probability terms. The entropy of the optimal model is Max[- CH, DC, HO, SF, BO Pr(CH, DC, HO, SF, BO) log Pr(CH, DC, HO, SF, BO)] = 1.9219 bits. In the above example, one may wonder why we do not simply use frequency count information of all variables to derive the desired probability model. There are several reasons due to the limitation and nature of a real world problem. Using the temperature data example, a weather station of each city is uniquely characterized by factors such as elevation of the station, operational hours and period (since inception), specific adjacent stations for data cross-validation, and calibration for precision and accuracy correction. In particular, the size of sample temperature data does not have to be identical across all weather stations. Nonetheless, the location of change points depends on each marginal individual population, and the observation of the conditional occurrence of change points of data with different spatial characteristic values (location) is still valid. In other words, the nature of temporal-spatial data may originate from different sources. Information from different sources does not have to be consistent, and may even at times contradict each other. However, each source may provide some, but not all, information that reach general consensus, and that collectively may reveal additional information not covered by each individual. @ Problem formulation 3 (for step 3): Different aspects of the concept of patterns have been discussed extensively in a number of publications by Grenander (Grenander, 1993, 1996). One interesting aspect found by the first author of this paper is the possibility of interpreting joint events of discrete random variables surviving statistical hypothesis test of interdependency as statistically significant association patterns. In doing so, significant previous works already established (Kullback, 1951, 1959; Good, 1960; Chen, 1998) may be used to provide a unified framework for linking information theory with statistical analysis. The significance of such a linkage is that it not only provides a basis for using statistical approaches for revealing hidden significant association patterns, but for using information theory (Shannon, 1972) as a measurement instrument to determine the quality of information obtained from statistical analysis. The purpose of deriving an optimal probability model in step 2 is to provide a basis for uncovering statistically significant spatial patterns. Our approach is to identify statistically significant patterns based on event associations. Significant event associations may be determined by statistical hypothesis testing based on mutual information measure or residual analysis (Kullback, 1959; Fisher, 1924). Following step 2 and using the formulation discussed earlier, let X1 and X2 be two random variables with {x11 ... x1z} and {x21 ... x2m} as the corresponding sets of possible values. The expected mutual information measure of X1 and X2 is defined as I(X1 X2)= i,j Pr(x1i x2j)log2 [Pr(x1i x2j)/Pr(x1i)Pr(x2j)]. Similarly, the expected mutual information measure of the interdependence among the multiple variables (X1 … Xp) is I(X1 … Xp)= zi=1…mj=1 Pr(x1i … xpj)log2[Pr(x1i…xpj)/Pr(x1i)…Pr(xpj)] (4) Note that the expected mutual information measure is zero if the variables are independent of each other. Since mutual information measure is asymptotically distributed as chi-square (Kullback, 1959; Goodman, 1968), statistical inference can be applied to test and compare the null hypothesis --- where the variables are independent of each other --- against the alternative hypothesis --- where the variables are interdependent. Specifically, the null hypothesis is rejected if I(X1…Xp) 2/2N; where N is the size of the data set, and 2 is the chi-square test statistic. The 2 test statistic, due to Pearson, can be expressed as below: 2 = zi=1 … mj=1 (o1i,… pj - e1i,..pj)2/ e1i,..pj (5) In the above equation, the 2 test statistic has the degree of freedom (|X1| - 1)(|X2| - 1)…(|Xp| - 1); where |Xi| is the number of possible value instantiation of Xi. Here o1i,… pj represents the observed counts of the joint event (X1 =x1i … Xp =xpj) and e1i,..pj represents the expected counts, and is computed from the hypothesized distribution under the assumption that X1, X2, ..., Xp are independent of each other. The chi-square test statistic and mutual information measure just shown can be further extended to measure the degree of statistical association at the event level. That is, the significance of statistical association of an event pattern E involving multiple variables can be measured using the test statistic: E2 = (o1i,… p,j - e1i,..p,j)2/ e1i,..p,j (6) while the mutual information analysis of an event pattern is represented by log2[Pr(x1i … xpj)/Pr(x1i)…Pr(xpj)]. As suggested elsewhere (Wong, 1997), the chi-square test statistic of an event pattern may be normally distributed. In such a case, one can perform a statistical hypothesis test to determine whether an event pattern E bears a significant statistical association. Specifically, the hypothesis test can be formulated as below: Null hypothesis H0: E is not a significant event pattern when E2 < 1.96, where 1.96 corresponds to a 5% significance level of normal distribution. Alternative hypothesis H1: E is a significant event patterns otherwise. 3. Temperature Analysis Application The information-statistical approach discussed in this paper has been applied to analyze temperature data. The temperature data source is the GHCN (Global Historical Climatology Network) data set obtained from the National Oceanic and Atmospheric Administration (NOAA) [www http://www.ncdc.noaa.gov/wdcamet.html]. This data set consists of data collected from 30 sources, accounting for approximately 800 weather stations throughout the world. There are two versions of the GHCN data set. This study uses the second version of the GHCN data set, which has been incorporated with a comprehensive set of quality assurance procedures. During the process of compiling GHCN v2 (second version), homogeneity testing and adjusting techniques have been developed (Peterson, 1994; Easterling et al, 1996). This is important because certain temperature readings were dated back to pre-20th century; e.g., temperature readings for the city of Boston dated back to 1747. The original form of these temperature readings is not in electronic format and digitization of these readings is required. As a consequence, data entry during the process of digitization may introduce human errors and outliers. Although nowadays temperature data are collected automatically by modern computerized instrumentation, combining the old and new data sets would require quality assurance that would address issues such as identification of outlier and noise filtering, homogeneity testing, and data adjustment. Homogeneity data adjustment concerns with corrections needed to make historical data such as those from the 1800s equivalent to data produced by 20th century siting practices. Readers interested in further details such as how, when, where, and by whom the data are collected are referred to the resources elsewhere (Peterson, 1994; Easterling et al, 1996; Grant, 1972), [www http://www.ncdc.noaa.gov/ol/climate/research/ghcn/ghcnqc.html]. The temporal distribution of temperature and its variation throughout the year depends primarily on the amount of radiant energy received from the sun. The spatial distribution of temperature data depends on geographical regions in terms of latitude and longitude, as well as possible modification by locations of continents and oceans, prevailing winds, oceanic circulation, topography, and other factors. Furthermore, spatial characteristics such as elevation also play a role in temperature changes. We noted that there have already been extensive studies on global climate based on yearly average temperature (Barnett, 1978; Paltridge, 1981; Mitchell, 1963; UCAR, 1997; WR, 1998). Global temperature analysis conducted by Paltridge and Woodruff (Paltridge, 1981) is one of the most comprehensive studies covering the period 1880 to 1980. A more recent follow-up extending the study from 1980 and onwards can be found on the web site of the U.S. Environmental Protection Agency (E.P.A.) [www http://www.epa.gov/globalwarming/climate/trends/temperature.html]. In this preliminary study, we are NOT interested in duplicating previous studies on analyzing global climate using yearly average temperature. Rather, our focus is on analyzing monthly average temperature, and specifically, in ten geographical locations spanning over different regions of the United States. Nonetheless, various analytical methods, data sources, and data calibration methods employed by others will be used as a basis for validating the consistency of our preliminary study compared to other studies. Ten geographical locations spanning over different regions of the United States were selected for this study. The objectives of this study are (1) to determine the patterns of temporal temperature variation using monthly averages, and then to compare these patterns to that of the yearly average using the proposed analytical method described in this paper, and (2) to identify possible spatial association patterns revealed by the monthly average temperature data. For the purpose of validating the consistency of our method compared to others, global temperature analysis following the framework of that proposed by Paltridge and Woodruff (Paltridge, 1981) will also be conducted. In this validation process, our analysis will cover the same period as the studies by E.P.A. and Paltridge. In our study of the ten selected cities in the U.S., the period coverage of each location is shown in the table 1. The specific data set used for this study is the monthly average temperature. The size of the data available for each location varies. The longest period covers the years 1747 to 2000 (Boston), while the shortest period covers 1950 to 2000 (DC and Chicago). In each one of the ten locations, the change point detection analyses are carried out twelve times, one for each month using all available data. For example, the size of Jan monthly average temperature data of Boston is 254 (2000 – 1747 +1). All 254 Jan monthly average temperatures are used for detecting the change points (indexed by year) in Jan. This is then repeated for every month from Feb to Dec; where a new set of 254 data points are used for change point detection. This is then repeated for each one of the ten locations. Altogether, the data size has an upper bound of (254x12x10=) 30488. In general, one may be interested in detecting the change point of variance, or both mean and variance, in a Gaussian model. In this paper, we are only concern about mean change point detection. It is because weather phenomenon tends to behave with ergodicity property, which has a statistical characteristic of common long-term variance. Detecting variance change point is beyond the scope of this paper. Readers interested in this topic is referred to the publications elsewhere (Chen, 1997). The Jan monthly average temperature data of Chicago and DC are used to illustrate the process of change point detection. By applying the technique described in the problem formulation 1 in section 2, four and eight change points are detected for the data sets of Chicago and DC respectively. These change points are shown below: Jan 1953 1957 1960 1964 1967 1972 1986 1988 1989 1991 1994 Chicago DC There are two interesting observations. First, the change in the Gaussian mean of the monthly average temperature of the data set of DC fluctuates on every other change point, while that of the data set of Chicago fluctuates in pairs. In addition, there is a change point occurring simultaneously in 1994. Following the process just described, change point detection according to Schwarz information criterion using formulas 2 and 3 is carried out for each one of the ten cities. For each city, change point detection is carried out using yearly data twelve times --- one for each month. Detected change points are then grouped by seasonal quarters; i.e., winter quarter (Dec – Feb), spring quarter (Mar – May), summer quarter (Jun – Aug), and fall quarter (Sep – Nov). Interesting seasonal trend patterns are summarized in table 2. The frequency count of the occurrence of each trend is summarized below: Location Decrease No change Increase ------------------------------------------------------------------CH 8 4 13 DC 27 18 29 DE 37 19 42 FA 51 27 63 HO 36 22 34 KT 26 14 33 BO 107 51 111 SF 79 81 86 SL 46 15 41 SE 27 21 36 ------------------------------------------------------------------444 272 488 Remark: There are cases that a change point is detected, but the incremental increase/decrease in the mean temperature value is statistically insignificant. For these cases, the change point is marked as “No change.” After the change points are identified, we ask the question whether any change points from various locations align in time. In other words, if there is a mean change in one location, are there any other locations also experience a mean change at the same time (by year)? And particularly, are any change points common to at least three different locations? Using previous formulation Oi(t1) to represent the monthly average temperature of location i at the year t1, there are three possibilities. At the year t1, it could be no change point, or a change point with increased Gaussian mean, or a change point with decreased Gaussian mean in a location i. Since there are ten locations, the number of possible combinations to account for the existence and type (increase/decrease) of change points is 310 = 59049. Obviously the problem will be unmanageable if we attempt to derive a probability model to account for the occurrences of all joint change points. Instead, we decided to study the temperature change points in 5 groups of 5 locations. They are: Group 1: CH (Chicago) | DC (Washington DC) | DE (Delaware) | FA (Fargo) | BO (Boston) Group 2: CH (Chicago) | BO (Boston) | HO (Houston) | KT (Kentucky) | DC Group 3: DE (Delaware)| SF (San Francisco) | FA (Fargo) | KT (Kentucky) | SE (Seattle) | FA (Fargo) | KT (Kentucky) | SL (St. Louis) Group 4: CH (Chicago) | DE (Delaware) Group 5: HO (Houston) | SF (San Francisco) | KT (Kentucky) | SL (St. Louis) | DC Note that the above 5 groups are chosen in such a way that any two groups will have at least one common city. In studying each of the five groups, we are interested in any trend patterns of simultaneous change points of at least three locations. With these patterns, we proceed to the following three tasks: 1. Based on the frequency count information, estimate the conditional probability of simultaneous change points. 2. Based on the conditional probability information, derive an optimal probability model with respect to Shannon entropy. 3. Based on the optimal probability model, identify statistical significant association patterns that characterize the type of changes (increased/decreased) in the Gaussian mean based on Chi-square test statistic discussed in section 2 (equation 6). In each study group, we report the number of probability constraints used for model optimization. The entropy of the optimal probability model, and the noticeable significant event association patterns are also reported. Noticeable significant event association patterns are defined as the most probable event patterns (ranked within top six) and that also pass the chi-square statistic test according to equation 6 with a 95% significance level. The noticeable significant event association patterns are presented in the decreasing order of statistical significant association according to equation 6. The results of each study group are summarized below. Group 1: Using the notation in (1), number of probability constraints for model derivation: 4 Optimal probability model satisfying the constraints: Entropy of the optimal probability model: 7.894 bits Noticeable significant association patterns and Pr (based on equation 6): (CH: DC: DE: FA: BO:) Pr = 0.004625 (CH: DC: DE: FA: BO:) Pr = 0.004625 Group 2: Using the notation in (1), number of probability constraints for model derivation: 11 Optimal probability model satisfying the constraints: Entropy of the optimal probability model: 1.505 bits Noticeable significant association patterns and Pr (based on equation 6): (CH: DC: HO: KT: BO: ) Pr= 0.698 (CH: DC: HO: KT: BO: ) Pr=0.086 (CH: DC: HO: KT: BO: ) Pr=0.03 (CH: DC: HO: KT: BO: ) Pr=0.116 (CH: DC: HO: KT: BO: ) Pr=0.052 (CH: DC: HO: Pr=0.017 KT: BO: ) Group 3: Using the notation in (1), number of probability constraints for model derivation: 9 Optimal probability model satisfying the constraints: Entropy of the optimal probability model: 0.2567 bits Noticeable significant association patterns and Pr (based on equation 6): (DE: FA: SF: (DE: FA: SF: KT: SE:) KT: SE: ) Pr=0.096 Pr=0.001925 (DE: FA: SF: KT: SE: ) Pr= 3x10-6 (DE: FA: SF: KT: SE: ) Pr=0.038 (DE: FA: SF: KT: SE: ) Pr=9.6x10-5 (DE: KT: SE: ) Pr=3x10-6 FA: SF: Group 4: Using the notation in (1), number of probability constraints for model derivation: 4 Optimal probability model satisfying the constraints: Entropy of the optimal probability model: 7.866 bits Noticeable significant association patterns and Pr (based on equation 6): (CH: DE: FA: KT: SL: ) Pr=0.008218 (CH: DE: FA: KT: SL: ) Pr=0.008218 (CH: DE: FA: KT: SL: ) Pr=0.008319 (CH: DE: FA: KT: SL: ) Pr=0.008319 (CH: DE: FA: KT: SL: ) Pr=0.008319 Group 5: Using the notation in (1), number of probability constraints for model derivation: 15 Optimal probability model satisfying the above constraints: Entropy of the optimal probability model: 0.8422 bits Noticeable significant association patterns and Pr (based on equation 6): (DC: HO: KT: SF: SL: ) Pr=0.036 (DC: HO: KT: SF: SL: ) Pr=0.035 (DC: HO: KT: SF: SL: ) Pr=0.028 (DC: HO: KT: SF: SL: ) Pr=0.026 (DC: HO: KT: SF: SL: ) Pr=0.004732 Validation: Since this paper emphasizes a novel information-statistical technique for analyzing temporal-spatial data, a fundamental question would be whether this technique could be assessed and validated in, at least, the case of temperature analysis. Our approach to assess and validate the proposed technique will focus on answering two questions. First, does the proposed technique yield consistent conclusion when applying to the task of temperature analysis using a framework conforming to the previous temperature studies? Second, does the proposed technique produce new interesting results that are not covered by the previous studies? In order to determine whether the proposed technique described in this paper yields consistent conclusions in comparison to previous studies, the framework proposed by Paltridge and Woodruff (Paltridge, 1981) for temperature analysis is used. The motivation behind selecting Paltridge’s framework is its conformity to the recent followup study conducted by the E.P.A. The temperature study conducted by Paltridge and Woodruff is focused on the global surface temperature change based on the yearly average temperature over the period of 1880 to 1980. Their study of the global surface temperature change is based on weighted averages over the yearly mean temperature, and is grouped into three regions by latitude: 30N-50N, 10N-10S, and 30S-50S. Their study also probed temperature change by seasons. To follow the framework of Paltridge’s study, GHCN data set is used as the data source for assessment and validation. The GHCN data set may be considered an extension of the data sets used by previous studies since it covered temperature readings from both land-based and sea-based weather stations, and extended beyond the ending period 1977 of Paltridge’s study. In order to conduct a study similar to that of Paltridge and Woodruff on the global temperature change based on yearly average temperature, the monthly average temperature data of a weather station for the period 1880 to 2000 are used to derive the yearly average temperature of the location. The derived yearly average temperature is then calibrated according to the altitude of the weather station (roughly 1F per 1000 feet) so that the yearly average temperature is referenced to the surface temperature in the sea level. This is then repeated for all (> 800) weather stations. The calibrated yearly average temperature data are then grouped into three regions as previously: 30N-50N, 10N10S, and 30S-50S. For each one of the three regions, the mean temperature of a year is then derived from averaging the calibrated yearly mean temperature data of the region. Furthermore, the overall mean temperature of a year over all regions is also derived. The change point detection (described in step 1 of this proposed technique) is applied to the overall mean temperature data. The results of the detected change points are shown in Fig. 1. The change point detection technique is applied again to the mean temperature data of each of the three regions; the results are shown in Figures 2 to 4. Figures 2, 3, and 4 are for the regions covering 30N-50N, 10N-10S, and 30S-50S respectively. It is interesting to note that the steady increase in the global surface temperature discovered in Paltridge’s study over the period of 1880-1980 is captured in the consistent upward trend patterns in Figure 1. Paltridge’s study of the global surface temperature change over the period of 1880-1980, by region, concluded that there is a steady increase in the global temperature in the northern hemisphere (30N-50N), and a random fluctuation in the equator region (10N-10S). The same conclusion can also be reached by referring to Fig. 2 and Fig.3. Our result for the southern hemisphere (30 S-50S), however, does not agree with Paltridge’s study. According to Fig. 4, the surface temperature change fluctuates only slightly, while Platridge’s study found that the surface temperature has increased over the period of 1880-1980. It is noteworthy that in each of the figures 1, 3, and 4 there is a significant downward trend patterns in the early 90s, particularly in Fig. 4. This is counter-intuitive since we really did not experience such a significant cool-down in early 1990’s worldwide. The GHCN data sources for the southern hemisphere are mainly from the same weather stations in Australia. But there are new data sources from weather stations on the African continent causing the significant downward trend pattern in early 90s. Hence, this significant downward trend pattern is likely to be an outlier rather than a significant pattern reflecting an actual phenomenon of temperature change. Discussion: As temperature is generally cyclical by year, we would expect in an ideal case ---without global warming and without “man-made” environmental disturbance --- that there would be no (Gaussian) mean change in the monthly average temperature over years. However, this does not appear in our study. On the other hand, if there is warming phenomenon in the sense that temperature is raising steadily, we would expect to observe a significant larger number of upward trends (and magnitudes) in the mean temperature in comparison to that of downward trends (and magnitudes). However, comparing the trend patterns of the temperature using monthly averages over the ten cities to those of the yearly global average, the upward and downward trend patterns fluctuate more frequently in that using monthly average data except those reported in table 2. The mean net temperature change of each city is also calculated and tabulated under the second column (labeled as “Net change”) in table 3. The mean net temperature change is calculated by averaging the monthly average temperature of 12 months covering all periods of the data available for each city. For comparison purpose, E.P.A. projections of the temperature change based on 50-year data and 100-year data are also tabulated and shown in table 3. Net changes reported in the second column of table 3 show that the temperature has raised in every city. In addition, most of them fall into the interval of E.P.A. projection based on last 100-year data. The three exceptions are St. Louis, Houston, and Boston. A further comparison also reveals that the net change in Boston also disagrees with the E.P.A. projection based on recent 50-year data. Yet the data set of Boston has the longest spanning period. A question raised, which is subject to future further study, is the consistency of the data adjustment and correction over such a long period of time. The sources of the results from E.P.A. tabulated above for validation were obtained from [http://www.epa.gov/globalwarming/climate/trends/temperature.html]. We now ask a similar question about the existence of any localized spatial trend patterns in the temperature data. By examining the significant event association patterns that also appear as the three most probable joint events in each probability model of the five studies, each study reveals some interesting conclusions. In the first study group, according to the statistical interdependency test the mean temperature decrease in both Chicago and DC do not occur independently, and likewise the mean temperature increase in both Delaware and Boston. The consistent pair-wise mean temperature change in Delaware and Boston is consistent with our expectations since they are in relatively close geographical proximity. In the second study group, a similar decrease in the mean temperature is also observed in both Boston and Chicago. An interesting contrast is the fifth study group. It shows the change in mean temperature moves in opposite direction between two locations --- San Francisco and St. Louis. The third and fourth study groups perhaps are most interesting. In the third study group the association patterns including Delaware and Kentucky reveal a decrease in the mean temperature while in the fourth study group the association patterns including Delaware and Kentucky reveal an increase in the mean temperature. A further study shows that both locations are in the close proximity of isotherm --- the curve of equal temperature across different locations. 5. Conclusion This paper discusses a treatment of the temporal-spatial data based on information-statistical analysis. The analysis consists of three steps. Under the assumption of Gaussian and iid, the temporal aspect of the data is examined by determining the possible mean change points of the Gaussian model through a statistical hypothesis test using Schwarz information criterion. Based on the detected change points, we qualify the magnitude changes in the mean change points and marginalize such frequency information over the temporal domain. After doing so, the analytical step involves formulating an optimization problem based on available frequency information in an attempt to derive an optimal discrete-valued probability model that captures possible spatial association characteristics of the data. A Chi-square hypothesis test is then applied to detect any statistically significant event association patterns. The proposed analysis approach is applied to temperature analysis. The application of our proposed information-statistical approach to temperature analysis has been successful. It is able to reach consistent conclusions as were found by others, and in addition, new interesting results about trend patterns of temperature change by season of individual cities are found. Acknowledgement: The authors are indebted to the anonymous reviewers for their insightful comments. When the idea of this paper was presented in MLDM2001, the constructive comments from the audience are also acknowledged. The manuscript preparation, temperature data repackaging, and web-based data hosting resources are supported in part by a NSF DUE CCLI grant #0088778. Professor David Locke of Biochemistry Department in Queens College/CUNY provided technical proofreading. Professor Locke and Professor Mankiewicz of the School of Earth and Environmental Sciences in Queens College offered many insightful discussions concerning quality assurance of the GHCN v2 data set, the interpretation of our temperature analysis results, and comparisons of our results to others such as E.P.A. This paper is dedicated to the people who help rebuilding the City of New York from the crisis of Sept 11 of 2001. References Barnett, T. P., 1978. Estimating Variability of Surface Air Temperature in the Northern Hemisphere, Monthly Weather Review, 106, 1353-1367. Borgwardt, K.H., 1987. The Simplex Method, A Probabilistic Analysis, Springer-Verlag, Berlin. Chen, J. and Gupta, A.K., 1977. Testing and Locating Variance Change Points with Application to Stock Prices, Journal of the American Statistical Association, V.92 (438), American Statistical Association, June 1997, p. 739-747. Chen, J. and Gupta, A.K., 1998. Information Criterion and Change Point Problem for Regular Models, Technical Report No. 98-05, Department of Math. and Stat., Bowling Green State U., Ohio. Easterling, David R., Thomas, C. Peterson, and Thomas, R. Karl, 1996. On the development and use of homogenized climate data sets. Journal of Climate, 9, 14291434. Fisher, R.A., 1924. The conditions under which 2 measures the discrepancy between observation and hypothesis, Journal of the Royal Statistical Society, 87:442-450. Grant, Eugene L. and Leavenworth, Richard S., 1972. Statistical Quality Control. McGraw-Hill Book Company, New York. Good, I.J., 1960. Weight of Evidence, Correlation, Explanatory Power, Information, and the Utility of Experiments, Journal of Royal Statistics Society, Ser. B, 22:319-331. Goodman, L.A., 1968. The analysis of cross-classified data: Independence, quasiindependence and interactions in contingency tables with and without missing entries, Journal of the American Statistical Association, 63:1091-1131. Grenander, U., 1993. General Pattern Theory, Oxford University Press, Oxford. Grenander, U., 1996. Elements of Pattern Theory, The Johns Hopkins University Press, ISBN 0-8018-5187-4. Gupta, A.K. and Chen, J., 1996. Detecting Changes of mean in Multidimensional Normal Sequences with Applications to Literature and Geology, Computational Statistics, 11:211-221, Physica-Verlag, Heidelberg. Karmarkar, N., 1984. A New Polynomial-time Algorithm for Linear Programming, Combinatorica 4 (4) 373- 395. Kuenzi, H.P., Tzschach, H.G., and Zehnder, C.A., 1971. Numerical Methods of Mathematical Optimization, New York, Academic Press. Kullback, S. and Leibler, R., 1951. On Information and Sufficiency, Ann. Math. Statistics, 22:79-86. Kullback, S., 1959. Information and Statistics, Wiley and Sons, New York. Mitchell, J. M., Jr., 1963. On the Worldwide Pattern of Secular Temperature Change, Changes of Climate, Arid Zone Research, Vol. 20, UNESCO, Paris, 161 – 181. Paltridge, G. and Woodruff, S., 1981. Changes in Global Surface Temperature From 1880 to 1997 Derived From Historical Records of Sea Surface Temperature, Monthly Weather Review, 109, 2427-2434. Peterson, Thomas C. and Easterling, David R., 1994. Creation of homogeneous composite climatological reference series. International Journal of Climatology, 14, 671679. Shannon, C.E. and Weaver, W., 1972. The Mathematical Theory of Communication, University of Urbana Press, Urbana. Schwarz G., 1978. Estimating the dimension of a model, Ann. Statist., 6, 461-464. Sy B.K., 2001. Probability Model Selection Using Information-Theoretic Optimization Criterion, Journal of Statistical Computing and Simulation, Gordon & Breach Publishing Group, NJ, 69(3). Vostrikova, L. Ju., 1981. Detecting disorder in multidimensional random process, Soviet Math. Dokl., 24, 55-59, 1981. (UCAR 1997) “Reports to the Nation: Our Changing Climate,” University Corporation for Atmospheric Research (UCAR) and the National Oceanic and Atmospheric Administration (NOAA), (UCAR, Boulder, Colorado, 1997), pp 20. Wong, A.K.C. and Wang, Y., 1997. High Order Pattern Discovery from Discrete-valued Data," IEEE Trans. On Knowledge and Data Engineering, 9(6):877-893. Wright, S., 1997. Primal-Dual Interior-Point Methods, SIAM, ISBN 0-89871-382-X. (WR 1998) “Critical Trends: The Global Commons,” World Resources 1998-99: A Guide to the Global Environment (Editor Leslie Roberts), Oxford University Press, 1998, ISBN 0-19-521407-2, pp 170 – 184. [www http://www.epa.gov/globalwarming/climate/trends/temperature.html] [www http://www.ncdc.noaa.gov/wdcamet.html] [www http://www.ncdc.noaa.gov/ol/climate/research/ghcn/ghcnqc.html] Dr. B. K. Sy is a Full Professor of the Computer Science Department of Queens College, and the University Graduate Center of the City University of New York. He published funded research supported by federal and private agencies. His research spans over multi-disciplinary areas such as information-statistical based data mining for census analysis, pattern theory and pattern-based approach for science learning, earthquakes modeling using multi-population theory, and dysarthric speech evaluation using computer based voice recognizer. His research group has engaged in various research projects in intelligent information technology and data mining in database systems with particular emphasis on science education and industrial applications. Dr. Sy received his Ph.D. and M.Sc. in Electrical and Computer Engineering in 1988 from Northeastern University, Boston, Massachusetts. Dr. A. K. Gupta has been invited to write papers for 31 national and international conferences, symposia and publications in the past 25 years. Overall he has been invited to present more than 80 talks at various colloquia, universities and professional meetings, most notably including advanced lectures on statistical methods for the U.S. Air Force. Gupta is an elected fellow of the American Statistical Association, the Institute of Statisticians and the Royal Statistical Society of England. He has written more than 100 articles and he has edited, co-edited or co-authored six books on statistics. In 1990 he received the Olscamp Research Award. Dr. Gupta, who joined the University in 1976, received his doctoral degree from Purdue University, bachelor's and master's degrees from Poona University in India and a bachelor's degree in statistics from Banara Hindu University in India. Table captions Table 1: Spinning Period of Coverage of the Ten Locations Table 2: Seasonal trend patterns of ten locations in the U.S. Table 3: Result comparison using E.P.A. studies as a reference Location Chicago Washington DC Delaware Fargo Houston Kentucky Boston San Francisco St. Louis Seattle Symbol CH DC DE FA HO KT BO SF SL SE Start year 1950 1950 1854 1883 1948 1949 1747 1853 1893 1947 End year 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 Spanning period 51 51 147 118 53 52 254 148 108 54 Table 1: Spinning Period of Coverage of the Ten Locations City (SE) Seattle (SF) San Francisco (CH) Chicago (HO) Houston (KT) Kentucky Seasonal quarter Winter Fall and Winter All 4 quarters Winter and Spring Summer Winter Winter Summer and Fall (DE) Delaware (DC) Washington DC Trend pattern (upward/downward) Upward Upward Upward Upward Downward Upward Slightly upward Downward Table 2: Seasonal trend patterns of ten locations in the U.S. City symbol (period coverage) Net change in Celsius (C) FA (1875-2000) SE (1947-2000) SF (1853-2000) SL (1893-2000) CH (1950-2000) HO (1948-2000) KT (1949-2000) DE (1854-2000) BO (1747-2000) DC (1950-2000) 3.237 2.165 1.276 1.252 0.955 0.794 0.599 0.598 0.499 0.031 E.P.A. projection based on data from 1950-1999 (in C) 4 3 3 2 2.5 1 2 2 -3 1 Table 3: Result comparison using E.P.A. studies as a reference E.P.A. projection based on last 100- year data (in C) 1 to 4 1 to 4 1 to 4 0 to 1 0 to 1 0 to –1 0 to 1 0 to 1 1 to 4 0 to 1 Figure Captions: Figures 1: World Mean Temperature Change Figures 2: Region 1 (Latitude 30N – 50N) World Mean Temperature Change Figures 3: Region 2 (Latitude 10S – 10N) World Mean Temperature Change Relative change in successive mean temp Figures 4: Region 3 (Latitude 30S – 50S) World Mean Temperature Change 1.017 2 1 lat0 0 1 1.722 2 1880 1900 1920 3 1.88 10 1940 yr 1960 1980 2000 3 2 10 Year (1880 - 2000) Region 1 (latitude 30N-50N) mean change Fig. 1: World Mean Temp Change 3.621 4 3 lat1 R_lat1 2 1 0 0.236 1 1880 1.88 10 1900 1920 3 1940 yr 1960 1980 2000 3 2 10 1980 2000 3 2 10 Year (1880 - 2000) Region 2 (Latitude 10S-10N) mean change Fig. 2: Region 1 World Mean Temp Change 1.459 2 0 lat2 R_lat2 2 3.512 4 1880 1.88 10 1900 3 1920 1940 yr 1960 Year (1880 - 2000) Fig. 3: Region 2 World Temp Change Region 3 (Latitude 30S-50S) mean change 1.242 5 0 lat3 R_lat3 5 6.293 10 1880 3 1.88 10 1900 1920 1940 yr 1960 Year (1880 - 2000) Fig. 4: Region 3 World Temp Change 1980 2000 3 2 10