Information-Statistical Approach for Temporal

advertisement
INFORMATION-STATISTICAL APPROACH
FOR TEMPORAL-SPATIAL DATA
WITH APPLICATION
BON K. SY
Queens College/CUNY
Computer Science Department
Flushing NY 11367
U.S.A.
bon@bunny.cs.qc.edu
ARJUN K. GUPTA
Bowling Green State University
Department of Mathematics and Statistics
Bowling Green OH 43403
U.S.A.
gupta@bgnet.bgsu.edu
Abstract
A treatment for temporal-spatial data such as atmospheric
temperature using an information-statistical approach is proposed.
Conditioning on specific spatial nature of the data, the temporal aspect of
the data is first modeled parametrically as Gaussian, and Schwartz
information criterion is then applied to detect multiple mean change
points, --- thus the Gaussian statistical models --- to account for changes
of the mean population over time. To examine the spatial characteristics of
the data, successive mean change points are qualified by finite categorical
values. The distribution of the finite categorical values is then used to
estimate a non-parametric probability model through a non-linear SVDbased optimization approach; where the optimization criterion is Shannon
expected entropy. This optimal probability model accounts for the spatial
characteristics of the data and is then used to derive spatial association
patterns subject to chi-square statistic hypothesis test. The proposed
approach is applied to examine the weather data set obtained from NOAA.
Selected temperature data are studied. These data cover different
geographical localities in the United States, with some spanning over 200
years. Preliminary results are reported.
Keywords: Temporal-spatial data; Information theory; Schwarz information
criterion, Probability model optimization; Statistical association pattern.
1. Introduction
This paper presents a treatment for temporal-spatial data using an informationstatistical approach. Let Oi(t1) … Oi(tn) be sequence of n independent observations made
at the ith location; where i=1..p. Temporal-spatial data analysis to be discussed in this
paper can be formulated as a 3-step process:
1. Given a specific location indexed by i, and assuming the observations are
Gaussian, the specific task is to detect mean change points in the Gaussian model.
2. Upon detection of mean change points, the specific task is to identify the optimal
non-parametric probability model with discrete-valued random variables: X1, X2,
… , Xp; where each value of Xi accounts for a possible qualitative change of
successive mean change points; e.g., { Xi: x1 =increase, x2 = no-change, x3 =
decrease}.
3. Upon derivation of the optimal non-parametric probability model, the specific
task is to identify statistical association pattern manifested as a p-dimensional
vector of {vi,j: i=1.. p, j=1..3}; where vi,j represents the jth value of random
variable Xi.
An example of temporal-spatial data is the monthly average temperature data.
Consider three distinct city locations in the U.S.: Boston, Denver, and Houston --indexed by 1, 2, 3, and also, 4 years (or 48 months) of monthly average temperature data,
we may represent these temporal-spatial temperature data by Oi(t1) … Oi(tn); where n =
48, and p = 3 (or i = 1..3). Using this temperature data example, the temporal-spatial data
analysis to be discussed in this paper attempts to address two questions:
1. Suppose the monthly average temperature data of each city are Gaussian
distributed, we ask the question whether there are multiple (Gaussian) mean
change points of monthly average temperature data, and if so, where do they
locate in the time sequence?
2. If multiple change points exist, are there any significant statistical association
patterns that characterize the changes in the mean of the monthly average
temperature data of the three cities?
We will now present the problem formulation for each step. The problem
formulation for step 1 is focused on the temporal aspect of the temporal-spatial data. The
problem formulation for step 2 acts as a “bridge” process to shift the focus of the analysis
from the temporal aspect to the spatial aspect. The problem formulation for step 3 is
focused on the spatial aspect of the analysis of temporal-spatial data.
Problem formulation 1 (for step 1):
Let X1(T), X2(T), … , Xp(T) be p time-varying random variables. For some i 
{1..p}, let Oi(t1) … Oi(tn) be a sequence of n independent observations of Xi(T). Suppose
each observation Oi(tj) is obtained from a normal distribution model with unknown mean
i,j and common variance , we would like to test the hypothesis:
H0: i,1 = … = i,n = i (unknown)
Versus the alternative:
H1: i,1 = … = i,c1  i,c1+1 = … = i,c2  …  i,cq = … = i,cq+1 = i,n
Where 1  c1 < c2 < .. < cq + 1 = n
This statistical test for a specific i, if H0 is accepted, implies that all the
observations of Xi(T) belong to a single normal distribution model with mean = i. In
other words, Xi(T) can be modeled as a Gaussian model. If H0 is rejected, it implies that
each observation of Xi(T) belongs to one of the q populations. Or Xi(T) has to be modeled
by q Gaussian models.
Problem formulation 2 (for step 2):
Following the problem formulation for step 1 and assuming H1 is not rejected, the
change from i,j to i,j+1 will be qualified as either increase or decrease; where j  {c1,
..., cq}. For any fixed time unit k’  {1..n}, we abbreviate Xi(k’) as Xi. Note that Xi can
assume one of three discrete values according to the following rules:
Xi =

if k’  {c1, ., cq} and i,k’ > i,k’+1 (i..e., decrease)

if k’  {c1, ., cq}

if k’  {c1, ., cq} and i,k’ < i,k’+1 (i..e., decrease)
(1)
Given the marginal and joint frequency counts of the possible discrete values of
{Xi}, we would like to identify an optimal discrete-valued probability model that
preserves maximally the biased probability information available while minimizes the
bias introduced by unknown probability information. The optimization criterion will be
the Shannon expected entropy which captures the principle of minimum biased unknown
information. We will later show that this problem formulation is indeed an optimization
problem with linear constraints and a non-linear objective function.
Problem formulation 3 (for step 3):
Upon the identification of the optimal probability model, we would like to
investigate the existence of statistically significant spatial patterns characterized by the
joint event of X={Xi : xi = , , } where |X| = p. Specifically, we would like to test the
hypothesis:
H0: {Xi : xi} in X are independent of each other for i=1..p.
Versus the alternative:
H1: {Xi : xi} in X are interdependent of each other for i=1..p.
2. Information-Statistical Analysis
@ Problem formulation 1 (for step 1):
Recall the formulation presented in section 1, for some i  {1..p}, let Oi(t1) …
Oi(tn) be a sequence of n independent observations of Xi(T). Suppose each observation
Oi(tj) is obtained from a normal distribution model with unknown mean i,j and common
variance , we would like to test the hypothesis --- for each i  {1..p}:
H0: i,1 = … = i,n = i (unknown)
Versus the alternative:
H1: i,1 = … = i,c1  i,c1+1 = … = i,c2  …  i,cq = … = i,cq+1 = i,n
Where 1  c1 < c2 < .. < cq + 1 = n.
The statistical hypothesis test shown above is to compare the null hypothesis
under the assumption that there is no change in the mean against the alternative
hypothesis that there are q changes at the instant c1, c2, …, cq. To determine whether
there are multiple change points for the mean, Schwarz Information Criterion (SIC) along
with binary segmentation technique (Vostrikova, 1981) is employed.
Our choice on Schwarz Information Criterion (SIC) for change point detection is
not arbitrary. SIC can be considered as a variant of Akaike Information Criterion (AIC)
with a penalty factor. When AIC is used as a measure of model evaluation, MAICE
(Maximum AIC Estimate) is most appropriate. However, MAICE is not claimed as
asymptotically consistent estimate of model order (Schwarz, 1978). In contrast, SIC gives
consistent estimator of the true model. In addition, the test statistic of SIC (Gupta, 96) has
a chi-square limiting distribution. The estimator of the change point is found to be
consistent and has a convergence rate identical to all likelihood based methods.
Furthermore, the method based on information criterion, as opposed to likelihood
methods, has better power in detecting changes occurring at the middle of the process,
and a competitive power performance in comparison to other cases.
Schwarz Information Criterion has the form: –2log L (ˆ) +r log n, where L (ˆ) is
the maximum likelihood function for the model, r is the number of free parameters in the
model, and n is the sample size.
In this setting we have one and q+1 models
corresponding to the null and the alternative hypotheses, respectively. The decision to
accept H0 or H1 will be made based on the principle of minimum information criterion.
That is, we do not reject H0 if SIC (n)  min SIC (k ) (where m=1 in this case for
mk  n m
univariate model) and reject H0 if SIC(n) > SIC(k) for some k and estimate the position of
change point k0 by k̂ such that
SIC (kˆ)  min SIC (k )
m k  n  m
For detecting multiple change points, the binary segmentation technique proposed
by Vostrikova was employed. Further details can be referred to a report elsewhere
(Gupta, 1996; Chen, 1997). The binary segmentation is repeated for every value of i 
{1..p}:
Under H0 and a given i,
SIC (n)  2 log L(ˆn )  r log n
For Gaussian model: SIC(n) = n log 2+ n + nlogi + 2log n
(2)
Where
i = (1/n) nj=1 (Oi(tj)- i)2
and
i = (1/n) nj=1 Oi(tj)
Under H1,
SIC (n)  2 log L(ˆn )  r log n
For Gaussian model: SIC(k) = n log 2+ n + nlog’i + 3log n
Where
’i = (1/n) kj=1 (Oi(tj)- ’i)2 + (1/n) nj=k+1 (Oi(tj)- ’n-k)2
’i = (1/k) kj=1 Oi(tj)
and
’n-k = (1/(n-k)) nj=k+1 Oi(tj)
(3)
@ Problem formulation 2 (for step 2):
When change point(s) is/are detected in step 1, each change point partitions the
temporal-spatial data set into two sub-populations. Each population mean can be
estimated following similar procedure as described in step 1. The change in the mean
between two (time-wise) adjacent sub-populations can be qualified using one of three
possible categorical values: increase, same, and decrease. Since each change point has a
corresponding time index, not only the marginal frequency information of the
corresponding “spatial-specific” random variable can be derived, the joint frequency
information related to multiple variables can also be derived by alignment through
common time index.
Consider the following snapshot of the monthly average temperature data of
January and February from year 1988 to 1992 of five cities, Chicago (CH), Washington
D.C. (DC), Houston (HO), San Francisco (SF), and Boston (BO):
CH DC HO SF BO
1988     
1989     
1990   
 
1991     
1992     
In the above table, “” or “” refers to the location of a change point, and “”
refers to no change point detected. For example, two change points are detected in
Chicago --- 1989 and 1991. These two change points partition the monthly average
temperature data into three sub-populations during the period of 1988 to 1992: one subpopulation is prior to 1989, one between 1989 and 1991, and one after 1991. The “” in
1989 indicates that the Gaussian mean of the model for Chicago accounting the period
prior to 1989 is smaller than the Gaussian mean of the model for Chicago accounting the
period between 1989 and 1991. Similarly, the “” in 1991 indicates that the Gaussian
mean of the model for Chicago accounting the period between 1989 and 1991 is greater
than that of the period after 1991.
With the conception just discussed, each city could be perceived as a discretevalued random variable and the frequency count information that reflects change points
indicated by “” and “” may be used to derive the corresponding probability
distribution. For example,
Pr(BO:) =  CH, DC, HO, SF Pr(CH, DC, HO, SF, BO:) = 1/5
Pr(BO:) = CH, DC, HO, SF Pr(CH, DC, HO, SF, BO:) = 1/5
Pr(DC:) = Pr(CH:) =Pr(CH:) = Pr(HO:) = Pr(SF:) = 1/5
Pr(HO:,SF:|BO:) = 1   CH,DC,HO,SF,BO Pr(CH, DC, HO, SF, BO) = 0
 CH, DC, HO, SF, BO Pr(CH, DC, HO, SF, BO) = 1
In this example, the probability model consists of 35 = 243 joint probability terms
Pr(CH,DC,HO,SF,BO). In theory, it is possible to have up to 243 linearly independent
constraints for the existence of a valid probability model. But in practice we may care
only the constraints that carry statistical significant information. In the above example,
we show a case of nine linear probability constraints. Note that one could observe a
constraint Pr(DC:|BO:) = 1 as shown in the table but is intentionally omitted in the
above example. Given these probability constraints, we would like to derive an optimal
probability model subject to
Max[- CH, DC, HO, SF, BO Pr(CH, DC, HO, SF, BO) log Pr(CH, DC, HO, SF, BO)].
As just shown, identifying an optimal probability model based on the marginal and
joint frequency information of discrete random variables is an optimization problem.
Specifically, the optimization problem consists of a set of linear probability constraints,
and a non-linear objective function due to Shannon entropy optimization criterion.
In the operations research community, techniques for solving various optimization
problems have been discussed extensively. Simplex and Karmarkar algorithms
(Borgwardt, 1987; Karmarkar, 1984) are two methods that are constantly used, and are
robust for solving many linear optimization problems. Wright (Wright, 1997) has written
an excellent textbook on primal-dual formulation for the interior point method with
different variants of search methods for solving non-linear optimization problems. It was
discussed in Wright’s book that the primal-dual interior point method is robust in
searching optimal solutions for problems that satisfy KKT (Karush-Kuhn-Tucker)
conditions with a second order objective function.
In this research, we adopt an optimization problem solving approach that follows
the spirit of the primal-dual interior point method. But this approach deviates from the
traditional approach in the search towards an optimal solution in the sense that it
integrates two approaches for solving the algebraic system of linear probability
constraints; namely, the Kuenzi, Tzschach, and Zehnder approach (Kuenzi et al, 1972),
and the Singular Value decomposition (SVD) algorithm (Wright, 1997). Further
theoretical details are referred to a report elsewhere (Sy, 2001). Below shows the solution
of the example optimization problem just discussed:
Pr(CH:, DC:, HO:, SF:, BO:) = 0.2
Pr(CH:, DC:, HO:, SF:, BO:) = 0.4
Pr(CH:, DC:, HO:, SF:, BO:) = 0.2
Pr(CH:, DC:, HO:, SF:, BO:) = 0.2
Pr(CH, DC, HO, SF, BO) = 0 for the remaining joint probability terms.
The entropy of the optimal model is Max[- CH, DC, HO, SF, BO Pr(CH, DC, HO, SF,
BO) log Pr(CH, DC, HO, SF, BO)] = 1.9219 bits.
In the above example, one may wonder why we do not simply use frequency
count information of all variables to derive the desired probability model. There are
several reasons due to the limitation and nature of a real world problem. Using the
temperature data example, a weather station of each city is uniquely characterized by
factors such as elevation of the station, operational hours and period (since inception),
specific adjacent stations for data cross-validation, and calibration for precision and
accuracy correction. In particular, the size of sample temperature data does not have to be
identical across all weather stations. Nonetheless, the location of change points depends
on each marginal individual population, and the observation of the conditional occurrence
of change points of data with different spatial characteristic values (location) is still valid.
In other words, the nature of temporal-spatial data may originate from different
sources. Information from different sources does not have to be consistent, and may even
at times contradict each other. However, each source may provide some, but not all,
information that reach general consensus, and that collectively may reveal additional
information not covered by each individual.
@ Problem formulation 3 (for step 3):
Different aspects of the concept of patterns have been discussed extensively in a
number of publications by Grenander (Grenander, 1993, 1996). One interesting aspect
found by the first author of this paper is the possibility of interpreting joint events of
discrete random variables surviving statistical hypothesis test of interdependency as
statistically significant association patterns. In doing so, significant previous works
already established (Kullback, 1951, 1959; Good, 1960; Chen, 1998) may be used to
provide a unified framework for linking information theory with statistical analysis. The
significance of such a linkage is that it not only provides a basis for using statistical
approaches for revealing hidden significant association patterns, but for using
information theory (Shannon, 1972) as a measurement instrument to determine the
quality of information obtained from statistical analysis.
The purpose of deriving an optimal probability model in step 2 is to provide a
basis for uncovering statistically significant spatial patterns. Our approach is to identify
statistically significant patterns based on event associations. Significant event
associations may be determined by statistical hypothesis testing based on mutual
information measure or residual analysis (Kullback, 1959; Fisher, 1924).
Following step 2 and using the formulation discussed earlier, let X1 and X2 be two
random variables with {x11 ... x1z} and {x21 ... x2m} as the corresponding sets of possible
values. The expected mutual information measure of X1 and X2 is defined as I(X1 X2)= i,j
Pr(x1i x2j)log2 [Pr(x1i x2j)/Pr(x1i)Pr(x2j)]. Similarly, the expected mutual information
measure of the interdependence among the multiple variables (X1 … Xp) is
I(X1 … Xp)= zi=1…mj=1 Pr(x1i … xpj)log2[Pr(x1i…xpj)/Pr(x1i)…Pr(xpj)]
(4)
Note that the expected mutual information measure is zero if the variables are
independent of each other. Since mutual information measure is asymptotically
distributed as chi-square (Kullback, 1959; Goodman, 1968), statistical inference can be
applied to test and compare the null hypothesis --- where the variables are independent of
each other --- against the alternative hypothesis --- where the variables are
interdependent. Specifically, the null hypothesis is rejected if I(X1…Xp)  2/2N; where N
is the size of the data set, and 2 is the chi-square test statistic. The 2 test statistic, due to
Pearson, can be expressed as below:
2 = zi=1 … mj=1 (o1i,… pj - e1i,..pj)2/ e1i,..pj
(5)
In the above equation, the 2 test statistic has the degree of freedom (|X1| - 1)(|X2|
- 1)…(|Xp| - 1); where |Xi| is the number of possible value instantiation of Xi. Here o1i,… pj
represents the observed counts of the joint event (X1 =x1i … Xp =xpj) and e1i,..pj represents
the expected counts, and is computed from the hypothesized distribution under the
assumption that X1, X2, ..., Xp are independent of each other.
The chi-square test statistic and mutual information measure just shown can be
further extended to measure the degree of statistical association at the event level. That is,
the significance of statistical association of an event pattern E involving multiple
variables can be measured using the test statistic:
E2 = (o1i,… p,j - e1i,..p,j)2/ e1i,..p,j
(6)
while the mutual information analysis of an event pattern is represented by log2[Pr(x1i …
xpj)/Pr(x1i)…Pr(xpj)].
As suggested elsewhere (Wong, 1997), the chi-square test statistic of an event
pattern may be normally distributed. In such a case, one can perform a statistical
hypothesis test to determine whether an event pattern E bears a significant statistical
association. Specifically, the hypothesis test can be formulated as below:
Null hypothesis H0: E is not a significant event pattern when E2 < 1.96,
where 1.96 corresponds to a 5% significance level of normal distribution.
Alternative hypothesis H1: E is a significant event patterns otherwise.
3. Temperature Analysis Application
The information-statistical approach discussed in this paper has been applied to
analyze temperature data. The temperature data source is the GHCN (Global Historical
Climatology Network) data set obtained from the National Oceanic and Atmospheric
Administration (NOAA) [www http://www.ncdc.noaa.gov/wdcamet.html]. This data set
consists of data collected from 30 sources, accounting for approximately 800 weather
stations throughout the world.
There are two versions of the GHCN data set. This study uses the second version
of the GHCN data set, which has been incorporated with a comprehensive set of quality
assurance procedures. During the process of compiling GHCN v2 (second version),
homogeneity testing and adjusting techniques have been developed (Peterson, 1994;
Easterling et al, 1996). This is important because certain temperature readings were dated
back to pre-20th century; e.g., temperature readings for the city of Boston dated back to
1747. The original form of these temperature readings is not in electronic format and
digitization of these readings is required. As a consequence, data entry during the process
of digitization may introduce human errors and outliers. Although nowadays temperature
data are collected automatically by modern computerized instrumentation, combining the
old and new data sets would require quality assurance that would address issues such as
identification of outlier and noise filtering, homogeneity testing, and data adjustment.
Homogeneity data adjustment concerns with corrections needed to make historical data
such as those from the 1800s equivalent to data produced by 20th century siting practices.
Readers interested in further details such as how, when, where, and by whom the data are
collected are referred to the resources elsewhere (Peterson, 1994; Easterling et al, 1996;
Grant, 1972), [www http://www.ncdc.noaa.gov/ol/climate/research/ghcn/ghcnqc.html].
The temporal distribution of temperature and its variation throughout the year
depends primarily on the amount of radiant energy received from the sun. The spatial
distribution of temperature data depends on geographical regions in terms of latitude and
longitude, as well as possible modification by locations of continents and oceans,
prevailing winds, oceanic circulation, topography, and other factors. Furthermore, spatial
characteristics such as elevation also play a role in temperature changes.
We noted that there have already been extensive studies on global climate based
on yearly average temperature (Barnett, 1978; Paltridge, 1981; Mitchell, 1963; UCAR,
1997; WR, 1998). Global temperature analysis conducted by Paltridge and Woodruff
(Paltridge, 1981) is one of the most comprehensive studies covering the period 1880 to
1980. A more recent follow-up extending the study from 1980 and onwards can be found
on the web site of the U.S. Environmental Protection Agency (E.P.A.) [www
http://www.epa.gov/globalwarming/climate/trends/temperature.html].
In this preliminary study, we are NOT interested in duplicating previous studies
on analyzing global climate using yearly average temperature. Rather, our focus is on
analyzing monthly average temperature, and specifically, in ten geographical locations
spanning over different regions of the United States. Nonetheless, various analytical
methods, data sources, and data calibration methods employed by others will be used as a
basis for validating the consistency of our preliminary study compared to other studies.
Ten geographical locations spanning over different regions of the United States
were selected for this study. The objectives of this study are (1) to determine the patterns
of temporal temperature variation using monthly averages, and then to compare these
patterns to that of the yearly average using the proposed analytical method described in
this paper, and (2) to identify possible spatial association patterns revealed by the
monthly average temperature data. For the purpose of validating the consistency of our
method compared to others, global temperature analysis following the framework of that
proposed by Paltridge and Woodruff (Paltridge, 1981) will also be conducted. In this
validation process, our analysis will cover the same period as the studies by E.P.A. and
Paltridge.
In our study of the ten selected cities in the U.S., the period coverage of each
location is shown in the table 1. The specific data set used for this study is the monthly
average temperature. The size of the data available for each location varies. The longest
period covers the years 1747 to 2000 (Boston), while the shortest period covers 1950 to
2000 (DC and Chicago).
In each one of the ten locations, the change point detection analyses are carried
out twelve times, one for each month using all available data. For example, the size of
Jan monthly average temperature data of Boston is 254 (2000 – 1747 +1). All 254 Jan
monthly average temperatures are used for detecting the change points (indexed by year)
in Jan. This is then repeated for every month from Feb to Dec; where a new set of 254
data points are used for change point detection. This is then repeated for each one of the
ten locations. Altogether, the data size has an upper bound of (254x12x10=) 30488.
In general, one may be interested in detecting the change point of variance, or
both mean and variance, in a Gaussian model. In this paper, we are only concern about
mean change point detection. It is because weather phenomenon tends to behave with
ergodicity property, which has a statistical characteristic of common long-term variance.
Detecting variance change point is beyond the scope of this paper. Readers interested in
this topic is referred to the publications elsewhere (Chen, 1997).
The Jan monthly average temperature data of Chicago and DC are used to
illustrate the process of change point detection. By applying the technique described in
the problem formulation 1 in section 2, four and eight change points are detected for the
data sets of Chicago and DC respectively. These change points are shown below:
Jan
1953
1957
1960
1964
1967
1972 1986 1988 1989 1991 1994

Chicago
DC











There are two interesting observations. First, the change in the Gaussian mean of
the monthly average temperature of the data set of DC fluctuates on every other change
point, while that of the data set of Chicago fluctuates in pairs. In addition, there is a
change point occurring simultaneously in 1994.
Following the process just described, change point detection according to
Schwarz information criterion using formulas 2 and 3 is carried out for each one of the
ten cities. For each city, change point detection is carried out using yearly data twelve
times --- one for each month. Detected change points are then grouped by seasonal
quarters; i.e., winter quarter (Dec – Feb), spring quarter (Mar – May), summer quarter
(Jun – Aug), and fall quarter (Sep – Nov). Interesting seasonal trend patterns are
summarized in table 2. The frequency count of the occurrence of each trend is
summarized below:
Location
Decrease
No change
Increase
------------------------------------------------------------------CH
8
4
13
DC
27
18
29
DE
37
19
42
FA
51
27
63
HO
36
22
34
KT
26
14
33
BO
107
51
111
SF
79
81
86
SL
46
15
41
SE
27
21
36
------------------------------------------------------------------444
272
488
Remark: There are cases that a change point is detected, but the incremental
increase/decrease in the mean temperature value is statistically insignificant. For these
cases, the change point is marked as “No change.”
After the change points are identified, we ask the question whether any change
points from various locations align in time. In other words, if there is a mean change in
one location, are there any other locations also experience a mean change at the same
time (by year)?
And particularly, are any change points common to at least three
different locations?
Using previous formulation Oi(t1) to represent the monthly average temperature of
location i at the year t1, there are three possibilities. At the year t1, it could be no change
point, or a change point with increased Gaussian mean, or a change point with decreased
Gaussian mean in a location i. Since there are ten locations, the number of possible
combinations to account for the existence and type (increase/decrease) of change points is
310 = 59049. Obviously the problem will be unmanageable if we attempt to derive a
probability model to account for the occurrences of all joint change points. Instead, we
decided to study the temperature change points in 5 groups of 5 locations. They are:
Group 1:
CH (Chicago) | DC (Washington DC) | DE (Delaware) | FA (Fargo)
| BO (Boston)
Group 2:
CH (Chicago) | BO (Boston)
| HO (Houston) | KT (Kentucky) | DC
Group 3:
DE (Delaware)| SF (San Francisco)
| FA (Fargo)
| KT (Kentucky) | SE (Seattle)
| FA (Fargo)
| KT (Kentucky) | SL (St. Louis)
Group 4:
CH (Chicago) | DE (Delaware)
Group 5:
HO (Houston) | SF (San Francisco)
| KT (Kentucky) | SL (St. Louis)
| DC
Note that the above 5 groups are chosen in such a way that any two groups will
have at least one common city. In studying each of the five groups, we are interested in
any trend patterns of simultaneous change points of at least three locations. With these
patterns, we proceed to the following three tasks:
1. Based on the frequency count information, estimate the conditional probability of
simultaneous change points.
2. Based on the conditional probability information, derive an optimal probability
model with respect to Shannon entropy.
3. Based on the optimal probability model, identify statistical significant association
patterns that characterize the type of changes (increased/decreased) in the
Gaussian mean based on Chi-square test statistic discussed in section 2 (equation
6).
In each study group, we report the number of probability constraints used for
model optimization. The entropy of the optimal probability model, and the noticeable
significant event association patterns are also reported. Noticeable significant event
association patterns are defined as the most probable event patterns (ranked within
top six) and that also pass the chi-square statistic test according to equation 6 with a
95% significance level. The noticeable significant event association patterns are
presented in the decreasing order of statistical significant association according to
equation 6. The results of each study group are summarized below.
Group 1:
 Using the notation in (1), number of probability constraints for model derivation: 4
 Optimal probability model satisfying the constraints:
Entropy of the optimal probability model: 7.894 bits
 Noticeable significant association patterns and Pr (based on equation 6):
(CH: DC: DE: FA:
BO:)
Pr = 0.004625
(CH: DC: DE: FA:  BO:)
Pr = 0.004625
Group 2:
 Using the notation in (1), number of probability constraints for model derivation: 11
 Optimal probability model satisfying the constraints:
Entropy of the optimal probability model: 1.505 bits
 Noticeable significant association patterns and Pr (based on equation 6):
(CH: 
DC: 
HO:  KT:  BO: )
Pr= 0.698
(CH: 
DC: 
HO: 
KT:  BO: )
Pr=0.086
(CH: 
DC:  HO: 
KT:  BO: )
Pr=0.03
(CH: 
DC:  HO: 
KT:  BO: )
Pr=0.116
(CH:  DC:  HO:  KT:  BO: )
Pr=0.052
(CH:  DC:  HO: 
Pr=0.017
KT:  BO: )
Group 3:
 Using the notation in (1), number of probability constraints for model derivation: 9
 Optimal probability model satisfying the constraints:
Entropy of the optimal probability model: 0.2567 bits
 Noticeable significant association patterns and Pr (based on equation 6):
(DE:
FA: SF:
(DE:
FA: SF: 
KT: SE:)
KT:  SE: )
Pr=0.096
Pr=0.001925
(DE:
FA: SF: 
KT:  SE: )
Pr= 3x10-6
(DE:
FA: SF:
KT:  SE: )
Pr=0.038
(DE:  FA: SF:
KT:  SE: )
Pr=9.6x10-5
(DE:
KT:  SE: )
Pr=3x10-6
FA: SF:
Group 4:
 Using the notation in (1), number of probability constraints for model derivation: 4
 Optimal probability model satisfying the constraints:
Entropy of the optimal probability model: 7.866 bits
 Noticeable significant association patterns and Pr (based on equation 6):
(CH:  DE:  FA:  KT:  SL: )
Pr=0.008218
(CH:  DE:  FA:  KT:  SL: )
Pr=0.008218
(CH:  DE:  FA:  KT:  SL: )
Pr=0.008319
(CH:  DE:  FA:  KT:  SL: )
Pr=0.008319
(CH:  DE:  FA:  KT:  SL: )
Pr=0.008319
Group 5:
 Using the notation in (1), number of probability constraints for model derivation: 15
 Optimal probability model satisfying the above constraints:
Entropy of the optimal probability model: 0.8422 bits
 Noticeable significant association patterns and Pr (based on equation 6):
(DC: HO:  KT:  SF:  SL: ) Pr=0.036
(DC: HO:  KT:  SF:  SL: ) Pr=0.035
(DC: HO:  KT:  SF:  SL: ) Pr=0.028
(DC: HO: 
KT:  SF:  SL: ) Pr=0.026
(DC: HO: 
KT: 
SF:  SL: ) Pr=0.004732
Validation:
Since this paper emphasizes a novel information-statistical technique for
analyzing temporal-spatial data, a fundamental question would be whether this technique
could be assessed and validated in, at least, the case of temperature analysis.
Our approach to assess and validate the proposed technique will focus on
answering two questions. First, does the proposed technique yield consistent conclusion
when applying to the task of temperature analysis using a framework conforming to the
previous temperature studies? Second, does the proposed technique produce new
interesting results that are not covered by the previous studies?
In order to determine whether the proposed technique described in this paper
yields consistent conclusions in comparison to previous studies, the framework proposed
by Paltridge and Woodruff (Paltridge, 1981) for temperature analysis is used. The
motivation behind selecting Paltridge’s framework is its conformity to the recent followup study conducted by the E.P.A.
The temperature study conducted by Paltridge and Woodruff is focused on the
global surface temperature change based on the yearly average temperature over the
period of 1880 to 1980. Their study of the global surface temperature change is based on
weighted averages over the yearly mean temperature, and is grouped into three regions by
latitude: 30N-50N, 10N-10S, and 30S-50S. Their study also probed temperature
change by seasons.
To follow the framework of Paltridge’s study, GHCN data set is used as the data
source for assessment and validation. The GHCN data set may be considered an
extension of the data sets used by previous studies since it covered temperature readings
from both land-based and sea-based weather stations, and extended beyond the ending
period 1977 of Paltridge’s study.
In order to conduct a study similar to that of Paltridge and Woodruff on the global
temperature change based on yearly average temperature, the monthly average
temperature data of a weather station for the period 1880 to 2000 are used to derive the
yearly average temperature of the location. The derived yearly average temperature is
then calibrated according to the altitude of the weather station (roughly 1F per 1000 feet)
so that the yearly average temperature is referenced to the surface temperature in the sea
level. This is then repeated for all (> 800) weather stations. The calibrated yearly average
temperature data are then grouped into three regions as previously: 30N-50N, 10N10S, and 30S-50S. For each one of the three regions, the mean temperature of a year is
then derived from averaging the calibrated yearly mean temperature data of the region.
Furthermore, the overall mean temperature of a year over all regions is also derived.
The change point detection (described in step 1 of this proposed technique) is
applied to the overall mean temperature data. The results of the detected change points
are shown in Fig. 1. The change point detection technique is applied again to the mean
temperature data of each of the three regions; the results are shown in Figures 2 to 4.
Figures 2, 3, and 4 are for the regions covering 30N-50N, 10N-10S, and 30S-50S
respectively.
It is interesting to note that the steady increase in the global surface temperature
discovered in Paltridge’s study over the period of 1880-1980 is captured in the consistent
upward trend patterns in Figure 1. Paltridge’s study of the global surface temperature
change over the period of 1880-1980, by region, concluded that there is a steady increase
in the global temperature in the northern hemisphere (30N-50N), and a random
fluctuation in the equator region (10N-10S). The same conclusion can also be reached
by referring to Fig. 2 and Fig.3. Our result for the southern hemisphere (30 S-50S),
however, does not agree with Paltridge’s study. According to Fig. 4, the surface
temperature change fluctuates only slightly, while Platridge’s study found that the surface
temperature has increased over the period of 1880-1980.
It is noteworthy that in each of the figures 1, 3, and 4 there is a significant
downward trend patterns in the early 90s, particularly in Fig. 4. This is counter-intuitive
since we really did not experience such a significant cool-down in early 1990’s
worldwide. The GHCN data sources for the southern hemisphere are mainly from the
same weather stations in Australia. But there are new data sources from weather stations
on the African continent causing the significant downward trend pattern in early 90s.
Hence, this significant downward trend pattern is likely to be an outlier rather than a
significant pattern reflecting an actual phenomenon of temperature change.
Discussion:
As temperature is generally cyclical by year, we would expect in an ideal case ---without global warming and without “man-made” environmental disturbance --- that
there would be no (Gaussian) mean change in the monthly average temperature over
years. However, this does not appear in our study.
On the other hand, if there is warming phenomenon in the sense that temperature
is raising steadily, we would expect to observe a significant larger number of upward
trends (and magnitudes) in the mean temperature in comparison to that of downward
trends (and magnitudes). However, comparing the trend patterns of the temperature using
monthly averages over the ten cities to those of the yearly global average, the upward and
downward trend patterns fluctuate more frequently in that using monthly average data
except those reported in table 2.
The mean net temperature change of each city is also calculated and tabulated
under the second column (labeled as “Net change”) in table 3. The mean net temperature
change is calculated by averaging the monthly average temperature of 12 months
covering all periods of the data available for each city. For comparison purpose, E.P.A.
projections of the temperature change based on 50-year data and 100-year data are also
tabulated and shown in table 3.
Net changes reported in the second column of table 3 show that the temperature
has raised in every city. In addition, most of them fall into the interval of E.P.A.
projection based on last 100-year data. The three exceptions are St. Louis, Houston, and
Boston. A further comparison also reveals that the net change in Boston also disagrees
with the E.P.A. projection based on recent 50-year data. Yet the data set of Boston has
the longest spanning period. A question raised, which is subject to future further study, is
the consistency of the data adjustment and correction over such a long period of time.
The sources of the results from E.P.A. tabulated above for validation were obtained from
[http://www.epa.gov/globalwarming/climate/trends/temperature.html].
We now ask a similar question about the existence of any localized spatial trend
patterns in the temperature data. By examining the significant event association patterns
that also appear as the three most probable joint events in each probability model of the
five studies, each study reveals some interesting conclusions.
In the first study group, according to the statistical interdependency test the mean
temperature decrease in both Chicago and DC do not occur independently, and likewise
the mean temperature increase in both Delaware and Boston. The consistent pair-wise
mean temperature change in Delaware and Boston is consistent with our expectations
since they are in relatively close geographical proximity. In the second study group, a
similar decrease in the mean temperature is also observed in both Boston and Chicago.
An interesting contrast is the fifth study group. It shows the change in mean temperature
moves in opposite direction between two locations --- San Francisco and St. Louis.
The third and fourth study groups perhaps are most interesting. In the third study
group the association patterns including Delaware and Kentucky reveal a decrease in the
mean temperature while in the fourth study group the association patterns including
Delaware and Kentucky reveal an increase in the mean temperature. A further study
shows that both locations are in the close proximity of isotherm --- the curve of equal
temperature across different locations.
5. Conclusion
This paper discusses a treatment of the temporal-spatial data based on
information-statistical analysis. The analysis consists of three steps. Under the
assumption of Gaussian and iid, the temporal aspect of the data is examined by
determining the possible mean change points of the Gaussian model through a statistical
hypothesis test using Schwarz information criterion. Based on the detected change points,
we qualify the magnitude changes in the mean change points and marginalize such
frequency information over the temporal domain. After doing so, the analytical step
involves formulating an optimization problem based on available frequency information
in an attempt to derive an optimal discrete-valued probability model that captures
possible spatial association characteristics of the data. A Chi-square hypothesis test is
then applied to detect any statistically significant event association patterns. The
proposed analysis approach is applied to temperature analysis. The application of our
proposed information-statistical approach to temperature analysis has been successful. It
is able to reach consistent conclusions as were found by others, and in addition, new
interesting results about trend patterns of temperature change by season of individual
cities are found.
Acknowledgement: The authors are indebted to the anonymous reviewers for their
insightful comments. When the idea of this paper was presented in MLDM2001, the
constructive comments from the audience are also acknowledged. The manuscript
preparation, temperature data repackaging, and web-based data hosting resources are
supported in part by a NSF DUE CCLI grant #0088778. Professor David Locke of
Biochemistry Department in Queens College/CUNY provided technical proofreading.
Professor Locke and Professor Mankiewicz of the School of Earth and Environmental
Sciences in Queens College offered many insightful discussions concerning quality
assurance of the GHCN v2 data set, the interpretation of our temperature analysis results,
and comparisons of our results to others such as E.P.A. This paper is dedicated to the
people who help rebuilding the City of New York from the crisis of Sept 11 of 2001.
References
Barnett, T. P., 1978. Estimating Variability of Surface Air Temperature in the Northern
Hemisphere, Monthly Weather Review, 106, 1353-1367.
Borgwardt, K.H., 1987. The Simplex Method, A Probabilistic Analysis, Springer-Verlag,
Berlin.
Chen, J. and Gupta, A.K., 1977. Testing and Locating Variance Change Points with
Application to Stock Prices, Journal of the American Statistical Association, V.92 (438),
American Statistical Association, June 1997, p. 739-747.
Chen, J. and Gupta, A.K., 1998. Information Criterion and Change Point Problem for
Regular Models, Technical Report No. 98-05, Department of Math. and Stat., Bowling
Green State U., Ohio.
Easterling, David R., Thomas, C. Peterson, and Thomas, R. Karl, 1996. On the
development and use of homogenized climate data sets. Journal of Climate, 9, 14291434.
Fisher, R.A., 1924. The conditions under which 2 measures the discrepancy between
observation and hypothesis, Journal of the Royal Statistical Society, 87:442-450.
Grant, Eugene L. and Leavenworth, Richard S., 1972. Statistical Quality Control.
McGraw-Hill Book Company, New York.
Good, I.J., 1960. Weight of Evidence, Correlation, Explanatory Power, Information, and
the Utility of Experiments, Journal of Royal Statistics Society, Ser. B, 22:319-331.
Goodman, L.A., 1968. The analysis of cross-classified data: Independence, quasiindependence and interactions in contingency tables with and without missing entries,
Journal of the American Statistical Association, 63:1091-1131.
Grenander, U., 1993. General Pattern Theory, Oxford University Press, Oxford.
Grenander, U., 1996. Elements of Pattern Theory, The Johns Hopkins University Press,
ISBN 0-8018-5187-4.
Gupta, A.K. and Chen, J., 1996. Detecting Changes of mean in Multidimensional Normal
Sequences with Applications to Literature and Geology, Computational Statistics,
11:211-221, Physica-Verlag, Heidelberg.
Karmarkar, N., 1984. A New Polynomial-time Algorithm for Linear Programming,
Combinatorica 4 (4) 373- 395.
Kuenzi, H.P., Tzschach, H.G., and Zehnder, C.A., 1971. Numerical Methods of
Mathematical Optimization, New York, Academic Press.
Kullback, S. and Leibler, R., 1951. On Information and Sufficiency, Ann. Math.
Statistics, 22:79-86.
Kullback, S., 1959. Information and Statistics, Wiley and Sons, New York.
Mitchell, J. M., Jr., 1963. On the Worldwide Pattern of Secular Temperature Change,
Changes of Climate, Arid Zone Research, Vol. 20, UNESCO, Paris, 161 – 181.
Paltridge, G. and Woodruff, S., 1981. Changes in Global Surface Temperature From
1880 to 1997 Derived From Historical Records of Sea Surface Temperature, Monthly
Weather Review, 109, 2427-2434.
Peterson, Thomas C. and Easterling, David R., 1994. Creation of homogeneous
composite climatological reference series. International Journal of Climatology, 14, 671679.
Shannon, C.E. and Weaver, W., 1972. The Mathematical Theory of Communication,
University of Urbana Press, Urbana.
Schwarz G., 1978. Estimating the dimension of a model, Ann. Statist., 6, 461-464.
Sy B.K., 2001. Probability Model Selection Using Information-Theoretic Optimization
Criterion, Journal of Statistical Computing and Simulation, Gordon & Breach Publishing
Group, NJ, 69(3).
Vostrikova, L. Ju., 1981. Detecting disorder in multidimensional random process, Soviet
Math. Dokl., 24, 55-59, 1981.
(UCAR 1997) “Reports to the Nation: Our Changing Climate,” University Corporation
for Atmospheric Research (UCAR) and the National Oceanic and Atmospheric
Administration (NOAA), (UCAR, Boulder, Colorado, 1997), pp 20.
Wong, A.K.C. and Wang, Y., 1997. High Order Pattern Discovery from Discrete-valued
Data," IEEE Trans. On Knowledge and Data Engineering, 9(6):877-893.
Wright, S., 1997. Primal-Dual Interior-Point Methods, SIAM, ISBN 0-89871-382-X.
(WR 1998) “Critical Trends: The Global Commons,” World Resources 1998-99: A Guide
to the Global Environment (Editor Leslie Roberts), Oxford University Press, 1998, ISBN
0-19-521407-2, pp 170 – 184.
[www http://www.epa.gov/globalwarming/climate/trends/temperature.html]
[www http://www.ncdc.noaa.gov/wdcamet.html]
[www http://www.ncdc.noaa.gov/ol/climate/research/ghcn/ghcnqc.html]
Dr. B. K. Sy is a Full Professor of the Computer Science Department of Queens
College, and the University Graduate Center of the City University of New York. He
published funded research supported by federal and private agencies. His research spans
over multi-disciplinary areas such as information-statistical based data mining for census
analysis, pattern theory and pattern-based approach for science learning, earthquakes
modeling using multi-population theory, and dysarthric speech evaluation using computer
based voice recognizer. His research group has engaged in various research projects in
intelligent information technology and data mining in database systems with particular
emphasis on science education and industrial applications. Dr. Sy received his Ph.D. and
M.Sc. in Electrical and Computer Engineering in 1988 from Northeastern University,
Boston, Massachusetts.
Dr. A. K. Gupta has been invited to write papers for 31 national and international
conferences, symposia and publications in the past 25 years. Overall he has been invited
to present more than 80 talks at various colloquia, universities and professional meetings,
most notably including advanced lectures on statistical methods for the U.S. Air Force.
Gupta is an elected fellow of the American Statistical Association, the Institute of
Statisticians and the Royal Statistical Society of England. He has written more than 100
articles and he has edited, co-edited or co-authored six books on statistics. In 1990 he
received the Olscamp Research Award. Dr. Gupta, who joined the University in 1976,
received his doctoral degree from Purdue University, bachelor's and master's degrees
from Poona University in India and a bachelor's degree in statistics from Banara Hindu
University in India.
Table captions
Table 1: Spinning Period of Coverage of the Ten Locations
Table 2: Seasonal trend patterns of ten locations in the U.S.
Table 3: Result comparison using E.P.A. studies as a reference
Location
Chicago
Washington DC
Delaware
Fargo
Houston
Kentucky
Boston
San Francisco
St. Louis
Seattle
Symbol
CH
DC
DE
FA
HO
KT
BO
SF
SL
SE
Start year
1950
1950
1854
1883
1948
1949
1747
1853
1893
1947
End year
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
Spanning period
51
51
147
118
53
52
254
148
108
54
Table 1: Spinning Period of Coverage of the Ten Locations
City
(SE) Seattle
(SF) San Francisco
(CH) Chicago
(HO) Houston
(KT) Kentucky
Seasonal quarter
Winter
Fall and Winter
All 4 quarters
Winter and Spring
Summer
Winter
Winter
Summer and Fall
(DE) Delaware
(DC) Washington DC
Trend pattern (upward/downward)
Upward
Upward
Upward
Upward
Downward
Upward
Slightly upward
Downward
Table 2: Seasonal trend patterns of ten locations in the U.S.
City symbol (period
coverage)
Net change in Celsius
(C)
FA (1875-2000)
SE (1947-2000)
SF (1853-2000)
SL (1893-2000)
CH (1950-2000)
HO (1948-2000)
KT (1949-2000)
DE (1854-2000)
BO (1747-2000)
DC (1950-2000)
3.237
2.165
1.276
1.252
0.955
0.794
0.599
0.598
0.499
0.031
E.P.A. projection based
on data from 1950-1999
(in C)
4
3
3
2
2.5
1
2
2
-3
1
Table 3: Result comparison using E.P.A. studies as a reference
E.P.A. projection based
on last 100- year data (in

C)
1 to 4
1 to 4
1 to 4
0 to 1
0 to 1
0 to –1
0 to 1
0 to 1
1 to 4
0 to 1
Figure Captions:
Figures 1: World Mean Temperature Change
Figures 2: Region 1 (Latitude 30N – 50N) World Mean Temperature Change
Figures 3: Region 2 (Latitude 10S – 10N) World Mean Temperature Change
Relative change in successive mean temp
Figures 4: Region 3 (Latitude 30S – 50S) World Mean Temperature Change
1.017
2
1
lat0
0
1
1.722
2
1880
1900
1920
3
1.88 10
1940
yr
1960
1980
2000
3
2 10
Year (1880 - 2000)
Region 1 (latitude 30N-50N) mean change
Fig. 1: World Mean Temp Change
3.621
4
3
lat1
R_lat1
2
1
0
0.236
1
1880
1.88 10
1900
1920
3
1940
yr
1960
1980
2000
3
2 10
1980
2000
3
2 10
Year (1880 - 2000)
Region 2 (Latitude 10S-10N) mean change
Fig. 2: Region 1 World Mean Temp Change
1.459
2
0
lat2
R_lat2
2
3.512
4
1880
1.88 10
1900
3
1920
1940
yr
1960
Year (1880 - 2000)
Fig. 3: Region 2 World Temp Change
Region 3 (Latitude 30S-50S) mean change
1.242
5
0
lat3
R_lat3
5
6.293 10
1880
3
1.88 10
1900
1920
1940
yr
1960
Year (1880 - 2000)
Fig. 4: Region 3 World Temp Change
1980
2000
3
2 10
Download