Statistical Approaches to testing for linearity in

advertisement
Shashi M. Kanbur
SUNY Oswego
Workshop on Best Practices in Astro-Statistics
IUCAA January 2016
Collaborators and Funding
 HP Singh, R. Gupta, C. Ngeow, L. Macri, A. Bhardwaj,





S. Das, R. Kundu, S. Deb.
A. Nanthakumar
NSF, IUSSTF, IUCAA, Delhi University, SUNY Oswego
Website:
http://www.oswego.edu/~kanbur/iucaa2016/
http://www.oswego.edu/~kanbur/DU2014
Linear Regression
 A very common type of model in science
 (xi,yi), i=1,….,N
 Yi = a + bxi + εi,
 where xi, yi are the independent/dependent variables,
respectively, a,b are the intercept/slope, respectively
and εi is the error.
 The error model is usually εi ~ N(0, σ2 )
 We are interested in testing hypotheses on the slope b.
Linear Regression
 Least Squares estimates of the intercept and slope are
given by
iN
 (x  x )(y  y )
i
bˆ 
i
i1
iN
 (x  x )

i
, aˆ  y  bˆ x
2
i1
with standard errors given by

sbˆ 
1 iN 2
ˆ
n  2 i1 i
iN
 (x
i1
i
 x )2
iN
,saˆ  sbˆ
1
x i2

N i1
Linear Regression
 Interested in testing whether the following model is
better:
 H0: b=b0 vs. HA: b=b1 , x ≤ x0 , b=b2 , x > x0
 That is there is a change of slope at x0 - the break
point.
 Can fit regression lines to data on both sides of the
break point with slope estimates
bˆ1, bˆ 2
Linear Regression
 The standard way to “check” this is by looking at the
intervals
(bˆ1  m.sbˆ ),(bˆ2  m.sbˆ )
1
2
and see if they are mutually exclusive.
This
essentially puts confidence intervals around the
slope estimates. Depending on the choice of m, this
says that the probability that the true slope is in the
interval above is 1-α – or the probability of an error is
α.
Linear Regression
 Then if A={“short period” slope is wrong}, B={“long period”
slope is wrong}.
 In comparing the long and short period slopes, the
probability of at least one mistake
 If 1 > α > 0, then 2α-α2 > α.
P(AUB) P(A) P(B) P(AI B)  2   2
 If we carry out statistical tests to significance level α, then
this is saying that the statistical tests outlined in this talk
have a smaller chance of making an error.
F Test
 Perhaps the simplest way to test for nonlinearity is to use the F
test:
(
RSS R  RSS F
F
)(
)
RSS F
R F
 Refer this statistic to F(νR – νF, νF)

 where the subscript R, F stands for the reduced and full models
respectively, and ν stands for the degrees of freedom. RSS stands
for the residual sum of squares and refer this test statistic to the
theoretical F distribution.
 Normality, heteroskedasticity and IID observations.
Normality/Heteroskedasticity
 (Xi, Yi) with residuals εi.
 Y i ‘ = Yf i + εi
 Permute residuals without replacement (bootstrap is with





replacement)
εni = εj
Yni = Yfi + εni
With (Xi, Yni) get the F statistic – repeat – Fi.
Find proportion of Fi that are greater than the observed
value of F.
Heteroskedasticity – plot residuals against the independent
variable. Try a transformation - perhaps log.
Testing for Normality
 Data (Xi, Yi), i=1,….N
 Quantiles: Fn(u) = (#Yi ≤ u)/N and compare with that
expected from a normal distribution.
 If the data are from a normal distribution, the q-q plot
should be close to a straight line.
Random Walk Methods
 Order the independent variable: x1<xx<….<xN
 If rk is the kth residual from a linear regression, then
k j
C( j )
r
k
k 1
R  max[ C( j)]  min[ C( j)]

 If the data are consistent with a single linear
regression, then the C(j) are a simple random walk.
 Our
test statistic, R, is the vertical range of the C(j)
Random Walk Methods
 If the partial sums are a random walk, R will be small.
 Permute rk so that you randomize the residuals. Then
recompute R. Repeat this procedure for a large number
(~10000) permutations. The significance statistic is the
 Fraction of the permuted R statistics that are greater
than the observed value of R: this is the significance
level under the null hypothesis of linearity.
 This is a non-parametric test and does not depend on
normality of the errors.
Testimator
 Test Estimator
 Sort the data in order of increasing independent
variable.
 Divide the sample into N1 different non-overlapping
and hence completely independent datasets. Each
subset has n data points and the remaining datapoints
are included in the last subset.
 We fit a linear regression to the first subset and
determine an initial slope estimate, β’.
Testimator
 This initial estimate of the slope becomes β0 in the
next subset under the null hypothesis that the slope of
the second subset is equal to the slope of the first
subset. We calculate the t-statistic such that
t obs 
'  0
iN
1
2
ˆ
, MSE 
(y

y
)

i
i ,
N  2 i1
MSE /Sxx
iN
Sxx   (x i  x ) 2
i1
Testimator
 Since there will be ng=n-1 hypothesis tests, the critical
t value will be a Bonferroni type
t / 2n g ,
 and ν is the number of data points in each subset.
 Once we know the observed and critical value of the tstatistics, we determine
| t obs |
k (
)
tc

 which is the probability that the initial testimator
guess is true. If the value of k < 1, the null hypothesis is
accepted and we derive
 the new testimator slope for
the next subset using the previously determined β’s
such that
ˆ  (1  k) 
w  k
0
Testimator
 This value of the testimator is taken as β0 for the next
subset. This process of hypothesis testing is repeated
ng times or until the value of k > 1, suggesting rejection
of the null hypothesis – that is the data are more
consistent with a non-linear relation.
The Extra-Galactic Distance Scale
 μ=m-M
 μ=m-(a+b.logP)
 Calibrating Galaxy, observe Cepheids and determine




M=a+blogP
Target galaxy, observe Cepheids mi, i=1,…N. So
μi = mi – (a + blogPi)
y=Lq
where y=(m1, m2,…mN), q=(a,b,μ1,μ2,…μN) is the vector
of unknowns and L is a (Nx(N+2)) matrix containing
1’s and logPi’s.
The Extra-Galactic Distance Scale
 This is a vector equation for the q’s and easily solvable




using the General Linear Model interface in R.
Minimize χ2 = (y-Lq)TC-1(y-Lq) yields the MLE
estimator for q. C is the matrix of measurement errors
Weighted least squares estimate when errors are
normally distributed.
q’ = (LTC-1L)-1LTC-1[y] and standard errors for the
parameters in q’ are (LTC-1L)-1.
If you formulate your statistical data analysis problems
in this General Linear Model formalism, its very easy
to solve in R along with a full error analysis.
The Extra-Galactic Distance Scale
and Bayes
 Bayesian GLM formalism applied to the estimate of H0
Segmented Lines and the Davies
Test
 The model is
 Y =as + bsX + ψ(X)Δa(X-Xb) and
 Δa=aL-aS and Ψ(X)=0, X<Xb, ψ(X)=1, X≥Xb.
 This assumes a continuous transition between the two
linear models. A more general situation , perhaps a
discontinuity is
 Y=as+bsX + Ψ(X)[Δa(X-Xb) – γ],
 where γ represents the magnitude of the gap.
Segmented Lines
 Choose an initial break point Xb’ and then fit the other
parameters in the equation.
 Estimate a new break point, Xb’’ = Xb’ + γ/Δa.
 Repeat until γ≈0.
Cepheid PL Relations
Cepheid PC Relations
Multiphase PL Relations
Multiwavelength PL Relations
Galactic PL Relations
ExtraGalactic PL Relations
Download