Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016 Collaborators and Funding HP Singh, R. Gupta, C. Ngeow, L. Macri, A. Bhardwaj, S. Das, R. Kundu, S. Deb. A. Nanthakumar NSF, IUSSTF, IUCAA, Delhi University, SUNY Oswego Website: http://www.oswego.edu/~kanbur/iucaa2016/ http://www.oswego.edu/~kanbur/DU2014 Linear Regression A very common type of model in science (xi,yi), i=1,….,N Yi = a + bxi + εi, where xi, yi are the independent/dependent variables, respectively, a,b are the intercept/slope, respectively and εi is the error. The error model is usually εi ~ N(0, σ2 ) We are interested in testing hypotheses on the slope b. Linear Regression Least Squares estimates of the intercept and slope are given by iN (x x )(y y ) i bˆ i i1 iN (x x ) i , aˆ y bˆ x 2 i1 with standard errors given by sbˆ 1 iN 2 ˆ n 2 i1 i iN (x i1 i x )2 iN ,saˆ sbˆ 1 x i2 N i1 Linear Regression Interested in testing whether the following model is better: H0: b=b0 vs. HA: b=b1 , x ≤ x0 , b=b2 , x > x0 That is there is a change of slope at x0 - the break point. Can fit regression lines to data on both sides of the break point with slope estimates bˆ1, bˆ 2 Linear Regression The standard way to “check” this is by looking at the intervals (bˆ1 m.sbˆ ),(bˆ2 m.sbˆ ) 1 2 and see if they are mutually exclusive. This essentially puts confidence intervals around the slope estimates. Depending on the choice of m, this says that the probability that the true slope is in the interval above is 1-α – or the probability of an error is α. Linear Regression Then if A={“short period” slope is wrong}, B={“long period” slope is wrong}. In comparing the long and short period slopes, the probability of at least one mistake If 1 > α > 0, then 2α-α2 > α. P(AUB) P(A) P(B) P(AI B) 2 2 If we carry out statistical tests to significance level α, then this is saying that the statistical tests outlined in this talk have a smaller chance of making an error. F Test Perhaps the simplest way to test for nonlinearity is to use the F test: ( RSS R RSS F F )( ) RSS F R F Refer this statistic to F(νR – νF, νF) where the subscript R, F stands for the reduced and full models respectively, and ν stands for the degrees of freedom. RSS stands for the residual sum of squares and refer this test statistic to the theoretical F distribution. Normality, heteroskedasticity and IID observations. Normality/Heteroskedasticity (Xi, Yi) with residuals εi. Y i ‘ = Yf i + εi Permute residuals without replacement (bootstrap is with replacement) εni = εj Yni = Yfi + εni With (Xi, Yni) get the F statistic – repeat – Fi. Find proportion of Fi that are greater than the observed value of F. Heteroskedasticity – plot residuals against the independent variable. Try a transformation - perhaps log. Testing for Normality Data (Xi, Yi), i=1,….N Quantiles: Fn(u) = (#Yi ≤ u)/N and compare with that expected from a normal distribution. If the data are from a normal distribution, the q-q plot should be close to a straight line. Random Walk Methods Order the independent variable: x1<xx<….<xN If rk is the kth residual from a linear regression, then k j C( j ) r k k 1 R max[ C( j)] min[ C( j)] If the data are consistent with a single linear regression, then the C(j) are a simple random walk. Our test statistic, R, is the vertical range of the C(j) Random Walk Methods If the partial sums are a random walk, R will be small. Permute rk so that you randomize the residuals. Then recompute R. Repeat this procedure for a large number (~10000) permutations. The significance statistic is the Fraction of the permuted R statistics that are greater than the observed value of R: this is the significance level under the null hypothesis of linearity. This is a non-parametric test and does not depend on normality of the errors. Testimator Test Estimator Sort the data in order of increasing independent variable. Divide the sample into N1 different non-overlapping and hence completely independent datasets. Each subset has n data points and the remaining datapoints are included in the last subset. We fit a linear regression to the first subset and determine an initial slope estimate, β’. Testimator This initial estimate of the slope becomes β0 in the next subset under the null hypothesis that the slope of the second subset is equal to the slope of the first subset. We calculate the t-statistic such that t obs ' 0 iN 1 2 ˆ , MSE (y y ) i i , N 2 i1 MSE /Sxx iN Sxx (x i x ) 2 i1 Testimator Since there will be ng=n-1 hypothesis tests, the critical t value will be a Bonferroni type t / 2n g , and ν is the number of data points in each subset. Once we know the observed and critical value of the tstatistics, we determine | t obs | k ( ) tc which is the probability that the initial testimator guess is true. If the value of k < 1, the null hypothesis is accepted and we derive the new testimator slope for the next subset using the previously determined β’s such that ˆ (1 k) w k 0 Testimator This value of the testimator is taken as β0 for the next subset. This process of hypothesis testing is repeated ng times or until the value of k > 1, suggesting rejection of the null hypothesis – that is the data are more consistent with a non-linear relation. The Extra-Galactic Distance Scale μ=m-M μ=m-(a+b.logP) Calibrating Galaxy, observe Cepheids and determine M=a+blogP Target galaxy, observe Cepheids mi, i=1,…N. So μi = mi – (a + blogPi) y=Lq where y=(m1, m2,…mN), q=(a,b,μ1,μ2,…μN) is the vector of unknowns and L is a (Nx(N+2)) matrix containing 1’s and logPi’s. The Extra-Galactic Distance Scale This is a vector equation for the q’s and easily solvable using the General Linear Model interface in R. Minimize χ2 = (y-Lq)TC-1(y-Lq) yields the MLE estimator for q. C is the matrix of measurement errors Weighted least squares estimate when errors are normally distributed. q’ = (LTC-1L)-1LTC-1[y] and standard errors for the parameters in q’ are (LTC-1L)-1. If you formulate your statistical data analysis problems in this General Linear Model formalism, its very easy to solve in R along with a full error analysis. The Extra-Galactic Distance Scale and Bayes Bayesian GLM formalism applied to the estimate of H0 Segmented Lines and the Davies Test The model is Y =as + bsX + ψ(X)Δa(X-Xb) and Δa=aL-aS and Ψ(X)=0, X<Xb, ψ(X)=1, X≥Xb. This assumes a continuous transition between the two linear models. A more general situation , perhaps a discontinuity is Y=as+bsX + Ψ(X)[Δa(X-Xb) – γ], where γ represents the magnitude of the gap. Segmented Lines Choose an initial break point Xb’ and then fit the other parameters in the equation. Estimate a new break point, Xb’’ = Xb’ + γ/Δa. Repeat until γ≈0. Cepheid PL Relations Cepheid PC Relations Multiphase PL Relations Multiwavelength PL Relations Galactic PL Relations ExtraGalactic PL Relations