Appendix to “The Bayesian Change Point and Variable Selection Algorithm: Application to the δ18O Proxy Record of the Plio-Pleistocene” published in the Journal of Computational and Graphical Statistics Authors: Eric Ruggieri, Duquesne University Charles E. Lawrence, Brown University Appendix A: The Bayesian Change Point and Variable Selection Algorithm The Bayesian Change Point Algorithm updated to include the EBIR variable selection procedure is described below. The dependence on X has been suppressed. Steps that differ from the original algorithm are indicated with a ‘*’. 1. (*)Calculating the Probability Density of the Data π(ππ,π ): EBIR is used to perform model averaging for each possible sub-string of the data, ππ:π , 1 ≤ i < j ≤ N: ππππ πππ₯π⁄ ⁄2 2 π£0 π02⁄ π£0⁄2 π£π 2 2 ) Γ( ⁄2) ππππ ) (πππ₯π πππ₯π ) (ππππ 2 π(ππ:π ) = ∑ π π£ π£ π⁄ 1 π π Γ( 0⁄2)(2π) ⁄2 2 |π π ππ:π + πΌπ΄ | ⁄2 π΄ππ π΄π ( ⁄2) π:π π ( (A1) 2. Forward Recursion: As before, let ππ (π1:π ) be the density of the data [π1 … ππ ] with π change points. Define: π1 (π1:π ) = ∑ π(π1:π£ )π(ππ£+1:π ) (A2) π£<π ππ (π1:π ) = ∑ ππ−1 (π1:π£ )π(ππ£+1:π ) (A3) π£<π 3. Stochastic Backtrace: In order to have a completely defined partition function (or normalization constant), π(π1:π ), we again define π(πΎ = 0) = 0.5 and for π>0, π(πΎ = π, π1 , … , ππ ) = 0.5 , (ππππ₯ ) (ππ ) so that ππππ₯ π(π1:π ) = ∑ ∑ π(π1:π |πΎ = π, π1 … ππ ) ∗ π(πΎ = π, π1 … ππ ) (A4) π=0 π1 …ππ Now, not only can we draw samples to evaluate the uncertainty in the number of change points, their locations, and the parameters of the regression model, but we can also sample to assess the uncertainty of including a given variable in each sub-interval of the data. We obtain direct samples from the posterior distributions of all unknown parameters as follows: 3.1. Sample a Number of Change Points: π(πΎ = π|π1:π ) = ππ (π1:π )π(π1 , … , ππ |πΎ = π)π(πΎ = π) π(π1:π ) (A5) 3.2. Sample the Locations of the Change Points: Let ππΎ+1 = π, the last data point. Then for π=K, K-1, …, 1, iteratively draw samples according to: π(ππ = π£ |ππ+1 ) = ππ−1 (π1:π£ )π(ππ£+1:ππ+1 ) . ∑π£∈[π−1,ππ+1 ) ππ−1 (π1:π£ )π(ππ£+1:ππ+1 ) (A6) 3.3. (*)Sample a Sub-Model for the Interval Between Adjacent Change Points: Samples are drawn from the posterior distribution on the set of possible sub-models given the data according to: π(π΄π | π(ππ +1):ππ+1 ) = β¬ π(π(ππ +1):ππ+1 |π½, π 2 , π΄π )π(π½ |π 2 , π΄π ) π(π 2 )ππ½ππ 2 π(π΄π ) π(π(ππ+1):ππ+1 ) i.e. π(π΄π | π(ππ +1):ππ+1 ) = π(π(ππ +1):ππ+1 |π΄π ) π(π΄π ) , π(π(ππ+1):ππ+1 ) (A7) where π(π(ππ +1):ππ+1 ) is given by Eq. (A1). 3.4. Sample the Regression Parameters for the Interval Between Adjacent Change Points: Let π = ππ+1 – ππ , the number of data points in the sub-interval and let π½ ∗ = −1 π π π + πΌπ΄π ) π(π π , π π = (π(ππ +1):ππ+1 − (π(π π +1):ππ+1 (ππ +1):ππ+1 π +1):ππ+1 (ππ +1):ππ+1 π π(ππ +1):ππ+1 π½∗ ) (π(ππ +1):ππ+1 − π(ππ +1):ππ+1 π½∗ ) + π½ ∗ π πΌπ΄π π½∗ + π£0 π02 , and π£π = π£0 + π. Using Bayes Rule one final time, we obtain: −1 π π(π½|π 2 ) ~ π(π½ ∗ , (π(π π + πΌπ΄π ) π 2 ) π +1):ππ+1 (ππ +1):ππ+1 π(π 2 ) ~ ππππππ − πΌππ£πππ π π 2 ( π£π , π π ⁄π£π ). (A8) (A9) Appendix B: Parameter Settings for the Bayesian Change Point and Variable Selection Algorithm The Bayesian Change Point and Variable Selection algorithm has several parameters that can be tuned by the user: the minimum allowable distance between two change points; ππππ₯ - the maximum number of change points; ππππ , πππ₯π , ππππ , and πππ₯π - prior parameters for the EBIR algorithm; and π£0 and π02 parameters for the prior distribution on the error variance. The choice for each of these parameters is described below. We place two types of constraints on the minimum distance between any two change points in order to ensure that enough data is available to estimate the parameters of the model accurately: 1) Change points are required to be a minimum of 50 data points apart from each other. In general, we recommend that each data segment have a minimum length of twice the number of data points as free parameters in the regression model; 2) Change points are required to be at least half the distance spanned by the longest frequency sinusoid in the analysis (i.e. a minimum of 100 kyr for the simulation and a minimum of 200 kyr for the analysis of the δ18O proxy record) apart from each other. This prevents the regression model from fitting short intervals with very large regression coefficients which act to cancel each other out rather than regression coefficients on the order of the dependent variable. As an example, the most recent 100 kyr of the δ18O record has demanded a change point in several of the analyses that we have conducted on this data set [see also Ruggieri et al. (2009)]. Without this constraint, three of the regression coefficients used to fit this small section of the data are larger than 1, which is the maximal range of the data set (Figure 4a). Adding this constraint keeps all regression coefficients in all intervals less than 0.5, which is more realistic. Both of these constraints are utilized when modeling the δ18O record as the data points are not equally spaced. Two changes are well documented in the geosciences literature: the Mid-Pleistocene Transition ~1 Ma and the intensification of Northern Hemisphere glaciations ~2.7Ma. Therefore, geoscientists expect at least two change points in the δ18O proxy record, but there is serious doubt that there can be significantly more. Therefore, we set ππππ₯ equal to six change points. For the simulation, there are no a priori assumptions. However, since the minimum distance between change points is 100 and there are only 1000 data points, there can be no more than 10 change points. Therefore, we set ππππ₯ = 10 for the simulation. Variations in the values of ππππ , πππ₯π , ππππ , and πππ₯π affect the probability of a sub-model being selected in a given sub-interval, but otherwise have little impact on the overall number of change points. Here, we set ππππ =0.01 and πππ₯π = 100 [a setting suggested by George and McCulloch (1993)] and ππππ = 0.5, πππ₯π = 0.5, indicating that all sub-models are equally likely. As an example of how altering these parameters affects the results, changing πππ₯π to 1000 (indicating a tighter normal distribution about zero for the excluded variables) provides greater incentive for regressors whose amplitudes are of moderate size to be placed in the include rather than the exclude category, as the ‘penalty’ for an amplitude that differs greatly from zero has now been increased. More specifically, increasing πππ₯π to 1000 will cause the 53 kyr sinusoid to be selected for much of the Pleistocene (~2.7Ma to the present), and in general increases the number of included regressors in each interval. Thus, these parameters will change the set of selected variables, but not the placement of the change points for the analysis of the δ18O proxy record. However, the posterior distribution on the number of change points for the analysis of the δ18O proxy record (but not the simulation) can be highly sensitive to the choice of parameters for the prior distribution of the error variance, π£0 and π02 . Specifically, the number of change points, but not the distribution of their positions, can vary with the choice of these parameters, a phenomena previously noted by Fearnhand (2006). In general, the larger the product of these two parameters, the fewer the number of change points chosen by the algorithm. The prior distribution on the error variance is akin to adding pseudo data points of a given residual variance and helps to bound the likelihood function. Given a data set with a small variance and a model with a large number of free parameters (here there are 21 coefficients, one for each of the ten sine and cosine waves and one constant), the model may be able to fit small segments of the data nearly perfectly under certain parameter settings. If this happens, the density of the data, π(ππ,π ), can become unbounded and many spurious changes points in close proximity to one another will be observed in the final output. To counteract this phenomenon, penalized likelihood techniques are often implemented (Ciuperca et al. 2003). Since we can be confident that the variance of the residuals will not be larger than the overall variance of the data set, we conservatively set the prior variance π02 , equal to the variance of the data set being used and set π£0 to be 5% of the size of the minimum allowed sub-interval. For the simulation, π£0 = 5 and for the analysis of the δ18O proxy record, π£0 = 10. With these settings, the results are much less sensitive. Larger numbers of change points results from setting strongly informed prior distributions. A choice along these lines acts as if there is more data, and thus the algorithm is able to pick up subtle changes. Alternatively, one could choose values independent of the data set, such as π£0 = 4 and π02 =0.25. The key to this choice is to make sure that the π£π⁄ 2, product π£0 π02 ≥1. This ensures that (π π ) π where π π = (ππ:π − ππ:π π½ ∗ ) (ππ:π − ππ:π π½ ∗ ) + π½ ∗ π πΌπ΄π π½∗ + π£0 π02 acts to shrink, rather than enlarge the density function π(ππ,π ), even when the residual variance (which is related to the variance of substring ππ,π ) is small. We note that if MAP estimates are being employed by the Bayesian Change Point and Variable Selection algorithm, the procedure outlined above is equivalent to penalized regression (Ciuperca et al. 2003). Appendix C: Glossary of Terms Definition of Variables: π΄π = Vector of indicator variables for the inclusion or exclusion of each predictor variables from the subset of predictor variables under consideration. π½ = [π½1 , π½2 , … , π½π ] = Vector of regression coefficients. π½ is piecewise constant. The values of the vector can change only at the change points / regime boundaries. ∗ π½ = Posterior mean of the regression coefficients. {πΆ} = Set of change points whose locations are π0 = 0, π1 , … , ππ , ππ+1 = π. π = Random error term which is assumed independent, mean zero, and normally distributed. π(ππ:π ) = Probability density of a homogeneous (i.e. no change points) substring of the data set. π(ππ:π ) is calculated in Step 1 of the Bayesian Change Point and Variable Selection algorithm. πΌ = Identity matrix. πΌπ΄π = Diagonal matrix with either ππππ or πππ₯π on the diagonal, corresponding to whether or not a specific variable is included in the sub-model being considered. π = Number of change points. ππππ₯ = Maximum number of allowed change points. π0 = Scale parameter used in the prior distribution on the regression coefficients, π½, that relates the variance of the regression coefficients to the residual variance. ππππ = Similar to π0 , but associated with the set of included predictor variables. πππ₯π = Similar to π0 , but associated with the set of excluded predictor variables. π = Total number of predictor variables. ππππ = Number of predictor variables included in sub-model π΄π . πππ₯π = Number of predictor variables excluded from sub-model π΄π . Note: ππππ + πππ₯π = π. π = Number of data points in a subset of the data set. π = Total number of data points. ππ = Number of change point solutions with exactly k change points ππππ = Probability of including a predictor variable. πππ₯π = Probability of excluding a predictor variable. Note: ππππ + πππ₯π = 1. ππ (π1:π ) = ππ (π1:π |ππ:π ) = Probability density of the first j observations of the data set containing π change points, given the regression model. ππ (π1:π ) is calculated in the forward recursion step of the Bayesian Change Point and Variable Selection algorithm. 2 π = Residual Variance. π 2 is piecewise constant. The value of π 2 can change only at the change points / regime boundaries. 2 π£0 , π0 = Parameters for the prior distribution for the residual variance. π£0 and π02 act as pseudo data points - π£0 pseudo data points of variance π02 [essentially, unspecified training data gleaned from prior knowledge of the problem] π£π , π π = Parameters for the posterior distribution of the error variance, π 2 . π = [π1 , π2 , … , ππ ] = Matrix of predictor variables. ππ:π = {ππ , ππ+1 , … , ππ−1 , ππ }, 1 ≤ i < j ≤ N = Sub-matrix of π containing all π predictor variables for a subset of the data set. π = [π1 , π2 , … , ππ ] = Vector of response variables. ππ:π = {ππ , ππ+1 , … , ππ−1 , ππ }, 1 ≤ i < j ≤ N = Substring (i.e. subset or regime) of the response variables. Abbreviations and Acronyms: δ18O – Ratio of 18O to 16O in ocean sediment cores. This ratio is used to quantify the amount of ice on the Earth at any time in the past. Larger values of δ18O indicate more ice volume. EBIR = Exact Bayesian Inference in Regression. An efficient algorithm for Bayesian variable selection (Ruggieri and Lawrence 2012). EM = Expectation Maximization algorithm. HMM = Hidden Markov Model. ka = Thousands of years ago. kyr = Thousands of years. Ma = Millions of years ago. MAP = Maximum a posteriori estimator, i.e. the mode of the posterior distribution. MPT = Mid-Pleistocene Transition. A transitional period where glacial melting events went from occurring every 41 kyr to every 100 kyr. MCMC = Markov Chain Monte Carlo