This article was downloaded by: [University of Waterloo] On: 31 October 2014, At: 09:23 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Communications in Statistics - Simulation and Computation Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/lssp20 A New Algorithm in Bayesian Model Averaging in Regression Models a b Tsai-Hung Fan , Guo-Tzau Wang & Jenn-Hwa Yu c a Institute of Statistics, National Central University , Jhongli , Taiwan b National Center for High-performance Computing , Hsinchu , Taiwan c Department of Mathematics , National Central University , Jhongli , Taiwan Published online: 17 Sep 2013. To cite this article: Tsai-Hung Fan , Guo-Tzau Wang & Jenn-Hwa Yu (2014) A New Algorithm in Bayesian Model Averaging in Regression Models, Communications in Statistics - Simulation and Computation, 43:2, 315-328, DOI: 10.1080/03610918.2012.700750 To link to this article: http://dx.doi.org/10.1080/03610918.2012.700750 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Downloaded by [University of Waterloo] at 09:23 31 October 2014 Conditions of access and use can be found at http://www.tandfonline.com/page/termsand-conditions Communications in Statistics—Simulation and ComputationR , 43: 315–328, 2014 Copyright © Taylor & Francis Group, LLC ISSN: 0361-0918 print / 1532-4141 online DOI: 10.1080/03610918.2012.700750 A New Algorithm in Bayesian Model Averaging in Regression Models TSAI-HUNG FAN,1 GUO-TZAU WANG,2 AND JENN-HWA YU3 1 Institute of Statistics, National Central University, Jhongli, Taiwan National Center for High-performance Computing, Hsinchu, Taiwan 3 Department of Mathematics, National Central University, Jhongli, Taiwan Downloaded by [University of Waterloo] at 09:23 31 October 2014 2 We propose a new iterative algorithm, namely the model walking algorithm, to modify the widely used Occam’s window method in Bayesian model averaging procedure. It is verified, by simulation, that in the regression models, the proposed algorithm is much more efficient in terms of computing time and the selected candidate models. Moreover, it is not sensitive to the initial models. Keywords Ayesian model averaging; Occam’s window; Regression models. Mathematics Subject Classification 65C05. Primary 62F15; 62J05; 65C60; Secondary 1. Introduction One major concern in regression analysis is to determine the “best” subset of predictor variables. Here, “best” means the model providing the most accurate predictions for new cases. Traditionally, one is more interested in choosing a single “best” model based on some model selection criteria such as Mallows’ Cp , forward selection, backward selection, stepwise selection, etc. (cf. Montgomery et al. (2006)), and making prediction as if the selected models were the true model. However, in many situations, there may not exist a “best” model, or indeed there are several possibly appropriate models. Using only one model may severely ignore the problem of model uncertainty and cause inaccurate predictions. (See Draper (1995), Kass and Raftery (1995), Madigan and York (1995), and the references therein.) Intuitively, it is therefore reasonable to consider the average result of all possible combinations of predictors. The idea of model combination was first raised by Bernard (1963). Madigan and Raftery (1994) first propose the Bayesian viewpoint to take into account of model uncertainty in graphical models using Occam’s window. However, when the number of predictors is large, the number of all possible models becomes huge and yet many of which indeed are not of much contribution to the response variable. Alternatively, Raftery et al. (1997) consider the Bayesian model averaging method in which average is taken over a reduced sets of models according to their posterior probabilities. They apply the “Occam’s window” criterion in regression models to exclude models with too small Received October 28, 2011; Accepted June 4, 2012 Address correspondence to Prof. Tsai-Hung Fan, Ph.D., Institute of Statistics, National Central University, Jhongli, Taiwan; E-mail: thfan@stat.ncu.edu.tw 315 Downloaded by [University of Waterloo] at 09:23 31 October 2014 316 Fan et al. posterior probabilities to increase the computational efficiency. We refer their approach as “Occam’s window”. The algorithm in “Occam’s window” includes “down-algorithm” and “up-algorithm” to search appropriate models. The up- and down-algorithms depend essentially on the size of the “window”. For larger windows, more models are selected in a longer searching procedure. It is more conservative in the sense that it usually includes unimportant models. A small window on the other hand speeds up the searching procedure but consequently it is very possible to exclude influential models. Due to computational burden, both algorithms are performed only once which usually results in a situation that important models are missed but unnecessary models are retained. Moreover, the “Occam’s window” algorithm is conducted based on a specified initial subset of possible models which make certain effect on either the final result or the cost of computation. In this article, we modify the upand down-algorithms in “Occam’s window” by an iterative searching procedure based on relative probabilities among neighborhood models to determine the searching direction. It is verified, by simulation, that in the usual regression models, the proposed algorithm, namely the model walking algorithm hereafter, is much more efficient in terms of computing time as well as the selected candidate models. Moreover, it is not sensitive to the initial models. In the next section, we briefly outline the Occam’s window algorithm of Raftery et al. (1997) and the proposed model walking algorithm is introduced in Section 3. Then, in Section 4, we compare both algorithms via simulation based on regression models. Finally, Section 5 closes with conclusions. 2. Bayesian Model Averaging in Linear Regression In this section, we will give a brief review of the Bayesian model averaging method proposed by Raftery et al. (1997). The original algorithm is applied to the regression models. Let X1 , . . . , Xp be all the possible predictive variables in a multiple regression model. Typically, one important and practical concern is to select variables that are of significant effect on the response into the model. Therefore, each predictor could be or could not be included in the model. Let M1 , . . . , MK denote all the K = 2p possible models and suppose that the set of the predictive variables in model Mk is X(k) = {X1(k) , . . . , Xp(k)k } ⊆ {X1 , . . . , Xp }. Then, the regression model in Mk is written by Y = βk0 + pk βkj Xj(k) + k , j =1 where Y is the response variable, β k = (βk0 , βk1 , . . . , βkpk )t is the regression coefficient vector under Mk , and k is the random error. Let θk be the set of all unknown parameters in Mk and f (y|θk , Mk ) be the associated sampling distribution of Y given X(k) . (We assume the predictors are all given throughout the article.) If π (θk |Mk ) denotes the prior density of θk under Mk , then the posterior density of θk given Y = y is π (θk |y, Mk ) = where f (y|θk , Mk )π (θk |Mk ) , f (y|Mk ) f (y|Mk ) = f (y|θk , Mk )π (θk |Mk )dθk Bayesian Model Averaging in Regression Models 317 is the marginal density of the response Y in model Mk which is the true distribution of Y if model Mk is the true model. Let P (Mk ) be the prior probability that model Mk is the true model, then given Y = y and all predictors, f (y|Mk )P (Mk ) P (Mk |y) = K l=1 f (y|Ml )P (Ml ) Downloaded by [University of Waterloo] at 09:23 31 October 2014 is the posterior probability of Mk . The higher this probability, the more possible is Mk . Denote the quantity of interest to be , such as the future observation(s) etc., and f (|Mk , y, θk ) to be the conditional density of given y and θk under Mk . Then, the predictive density of in Mk is f (|y, Mk ) = f (|y, θk , Mk )π (θk |y, Mk )dθk . Under Mk , the Bayesian predictive analysis of can then be derived from this distribution. Thus, one can get the overall marginal predictive density, by integrating the results from all possible models, as f (|y) = K f (|y, Mk )P (Mk |y). (1) k=1 (1) indeed is the weighted average of all the predictive densities with respect to the model posterior probabilities. Note that computation of (1) is time consuming among all possible models. Madigan and Raftery (1994) use Occam’s window to eliminate models with lower posterior probabilities so that the final predictive distribution only takes average over a smaller set of models, say A. Hence, (1) is replaced by f (|y, A) = f (|y, Mk )P (Mk |y, A), Mk ∈A where P (Mk |y, A) = P (Mk |y) . Ml ∈A P (Ml |y) To determine A, they screen all models based on the following principles: 1. A model with posterior probability far below the “best” model should be eliminated. 2. A model containing “better” submodel(s) should be eliminated. Here, the best model means the model with the highest posterior probability and a submodel means the one whose set of predictors is a subset of those in another model. More precisely, Raftery et al. (1997) propose the down- and up-algorithm to implement the Occam’s razor in regression models. Given a set of initial models I, the algorithm begins with the down-algorithm by examining all submodels of each model in I based on Occam’s razor. Assume M0 ⊂ M1 ∈ I. The Occam’s razor chooses the model(s) by the posterior odds, P (M1 |y)/P (M0 |y), through a predetermined interval [OL , OR ]. The initial M1 is selected and M0 is deleted if P (M1 |y)/P (M0 |y) > OR ; while it keeps M0 but eliminates M1 if P (M1 |y)/P (M0 |y) < OL . Both models are retained when the posterior odds falls into [OL , OR ]. Furthermore, if M1 has already been deleted (in any previous step), M0 is 318 Fan et al. eliminated automatically. The up-algorithm is then proceeded using Occam’s razor again to examine all the super-models (models that contain the original one) of each model retained by the down-algorithm. The resulting models form the set A of candidate models. This method only performs the down- and up-algorithm once, it may easily miss some probable models unless adjusting OL and OR to enlarge the window size or including more models into the initial set but the computational cost is heavily increased consequently. Details can be seen in Madigan and Raftery (1994) and Raftery et al. (1997). Downloaded by [University of Waterloo] at 09:23 31 October 2014 3. Model Walking Algorithm To increase the computational efficiency, we propose a new algorithm to decide a better set of models over which the average is taken. The basic idea is to begin with one single model and only examines its neighborhood models with relatively higher posterior probabilities. The neighborhood models of the model M are defined by its sub-neighborhood, N − (M), and super-neighborhood, N + (M), respectively, where N − (M) contains all sub-models of M with only one less predictive variable than M; and N + (M) contains all super-models of M with only one more predictive variable than M. Define N − (M) = ∅ if M only has one predictor; and N + (M) = ∅ if M includes all predictors. It is an iterative procedure. Beginning with a prespecified initial model M, we first examine its sub-neighborhood N − (M). For each M − ∈ N − (M), compare the posterior odds R − = P (M|y)/P (M − |y) with RA− = PM /P (M − |y), where PM is the maximum posterior probability, which is updated at each iteration, among all models considered at the moment. For predetermined threshold values B − and BA− , if R − < B − and RA− < BA− , it yields that M − is a “good” model and we put it into C so that its neighborhood models deserve to be examined again; otherwise, the searching procedure for M − is paused and set M − into D. Repeat this procedure for each model in N + (M), and leave M into D afterwards. Therefore, C contains the good models whose neighborhood models are to be examined and D includes models that either are not good (which should be eliminated after all) or the good models whose neighborhood models have already been checked. The procedure continues to check the neighborhood models of all models in C until no good model is left, i.e., C = ∅. The resulting D is the set of final candidate models. The final task is to delete those models that are not good. We follow the first principle above of Madigan and Raftery (1994) to exclude maxM ∈D P (M |y) > C. D = M ∈ D : P (M|y) (2) from D and get A = D − D , where C is a predetermined threshold value. The following is the algorithm. Step 1: Specification: specify the initial model M, the positive threshold values B − , B + , BA− , BA+ , and C. Step 2: Initialization: let C = ∅ and PM = P (M|y). Step 3: Searching in N − (M): for each M − ∈ N − (M), replace PM by max{PM , P (M − |y)}. Move M − to C if P (M|y)/P (M − |y) < B − and PM /P (M − |y) < BA− ; otherwise, M − is moved to D. Step 4: Searching in N + (M): for each M + ∈ N + (M), substitute M − , N − (M), B − , and BA− in Step 3 with M + , N + (M), B + and BA+ , respectively. Move M to D after the search. Step 5: Iterative search: if C = ∅, take M to be any model in C and repeat Step 3 and Step 4. Step 6: Finalization: define A = D − D where D is given by (2). Bayesian Model Averaging in Regression Models 319 It is recommended in Step 1 to choose a model whose predictive variables are highly correlated with the response variable. This can be determined by choosing variables with relatively higher sample correlation coefficients from the data, for example. The model walking algorithm begins with only one single model, but enlarges the searching area by “walking” around its neighborhood models; unlike the Occam’s window that begins with several models but only searches over their sub- and super-models once. With the limitation of the threshold values, for instance B − and B + control the relative difference of posterior probabilities; and BA− and BA+ restrict the absolute difference from the most probable model, it makes detailed searches in an area that is considerably reduced from all possible models. Downloaded by [University of Waterloo] at 09:23 31 October 2014 4. Comparison and Simulation We compare the two algorithms via simulation in linear regression models. Each comparison study is based on the same simulated datasets. In what follows, there are 10 possible predictive variables (p = 10) and the data are simulated from Y = β0 1 + p βj X j + , (3) j =1 where 1 = (1, . . . , 1)tn is an n-vector of 1’s, Y = (Y1 , . . . , Yn )t , X j = (X1j , . . . , Xnj )t , and the random error vector = (1 , . . . , n )t follows the n-variate multivariate normal distribution, Nn (0, σ 2 I). Assume that the prior distribution of β = (β0 , β1 , . . . , β10 )t 2 is N11 (μ, σ 2 V ) and that of σ 2 is inverse-gamma, IG( ν2 , νλ ), where μ, V , ν, and λ are hyperparameters. Following Raftery et al. (1997), let ν = 2.58, λ = 0.28, and μ = (β0 , 0, . . . , 0)t , in which β0 is the usual least squares estimate of the constant term β0 . The variance-covariance matrix V is a diagonal matrix with diagonal elements −2 ), where sY2 and si2 are the sample variances of (σ 2 sY2 , σ 2 φ 2 s1−2 , σ 2 φ 2 s2−2 , . . . , σ 2 φ 2 s10 the response variable and the predictive variables correspondingly and φ = 2.85. Under model Mk , let X (k) = (X k1 , . . . , X kpk ) be the corresponding design matrix, then given σ 2 , X (k) and all the above hyperparameters, the conditional marginal distribution of Y |σ 2 is Nn X (k) μ, σ 2 k , where k = I n + X (k) V X t(k) . Therefore, after integrating out σ 2 with 2 ) prior distribution, the marginal density of Y under Mk is respect to the IG( ν2 , νλ ν f ( y|Mk ) = (νλ) 2 ν π n/2 2 ν+n 2 | k |1/2 [( y − X (k) μ)t −1 k (y − X (k) μ) + νλ]− ν+n 2 . (4) (See Raftery et al. (1997) for details.) Suppose each model has equal prior probability, 2−10 . To compare P (Mk |y), k = 1, . . . , K, we only need to compute (4). We consider various values of n and σ in the simulation study. The true value of β is given in Table 1 from which we see that variables X1 and X2 have no effect at all on Y. The predictive variables {xij }, Table 1 True values of the regression coefficients β0 β1 β2 β3 β4 β5 β6 β7 β8 β9 β10 5.1 0 0 2.459 1.456 0.734 0.456 0.231 0.1342 0.052 0.00321 320 Fan et al. i = 1, . . . , n, j = 1, . . . , 10, are generated independently from U (0, 30) for each combination of n and σ , and the response variables {yi }ni=1 are then simulated from (3). We apply both Occam’s window and model walking algorithms to determine the final candidate sets of models, denoted by Aold and Anew , respectively. Five hundred simulation runs are conducted Table 2 Average numbers of examined models (V) and missing probabilities (L) using model walking algorithm with good ini-tial model B + /B − Downloaded by [University of Waterloo] at 09:23 31 October 2014 n 40 σ 3 5 10 15 20 60 3 5 10 15 20 80 3 5 10 15 20 100 15 20 V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L 1/1 2/2 3/3 5/5 7.5/7.5 10/10 34.244 0.12978 31.928 0.14802 27.414 0.20324 30.492 0.19570 29.97 0.21166 36.506 0.07941 32.56 0.10580 26.86 0.14416 26.634 0.15511 24.666 0.17049 38.392 0.05996 34.344 0.08607 27.782 0.10889 23.086 0.13463 20.612 0.13897 21.678 0.11906 19.324 0.13135 38.632 0.08470 36.962 0.10112 34.864 0.13180 37.824 0.13441 37.034 0.14686 39.344 0.05509 36.658 0.06761 31.716 0.09420 32.324 0.10577 30.764 0.11376 41.35 0.03689 38.258 0.05401 31.262 0.07632 27.714 0.09326 24.848 0.09955 25.73 0.07985 23.494 0.08902 42.852 0.06431 42.356 0.07602 43.936 0.09382 44.936 0.10043 45.314 0.11190 42.558 0.04066 40.648 0.05134 36.714 0.07227 38.03 0.07944 37.1 0.08593 43.536 0.02922 42.086 0.03902 35.38 0.05428 32.832 0.06821 29.402 0.07799 29.968 0.06177 28.358 0.06382 52.786 0.04051 54.69 0.04582 63.144 0.05775 63.328 0.06216 65.738 0.06946 49.342 0.02814 49.238 0.03219 48.462 0.04450 50.898 0.05178 50.534 0.05703 48.476 0.01897 49.794 0.02391 44.326 0.03516 44.164 0.04466 39.14 0.05246 39.258 0.03945 37.2 0.04361 65.912 0.02529 70.104 0.02848 85.088 0.03468 89.226 0.03773 91.896 0.04270 58.078 0.01879 59.564 0.02090 60.76 0.02994 68.482 0.03271 68.418 0.03781 55.184 0.01375 57.092 0.01700 54.036 0.02431 58.7426 0.02910 54.636 0.03331 50.454 0.02635 48.082 0.03178 78.592 0.01695 85.444 0.01828 103.052 0.02274 112.56 0.02377 117.67 0.02755 66.532 0.01306 69.46 0.01466 72.404 0.02124 85.338 0.02230 87.072 0.02548 61.91 0.01047 64.734 0.01229 61.434 0.01854 71.726 0.02044 67.934 0.02387 61.962 0.01842 61.55 0.02203 Bayesian Model Averaging in Regression Models 321 in each case, and the average total probabilities of models in Acold and Acnew are recorded (denoted by L). These two sets are the models not selected by using the Occam’s window and the proposed algorithm and we refer the probability by missing probability from now on. The average number of models searched by each algorithm is also marked (by V) which can roughly represent the computational cost. For simplicity, we use C = 20, B + = B − , and BA+ = Table 3 Average numbers of examined models (V) and missing probabilities (L) using up- and down-algorithm with good initial model OL−1 /OR Downloaded by [University of Waterloo] at 09:23 31 October 2014 n 40 σ 3 5 10 15 20 60 3 5 10 15 20 80 3 5 10 15 20 100 15 20 V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L 5/5 10/5 10/10 15/10 15/15 20/15 67.594 0.28011 63.066 0.30386 58.104 0.35044 51.612 0.37831 55.6 0.39800 67.15 0.24025 59.668 0.26067 49.964 0.31177 41.374 0.34818 40.148 0.36042 69.268 0.20147 63.794 0.23043 49.004 0.27585 34.748 0.32236 31.27 0.34055 33.234 0.29871 30.196 0.31463 68.262 0.26847 64.162 0.29090 59.96 0.33432 55.804 0.36611 62.304 0.38259 67.532 0.22667 60.176 0.24742 50.562 0.29659 43.882 0.33114 43.268 0.35022 69.334 0.19267 63.916 0.21748 49.49 0.26183 36.456 0.30685 33.416 0.32687 34.75 0.28532 31.732 0.30155 121.488 0.17138 119.972 0.18235 125.418 0.20981 110.67 0.24235 136.148 0.25404 105.592 0.15791 98.962 0.16460 87.398 0.20517 77.226 0.22982 80.44 0.24696 100.738 0.14164 93.89 0.15354 78.884 0.19165 61.482 0.21879 58.58 0.23541 53.134 0.20785 51.284 0.22572 122.094 0.16660 120.83 0.17705 127.782 0.20275 115.484 0.23720 146.446 0.24705 106.264 0.15223 99.294 0.16071 88.294 0.20024 79.274 0.22409 83.462 0.24225 100.932 0.13712 94.184 0.14874 79.214 0.18722 62.92 0.21410 61.13 0.22984 54.696 0.20203 53.01 0.21984 198.894 0.07027 201.774 0.07291 229.108 0.08698 230.896 0.11252 306.692 0.12059 157.858 0.09760 150.014 0.10031 147.306 0.12902 135.314 0.14841 148.448 0.16468 136.582 0.09930 132.898 0.10282 121.774 0.13375 98.848 0.15713 100.292 0.16942 79.648 0.15644 82.482 0.16492 201.478 0.06810 202.476 0.06998 230.8 0.08309 241.72 0.10915 321.204 0.11663 157.858 0.09497 150.148 0.09773 148.242 0.12663 137.546 0.14510 153.136 0.16049 136.832 0.09697 133.34 0.10038 121.774 0.13198 100.446 0.15362 1001.762 0.16712 80.73 0.15383 83.81 0.16258 322 Fan et al. BA− = 5C in the model walking algorithm and some popular choices of OL and OR for the Occam’s window. We first select the initial model to be the one which includes all the predictive variables that have sample correlation coefficients at least 0.3 with the response variable and the results of L and V of the two algorithms are listed in Tables 2 and 3, respectively. In Table 4 Average numbers of examined models (V) and missing probabilities (L) using model walking algorithm with inappropriate initial model B + /B − Downloaded by [University of Waterloo] at 09:23 31 October 2014 n 40 σ 3 5 10 15 20 60 3 5 10 15 20 80 3 5 10 15 20 100 15 10 V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L 1/1 2/2 3/3 5/5 7.5/7.5 10/10 67.562 0.12988 64.754 0.14400 58.824 0.19595 53.348 0.22335 50.654 0.23925 71.07 0.07973 66.874 0.10301 60.634 0.14656 54.54 0.17880 50.954 0.17862 73.688 0.05504 69.69 0.08219 62.318 0.10697 56.652 0.13749 522.392 0.15773 57.834 0.11592 53.794 0.12997 73.092 0.08274 70.122 0.10193 66.346 0.12872 61.364 0.14962 57.962 0.16863 75.27 0.05135 70.794 0.07424 66.456 0.09549 60.98 0.11569 56.358 0.12415 76.792 0.03541 73.94 0.05170 65.988 0.07685 62.096 0.09203 57.248 0.10515 62.29 0.07490 58.472 0.08486 78.196 0.05988 76.512 0.07462 75.846 0.09470 71.734 0.11041 69.854 0.12196 78.866 0.03936 75.622 0.05437 72.766 0.06876 68.276 0.08352 62.596 0.09372 79.538 0.02762 77.938 0.03671 70.364 0.05581 67.368 0.06886 62.754 0.7829 66.582 0.05615 63.384 0.06416 89.736 0.03875 90.218 0.04610 94.622 0.05909 95.33 0.06610 94.524 0.07358 85.78 0.02734 84.914 0.03408 85.118 0.04329 83.144 0.05251 76.62 0.06068 85.048 0.02009 85.736 0.02422 79.718 0.03611 78.784 0.04500 75.966 0.04949 75.438 0.03823 72.792 0.04300 105.482 0.02480 107.898 0.02874 117.122 0.03460 123.656 0.03861 125.71 0.04289 97.036 0.01849 97.32 0.02179 99.706 0.02856 101.53 0.03309 94.906 0.03929 93.708 0.01414 95.814 0.01637 90.972 0.02410 94.004 0.02933 91.086 0.03273 86.912 0.02603 84.97 0.03010 122.022 0.01597 125.682 0.01819 134.938 0.02300 148.946 0.03409 151.956 0.02714 107.534 0.01360 108.76 0.01524 112.656 0.02057 119.888 0.02234 112.65 0.02707 101.23 0.01068 104.588 0.01254 101.486 0.01693 107.82 0.02062 105.33 0.02292 98.214 0.02200 96.122 0.02200 Bayesian Model Averaging in Regression Models 323 contrast, we also consider the initial model to be the least appropriate one that only contains variables X1 and X2 in the model and the corresponding results are shown in Tables 4 and 5. We see from the tables that in both algorithms, the missing probabilities are increased in a longer searching procedure as σ increases for each n. Conversely, for fixed σ they are all decreased as the sample size increases. In the first case when the initial model is chosen Table 5 Average numbers of examined models (V) and missing probabilities (L) using up- and down-algorithm with inappropriate initial model OL−1 /OR Downloaded by [University of Waterloo] at 09:23 31 October 2014 n 40 σ 3 5 10 15 20 60 3 5 10 15 20 80 3 5 10 15 20 100 15 20 V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L 5/5 10/5 10/10 15/10 15/15 20/15 248.048 0.23420 237.278 0.25844 213.772 0.29613 182.064 0.32598 157.974 0.35110 224.866 0.20966 212.31 0.22515 185.55 0.26895 157.024 0.30426 132.132 0.32129 218.548 0.18899 213.88 0.20351 182.7 0.24378 147.086 0.27914 129.954 0.29758 151.22 0.26339 129.192 0.28102 248.762 0.22099 237.878 0.24112 213.772 0.28285 182.064 0.31193 158.604 0.33775 225.508 0.19224 212.31 0.21514 185.636 0.25314 157.244 0.29094 132.216 0.31051 219.276 0.17222 214.02 0.18859 182.7 0.23068 147.392 0.26514 130.004 0.28543 151.412 0.24643 129.192 0.26847 525.324 0.13124 522.872 0.13363 497.782 0.14950 458.644 0.17472 429.696 0.19222 427.06 0.12905 410.09 0.13238 369.498 0.15731 338.634 0.18463 298.628 0.20177 377.378 0.12042 371.8 0.12455 330.774 0.14862 285.134 0.17571 256.368 0.19325 268.802 0.16842 234.342 0.18811 525.324 0.12772 522.872 0.12868 497.782 0.14375 458.644 0.16970 429.696 0.18749 427.06 0.12417 410.886 0.12727 369.498 0.15226 339.676 0.17751 298.628 0.19811 377.378 0.11526 371.8 0.12039 330.774 0.14378 285.134 0.17181 256.368 0.18801 268.802 0.16292 234.758 0.18231 887.326 0.06049 891.446 0.05733 885.754 0.05866 866.938 0.06583 858.76 0.07325 681.566 0.08244 676.256 0.07607 641.302 0.08633 613.88 0.10093 587.51 0.11350 577.768 0.08295 566.414 0.08169 528.548 0.09428 494.258 0.10821 455.82 0.12315 437.48 0.11185 389.15 0.1270 887.326 0.05869 891.446 0.05422 885.754 0.05504 866.938 0.06314 858.76 0.07075 681.566 0.07981 676.256 0.07338 641.302 0.08391 614.434 0.09662 588.194 0.11059 577.768 0.08045 566.414 0.07958 528.548 0.09218 494.258 0.10585 437.48 0.10943 437.48 0.10943 389.15 0.12465 324 Fan et al. meaningfully, the proposed algorithm yields missing models with probabilities almost all under 10% except in small samples or in more uncertain cases with B + = 1. More models are examined for larger threshold values and therefore less models are missed. On the other hand, the Occam’s window can not ensure the missing probability under 10% even with OL = 1/15 and OR = 15, and it must examine between 121 and 321 models on the average Table 6 Average numbers of examined models (V) and missing probabilities (L) using model walking algorithm with maximum initial model B + /B − Downloaded by [University of Waterloo] at 09:23 31 October 2014 n 40 σ 3 5 10 15 20 60 3 5 10 15 20 80 3 5 10 15 20 100 15 20 V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L 1/1 2/2 3/3 5/5 7.5/7.5 10/10 59.108 0.10650 61.176 0.13057 67.338 0.19582 75.216 0.21923 77.914 0.23237 54.572 0.07535 57.946 0.09837 64.18 0.14572 71.028 0.17477 74.168 0.18092 49.87 0.05197 55.774 0.08195 62.068 0.11696 68.29 0.15184 72.022 0.16170 66.05 0.12989 69.758 0.14975 64.786 0.06258 68.01 0.07980 77.33 0.11773 86.994 0.13455 91.18 0.14128 58.904 0.04216 63.068 0.05634 70.99 0.08885 80.216 0.10386 82.7 0.11403 52.932 0.03064 60.6 0.04160 68.326 0.06840 76.03 0.08995 80.026 0.09551 72.868 0.07369 77.598 0.08687 67.985 0.04507 72.228 0.05682 86.076 0.08076 97.322 0.09320 101.526 0.10251 61.832 0.02968 66.926 0.03795 76.366 0.06211 86.51 0.07535 89.424 0.08257 55.02 0.02151 63.748 0.02804 71.812 0.05055 82.152 0.06116 85.902 0.07064 77.666 0.05216 83.506 0.05804 74.396 0.02876 81.656 0.03617 101.784 0.05043 117.428 0.05672 122.41 0.06521 67.114 0.01774 74.07 0.02221 87.308 0.03753 100.406 0.04635 102.846 0.05274 59.176 0.01296 69.022 0.01787 80.65 0.02939 93.326 0.03713 97.71 0.04298 86.7 0.03221 93.282 0.03680 84.489 0.01804 95.172 0.02155 121.064 0.03020 144.344 0.03302 149.694 0.03986 73.356 0.01190 82.48 0.01424 99.888 0.02325 117.318 0.02883 120.78 0.03395 63.79 0.00876 75.672 0.01139 90.344 0.01882 106.16 0.02456 113.002 0.02780 97.764 0.02106 103.704 0.02593 94.036 0.01214 106.838 0.01400 136.144 0.01991 168.394 0.02019 174.608 0.02499 79.08 0.00861 90.148 0.00970 110.06 0.01652 132.622 0.01983 136.998 0.01994 67.614 0.00642 81.234 0.00805 98.458 0.01362 118.232 0.01725 124.756 0.01994 106.792 0.01994 115.27 0.01812 Bayesian Model Averaging in Regression Models 325 when n = 40. However, the model walking algorithm searches at most 117 models with less than 3% missing probability in such extreme case. It performs much better than Occam’s window when the initial model is poorly selected as shown in Tables 4 and 5. Again almost all settings of B − yield missing probabilities under 10% (in fact only 4 are slightly above 10% when B + /B − ≥ 2). The corresponding searching procedures are relatively longer but Table 7 Average numbers of examined models (V) and missing probabilities (L) using up- and down-algorithm with maximum initial model OL−1 /OR Downloaded by [University of Waterloo] at 09:23 31 October 2014 n 40 σ 3 5 10 15 20 60 3 5 10 15 20 80 3 5 10 15 20 100 15 20 V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L 5/5 10/5 10/10 15/10 15/15 20/15 293.016 0.31055 343.512 0.32831 528.434 0.35868 664.596 0.37069 774.768 0.39011 229.32 0.25990 278.816 0.28134 449.434 0.32147 609.868 0.34335 705.598 0.36000 180.128 0.23036 248.984 0.25331 387.436 0.29592 550.41 0.31705 662.914 0.33015 498.464 0.29722 625.626 0.30846 313.242 0.29577 372.353 0.31185 572.438 0.34236 707.718 0.35566 819.124 0.37608 248.666 0.24404 297.336 0.26694 483.616 0.30745 650.204 0.32732 742.426 0.34731 192.59 0.21832 265.116 0.24256 416.206 0.28276 585.792 0.30213 695.08 0.31711 535.42 0.28104 661.12 0.29447 313.286 0.20430 372.452 0.20319 572.916 0.21803 708.252 0.22798 820.222 0.24465 248.666 0.18507 297.342 0.18603 483.82 0.21313 650.316 0.23190 742.782 0.24655 192.59 0.16973 265.126 0.17713 416.296 0.20140 585.856 0.21948 695.158 0.23029 535.442 0.21065 661.144 0.22128 322.874 0.19955 389.134 0.19630 595.132 0.21088 731.3 0.21961 842.342 0.23844 257.56 0.17897 305.692 0.18199 500.094 0.20814 668.392 0.22594 761.002 0.24178 200.96 0.16371 273.614 0.17212 432.946 0.19555 605.288 0.21432 711.986 0.22560 551.79 0.20536 678.366 0.21622 322.932 0.09286 389.512 0.08936 595.056 0.09496 732.926 0.09820 844.912 0.11184 257.56 0.12729 305.714 0.12227 500.326 0.13628 668.612 0.14753 761.536 0.16071 200.976 0.12678 273.63 0.12736 433.108 0.14045 605.43 0.15578 712.094 0.16675 551.868 0.15577 678.4 0.16722 328.91 0.09072 401.422 0.08537 610.508 0.09009 748.134 0.09400 859.116 0.10689 263.092 0.12488 312.33 0.11950 510.99 0.13337 681.082 0.14456 774.04 0.16387 206.304 0.12397 280.222 0.12471 444.092 0.13748 618.888 0.52215 724.09 0.16387 563.792 0.15266 689.386 0.16445 326 Fan et al. still much less than those caused by Occam’s window which searches as much as 891 models. Overall speaking, the computational cost of Occam’s window is between 4 and 5 times as much as that caused by the model walking algorithm and it misses more models on the average. Yet it seems that the proposed algorithm is less sensitive to the choice of the initial model. Table 8 Average numbers of examined models (V) and missing probabilities (L) using model walking algorithm with minimum initial model B + /B − Downloaded by [University of Waterloo] at 09:23 31 October 2014 n 40 σ 3 5 10 15 20 60 3 5 10 15 20 80 3 5 10 15 20 100 15 20 V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L 1/1 2/2 3/3 5/5 7.5/7.5 10/10 51.806 0.11750 49.202 0.15417 43.406 0.20128 38.19 0.21734 34.12 0.24117 55.95 0.07823 51.858 0.10665 45.142 0.14406 39.416 0.17280 35.626 0.18278 58.226 0.05573 53.544 0.08181 47.074 0.11202 40.962 0.14380 36.992 0.15426 42.592 0.11700 37.874 0.13140 56.672 0.07940 55.68 0.09872 51.554 0.12835 46.744 0.14651 42.356 0.16898 59.866 0.04970 56.274 0.07204 51.044 0.09166 45.696 0.11601 41.752 0.12588 61.41 0.03540 57.718 0.04994 51.654 0.07696 46.084 0.09478 42.204 0.10549 47.308 0.07858 42.246 0.08629 61.942 0.06254 62.78 0.06916 61.218 0.09225 56.638 0.10802 54.116 0.12317 63.7 0.03776 61.33 0.05323 56.712 0.07034 52.558 0.08526 49.138 0.09394 64.582 0.02772 61.768 0.03598 56.158 0.05692 51.958 0.06917 47.764 0.08122 52.322 0.05822 47.27 0.06436 73.922 0.04111 76.702 0.04616 80.412 0.05814 79.854 0.06760 81.196 0.07324 71.532 0.02604 72.14 0.03353 69.876 0.04484 69.936 0.05177 65.706 0.06056 71.108 0.01927 69.71 0.02413 66.544 0.03659 64.422 0.04419 61.538 0.05012 63.508 0.03679 57.094 0.04363 90.882 0.02657 96.58 0.02779 105.462 0.03405 109.182 0.03925 113.266 0.04284 82.434 0.01819 86.796 0.02120 85.756 0.02908 90.77 0.03307 86.646 0.03825 79.842 0.01420 79.46 0.01654 78.586 0.02481 79.522 0.02957 78.078 0.03304 75.768 0.02603 70.304 0.0287 109.174 0.01686 113.858 0.01867 126.066 0.02220 136.048 0.02421 140.416 0.02710 93.946 0.01325 99.406 0.01466 100.392 0.02036 109.504 0.02241 95.112 0.02631 89.38 0.01067 90.01 0.01222 89.61 0.01798 95.112 0.02058 93.142 0.02330 87.974 0.01904 83.1 0.02039 Bayesian Model Averaging in Regression Models 327 To further justify the sensitivity of the initial models, we consider the two most extreme initial models in addition, namely the one with all 10 predictive variables (maximum model) and the one without any predictors (minimum model). Tables 6–9 demonstrate the corresponding results. We see once again that the results in Tables 6 and 8 do not differ too much. It confirms that the proposed method indeed is not influenced much by the specified Table 9 Average numbers of examined models (V) and missing probabilities (L) using up- and down-algorithm with minimum initial model OL−1 /OR Downloaded by [University of Waterloo] at 09:23 31 October 2014 n 40 σ 3 5 10 15 20 60 3 5 10 15 20 80 3 5 10 15 20 100 15 20 V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L V L 5/5 10/5 10/10 15/10 15/15 20/15 246.04 0.23243 236.666 0.25037 213.55 0.29414 185.952 0.32357 162.008 0.35031 220.686 0.21143 210.428 0.22620 184.044 0.26907 156.49 0.29563 143.44 0.31488 226.874 0.18316 206.234 0.20389 179.046 0.24765 154.592 0.27309 131.624 0.29611 152.15 0.26082 128.368 0.27584 246.04 0.22267 236.666 0.23812 213.55 0.27842 185.952 0.31147 162.008 0.33892 220.686 0.19254 210.428 0.21209 184.044 0.25347 156.49 0.28152 143.44 0.30366 226.874 0.17140 206.234 0.19039 179.046 0.23340 154.592 0.25945 131.624 0.28350 152.15 0.24644 128.368 0.26427 525.806 0.13172 522.346 0.13148 500.814 0.14667 460.214 0.17002 436.834 0.19139 412.27 0.12626 412.026 0.12937 378.278 0.15398 345.606 0.17538 315.502 0.19648 387.16 0.11941 359.69 0.12670 337.254 0.15022 294.214 0.17446 262.42 0.19008 270.342 0.16966 234.288 0.18635 525.806 0.12769 522.346 0.12727 500.814 0.14058 460.214 0.16552 436.834 0.18686 412.27 0.12158 412.026 0.12445 378.278 0.14863 345.606 0.17092 315.502 0.19260 387.16 0.11476 359.69 0.12372 337.254 0.14597 294.214 0.16941 262.42 0.18615 270.342 0.16449 234.288 0.18336 886.788 0.05628 886.956 0.05440 886.424 0.05437 869.598 0.06414 864.152 0.07178 665.526 0.08150 675.022 0.07473 649.65 0.08502 623.14 0.09767 588.818 0.11300 584.034 0.08227 569.504 0.08370 543.95 0.09314 500.766 0.10718 461.122 0.11977 433.338 0.11296 393.43 0.12724 886.788 0.05411 886.956 0.05166 886.424 0.05158 869.598 0.02421 864.152 665.526 0.07913 675.022 0.07305 649.65 0.08175 623.14 0.09558 588.818 0.11076 584.034 0.07998 569.504 0.08182 543.95 0.09023 500.766 0.10530 461.122 0.11771 433.338 010969 393.43 0.12509 328 Fan et al. initial model. It is also observed that in Tables 7 and 9 when the predetermined windows OL−1 /OR are small, the up- and down-algorithm seems to be less sensitive to the two extreme initial models, but still the computational costs as well as the missing probabilities are substantially large; while for larger OL−1 /OR , using maximum initial model must scan considerably more models. One remark is that when σ increases for fixed n, the up- and down-algorithm seems to search less models with higher missing probabilities, particularly for large n; while when the sample sizes increases with σ fixed, it does not improve either in the numbers of examined models or missing probabilities. The numbers of models searched by the proposed algorithm may not be reduced much but the missing probabilities are all within acceptable ranges. Downloaded by [University of Waterloo] at 09:23 31 October 2014 5. Concluding Remark Variable selection is one of the important issues in regression analysis. However, the single best model may not exist or there exit on the other hand many plausible models in practice. Using only one model may severely ignore the problem of model uncertainty and cause inaccurate predictions. Bayesian model averaging method is an alternative from the prediction perspective. However, computational cost is too expensive when averaging over all possible models as the number of predictors is large. We propose a model walk algorithm by excluding models with small posterior probabilities to increase the computational efficiency. Simulation results show that the proposed method is successful in terms of missing probabilities and computational time compared to the Occam’s window method using down- and up-algorithms; and yet it is less sensitive to the choice of the initial model. The proposed algorithm can be extended to more sophisticated models such as the longitudinal regression models or time series regression models. More general models such as the regression models with t error distribution or the generalized linear models may also be considered incorporated with the Markov chain Monte Carlo methods. References Bernard, G. A. (1963). New methods of quality control. Journal of the Royal Statistical Society: Series A 126:255–258. Draper, D. (1995). Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society Series B 57:45–97. Kass, R. E., Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association 90:773–795. Madigan, D., Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window. Journal of the American Statistical Association 89:1535–1546. Madigan, D., York, J. (1995). Bayesian graphical models for discrete data. International Statistical Review 63:215–232. Montgomery, D. C., Peck, E. A., Vining, G. G. (2006). Introduction to Linear Regression Analysis. 4th ed. New York: Wiley. Raftery, A. E., Madigan, D., Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association 92:179–191.