Matching and Nonparametric IV Estimation, a Distance-Based Measure of Migration, and the Wages of Young Men ∗ John C. Ham University of Maryland, IZA and IRP (UW-Madison) Xianghong Li York University Patricia B. Reagan Ohio State University First Draft: September 2001 This Draft: July 2009 ∗ Corresponding author: John C. Ham; email: ham@econ.umd.edu. An earlier version of this paper was presented under the title “Matching and Selection Estimates of the Effect of Migration on Wages for Young Men.” We would like to thank Dwayne Benjamin, Stephen Cosslett, Songnian Chen, Xiaohong Chen, William Greene, Cheng Hsiao, Guido Imbens, John Kennan, Lung-fei Lee, Audrey Light, Geert Ridder, Barbara Sianesi, Aloysius Siow, Jeffrey Smith, Petra Todd, Insan Tunali, Bruce Weinberg, Tiemen Woutersen, James Walker and Jeff Yankow for very helpful comments and discussions. We are also grateful to seminar participants at Arizona, CEMFI, the Federal Reserve Bank of New York, McGill, University of Montreal, McMaster, Minnesota (Industrial Relations), Ohio State, Pompeu Fabra, Rutgers, Toronto, UC Berkeley, UC Davis, UC San Diego, USC, Vanderbilt, Western Ontario, Wisconsin and the Upjohn Institute, as well as at the Econometric Society and Society of Labor Economics meetings. We especially thank two anonymous referees and a Co-Editor for very helpful comments which lead to a much improved paper. Eileen Kopchik and Yong Yu provided outstanding programming help. This research was partially supported by the NSF and NICHD. We are responsible for all errors. This paper reflects the views of the authors and in no way represents the views of the NSF or NICHD. ABSTRACT Our paper estimates the effect of U.S. internal migration on wage growth for young men between their first and second job. Our analysis of migration extends previous research by: i) exploiting the distance-based measures of migration in the National Longitudinal Surveys of Youth 1979 (NLSY79); ii) allowing the effect of migration to differ by schooling level and iii) using propensity score matching to estimate the average treatment effect on the treated (ATET) for movers and iv) using local average treatment effect (LATE) estimators with covariates to estimate the average treatment effect (ATE) and ATET for compliers. We believe the Conditional Independence Assumption (CIA) is reasonable for our matching estimators since the NLSY79 provides a rich array of variables on which to match. Our matching methods are based on local linear, local cubic, and local linear ridge regressions. However, because it is unknown whether the standard n -bootstrap is appropriate for these matching estimators, we use an algorithm from Bickel and Sakov (2008) to implement the ‘m out of n’ bootstrap, which is valid under much weaker conditions. We find that the use of the standard n -bootstrap is appropriate in our application. Local linear and local ridge regression matching produce relatively similar point estimates and standard errors, while local cubic regression matching badly over-fits the data and provides very noisy estimates. From the matching estimators we find a significant positive effect of migration on the wage growth of college graduates, and a marginally significant negative effect for high school dropouts. We do not find any significant effects for other educational groups or for the overall sample. Our results are generally robust to changes in the model specification and changes in our distance-based measure of migration. We find that better data matters; if we use a measure of migration based on moving across county lines, we overstate the number of moves, while if we use a measure based on moving across state lines, we understate the number of moves. Further, using either the county or state measures leads to much less precise estimates. We also consider non-parametric LATE estimators with covariates (Frölich 2007), using two sets of instrumental variables. We precisely estimate the proportion of compliers in our data, but because we have small number of compliers (given the estimated proportion of compliers), we cannot obtain precise LATE estimates. 2 1. Introduction Internal migration is an important economic phenomenon in the United States. Between 2002 and 2003, about 40.1 million Americans moved, and about 60 percent of these movers were 20 to 29 years old.1 Labor economists typically model migration as an investment in human capital, but evidence on whether moving increases wages is mixed. By using data on young men from the 19791996 waves of the National Longitudinal Surveys of Youth 1979 (NLSY79), we attempt in this paper to identify the average initial, or contemporaneous, wage gain from U.S. internal migration for both those who move, and those who are at the margin of deciding to move and whose moving decision would react to an exogenous change in moving costs (referred to as compliers in LATE literature). We contribute to the migration literature in several ways. First, we allow migration effects to differ across educational groups and find that this distinction is important.2 Previous studies pool different educational groups to estimate average returns for all migrants. If returns on migration are positive for some education group(s), such as college graduates, and negative or zero for other groups, then the overall sample average may be statistically indistinguishable from zero even though individual components are nonzero. Second, we use a distance-based measure of migration in the NLSY79, instead of a measure based on moving across a state or county line. We define migration as having occurred if the respondent moved at least 50 miles, or changed Metropolitan Statistical Area (MSA) and moved at least 20 miles. Compared to a measure based on moving across a state or county line, the measures commonly used in the literature, the distance-based measure of migration corresponds more closely to the theoretical notion of changing local labor markets described by Hanushek (1973). We find that measuring migration by changing state underestimates migration by about 36%, and measuring migration by changing county overestimates migration by about 43%. Further, we obtain substantially less precise estimates using either of these alternative measures of migration. We also address the fact that those who move are a nonrandomly selected sample in two ways. First, we use propensity score matching to estimate an ATET of migration, i.e. we investigate the effect of migration on those who move. Second, we adopt a nonparametric instrumental variable approach to estimate local average treatment effects, which represent the effect of migration for the subpopulation who are at the margin of deciding to move (and who are compliers in the sense that they would react to an exogenous change in moving costs). One may prefer Frölich’s non- parametric IV estimator since it only requires an exclusion restriction (and we have one that seems plausible in our application), while matching requires invoking a conditional independence See http://www.census.gov/prod/2004pubs/p20-549.pdf To the best of our knowledge, Yankow (1999) is the only other author who allows migration effects to differ by education. 1 2 3 assumption (CIA). However, we argue in some detail below that the CIA is appropriate in our application since we use difference-in-difference matching and condition on a rich set of variables from the NLSY79. We consider propensity score matching estimators based on local linear, local cubic, and local linear ridge regressions. Since there are no theoretical results justifying the use of the standard bootstrap to estimate the standard errors for these matching estimators, we use an algorithm from Bickel and Saklov (2008) to implement the ‘m out of n’ bootstrap3, which is valid under much weaker assumptions than the standard bootstrap (Politis and Romano 1994 and Bickel et al. 1997). Interestingly, the Bickel-Saklov algorithm indicates that it is appropriate to use the standard bootstrap in our application. We use the Andrews-Buchinsky algorithm (Andrews and Buchinsky 2000, 2001) to choose the number of replications in the standard bootstrap, and in two cases below, the necessary number of replications is much larger than that usually used by applied researchers. We obtain relatively precise estimates of the ATET for each educational group from matching estimators, and our results are not sensitive to whether we use local linear versus local linear ridge matching; local cubic matching badly over-fits the data resulting in large standard errors. Further, our results are robust to minor changes in the propensity score specification, bandwidth choice, trimming levels, and reasonable changes in our distance-based measure of migration. We find a statistically significant and positive ATET effect among college graduates of around 10 percent, and a marginally significant negative effect for high school dropouts of about –12 percent. We do not find a statistically significant migration effect for the overall sample or the other educational groups (high school graduates and those with some college). Since we estimate a contemporaneous effect of migration on wage growth, insignificant or negative contemporaneous effects do not necessarily imply that migration is an irrational decision from a human capital perspective. Instead, it may simply indicate that, at least for certain individuals, there is an assimilation process where a short-term wage loss might be dominated by a greater long-term wage gain later. Our estimates and standard errors are somewhat sensitive to switching to a measure of migration based on crossing county or state lines, and under either of these alternative measures our ATET estimates become smaller in absolute value and statistically insignificant for college graduates and high school dropouts. Finally, we use Frolich’s nonparametric instrumental variable estimator to estimate the treatment effects for the compliers, and this estimator also provides an estimate of the proportion of compliers in our data. We estimate (relatively precisely) the proportion of compliers in our data, but Bickel and Sakov note that their algorithm is originally proposed by Bickel, Götze, and van Zwet (personal communication). 3 4 can estimate neither the ATE nor the ATET for compliers accurately since we have too few compliers in our data. Our paper is organized as follows: We review the migration literature in Section 2. In Section 3 we present our econometric approach. In Section 4 we describe our data, and our empirical results are presented in Section 5. Section 6 concludes the paper. 2. A Brief Review of the Migration Literature The most common theoretical model of migration treats the decision to migrate as an investment in human capital: individuals migrate if the present value of real income in a destination minus the cost of moving exceeds what could be earned at the place of origin (Sjaastad 1962).4 For our purposes, the empirical studies based on this model can be classified into two broad areas: those examining the determinants of migration and those focused on the consequences of migration for wages and earnings.5 While the determinants of migration are not the focus of our paper, they play an important role in our specification of the propensity score in matching and in our non-parametric LATE estimators. Polachek and Horvath (1977) and Plane (1993) find that geographic mobility peaks during the early to mid-twenties and declines with age thereafter, perhaps because the time horizon over which gains from migration can be realized grows shorter. These studies also find that the propensity to migrate increases with education. Highly educated workers operate in labor markets that cover broad geographic areas, whereas workers with low levels of education operate in more geographically isolated labor markets. Workers with more education also may be better informed about opportunities outside their local labor market and better able to evaluate that information. In addition, the migration decision is affected by migration cost and the non-wage benefits of different locations. Goss and Schoening (1984) provide some indirect evidence that households with fewer assets are less mobile, since they find that the probability of migration declines with the duration of unemployment. Further, Lansing and Mueller (1967) report that many moves are attributable to family-related issues, such as proximity to family members. Presumably being close to one’s family is a non-wage advantage of a given location. Of course, in the human capital model of migration, expected wage gains, local demand shocks, and inter-regional differences in returns to skill play an important role in the migration decision. Shaw (1991), Borjas, Bronars and Trejo (1992b), Dahl (2002), and Kennan and Walker (2003) use a Roy (1951) model of comparative advantage to study migration. Although the human 4 See also McCall and McCall (1987). They develop a “multi-armed bandit” approach to the migration decision. Workers rank locations by their pecuniary and nonpecuniary attributes, and then sample locations sequentially until a suitable match is found. Search costs limit the number of markets sampled. 5 Greenwood (1997) provides an excellent review of the literature. 5 capital model of migration clearly predicts a higher present value of lifetime earnings for those who migrate, the literature on the consequences of migration reaches no consensus on the contemporaneous returns to migration. Estimates of the average contemporaneous returns can be negative, zero, or positive. Positive contemporaneous returns are found by Bartel (1979) for younger workers, Hunt and Kau (1985) for repeat migrants, and Gabriel and Schmitz (1995) and Yankow (2003) for less-educated workers. Negative contemporaneous returns are found by Polachek and Horvath (1977), Borjas, Bronars and Trejo (1992a), and Tunali (2000).6 Studies that find statistically insignificant contemporaneous returns include Bartel (1979) for older workers, Hunt and Kau (1985) for one-time migrants, and Yankow (2003) for workers with more than a high school degree. The sign and significance of the migration effect depend on the sample chosen and on how researchers address three critical questions. First, what definition of migration is used? Although all authors view migration as a change of labor market, most define migration as occurring if a geographic boundary is traversed. The majority of authors, including most of those cited above, focus on interstate migration. A few, such as Hunt and Kau (1985) and Gabriel and Schmitz (1995), define migration as a change of Metropolitan Statistical Area (MSA). Falaris (1987) defines it as a change of Census region, while some authors, such as Linneman and Graves (1983), study intercounty migration. We find below that migration counts are sensitive to the definition used, and that the estimated returns to migration and their standard errors can be quite sensitive to the definition of migration used. The second question affecting the estimated effect of migration concerns the choice of comparison group. Most authors use all workers who do not migrate as the comparison group. But it is well known that there is wage growth associated with voluntary job turnover (Topel and Ward 1992). Since most migrants change jobs, the “return to migration” may confound returns to job changing with a return to geographic mobility. Bartel (1979) was the first to focus on the relationship between the types of job separation and migration. Others, such as Yankow (1999), condition on job changing but do not differentiate between types of job turnover (e.g. quits versus layoffs). Finally, Raphael and Riker (1999) consider only workers who were laid off. The third important question is how do researchers address the problem that migrants are likely to be a select sample in terms of observables and unobservables because migration is a choice variable. Nakosteen and Zimmer (1980, 1982) were among the first to provide evidence of positive self-selection into migration, and Robinson and Tomes (1982) and Gabriel and Schmitz (1995) also 6Alternatively, Tunali (2000) views migration as a lottery and finds that while a substantial portion of migrants experience wage reductions after moving, a minority realize very high returns. Individuals are willing to invest in an activity that has a high probability of yielding negative returns because of the potential for a very large payoff. 6 find this. On the other hand, Hunt and Kau (1985) and Borjas, Bronars and Trejo (1992a) find no evidence of self-selection. Note that all of these studies attempt to estimate an unconditional effect of moving or an average treatment effect, i.e. what is the effect of migration on wages for a randomly chosen individual. We see two problems with this. First, it may not be interesting to ask what the effect on wages of moving to a new location for a randomly chosen individual is. Many individuals will currently be in a location that gives them relatively high wages, and thus we could easily expect this treatment effect to be negative. Second, the choice of a new location is ambiguous here – is it the individual’s best alternative or a randomly chosen location? In this paper we first use matching to look at a less ambitious, but arguably better-specified question: the effect on wage growth for those who move. Note that there is no ambiguity in our setting as to the choice of new location.7 However we also use Frolich’s non-parametric IV estimator to estimate the ATE and ATET for compliers. 3. Econometric Approach 3.1 Propensity Score Matching Estimators We consider a group of young men who have quit their first job. They face a choice between accepting another job locally or moving to another labor market and accepting a job there. Our goal when using matching is to estimate the effect of internal migration on between-job wage growth for those who quit their first job and move.8 We use difference-in-difference matching, since we want to diminish the role of the individual fixed effects to make the assumptions underlying matching more tenable. Following the notation in the evaluation literature, let Di = 1 if an individual moves and Di = 0 otherwise. The outcome variable is defined as the logarithm of the starting wage on the second job minus the logarithm of the ending wage on the first job. For each individual i , we define two potential outcomes by treatment status, Yi 0 for Di = 0 and Yi1 for Di = 1 . After dropping the i subscript for ease of exposition, our goal is to identify the ATET = E (Y 1 − Y 0 D = 1) = E (Y 1 D = 1) − E (Y 0 D = 1). (3.1) We can estimate the first term on the right-hand side of (3.1) since we observe the between-job wage growth for the movers. However, we do not observe the between-job wage growth the movers 7 See Heckman, LaLonde and Smith (1999) for a discussion of when estimating the effect of “treatment on the treated” may be more useful than estimating an average treatment effect. For reasons discussed in Section 3.1.1, we cannot estimate an average treatment effect of migration given our data. 8 See Section 4 for our reasons for focusing on migration occurring between the first two jobs. 7 would have received had they not moved – the (counterfactual) second term on the right-hand side of (3.1). Instead we use propensity score matching9 to estimate this counterfactual. 3.1.1 Propensity Score Matching For matching to be valid, certain assumptions must hold. The fundamental assumption underlying matching estimators is known as the Ignorable Treatment Assignment Assumption (see Rosenbaum and Rubin 1983) or the Conditional Independence Assumption (CIA – see Lechner 2000). In our case the CIA requires Y0 ⊥ D | X, (3.2) where X is an appropriate set of observable variables unaffected by the treatment. This assumption states that, conditional on X , potential wage growth Y 0 is independent of treatment status. For this identification assumption to be credible, there must be some variable (with exogenous variation) that affects whether people move but is unrelated to potential wage growth; however it is not necessary to observe this type of variable. We will discuss this in greater detail in the next section. Note that since we are estimating the ATET for those who move, we do not need Y 1 to be independent of D after conditioning on X . The variable vector X contains pretreatment variables and time-invariant individual characteristics. To identify the ATET, we also need Pr( D = 1 | X ) < 1 . (3.3) Equation (3.3) is a common support condition that requires a positive probability of observing nonparticipants at each level of X . As we will see below, the common support constraint is not a problem in our data, unlike some other applications – especially those drawn from the job training literature. Matching on all variables in X becomes impractical as the number of variables increases; this is the well known curse of dimensionality. Rosenbaum and Rubin (1983) show that given (3.2) and (3.3), if Y 0 is independent of treatment status given X , then it is also independent of treatment status given p( X ) = Pr( D = 1 | X ) . Consequently, matching can be performed on a single index p( X ) instead of on all of the variables in X . 9 The seminal paper in the matching literature is Rosenbaum and Rubin (1983). For recent contributions in economics literature see Abadie and Imbens (2002), Hahn (1998), Heckman, Ichimura and Todd (1997, 1998), Heckman, Ichimura, Smith and Todd (1998), and Hirano, Imbens and Ridder (2003). See Imbens (2004) and Heckman, LaLonde and Smith (1999) for a discussion of previous work in the statistics literature and empirical applications of matching. See Frölich (2004) and Zhao (2004) for Monte Carlo evaluation of various matching approaches. 8 We should note that it is also possible to estimate an unconditional effect of moving if we are willing to make the stronger assumptions that (Y 0 ,Y 1 ) ⊥ D | X and 0 < Pr( D = 1 | X ) < 1 . However, we do not do this for two reasons. First, as we mentioned above, we believe the conditional effect for those who move is more interesting. Second, calculating an unconditional effect also involves using matching to estimate the wage growth that stayers would have experienced had they moved. In our case this would require using a relatively small number of movers to construct a counterfactual for a relatively large number of stayers. Frolich’s (2004) Monte Carlo experiments indicate that matching estimators for such a counterfactual have high mean squared errors. 3.1.2 Plausibility of the Conditional Independence Assumption and the Choice of Conditioning Variables The CIA in equation (3.2) is a rather strong condition, but note that we allow Y 1 , thus the migration gain ∆ = Y 1 − Y 0 , to depend on D . However, if Y 0 depends on Y 1 , we would have to rule out the dependence of Y 1 on D . An example of this dependence of Y 0 on Y 1 is an individual being able to use Y 1 to bargain for a better Y 0 . Given our empirical setting, we feel it is reasonable to rule out this type of bargaining. Individuals in our sample were an average age of 26 years old when they quit their first job, and it seems unlikely that the young men in our sample would have this kind of bargaining power. The advantage of allowing Y 1 to depend on D is that we do not need to include in X the factors that affect both Y 1 and D . Examples of such variables are the pull factors from the new destination (such as labor market conditions in the new location), which are unobserved for the stayers, and thus cannot be included in our conditioning variables. We select our conditioning variables to control for factors (or proxy for unobservables) expected to affect both the migration decision and the potential wage gain at the home location ( Y 0 ). Our propensity score specification includes age, education, race, occupation, starting wage, ending wage, and job tenure on the first job, living in an MSA10, and home ownership11. For example, we chose home ownership as a conditioning variable since we expect that it would affect moving costs and also would be correlated with the unobservables in wages. We would expect the wealth of the individual’s parents to affect the discount rate and thus the migration decision. Whether this variable also enters the wage equation, or is correlated with the unobservables in the wage equation, is an Workers in cities earn 33% more than their nonurban counterparts. Glaeser and Mare (2001) show that a portion of the urban wage premium is a wage growth, not a wage level, effect. They conclude that cities speed the accumulation of human capital. 11 All variables take the pre-migration value. 10 9 open question.12 Thus we estimate the propensity score without parents’ education in our baseline model, but include both parents’ education in our expanded model. In the expanded model we also consider a push factor for migration, the county unemployment rate at the end of the first job, since it may affect the migration decision and local wages. Assume that for a mover and a stayer, we have matched on all observed characteristics that affect both migration decision and Y 0 . It is reasonable to ask why some of them chose to move and the others do not, if we have controlled for all relevant factors. The answer is that there are variables that affect D but not Y 0 . In our case, individuals make migration decisions to maximize their expected utility, which is a function of moving costs and expected wage gains. Moving cost variables will affect D but not Y 0 , and they should not be included in the propensity score model. Thus our empirical model will match movers and stayers with similar estimated propensity scores who have similar Y 0 , but bear different moving costs. For example, the NLSY79 data contain a dummy variable indicating if an individual was living at age 14 in the same county where he was born. We would expect that this variable is a proxy for psychological cost of moving (because of current proximity to friends and families), but we would not expect this variable to affect wages. In Section 4 we will show that movers are much less likely to live in their county of birth at age 14. 3.1.3 Local Linear, Local Cubic and Local Linear Ridge Regression Matching Estimators We use a probit model to estimate the propensity score, pˆ ( x) , for each mover and stayer, and then ( consider different matching methods to construct the counterfactual ) E Y 0 D = 1, p( X ) = pˆ ( x ) . Let N1 be the number of movers and N 0 be the number of stayers in our data. The conditioning and outcome variables, {xi ,Yi1 } N1 i =1 two groups respectively. The propensity scores { pˆ ( x )} N0 j j =1 { pˆ ( xi )}i =1 N1 and {x j ,Y j0 } N0 j =1 , are observed for the are estimated for the mover group and are estimated for the stayer group. Applied economists have used many different matching estimators. For example, nearest neighbor matching uses only the closest (in terms of the ( ) estimated propensity score) stayer to estimate E Y 0 D = 1, p ( X ) = pˆ ( x ) for a mover. Using only one observation in the non-treated group is likely to be inefficient, and alternative matching 12 Willis and Rosen (1979) assume that father’s education affects the discount rate but does not enter the wage equation, nor is it correlated with the error in the wage equation. However, others may find their assumption too strong. Given Willis and Rosen assumption, father’s education makes a good IV but shall not be included in the propensity score. If, as others believe, it affects wages through the father’s connections, it should be in the propensity score, and also it will not be a valid IV. 10 estimators have been suggested. For example Heckman, Ichimura, and Todd (1997, 1998) propose local polynomial matching estimators, including kernel matching and local linear matching. Hirano, Imbens and Ridder (2003) suggest a weighted estimator (with the inverse of a nonparametric estimate of the propensity score as weights). Frolich (2004) suggests using a local linear ridge regression matching estimator, and finds that it performs well in his Monte Carlo experiments. (He also finds that local linear regression matching performs well when the control-treated ratio is relatively high, as in our application.) We also use local cubic matching since it offers the possibility of reducing the bias of the estimates – albeit at the cost of increasing the variance – and it has not been used previously in matching literature.13 For each observation i (i = 1,..., N1 ) in the mover group, local regression matching opens a window around pˆ ( xi ) = pi and uses all observations in the stayer group with estimated propensity scores in that window to construct a weighted mean, mˆ ( pi ) , to approximate the counterfactual E Y 0 | D = 1, p ( X ) = pi . Within the window, the closer pˆ( x j ) is to pi , the greater the weight the observation j gets in estimating mˆ ( pi ) . Local polynomial regression matching methods construct the counterfactuals by solving the following minimization problem for each mover i ( i = 1 to N1 ) 2 pˆ ( x j ) − pi l 0 L . Y j − β l pˆ ( x j ) − pi K h j =1 l =0 N0 min β 0 , β1 ,...,β L ∑ ∑ ( ) In (3.4) K (⋅) is a kernel weighting function, h is the bandwidth, and L polynomial chosen by the researcher. (3.4) is the order of the This minimization problem yields mˆ ( pi ) = βˆ0 as the counterfactual estimate for mover i . Economists have considered L = 0 (Kernel matching) and L = 1 (local linear matching) and, as noted above, we were interested in investigating how a higher order polynomial, e.g. L = 2 or L = 3 , would perform. Fan and Gijbels (1996) show that for local polynomial fitting, L − ν , where L is the order of polynomial and ν is the order of derivative, should be odd (for our case ν = 0 since we estimate the function). They demonstrate that in large samples, odd order polynomial L − ν = 2m + 1 fits are preferable to the corresponding even order polynomial L − ν = 2m fits; e.g. L − ν = 1 results in a bias reduction, but no increase in asymptotic variance, compared to a choice of L − ν = 0 . Given this large sample property, we consider both local linear ( L − ν = 1 ) and local cubic ( L − ν = 3 ) matching, but not Kernel or local quadratic matching. 13 We thank Stephen Cosslett for suggesting that we consider this possibility. 11 However in small samples, local linear regression could lead to a very rugged curve in regions of sparse or clustered data (Seifert and Gasser 1996). A local linear ridge regression matching estimator is likely to perform well in such a situation, since it is a local linear estimator with a penalty on large slopes of the local regression line. Frölich (2004) finds that matching based on local linear ridge regression very often outperforms other matching estimators, including nearest neighbor, kernel and local linear; moreover it is robust to simulation designs. Thus we also use the following local linear ridge regression estimator of Seifert and Gasser (1996, 2000): For each mover i with pˆ ( xi ) = pi , the estimated counterfactual E Y 0 | D = 1, p ( X ) = pi from local linear ridge regression matching is mˆ R ( pi ) = T1 ⋅ ( pi − pi ) T0 + , S0 S2 + r ⋅ h pi − pi (3.5) where N0 ( Sa ( pi ) = ∑ pˆ ( x j ) − pi j =1 N0 ( ) a Tb ( pi ) = ∑Y j0 pˆ ( x j ) − pi j =1 pˆ ( x j ) − pi ⋅K for a = 0,2, h ) b pˆ ( x j ) − pi ⋅K for b = 0,1 and h N0 pˆ ( x j ) − pi pi = ∑ pˆ ( x j ) ⋅ K h j =1 N0 pˆ ( x j ) − pi . h ∑ K j =1 Again, K (⋅) is a kernel weighting function, h is a bandwidth and r is the ridge parameter. We follow the rule of thumb from Seifert and Gasser (2000) and use a Gaussian kernel with r set −1 equal to 4 2π ∫ φ 2 ( u ) du ≈ 0.35 . To implement all of these matching estimators, we must choose a bandwidth or smoothing parameter, often the most important decision a researcher makes in nonparametric regression. Basically, there are two types of bandwidths: global (fixed) bandwidths and local (variable) bandwidths. The global bandwidth approach uses the same window width at each point where we run a local regression, while the variable bandwidth approach changes the bandwidth according to the data density around the particular point. In other words, the variable bandwidth approach allows us to use a small bandwidth where the probability mass is dense and a larger bandwidth where the probability mass is sparse. As Fan and Gijbels (1992, p. 2013) put it, “A different amount of smoothing is used at different data locations.” Given that local linear estimators have a well-known 12 problem over regions of sparse or clustered data, for both local linear and cubic regressions,14 we use the local adaptive bandwidth proposed by Fan and Gijbels (1996)15. Here the size of the window h( pi ) varies for each mover i with a different propensity score pˆ ( xi ) = pi ; h( pi ) is chosen to include the same number kn of stayers closest to pi to fit the local regression. The term kn is determined by the sample size n and will grow (slowly) as the sample size grows. Local linear ridge regression should be less sensitive to distribution of the data. When implementing this estimator we follow Frölich (2004) and use a global bandwidth chosen by leave-one-out cross-validation. Therefore, use of the local linear ridge regression estimator allows us to see the sensitivity of our estimates to changing estimation methods, and changing from a local to a global bandwidth. 3.1.4. Standard Error Estimation In spite of the widespread use of the bootstrap to obtain standard errors for matching estimators, no formal justification for its use has been established for any matching estimators. Moreover, Abadie and Imbens (2008) show that the bootstrap is, in general, not valid for nearest neighbor matching, even when the estimator is root-N consistent and asymptotically normally distributed with zero asymptotic bias; this problem occurs because of the extreme non-smoothness of nearest neighbor matching. While our matching estimators differ from nearest neighbor matching because the number of matches increases with the number of observations (determined by the bandwidth choice) and there is certainly additional smoothness in it compared to nearest neighbor matching estimators, we still have no support for the use of the standard bootstrap with our estimators. Politis and Romano (1994) and Bickel et al. (1997) show that when the bootstrap fails, using the ‘m out of n’ bootstrap rectifies this problem under minimal conditions. Here a smaller bootstrap sample size m is used, where m → ∞ and m / n → 0 . The problem for applied researchers is how to choose m . Fortunately Bickel and Sakov (2008, p. 971) present an adaptive procedure to determine: i) whether the standard n -bootstrap is valid and ii) if it is not valid, how to choose m . Here we give a step-by-step heuristic version of how we use their procedure: 1. We consider a sequence of bootstrap sample sizes ( m ' s ) of the form m j = p j n for j = 0,1,..., J , where n is the size of our original sample, p < 1 and Local cubic regression shall suffer from the same problem, probably at a more severe level. Ruppert, Sheather and Wand (1995) derive three optimal fixed (global) bandwidth selectors for local linear regression. We considered their preferred selector, the direct plug-in bandwidth selector (p. 1262). However it produced matching estimates with large standard errors. 14 15 13 p j n denotes the smallest integer greater than or equal to p j n . We choose p = 0.85 and J = 3 . Note that the case j = 0 corresponds to the standard bootstrap. 2. For each j = 0,...,3 we generate 1000 bootstrap samples of size m j from our original sample. This results in bootstrap samples of size m0 = 2078 to m3 = 1277 . 3. For each j we re-estimate the propensity score matching model for all bootstrap samples of size m j . 4. For each j we then calculate the empirical cumulative distribution function of the propensity score matching estimates (of the treatment effect) Fˆm j from the 1000 bootstrap samples. 5. For each j we calculate d j , the maximum Kolmogorov distance between Fˆm j and Fˆm j +1 . 6. We choose the value of j that minimizes d j . (If d j is minimized for two or more values of j , then among these we pick the j that is the smallest.) If the minimum occurs at j = 0 (or m = n ), the standard n -bootstrap is valid. We generally find that the maximum Kolmogorov distances d j , j = 0,...,3 tend to be minimized at j = 0 . For example given our local linear regression matching estimator with trimming q = 5 (Panel A of Table 4), we find d j , j = 0,..., 3 to be ( 0.032, 0.048, 0.032, 0.048 ) for college graduates and ( 0.040, 0.049, 0.049, 0.045 ) for high school dropouts. As a result we choose j = 0 and conclude that the standard bootstrap is valid in our case. In implementing the ordinary bootstrap, an important decision is choosing the number of bootstrap repetitions, and here we follow the procedure developed in Andrews and Buchinsky (2000, 2001). They propose a three-step method for choosing the number of bootstrap repetitions. We describe the Andrews and Buchinsky procedure for calculating standard errors in the Appendix. 3.1.5 Common Support Constraint and Balancing Tests As noted above, matching can only be used to estimate the ATET for values of X where the probability of observing nonparticipants is positive. In other words, the ATET is identified only over the portion of X ' s support where each mover can find a reasonable number of stayers in its neighborhood. To satisfy this condition, we add a common support constraint, following the 14 procedure proposed by Heckman, Ichimura and Todd (1997).16 Using their notation, we set the trimming level q = 5 .17 To test the sensitivity of our matching estimators to the trimming level, we also consider q = 3 and q = 7 for our baseline model, and find that this does not affect our results. Further, we do not conduct any trimming when we use the local linear ridge regression matching estimator, and find that this does not have a qualitative effect on our results. One explanation for this lack of sensitivity of our matching estimator is that our distributions of propensity scores for treatment and controls are much more similar than those for matching applications aimed at estimating the ATET for training programs. Again following much of the literature, we choose the functional form of the variables in the index function to ensure, via formal tests, that the conditioning variables X are distributed identically across the treatment group and the matching sample. Specifically, for a given functional form of the index function, we test whether our empirical model for propensity score balances the sample via two types of tests; paired t-tests and joint F tests.18 Paired t-tests examine whether the mean of each element of X for the treatment group is equal to that of the matched sample. However, paired t-tests are not able to detect differences between two distributions beyond the sample means. Since all matching methods require that the two distributions mimic each other at each quantile, instead of just exhibiting similar means, we also conduct a joint F test, as proposed originally by Rosenbaum and Rubin (1985). The treatment group and matched sample are broken down into quartiles according to the estimated propensity scores.19 At each quartile, we test whether the mean of all elements of X are jointly different across the two groups. If a model fails to pass either the t-tests or the F tests, we add higher order terms or interaction terms until the variables are balanced across the two groups. 3.1.6 Finer Balancing and Allowing the Treatment Effect to Differ by Educational Group We find that education plays a major role in the migration decision and wage rate determination, thus suggesting that it may be inappropriate to match across education groups, e.g. use a high school dropout who stays to estimate the counterfactual for a college graduate who See also Smith and Todd (2005) for details. Their procedure is designed to trim q percent to 2q percent of movers. Because this is a data-driven approach, the exact trimming level depends on the data structure. Fewer movers will be eliminated as the modes and shapes of the mover and stayer distributions become more similar. 18 Our tests are based on results from nearest neighbor matching. Although Abadie and Imbens (2008) show that bootstrap cannot be used to estimate standard errors for nearest neighbor matching, in the balancing tests, we do not need to calculate standard errors for the nearest neighbor estimates. 19 The number of intervals used in joint F tests depends on sample size. We have 378 movers, so we can only afford to break them down into quartiles. If larger samples are available, finer intervals, such as deciles, should be used. 16 17 15 moves. To avoid this problem, in calculating the ATET we only match stayers to mover i if they are in mover i ’s educational group. Rosenbaum and Rubin (1983) define such a procedure as finer balancing. Following Rosenbaum and Rubin (1985), we first estimate the propensity score using the entire sample20, as we do not have enough data to estimate a different index function for the four educational groups (high school dropouts, high school graduates, those with some college, and college graduates). We then match movers with stayers in the same educational group based on the estimated propensity score to get an estimate of the ATET for the whole sample. We also estimate ATET for each educational group, since the migration effect is likely to differ by level of schooling. For example, we would expect that it is much easier for college graduates to search and find a higher wage job in a new location without first moving there than it is for high school dropouts. Letting S denote schooling class and s denote a particular schooling level, the ATET for a schooling level is ATETs = E (Y 1 − Y 0 | D = 1, S = s ) = E (Y 1 | D = 1, S = s ) − E (Y 0 | D = 1, S = s ) . (3.6) To obtain the first term in equation 3.6, we take the mean increase in wages for those in schooling class s who move. To obtain the second term, we use propensity score matching estimators presented in Section 3.1.3 and only match stayers to mover i if they are in mover i ' s educational group. In our empirical work below we find that it is indeed important to allow the treatment effect to vary by educational group. 3.2 Nonparametric IV Estimation of Local Average Treatment Effects with Covariates21 As discussed in Section 3.1.2, while moving cost variables are very important for achieving our common support assumption ( Pr( D = 1 | X ) < 1 ), we do not include them in propensity score specification and thus do not even need to observe them. However, moving cost variables play a more direct role in our instrumental variable analysis, which estimates local average treatment effects (LATE) of migration.22 Given that we can observe moving cost variables, we estimate the LATE for migration using Frölich (2007)’s nonparametric IV estimator. We consider two versions of LATE estimators where the instrumental variable(s) defined as i) living in the same county as age 14 where he was born and ii) living in the same county as age 14 20 Our educational dummy variables (equivalent to Rosenbaum and Rubin’s sex dummy variable) are also included in the propensity score model. 21 We thank two anonymous referees for suggesting that we explore LATE estimators given that we have suitable instrumental variables in our data. 22 LATE was introduced by Imbens and Angrist (1994) and further developed by Angrist and Imbens (1995), Angrist, Imbens, and Rubin (1996), Imbens and Rubin (1997), Heckman and Vytlacil (2001) and Imbens (2001), among others. 16 plus a dummy variable coded 1 if his father has a college degree.23 As discussed in Section 3.1.2, living in the same county is an obvious candidate for an instrumental variable. Father’s education is a proxy for the ability to finance a move (especially because our respondents are relatively young). It will be a valid IV if, as Willis and Rosen (1979) assume, it does not affect wages directly and it is not correlated with the unobservables affecting wages.24 In our case, the instrument is not randomly assigned. To make them proper instruments, it is important to condition on some covariates. Consider the case where we only use a binary instrument, where Z equals 0 if an individual was still living in the same county at age 14 (bearing high moving cost) and 1 otherwise. If we do not use conditioning variables in our analysis, there could be common factors that affect both the instrument and wages. For example, suppose African Americans were less likely to move when they were young (i.e. more likely to have Z = 0 ). Given that race generally affects wages, our instrument would be invalid if we do not condition on race. We denote our conditioning variables by X , and use a rich set of variables such as past wages, work history, education, race, and marital status. Again we provide a heuristic description of Frölich (2007) approach. For individual i let Di denote the endogenous migration decision as defined in Section 3.1 and Z i denote the binary instrument as defined above. The outcome variable of interest, Yi , is defined in Section 3.1 as the between job wage growth. Let Di , z denote the potential participation (migration) status of an individual i if the level of the instrument were externally set to z . Note that Di ,Zi = Di is the observed value of D for individual i . Di , z defines four different types of individuals denoted by T ∈ {a , n, c, d } . Following the standard LATE literature, we call these four types: always-takers (a), never-takers (n), compliers (c), and defiers (d). In our case, the always-takers are individuals who will move regardless of the moving costs ( Di ,0 = 1 and Di ,1 = 1 ). The never-takers are those who will not move regardless of moving costs ( Di ,0 = 0 and Di ,1 = 0 ). The compliers are those who will move if their moving costs are low and will not move if their moving costs are high ( Di ,0 = 0 and Di ,1 = 1 ). The defiers are those who move when moving costs are high and vice-versa ( Di ,0 = 1 and Di ,1 = 0 ). Let Yi ,dz represent the potential outcome for individual i when Di and Z i were fixed externally to d ( d = 0,1 ) and z ( z = 0,1 ) respectively. The potential outcomes of interest are Yi ,dZi where d is We will discuss in Section 5.5 how we construct a single instrument based on both variables. Our results below indicate that including this variable in the propensity score model does not change the estimates, while we would expect the estimates to change if the father’s education also affects the wage. This appears to lend support to the Willis and Rosen assumption in our case. 23 24 17 fixed externally without a change in Z . Again note that the observed outcome for individual i is Yi ,DZii ≡ Yi . Given the following assumptions, Frölich (2007) shows that his conditional LATE estimator is identified. [Assumption 1: Exogenous covariates]: The conditioning variables X are exogenous in the sense that X iD,Zi i = X id, z ∀d , z , where X id, z is the potential value of X that would be observed for individual i if Di and Z i were set by an external intervention. This assumption will be violated if X i is affected by Z i or Di . Among the conditioning variables we use for the matching estimators, we suspect the home ownership variable might violate this condition because individuals bearing higher moving costs might be more likely to buy a house at younger ages. Thus we exclude this variable in our LATE analysis. [Assumption 2: No defiers]: P (T = d ) = 0 . This assumption says that reducing moving costs will not induce any mover to stay and increasing moving costs will not induce any stayer to move. [Assumption 3: Existence of compliers]: P (T = c ) > 0 . To identify the LATE, there have to be individuals in the population whose probabilities of migration, given other conditions, are affected by moving costs. [Assumption 4: Unconfounded type]: For all x ∈ Supp ( X ) P (Ti = t X i = x, Z i = 0 ) = P (Ti = t X i = x, Z i = 1) for t ∈ {a, n, c} . This assumption requires that at each level of X , the fractions of compliers, always-takers and never-takers are the same for both the Z = 1 and the Z = 0 groups. This condition will be violated if, for example, given the same X , the individuals bearing high moving costs ( Z = 0 ) are more likely to be compliers than those bearing low moving costs. We would expect this assumption to be plausible in our case because we compare two groups of individuals who share many characteristics (included in X ) such as past wages, education, race and occupation. [Assumption 5: Mean exclusion restriction]: For all x ∈ Supp ( X ) ( E (Y ) ( = 0, T = t ) = E (Y ) = 1, T = t ) for t ∈ {a , c} E Yi ,0Zi X i = x, Z i = 0, Ti = t = E Yi ,0Zi X i = x, Z i = 1, Ti = t for t ∈ {n, c} 1 i , Zi X i = x, Z i i 1 i , Zi X i = x, Z i i This assumption rules out a direct effect of Z on Y . It carries two conceptually distinct assumptions: an exclusion restriction on the individual level and an unconfoundedness assumption 18 on the population level. For example, for the potential outcome Yi1,Zi it requires on the individual level the potential outcome will not be affected by an exogenous change in Z i . One scenario that violates this assumption occurs when conditional on X , individuals with more experience moving ( Z = 1 ) are more adaptable to new situations than those with less experience ( Z = 0 ), since now the outcome would depend directly on Z .25 On the population level, it requires that the potential outcome Yi1,Zi is identically distributed in the subpopulations of Z = 1 and Z = 0, and thus it rules out selection effects that are related the potential outcome. To guarantee the unconfoundedness on the population level, it is important to include in X all variables that affect Z and potential outcomes. The final assumption requires [Assumption 6: Common support]: Supp ( X Z = 0 ) = Supp ( X Z = 1) . This assumption requires that the support of X is identical in both Z = 1 and Z = 0 population. We verify this assumption in Section 5.5. ( ) Let γ ATE = E Y 1 − Y 0 T = c denote the average treatment effect for the compliers. Given the above assumptions, Frolich’s non-parametric LATE estimator is given by γˆ ATE = ∑ (Y − mˆ ( X ) ) − ∑ ∑ ( D − µˆ ( X ) ) − ∑ i:Z i =1 i 0 i i: Z i = 0 i:Z i =1 i 0 i i: Z i = 0 (Y − mˆ ( X ) ) , ( D − µˆ ( X ) ) i 1 i 1 i (3.7) i where mˆ z ( x ) and µˆ z ( x ) are nonparametric regression estimators of mz ( x ) = E (Y X = x, Z = z ) and µ z ( x ) = E ( D X = x, Z = z ) . Of course, when X consists of a rich set of conditioning variables, we can encounter a curse of dimensionality problem because we have to carry out highdimensional nonparametric regression. Frölich (2007) shows that propensity score based estimators can be constructed for the estimation of γ to avoid high-dimensional nonparametric regression. Define π ( x ) = P ( Z = 1 X = x ) as the propensity score and two conditional mean functions mπ z ( ρ ) = E (Y π ( X ) = ρ , Z = z ) and µπ z ( ρ ) = E ( D π ( X ) = ρ , Z = z ) for z = 0,1 . Now the conditional mean functions depend only on the one-dimensional propensity score, and the IV estimator of the LATE is 25 Given this possibility, we may feel more confident about the condition involving the potential outcome Yi ,0Zi than Yi1,Zi . However, to use an IV estimator, we will always have to make an identification assumption, so we rule out this possibility given the rich independent variables we condition on such as past wages and work history. 19 γˆπATE m = ∑ (Y − mˆ π (πˆ ) ) − ∑ ∑ ( D − µˆπ (πˆ ) ) − ∑ i:Z i =1 i 0 i i:Z i =1 i 0 i i:Z i = 0 i:Z i = 0 (Y − mˆ π (πˆ ) ) , ( D − µˆπ (πˆ ) ) i 1 i 1 i (3.8) i where πˆi is a consistent estimator of π i = π ( X i ) . Similarly, Frölich shows that the average treatment effect for the treated compliers ( ) γ ATET = E Y 1 − Y 0 T = c, D = 1 can be estimated using = γˆπATET m (Yi − mˆ π 0 (πˆi ) ) ⋅ πˆi − ∑ i:Z =0 (Yi − mˆ π 1 (πˆi ) ) ⋅ πˆi i . ˆ ˆ ˆ − ⋅ − µ π π D ( ) ( ) ∑ i:Z =0 ( Di − µˆπ 1 (πˆi ) ) ⋅ πˆi π0 i i i i:Z =1 ∑ ∑ i:Z i =1 i (3.9) i ( ) We estimate the average treatment effect for the compliers, γ ATE = E Y 1 − Y 0 T = c , and the ( ) average treatment effect for the treated compliers, γ ATET = E Y 1 − Y 0 T = c, D = 1 , using (3.8) and (3.9) respectively. ( ) ( ) Besides estimating γ ATE = E Y 1 − Y 0 T = c and γ ATET = E Y 1 − Y 0 T = c, D = 1 , we can also estimate the fractions of compliers, always-takers and never-takers, even though we cannot identify them in our sample. The fractions of compliers, always-takers and never-takers are estimated respectively as {∑ 1 n P (T = a ) = {∑ N 1 n P (T = n ) = {∑ N 1 n P (T = c ) = N i:Zi =1 ( Di − µˆπ 0 (πˆi ) ) + ∑ i:Z =0 ( µˆπ 1 (πˆi ) − Di )} , (3.10) } (3.11) i i:Zi =1 µˆπ 0 (πˆi ) + ∑ i:Z =0 Di , and i:Zi =1 (1 − Di ) + ∑ i:Z =0 (1 − µˆπ 1 (πˆi ) ) i }, i (3.12) where N = N 0 + N1 (the number of movers plus the number of stayers) is the total number of observations in our sample. Finally it is worth emphasizing that the average treatment effect for the treated compliers, ( ) γ ATET = E Y 1 − Y 0 T = c, D = 1 , is a very different parameter than the average treatment effect on the treated ATET = E (Y 1 − Y 0 D = 1) in (3.1) (as identified by propensity score matching models). The latter represents the average treatment effect over a well-identified subpopulation, i.e. individuals who participated in the treatment, while the former involves a subpopulation that cannot be identified in the sense that we cannot distinguish compliers from non-compliers in our data. In our IV application, compliers are the subpopulation that would react (in terms of moving) to an exogenous change in moving costs. 20 4. Data and Summary Statistics Our primary data source is the 1979-1996 waves of the NLSY79. The survey began in 1979 with a sample of 12,686 men and women born between 1957 and 1964. Annual interviews were conducted from 1979 to 1994, with biennial interviews thereafter. The NLSY79 provides a comprehensive data set ideally suited for studying migration and job mobility. First, the longitudinal aspects of the data make it possible to track the same individuals over time as they move across jobs and labor markets. Furthermore, the NLSY79 data files include detailed longitudinal records of the employment history of each respondent. Second, the NLSY79 provides categorical information on the distance of a move when a change of address occurs.26 This, in turn, allows us to calculate a distance-based measure of migration and compare our results with more orthodox measures based on change of county or change of state. Our distance-based measure of migration corresponds more closely to the theoretical notion of changing local labor markets than do the alternative measures. We define migration as occurring if either i) the respondent moved at least 50 miles or ii) changed MSA and moved at least 20 miles.27 We also consider the sensitivity of our results to changes in the definition of migration that are feasible given the information available in the NLSY79. In addition, we compare the estimates based on our migration definition to those based on a county-to-county or a state-to-state definition. We note that a change-of-county definition of migration will misclassify as migrants individuals who move short distances across county lines but do not change labor markets. A change-of-state definition of migration will misclassify individuals who move hundreds of miles and change labor markets but remain in the same state as stayers. We focus on young men at the outset of their work careers, and restrict the analysis to the wage gain from migration for voluntary transitions between their first and second jobs. We call the first and second jobs after leaving school “job 1” and “job 2” respectively. We focus on moves from job 1 to job 2 for two reasons. First, individuals exhibit the majority of moving and job changing at the outset of work careers. Second, for individuals with similar backgrounds, the wage profile and work experience are more comparable at the beginning of their careers than later. Therefore matching has a better chance to succeed at the beginning of their careers. In order to construct a sample suitable for empirical analysis, we introduce several selection criteria. The sample is limited to young men since we felt that the moving decisions of women can be The researchers who produce the NLSY79 data observe the exact latitude and longitude of the respondent’s residence at the time of each interview. To insure confidentiality, the geocode data providing the latitude and longitude of a respondent’s residence are not available for research purposes. Instead the NLSY79 provides to researchers interval information on distance moved. 27 Adjacent county centroids are typically about 25 miles apart, so a move of 50 miles roughly corresponds to a move to a location two counties away. 26 21 more complicated and not comparable to men. Because our interest lies in post-schooling labor market activity, we follow individuals from the time they leave school. The longitudinal structure of the NLSY79 allows us to determine precisely when most workers make a permanent transition into the labor force. To avoid counting summer breaks or other inter-term vacations as leaving school, we define a schooling exit as the beginning of the first non-enrollment spell lasting at least 12 consecutive months. Accordingly, respondents are excluded from the sample if the date of schooling exit cannot be clearly ascertained from the data. For example, respondents who are continuously enrolled throughout the observation period or who have incomplete or inconsistent schooling information are excluded from the sample. Of the 6,403 male respondents in the initial sample, 262 were deleted because they never left school, or because an exact date of leaving school could not be determined. Further, 576 individuals were deleted because they did not hold at least two civilian jobs. Another 116 were lost because they did not report hours and wages on at least two civilian jobs. We eliminated 50 respondents because they reported being fired from their jobs. We imposed this restriction because we wanted to concentrate on voluntary job transitions. Nine respondents were deleted because they were not interviewed during the duration of at least two civilian jobs. We required that the respondent had had at least two jobs that had lasted at least 26 weeks. This resulted in the loss of 32 respondents.28 We excluded from the sample 120 observations with reported hourly wages of less than $1 or greater than $50, as most of these reports appeared to reflect extreme measurement error rather than true wages. Moreover, since we use a distance-based measure of migration, we required respondents to have valid residential location data at the times that they reported holding each job. This requirement resulted in the loss of 1479 respondents; thus the unavailability of location data is by far the largest single reason for sample deletion. We also deleted jobs that overlapped for more than 8 weeks. In this case we considered the respondent to be holding two jobs simultaneously and did not treat that as a job transition. For this reason we lost 132 respondents. We lost another 465 respondents who held two jobs satisfying all of the above criteria, but who reported an intervening job without location data. In this case we did not observe location data for the two consecutive jobs. Further, we lost 855 respondents who satisfied all of the above criteria but who experienced an interval of more than 13 weeks between two jobs. The between-job interval consisted of either a spell of unemployment, non-employment or employment in part time jobs (defined as those with average hours less than 25 or lasting less than 26 weeks). We excluded these individuals because we wanted to measure the return to migration conditional on a job change. Of course this raises the issue of whether we may be overstating the We also required that the respondent hold at least two jobs with average hours of at least 25 per week, but this restriction did not eliminate any respondents. 28 22 return to migration by excluding those who move and cannot find a job. We attempted to investigate this but could not determine from the data whether a migrant had a long spell of non-employment before moving or after moving.29 Finally we lost 229 respondents because data on the variables used in the analysis were lacking. After imposing these selection criteria, we had a sample of 2078 men. To recap, we explore individuals’ migration conditional on their voluntarily quitting their first job. Our movers consist of young men who quit their first job and moved to a new location, while our stayers consist of those who quit their first job but did not move. In this study, migration was defined to have occurred if the respondent moved at least 50 miles or changed MSA and moved at least 20 miles. Table 1 presents descriptive statistics. The first two columns contain variable names and definitions. The third, fourth and fifth columns provide means for the whole sample, the movers’ sample, and the stayers’ sample respectively. The last column presents the difference in means between the movers and stayers. Standard errors for the sample means are in parentheses. The first row of Table 1 shows that 18 percent of all voluntary job changes involved migration. [Insert Table 1 here] Panel A of Table 1 summarizes individual characteristics as of the end of job 1. The men in the sample are, on average, 26 years old at the time of the job change, with the movers being slightly younger than the stayers. The percentage of African Americans is about 10 points higher in the stayers’ sample than in the movers’ sample, the percentage of Hispanics is about the same in both samples and the percentage of Whites is 12 points higher in the movers’ sample than in the stayers’ sample. On average, the movers have a higher education level, are more likely to be married, and prior to migration, are less likely to own a house or live in an MSA. The NLSY79 also provides a rich set of conditioning variables including detailed information on each respondent’s job history and on characteristics of each job – a feature that makes matching an appealing strategy for estimating the migration effect. Panel B of Table 1 presents job 1 related variables. On job 1, movers on average have higher starting and ending wages, have slightly longer job tenure, and are more likely to have held a professional job on job 1. At the end of job 1, county unemployment rates are about the same for movers and stayers. We use all the variables in Panels A and B as conditioning variables for matching, but exclude the home ownership variable in the LATE estimators because we suspect this variable might violate the exogeneity 29 The problem is that we can determine the dates of employment changes but not the dates of location changes, as we only observe location at the interview dates. Thus someone who is in a new location at the time of the interview and has been unemployed for 6 months could have been unemployed for 6 months in the previous location and have just moved, or he could have been unemployed for 6 months in the new location. 23 condition (Assumption 1 in Section 3.2). Panel C of Table 1 shows the means of the logarithm of the starting wage on job 2. Again movers have higher average starting wages on job 2. Note that our outcome variable is defined as the logarithm of the starting wage on job 2 minus the logarithm of the ending wage on job 1. Between the end of job 1 and beginning of job 2, on average the mover and stayer sample experience roughly the same wage growth (around 9 to 10 percent). Panel D of Table 1 presents family background variables. Two dummy variables indicate whether the father or mother has a college degree. Movers are about twice as likely to have a father or mother with a college degree as stayers. Finally, as expected, stayers are much more likely than movers to have lived in their birth county at age 14. There are important differences between movers and stayers in the variables listed in Table 1, and on average movers have more favorable labor market characteristics than stayers. Thus, there is reason to suspect, a priori, that selection will be a serious problem that must be addressed when estimating the effect of migration on the real wage growth for those who move. 5. Empirical Results 5.1 Propensity Score Specification and Balancing Tests Table 2 reports two propensity score models for the migration decision. Model I, our baseline model, contains the individual characteristics and job 1 variables, except for the local unemployment rates at the end of job 1, as presented in Table 1. Model II is an expanded version of Model I, where we have added to the propensity score two dummy variables indicating whether father or mother has a college degree as well as the local unemployment rate at the end of job 1 (before moving for the movers). By comparing the matching estimates of returns to migration in Models I and II, we test the robustness of our propensity score matching estimators to the inclusion of additional variables that might affect estimated returns to migration. [Insert Table 2 here] The probit coefficients for the demographic variables have the expected signs in both models; all variables are individually significant unless otherwise noted. Consistent with most migration studies, Model I shows that, holding other variables constant, the probability of migration starts to decline at about age 25; Hispanics are more likely, and African Americans are less likely, to move than are non-Hispanic Whites, although the coefficient on the Hispanic dummy is statistically insignificant; individuals with less than a college degree are less likely to move than those with a college degree; married men are more likely to move than are unmarried men, while individuals 24 residing in an MSA when they quit their first job are less likely to migrate than are those living in non-metropolitan areas; and men in professional occupations on job 1 are more likely to migrate, while homeownership has a negative effect on migration. The three work history variables (starting wage, ending wage and job tenure on job 1) are not individually significant, but a likelihood ratio test shows that these variables are jointly significant. On average, movers have higher hourly wages prior to migration than do stayers. The additional variables in Model II indicate that – holding other variables constant – individuals whose father has a college degree are more likely to move, while we find that having a mother with a college degree, as well as the county unemployment rate, do not have a significant effect (either individually or jointly) on the probability of moving.30 We show the distributions of the estimated propensity score for movers and stayers from Model I in Figure 1: we use a solid line for the estimated density function for the movers and a dashed line for the estimated density function for the stayers. The empirical support of the two distributions is very similar and the modes are quite close, although, as expected, the movers have a higher average probability of moving than the stayers. [Insert Figure 1 here] Table 3 shows the balancing tests for both models; all the tests are conducted using the mover sample and the matched sample from nearest neighbor matching. Panel A shows the paired tstatistics for the difference in the variable means between movers and the matched sample of stayers. Panel B presents the joint F statistics for the difference in the means between the two groups for all variables at each quartile of the estimated propensity score. In no case is a test statistic near statistical significance at standard test levels.31 [Insert Table 3 here] 5.2 Propensity Score Matching Estimates of the ATET of Migration Table 4 presents the matching estimates of the effect of migration for movers on wage growth from the baseline propensity score model (Model I) for local linear, local cubic, and local linear ridge regression matching. The results from local linear matching estimators with the trimming 30A likelihood ratio test of the null hypothesis that the three additional variables in Model II all have zero coefficients does not reject this null. 31 As has been emphasized in the literature, balancing tests are used to choose the functional form for the propensity score model and do not shed any light on whether the CIA is valid in our application. 25 level q = 5 and two alternative trimming levels, q = 3 and q = 7 , are presented in Panels A, B, and C respectively.32 We calculate the minimum required number of bootstrap repetitions based on the three-step method of Andrews and Buchinsky (2000, 2001). The minimum numbers of repetitions are 241, 251 and 265 for the local linear estimators with q = 3, q = 5, q = 7 trimming respectively.33 For q = 5 trimming, we present the standard errors from 200 and 300 repetitions in Panel A. The two sets of standard errors are relatively close because 200 repetitions are not significantly less than the required minimum of 251 replications. For q = 3 and q = 7 trimming, we only present standard errors calculated from 300 bootstrap repetitions – more than the number of replications required by the Andrews-Buchinsky procedure. [Insert Table 4 here] For all three local linear estimators we choose a 25% bandwidth, which gives us a wide enough window when we disaggregate the data by educational class.34 Point estimates are quite close across three trimming levels. The results in Column 1 of the top three panels show a quite small, and statistically insignificant, effect of migration for the overall sample. When we disaggregate by education, the effect of migration for high school dropouts is estimated to be about −12% , and this estimate is significant at the 10 percent level. College graduates who migrate experience approximately 10.5% greater wage growth, and the estimates are statistically significant at the 5 percent level for q = 3 and q = 5 trimming and significant at the 10 percent level for q = 7 trimming. There are no statistically significant migration effects on wage growth for job changers who have only a high school education or some college. Panel D contains the results when we use local cubic matching and q = 5 trimming. Interestingly, the Andrews-Buchinsky algorithm requires 1074 replications, which of course is higher than the number generally used in empirical work. We find it intriguing that the standard errors based 32 Recall that in the Heckman, Ichimura and Todd (1997) algorithm, the actual amount of trimming depends on the similarity in the distributions of the estimated propensity score for movers and stayers. A level of q trimming will result in deletions between q and 2q percent of the movers. 33 For each estimator, we calculate the minimum repetitions required for the overall sample. We then calculate the minimum repetitions required for each educational group separately. Finally, we take the maximum of the five numbers as our required number of repetitions. For example if the required bootstrap repetitions for overall sample, high school dropouts, high school graduates, individuals with some college, and college graduates are 200, 289, 256, 235, 350 respectively, then we use 350 repetitions for that model. 34 To implement finer balancing matching, we first choose a variable bandwidth to give us a comparison group equal to 25% of the stayers. We then use only those in the group who are in the same educational category as the mover in question. Each mover gets far less than 25% of stayers in the local regression. We find that our results are not sensitive to a 1%-2% bandwidth change. 26 on only 300 repetitions are much too small, indicating the importance of letting the data determine the number of repetitions. Further, the standard errors based on the appropriate number of replications are very large, indicating that we are over-fitting the data by using the local cubic regression matching estimator. Panel E presents estimates from the local linear ridge regression matching estimator. As discussed in Section 3.1.3, we follow Frölich (2004) and use a global bandwidth chosen by leave-oneout cross-validation. Because local linear ridge regression is more stable in regions of sparse or clustered data, we apply this estimator to the original data without trimming. Again standard errors are calculated from the bootstrap with 300 repetitions; here the Andrews-Buchinsky method requires only 198 repetitions. Both point estimates and standard errors are close to those from local linear matching, except that the estimated effect for college graduates is about 4 percentage points higher and the estimated effect for high school dropouts is no longer significant. Some may be concerned that the estimated negative effect for high school dropouts is inconsistent with an economic model of migration, but we note the following. First, we estimate a contemporaneous effect of migration on wage growth. Insignificant or negative contemporaneous effects do not necessarily imply that migration is an irrational decision from the perspective of the human capital approach. As noted in Section 2, Borjas, Bronars and Trajo (1992a) have found that positive returns to migration often are not realized until five or six years after the original migration, and that the initial returns are negative. It is interesting to note that some of the previous studies (e.g. Polachek and Horvath, 1977; and Tunali, 2000) found negative returns for the entire sample, while we find them only for high school dropouts. Migration may involve an assimilation process. A short-term loss in wage need not, and probably does not, imply a drop in life-time utility from moving if individuals catch up later. Second, unlike most migration studies, our study estimates a migration effect that has netted out the effect of job changing, and thus our results do not imply that any group experiences a negative return from job changing. Third, it is possible that return and repeat migration are driving the negative returns for high school dropouts, and we do observe more repeat and return migration for high school dropouts than for other education categories.35 To explore this possibility, we excluded those with repeat or return migration from the mover sample. This modification, however, did not change the negative migration effect for high school dropouts or the positive effect for college graduates. Return migration within two years is 36% for dropouts and 24% for the overall sample. Repeat migration within two years is 36% for dropouts and 22% for the whole sample. 35 27 5.3 Robustness of the Matching Estimates Comparing Table 5 Panels A (repeated from Table 4 Panel A for comparison purposes) and B illustrates how the estimates of the ATET change when we add the county unemployment rate and two dummy variables indicating whether father or mother has a college degree to the propensity score; here we again use local linear regression (with q = 5 trimming). The estimated effects are quite close except that returns for high school dropouts change from −12.2% to −10.27% when we use the additional variables. Not surprisingly, the standard errors are generally larger under the expanded model because we use the additional conditioning variables. Due to these larger standard errors, the migration effects for high school dropouts and college graduates are no longer statistically significant. [Insert Table 5 here] Next we investigate how sensitive our matching estimates are to the (distance-based) threshold used to define migration. As noted above, to this point we have defined migration as having occurred if the respondent moved at least 50 miles or changed MSA and moved at least 20 miles. In Panels C and D of Table 5 we use two alternative definitions of migration that require moving at least 50 miles or 20 miles respectively regardless of whether someone was changing an MSA; again the results are based on local linear regression (with q = 5 trimming). When we use the 50 miles criterion, we find that the absolute values of the point estimates for high school dropouts and college graduates fall somewhat while their standard errors rise, and all estimates become insignificant, even at the 10% level. However, our inference is not affected as much by using a migration definition based on moving 20 miles for everyone, except that the negative migration effect for high school graduates becomes significant at the 10% level (see Panel D). We would argue that within the U.S. a move between MSAs of 20 miles or greater involves changing labor markets, and thus our original definition of migration makes the most sense. 5.4 Using State-to-State or County-to-County Definitions of Migration Our distance-based measure of migration data allows us to use a measure of migration which we believe is superior to two alternative definitions of migration commonly used in the literature: changing state of residence or changing county of residence. For example, MSAs are often spread over counties and even states. Table 6 presents numbers of moves using the three definitions. In Panel A we investigate how a state-to-state criterion misclassifies migration given our definition. We see first that a state-to-state measure misses 136 moves that our definition catches; moreover included in these 136 moves are 56 that involve moves between 50 to 100 miles and 59 that involve 28 moves of more than 100 miles. Further, the state-to-state measure counts 16 short-distance moves that we would not consider to represent migration using our definition. Next, in Panel B we examine the differences between our definition of migration and one based on moves across county borders. Not surprisingly, the county-to-county measure does not miss any moves defined by our measure. However, the county-to-county measure would count an extra 164 moves as migration, with 105 of them less than 20 miles. (Such a short distance does not seem to imply changing local labor markets). Note the differences in the number of moves between our definition and those using state-to-state or county-to-county definitions are quite large given that we only have 378 movers using our definition. [Insert Table 6 here] In Table 7 we compare the results from our baseline specification for local linear regression with q = 5 trimming (repeated from Table 4 panel A for comparison purposes), to results based on a state-to-state measure (presented in Panel B) and a county-to-county measure (presented in Panel C). When we use the state-to-state measure, all of our treatment effects become quite insignificant. However, when we use a county-to-county measure, the migration effect becomes insignificant for high school dropouts and college graduates, but the estimated negative effect for high school graduates becomes statistically significant. Clearly a researcher’s inference about these treatment effects would be affected by which definition of migration is used. [Insert Table 7 here] 5.5 Empirical Results from LATE Estimators36 As discussed in Section 3.2, we use the same county variable (a dummy variable coded 1 for a respondent living at age 14 in the same county where he was born), and a dummy variable coded 1 if the individual’s father has a college degree, as IV in our LATE estimation. Appendix Table A.1 presents the estimated coefficients for two probit models that estimate the probability of migration using these instruments and other conditioning variables. In column 1 we only include the same county variable with the other conditioning variables in LATE estimators, and the same county variable is significant at the 1% level. We denote results based on this specification as Model A. In column 2 we also include a variable indicating whether the father went to college, and a likelihood ratio test indicates that the two instruments are jointly significant at the 1% level. We denote results based on this specification as Model B. 36 We use Markus Frölich’s gauss program for LATE estimators with covariates. 29 Column 1 of Table 8 presents the estimated coefficients for the conditioning variables in a probit equation where dependent variable is equal to one if the respondent was not living at age 14 in the same county where he was born (indicating lower moving costs), and equal to zero otherwise (indicating higher moving costs). In column 2 (Model B) we show the probit coefficients for the conditioning variables where we construct our dependent variable as equal to one if the respondent was not living at age 14 in the same county where he was born and his father has a college degree (indicating a better financial condition) and equal to zero if neither of these conditions holds. (We drop 847 observations where only one of the conditions holds.)37 In Model A, the African American dummy variable is highly significant. In Model B, the coefficients on dummy variables for African Americans, high school dropouts, high school graduates and for those married are highly significant while the coefficients for Hispanics dummy variable and those with some college are significant at the 10% level. The upshot is that it is important to include conditioning variables in our LATE estimators. [Insert Table 8 here] Based on estimates from both Models A and B in Table 8, Figure 2 presents the density plots of the estimated “instrument propensity scores”, π ( x ) = P ( Z = 1 X = x ) , for Z = 1 and Z = 0 groups. In the same spirit of Figure 1, Figure 2 summarizes the similarity of the conditioning variables between two groups, and here the two groups are defined by instruments. The two densities exhibit substantial overlap under each model (although the degree of similarity is greater for Model A than for Model B), and we expect that assumption 6 of Section 3.2 is satisfied for both models. [Insert Figure 2 here] As noted above we consider two LATE parameters: i) the average treatment effect (ATE) for all compliers and ii) the average treatment effect on the treated (ATET) for the compliers who moved. Using Model A, where only the same county variable is an IV, we estimate the fractions of the sample who are compliers, always-takers and never-takers to be 4.3% (1.64%), 16.3% (1.08%) and 79.4% (1.30%) respectively where standard errors are in parentheses.38 Further, we estimate the ATE and ATET (on wage growth) for compliers to be −7.6% (50.8%) and 11.8% (253.1%) respectively with standard errors in parentheses. With Model B, which uses both the same county 37 We thank Markus Frölich for suggesting that we construct the instrument in this way. Standard errors are calculated using the number of repetitions indicated by the Andrews-Buchinsky algorithm. 38 30 and father’s education variables as IV, we estimate the fractions of compliers, always-takers and never-takers to be 10.1% (5.69%), 16.1% (1.27%) and 73.8% (5.76%) respectively. The estimated ATE and ATET for compliers are −22.0% (256.4%) and −1.7% (338.6%) respectively. Thus in each model we estimate the fractions of compliers, always-takers and never-takers relatively precisely, but there are far too few compliers to estimate the ATE and ATET with any precision.39 6. Conclusion Our paper estimates the effect of U.S. internal migration on wage growth for young men between their first and second job. Our analysis of migration differs from previous research in four important ways. First, we exploit the distance-based measures of migration in the NLSY79. Second, we let the effect of migration on wage growth differ by schooling level. Third, we use propensity score matching to address selection issues and to estimate the effect of migration on the wage growth of young men who move and argue that this treatment effect is at least as interesting for economists as the effect of migration on wages for a randomly chosen individual. Since matching is a “data hungry” approach, we are fortunate that the NLSY79 provides a rich array of variables on which to match, making our use of the Conditional Independence Assumption reasonable. Specifically, we use variables on home ownership, previous starting wage, ending wage and job tenure, as well as variables on family background and demographics. Finally, we use non-parametric LATE estimators with conditioning variables to estimate the ATE and ATET for compliers. We consider a number of matching estimators: local linear, local cubic, and local linear ridge regression. We find that local linear and local linear ridge regression matching produce relatively similar point estimates and standard errors, while our use of local cubic regression matching leads to over fitting the data and large standard errors. Since it is unknown whether the standard n -bootstrap is appropriate for the matching estimators that we consider, we use an algorithm from Bickel and Sakov (2008) to implement the ‘m out of n’ bootstrap procedure. Interestingly, we find that the use of the standard n -bootstrap is appropriate for calculating standard errors in our application. We use the Andrews-Buchinsky method to estimate the number of bootstrap replications when we use the standard n -bootstrap, and find that for the local cubic regression estimator and LATE estimators (not for other matching estimators), the required number of repetition is much larger than those commonly used in empirical research. From our matching estimators, we find a significant positive effect of migration on the wage growth of college graduates, and a marginally significant negative effect for high school dropouts. 39 Both LATE estimators require large data. Frölich and Lechner (2006) also find that LATE estimates are highly variable for individual labor markets. Fortunately in their case they are able to aggregate individual markets weighted by estimated number of compliers to get precise point estimates. 31 We do not find any significant effects for other educational groups or for the overall sample. Our results are generally robust to changes in the model specification, changes in our distance-based measure of migration, level of trimming, and changing from a local bandwidth to a global bandwidth. We find that better data matters; if we use a measure of migration based on moving across county lines, we overstate the number of moves, while if we use a measure based on moving across state lines, we understate the number of moves. Further, moving to either migration definition has an important effect on our estimates of the ATET for movers. We also consider LATE estimators of the ATE and ATET of moving for compliers. Here we consider using i) whether the individual was living in his birth county at age 14 and ii) the same county variable and a variable indicating whether the respondents’ father went to college as instrumental variables. However, we have far too few compliers in our data to obtain precise LATE estimates of the ATE and ATET for compliers using either set of instruments. 32 References Abadie, A. and Imbens, G. (2008) “On the Failure of the Bootstrap for Matching Estimators,” Econometrica, 76, 1537-1557 Andrews, D. W. K. and Buchinsky, M. (2000), “A Three-Step Method for Choosing the Number of Bootstrap Repetitions,” Econometrica, 67, 23-51. Andrews, D. W. K. and Buchinsky, M. (2001), “Evaluation of a Three-Step Method for Choosing the Number of Bootstrap Repetitions,” Journal of Econometrics, 103, 345-386. Angrist, J. and Imbens, G. (1995), “Two-stage Least Squares Estimation of Average Causal Effects in Models with Variable Treatment Intensity,” Journal of American Statistical Association, 90, 431– 442. Angrist, J., Imbens, G., and Rubin, D. (1996), “Identification of Causal Effects Using Instrumental Variables,” Journal of American Statistical Association, 91, 444–472 (with discussion) Bartel, A. P. (1979), “The Migration Decision: What Role Does Job Mobility Play?” American Economic Review, 69, 775-786. Bickel, P., Götze, F. and van Zwet, W. (1997), “Resampling fewer than n observations: gains, losses and remedies for losses,” Statistica Sinica, 7, 1-31. Bickel, P. J. and Sakov, A. (2008), “On the Choice of m in the m Out of n Bootstrap and Confidence Bounds for Extrema,” Statistica Sinica, 18, 967-985. Borjas, G. J., Bronars, S. G. and Trejo, S. J. (1992a), “Assimilation and the Earnings of Young Internal Migrants,” The Review of Economics and Statistics, 74, 170-175. Borjas, G. J., Bronars, S. G. and Trejo, S. J. (1992b), “Self-Selection and Internal Migration in the United States,” Journal of Urban Economics, 32, 159-185. Dahl, G. B. (2002), “Mobility and the Returns to Education: Testing a Roy Model with Multiple Markets,” Econometrica, 70, 2367-2420. Falaris, E. M. (1987), “A Nested Logit Migration Model with Selectivity,” International Economic Review, 28, 429-444. Fan, J. and Gijbels, I. (1992), “Variable Bandwidth and Local Regression Smoothers,” Annals of Statistics, 20, 2008-2036. Fan, J. and Gijbels, I. (1996), Local Polynomial Modeling and Its Applications, Monographs on Statistics and Applied Probability 66 (London: Chapman & Hall). Frölich, M. (2004), “Finite Sample Properties of Propensity-Score Matching and Weighting Estimators,” Review of Economics and Statistics, 86, 77-90. Frölich, M. and Lechner, M. (2006), “Exploiting Regional Treatment Intensity for the Evaluation of Labour Market Policies,” IZA working paper No. 2144. 33 Frölich, M. (2007), “Nonparametric IV Estimation of Local Average Treatment Effects with Covariates,” Journal of Econometrics, 139, 35-75 Gabriel, P. E. and Schmitz, S. (1995), “Favorable Self-Selection and the Internal Migration of Young White Males in the United States,” Journal of Human Resources, 30, Summer, 460-471. Glaeser, L. E. and Mare, D. C. (2001), “Cities and Skills,” Journal of Labor Economics, 19, 316-342. Goss, E. P. and Schoening, N. C. (1984), “Search Time, Unemployment and the Migration Decision,” Journal of Human Resources, 19, 570-579. Greenwood, M. J. (1997), “Internal Migration in Developed Countries,” Handbook of Population and Family Economics, vol. 1B, 648-720 (Amsterdam, New York: Elsevier). Hahn, J. (1998), “On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects,” Econometrica, 66, 315-331. Hanushek, E. A. (1973), “Regional Differences in the Structure of Earnings,” Review of Economics and Statistics, 55, 204-213. Heckman, J., Ichimura, H. and Todd, P. (1997), ”Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Program,” Review of Economic Studies, 64, 605-654. Heckman, J., Ichimura, H. and Todd, P. (1998), “Matching as an Econometric Evaluation Estimator,” Review of Economic Studies, 65, 261-294. Heckman, J., Ichimura, H., Smith, J. and Todd, P. (1998), ”Characterizing Selection Bias Using Experimental Data,” Econometrica, 66, 1017-1098. Heckman, J., LaLonde R. and Smith, J. (1999), “The Economics and Econometrics of Active Labor Market Programs,” in O. Ashenfelter and D. Card (eds.), Handbook of the Labor Economics (Amsterdam, New York: Elsevier). Heckman, J. and Vytlacil, E. (2001), “Local Instrumental Variables,” in Hsiao, C., Morimune, K., Powell, J. (ed), Nonlinear Statistical Inference: Essays in Honor of Takeshi Amemiya, Cambridge University Press, Cambridge. Hirano, K., Imbens, G. and Ridder, G. (2003), “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score,” Econometrica, 71, 1161-1189. Hunt, J. C. and Kau, J. B. (1985), “Migration and Wage Growth: A Human Capital Approach,” Southern Economic Journal, 51, 697-710. Imbens, G. (2001), “Some Remarks on Instrumental Variables,” in Lechner, M. and Pfeiffer, F. (ed), Econometric Evaluation of Labour Market Policies, Physica/Springer, Heidelberg, pp. 17–42. Imbens, G. (2004), “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” Review of Economics and Statistics, 86, 4-29. Imbens, G. and Angrist, J. (1994), “Identification and Estimation of Local Average Treatment Effects,” Econometrica, 62, 467–475. 34 Imbens, G. and Rubin, D. (1997), “Estimating Outcome Distributions for Compliers in Instrumental Variables Models,” Review of Economic Studies, 64, 555–574. Kennan, J. and Walker, J. (2003), “The Effect of Expected Income on Individual Migration Decisions,” (Mimeo, University of Wisconsin). Lansing, J. B., and Mueller, E. (1967), “The Geographic Mobility of Labor,” Survey Research Center, Ann Arbor, MI. Lechner, M. (2000), “An Evaluation of Public-Sector-Sponsored Continuous Vocational Training Programs in East Germany,” Journal of Human Resources, 35, 347-375. Linneman, P. and Graves, P. E. (1983), “Migration and Job Change: A Multinomial Logit Approach,” Journal of Urban Economics, 14, 263-279. McCall, B. P. and McCall, J. J. (1987), “A Sequential Study of Migration and Job Search,” Journal of Labor Economics, 5, 452-476. Nakosteen, R. A. and Zimmer, M. A. (1980), “Migration and Income: The Question of SelfSelection,” Southern Economic Journal, 46, 840-851. Nakosteen, R. A. and Zimmer, M. A. (1982), “The Effects on Earnings of Interregional and Interindustry Migration,” Journal of Regional Science, 22, 325-341. Plane, D. A. (1993), “Demographic Influences on Migration,” Regional Studies, 27, 375-383. Polachek, S. W. and Horvath, F. W. (1977), “A Life Cycle Approach to Migration: Analysis of the Perspicacious Peregrinator,” in Ehrenberg, R. G. (ed.), Research in Labor Economics, 103-149. Politis, D. and Romano, J. (1994), “Large sample confidence regions on subsamples under minimal Assumptions,” Annals of Statistics, 22, 2031-2050. Raphael, S. and Riker, D. A. (1999), “Geographic Mobility, Race, and Wage Differentials,” Journal of Urban Economics, 45, 17-46. Robinson, C. and Tomes, N. (1982), “Self-Selection and Interprovincial Migration in Canada,” Canadian Journal of Economics, 15, 474-502. Rosenbaum, P. R. and Rubin, D. B. (1983), “The Central Role of the Propensity Score in Observational Studies for Casual Effects,” Biometrika, 70, 41-55. Rosenbaum, P. R. and Rubin, D. B. (1985), “Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score,” The American Statistician, 39, 33-38. Ruppert, D., Sheather, S. J. and Wand, M. P. (1995), “An Effective Bandwidth Selector for Local Least Squares Regression,” Journal of the American Statistical Association, 90, 1257-1270. Seifert, B., and Gasser, T. (1996), “Finite-Sample Variance of Local Polynomials: Analysis and Solutions,” Journal of American Statistical Association, 91, 267–275. 35 _____, (2000), “Data Adaptive Ridging in Local Polynomial Regression,” Journal of Computational and Graphical Statistics, 9, 338–360. Shaw, K. (1991), “The Influence of Human Capital Investment on Migration and Industry Change,” Journal of Regional Science, 31, 397-416. Sjaastad, L. A. (1962), “The Costs and Returns of Human Migration,” Journal of Political Economy, 70, 80-93. Smith, J. and Todd, P. (2005), “Does Matching Address LaLonde's Critique of Nonexperimental Estimators?” Journal of Econometrics, 125, 355-364. Topel, R. H. and Ward, M. P. (1992), “Job Mobility and the Careers of Young Men,” Quarterly Journal of Economics, 107, 439-479. Tunali, I. (2000), “Rationality of Migration,” International Economic Review, 41, 893-920. Willis, R. and Rosen, S. (1979), “Education and Self-Selection,” Journal of Political Economy, 87, s7-s36. Yankow, J. J. (1999), “The Wage Dynamics of Internal Migration,” Eastern Economic Journal, 25, 265278. Yankow, J. J. (2003), “Migration, Job Change, and Wage Growth: A New Perspective on the Pecuniary Return to Geographic Mobility,” Journal of Regional Science, 43, 483-516. Zhao, Z. (2004), “Using Matching to Estimate Treatment Effects: Data Requirements, Matching Metrics, and Monte Carlo Evidence,” Review of Economics and Statistics, 86, 91-107. 36 Table 1. Variable Definitions and Descriptive Statistics Variable Name Variable Definition Means Whole Sample Migrate =1 if respondent moved at least 50 miles or changed MSA and moved at least 20 miles Movers Stayers Difference 0.18 (0.009) Panel A: Individual Characteristics Age Age in years Black =1 if African American Hispanic =1 if Hispanic White =1 if non-Hispanic White Dropout High_school Some_college College =1 if highest grade completed is less than 12 =1 if highest grade completed is equal to 12 =1 if highest grade completed is greater than 12 and less than 16 =1 if highest grade completed is greater than or equal to 16 Married =1 if married, spouse present Home_Owner =1 if own home on job 1 MSA =1 if reside in MSA at time of job 1 26.08 (0.096) 0.23 (0.009) 0.13 (0.007) 0.63 (0.011) 0.17 (0.008) 0.47 (0.011) 25.99 (0.185) 0.15 (0.018) 0.13 (0.017) 0.73 (0.023) 0.12 (0.017) 0.33 (0.024) 26.10 (0.110) 0.25 (0.011) 0.14 (0.008) 0.61 (0.012) 0.18 (0.009) 0.50 (0.012) 0.18 (0.008) 0.18 (0.020) 0.18 (0.009) 0.18 (0.008) 0.40 (0.011) 0.20 (0.009) 0.86 (0.008) 0.36 (0.025) 0.44 (0.026) 0.14 (0.018) 0.84 (0.019) 0.14 (0.008) 0.39 (0.012) 0.21 (0.010) 0.87 (0.008) 0.227 (0.026) 2.13 (0.023) 2.21 (0.024) 2.66 (0.121) 1.99 (0.010) 2.07 (0.010) 2.59 (0.058) 0.141 (0.025) 0.142 (0.026) 0.070 (0.134) -0.106 (0.215) -0.108 (0.021) -0.009 (0.019) 0.12 (0.026) -0.06 (0.019) -0.166 (0.027) 0.002 (0.021) 0.057 (0.028) -0.068 (0.020) -0.029 (0.020) Panel B: Job 1 Variables Tenure Tenure of job 1 (year) 2.02 (0.009) 2.09 (0.010) 2.60 (0.052) Professional1 =1 if professional/managerial occupation on job 1 0.22 (0.009) 0.39 (0.025) 0.18 (0.009) 0.210 (0.027) Unemployment Rate County unemployment Rate at the end of job 1 6.995 (0.069) 6.96 (0.165) 7.00 (0.076) -0.005 (0.181) Logarithm of starting wage on job 2 ($1990) 2.19 (0.010) 2.30 (0.026) 2.16 (0.011) 0.142 (0.029) 0.07 (0.006) 0.14 (0.008) 0.12 (0.017) 0.25 (0.022) 0.06 (0.006) 0.12 (0.008) 0.06 (0.018) 0.57 (0.011) 0.49 (0.026) 0.59 (0.012) -0.096 (0.028) log(startwage1) log(endwage1) Logarithm of starting wage on job 1 ($1990) Logarithm of ending wage on job 1 ($1990) Panel C: Job 2 Variable log(startwage2) Panel D: Family background =1 if mother was college Mother_college graduate Father_college =1 if father was college graduate Same_county =1 if respondent resides at age 14 in the same county as county of birth Notes: 1. Sample size equals 2,078, and the sample consists of 378 movers and 1,700 stayers. 2. Standard errors of the sample mean are in parentheses. 37 0.132 (0.024) Table 2. Propensity Score of Migration Coefficient Estimates Model I Model II Intercept -5.82** (1.30) -5.97** (1.34) Age/10.0 4.13** (0.98) 4.15** (1.00) Age**2 /100.0 -0.83** (0.19) -0.83** (0.19) 0.03 (0.10) 0.06 (0.10) Black -0.27** (0.09) -0.25** (0.09) Dropout -0.58** (0.13) -0.53** (0.14) High_school -0.60** (0.11) -0.56** (0.11) Some_college -0.42** (0.11) -0.39** (0.12) Married 0.16** (0.08) 0.17** (0.08) MSA1 -0.30** (0.10) -0.30** (0.10) Professional1 0.34** (0.09) 0.35** (0.09) Home_Owner1 -0.53** (0.11) -0.55** (0.11) log(startwage1) 0.18 (0.11) 0.19 (0.12) Tenure 0.02 (0.02) 0.02 (0.02) log(endwage1) 0.07 (0.11) 0.05 (0.12) Hispanic Father_college 0.21** (0.11) Mother_college -0.04 (0.13) County unemployment rate 0.01 (0.01) Chi-square statistic 8.72 2.30 Notes: ** and * indicate significance at the 5% and 10% level, respectively. 1. Probit models are used for estimation of the propensity score. 2. Values in the parentheses are standard errors. 3. The Chi-square statistic of Model I is from the likelihood ratio test against the model without the three job 1 variables: starting wage, ending wage and tenure. The Chi-square statistic of Model II is from the likelihood ratio test against Model I. The critical value for both tests at the 5% significance level is 7.81. 38 Table 3. Balancing Tests Panel A: Paired t- tests Model I Age Hispanic Black Married MSA1 Professional1 Home Owner1 log(startwage1) Tenure log(endwage1) Father college Mother_college Unemployment rate Model II Difference Paired t Statistics 0.018 0.015 0.003 -0.023 -0.009 -0.023 -0.021 -0.022 -0.006 -0.020 - 0.480 0.585 0.112 -0.617 -0.341 -0.853 -0.777 -0.717 -0.038 -0.605 - - - Difference Paired t Statistics 0.298 0.009 0.018 -0.018 0.009 -0.006 -0.033 0.008 0.134 0.020 0.030 -0.030 -0.369 1.174 0.351 0.662 -0.465 0.321 -0.184 -1.239 0.242 0.797 0.604 1.054 -1.271 -1.383 Panel B: F Test Statistics Model I Model II 1st quartile 0.84 0.78 2nd quartile 0.65 1.07 3rd quartile 0.81 1.13 4th quartile 0.90 1.38 F (10, 75) = 1.96 F (13, 69) = 1.86 Critical value at the 5% level Notes: 1. All tests are based on nearest neighbor matching with q = 5 trimming, and the differences in the variable means are taken between movers and the matched sample of stayers. 2. Each F test is based on the variables included in the respective model. 3. No test statistic is statistically significant at standard confidence levels. 39 Table 4. Matching Estimates of the Average Effect of Migration on Wage Growth from Model I (Baseline Model) Matching Estimator Overall Sample Dropouts High_school Some_college College Grads Panel A: Local Linear Matching with Trimming level q = 5 25% bandwidth (200 repetitions) [300 repetitions] -0.32% (2.49%) (2.46%) -12.20%* (6.90%) (7.10%) -4.79% (3.87%) (3.96%) 0.03% (5.55%) (5.61%) 10.42%** (5.83%) (5.35%) Panel B: Local Linear Matching with Trimming level q = 3 25% bandwidth (300 repetitions) 0.06% (2.40%) -12.20%* (7.40%) -4.79% (3.70%) -0.10% (5.58%) 10.56%** (5.34%) Panel C: Local Linear Matching with Trimming level q = 7 25% bandwidth (300 repetitions) -0.88% (2.51%) -12.46%* (7.36%) -4.73% (3.71%) -0.75% (5.61%) 10.86%* (6.06%) Panel D: Local Cubic Matching with Trimming level q = 5 25% bandwidth (300 repetitions) {1100 repetitions} -1.64% (3.62%) {6.28%} -12.51% (12.52%) {29.26%} -4.90% (5.89%) {15.05%} -4.48% (9.93%) {8.74%} 9.28% (5.62%) {6.22%} Panel E: Local Linear Ridge Matching without Trimming bandwidth chosen by cross-validation (300 repetitions) 1.90% (2.63%) -12.60% (7.77%) -4.40% (3.89%) -2.20% (5.41%) 14.60%** (4.77%) Notes: ** and * indicate significance at the 5% and 10% level, respectively. 1. The individual migration effect is estimated by each mover's wage growth minus its counterfactual estimated from the corresponding matching estimator. We then average the individual effects over corresponding samples to obtain the estimates in this table. 2. The trimming procedure is designed to trim q percent to 2q percent of movers. Because this is a data-driven approach, the exact trimming level depends on the data structure. 3. Standard errors calculated from corresponding bootstrap repetitions are in parentheses. 4. To implement finer balancing matching with 25% bandwidth, we first choose a variable bandwidth to give each mover a comparison group equal to 25% of the stayers. We then use only those in the group who are in the same educational category as the mover in question. Each mover gets far less than 25% of stayers in the local regression. 40 Table 5. Matching Estimates of the Average Effect of Migration on Wage Growth Based on an Expanded Model and Two Alternative Migration Definitions Local Linear Estimator Overall Dropouts High_school Some_college College Grads 0.03% (5.61%) 10.42%** (5.35%) -0.50% (5.73%) 9.75% (6.05%) Panel A: Model I (Baseline Model) with 25% bandwidth (300 repetitions) -0.32% (2.46%) -12.20%* (7.10%) -4.79% (3.96%) Panel B: Model II (Expanded Model) with 25% bandwidth (300 repetitions) -0.47% (2.49%) -10.27% (7.22%) -4.83% (3.90%) Panel C: Longer Distance (50 miles) Migration Definition for Everyone - Model I with 25% bandwidth (300 repetitions) -0.33% (2.77%) -11.31% (8.20%) -3.74% (4.25%) -0.05% (5.90%) 8.71% (5.93%) Panel D: Shorter Distance (20 miles) Migration Definition for Everyone - Model I with 25% bandwidth (300 repetitions) -0.90% (2.41%) -10.44%* (6.30%) Note: See notes to Table 4. 41 -6.35%* (3.63%) 1.20% (5.15%) 11.55%* (5.94%) Table 6. Misclassification of Movers and Stayers When Move is Defined as Change-of-State or Change-of-County Number of individuals Panel A. Change-of-State Measure Long distance moves not counted when defining mover 20 miles < Moving distance < 50 miles (changing MSA) 50 miles < Moving distance < 100 miles Moving distance > 100 miles Moving distance > 500 miles 136 21 56 58 1 Short distance moves counted when defining mover Moving distance < 5 miles 5 miles < Moving distance < 20 miles 16 4 8 20 miles < Moving distance < 50 miles (not changing MSA) 4 Panel B. Change-of-County Measure Long distance moves not counted when defining mover 0 Short distance moves counted when defining mover Moving distance < 5 miles 5 miles < Moving distance < 20 miles 20 miles < Moving distance < 50 miles (not changing MSA) 42 164 17 88 59 Table 7. Matching Estimates of the Average Effect of Migration on Wage Growth Based on Another Two Alternative Definitions of Migration Local Linear Estimator with 25% bandwidth (300 repetitions) with 25% bandwidth (300 repetitions) with 25% bandwidth (300 repetitions) Overall Dropouts Some_college College Grads Panel A: Base Distance Measure, Model I -0.32% -12.20%* -4.79% (2.32%) (7.25%) (3.65%) 0.03% (5.66%) 10.42%** (5.06%) Panel B: Change-of-State Measure, Model I 0.03% -7.13% -4.71% (2.91%) (8.49%) (5.41%) -1.67% (6.86%) 8.59% (6.70%) Panel C: Change-of-County Measure, Model I -1.98% -7.35% -6.23%** (2.05%) (5.70%) (2.98%) -0.48% (5.20%) 7.27% (4.81%) Note: See notes to Table 4 43 High_school Table 8. Probit Estimates for Propensity Score of Intrumental Variables Model A Model B Intercept -0.262 (0.962) -1.230 (1.961) Age/10.0 -0.126 (0.720) 0.009 (1.446) Age**2 /100.0 0.020 (0.134) -0.014 (0.266) Hispanic 0.118 (0.085) -0.308* (0.173) -0.485** (0.073) -0.910** (0.156) Dropout 0.046 (0.114) -1.591** (0.254) High_school 0.002 (0.095) -1.077** (0.153) Some_college 0.149 (0.101) -0.278* (0.150) Married -0.063 (0.060) -0.257** (0.112) MSA1 0.067 (0.082) 0.268 (0.169) Professional1 0.017 (0.081) 0.058 (0.129) log(startwage1) 0.158 (0.098) 0.221 (0.173) Tenure -0.004 (0.014) -0.023 (0.028) log(endwage1) -0.007 (0.098) 0.228 (0.173) Black Note: For Model A, the dependent variable is one if the respondent was not living at age 14 in the same county where he was born and is zero otherwise. For Model B, the dependent variable is equal to one if both of the following hold: i) the respondent was not living at age 14 in the same county where he was born and ii) the father has a college degree, and is equal to zero if neither condition holds. For this model, we drop observations where only one of the conditions holds. 44 Table A.1 Reduced Form Probit Estimates for the Migration Decision and Tests for Significance of Excluded Instruments Model A Model B Intercept -6.134** (1.296) -6.182** (1.298) Age/10.0 4.589** (0.976) 4.582** (0.977) Age**2 /100.0 -0.925** (0.183) -0.924** (0.183) 0.059 (0.103) 0.077 (0.103) Black -0.201** (0.091) -0.180** (0.092) Dropout -0.571** (0.131) -0.508** (0.136) High_school -0.606** (0.105) -0.550** (0.110) Some_college -0.438** (0.112) -0.405** (0.114) 0.038 (0.072) 0.046 (0.072) MSA1 -0.281** (0.096) -0.292** (0.096) Professional1 0.328** (0.089) 0.329** (0.089) log(startwage1) 0.175 (0.114) 0.172 (0.114) Tenure 0.005 (0.018) 0.005 (0.018) log(endwage1) 0.013 (0.114) 0.006 (0.114) Same County -0.177** (0.068) -0.162** (0.069) Hispanic Married 0.177* (0.098) Father_college 9.912 Chi-square statistic Note: For Model B the null hypothesis of the likelihood ratio test is that the coefficients of Same County and Father_college variables are both equal to zero. The null hypothesis is rejected at the 1% level. 45 2.0 2.5 3.0 Figure 1: Distributions of the Estimated Propensity Scores (Table 2 Model I) 0.0 0.5 1.0 1.5 movers stayers 0.0 0.2 0.4 46 0.6 0.8 1.0 Figure 2: Distributions of the Estimated Instrument Propensity Scores (Table 8 Model A and Model B) 3.0 Model A 0.0 1.0 2.0 z=1 z=0 0.0 0.2 0.4 0.6 0.8 3.0 Model B 0.0 1.0 2.0 z=1 z=0 0.0 0.2 0.4 0.6 47 0.8 1.0 Appendix: Three-Step Method for Choosing the Number of Bootstrap Repetitions Andrews and Buchinsky (2000, 2001) propose a three-step method for choosing the number of bootstrap repetitions. We follow their procedure to set the proper number of bootstrap repetitions to calculate the standard errors for each parameter we estimate. The following is a special case in Andrews and Buchinsky (2001). We first define the notations related to our problem following Andrews and Buchinsky (2001). θ is a scalar parameter, and λ is an unknown parameter of interest. In our case θ is the ATET identified by the matching estimator or one of the two parameters identified by the LATE estimators (ATE and ATET for compliers), and λ is the standard error of θ . B is the number of repetitions, and pdb denotes the measure of accuracy, which is the percentage deviation of the bootstrap quantity of interest based on bootstrap repetitions from the ideal bootstrap quantity for which B = ∞ . The magnitude of B depends on both the accuracy required and the data. If we required the actual percentage deviation to be less than pdb with a specified probability 1 − τ , then the three-step method takes pdb and τ as given and provides a minimum number of repetitions B* to obtain the desired level of accuracy. We use a conventional accuracy level, ( pdb,τ ) = (10,0.05) . Step 1. Calculate initial number of repetitions B1 The three-step method depends on a preliminary estimate ω1 of the asymptotic variance ω of ( ) B1 2 λˆB − λˆ∞ λˆ∞ , where λˆB and λ̂∞ are the bootstrap estimates from B and infinite repetitions respectively. Following Andrews and Buchinsky (2000, 2001), we set a starting value of ω1 = 0.5 in equation A.1 below. (The three-step method is not too sensitive to this starting value because it uses a finite sample estimate of ω in the last step.) Given this starting value, we then calculate 10,000 ∗ z12−τ 2 ∗ ω1 B1 = int 2, pdb where z1−τ 2 (A.1) is 1 − τ 2 quantile of standard normal distribution. In our case B1 = 193 . { } Step 2. Use the bootstrap results θˆ : θˆ1 ,θˆ2 ,...,θˆB1 to update ω1 to ωB : µB = 1 B1 ˆ ∑θ r , B1 r =1 ( 1 B1 ˆ γB = ∑ θ r − µB B1 − 1 r =1 (A.2) ) 4 seB4 − 3 , (A.3) 48 { } where seB is the standard deviation of θˆ : θˆ1 ,θˆ2 ,...,θˆB1 . Then ωB = ( 2+γB) 4 . (A.4) Step 3. Calculate B2 from 10,000 ∗ z12−τ 2 ∗ ωB B2 = int 2. pdb (A.5) Then the minimum number of repetitions is B* = max ( B1 , B2 ) . 49