Quantile House Price Indexes N. Edward Coulson Department of Economics Pennsylvania State University University Park, PA 16802-3306 fyj@psu.edu Phone: 814-863-0625 Fax: 814-863-4775 Daniel P. McMillen Department of Economics (MC 144) University of Illinois at Chicago 601 S. Morgan St. Chicago, IL 60607 mcmillen@uic.edu Phone: 312-413-2100 Fax: 312-996-3344 March 28, 2005 Abstract Unobserved remodeling and missing “quality” variables, which are endemic to existing hedonic housing data sets, tend to produce an upward bias to existing data sets. To reduce the effect of missing variables that tend to occur at certain points in the error distribution, we propose the use of quantile regression procedures to estimate house price indexes. We find evidence of significant quantile effects in a sample of home sales drawn from Chicago for 1983-2001. Prices drawn from the upper tail of the error distribution increased most rapidly during this period. A time series analysis of the estimated price indexes suggests that quantile effects are similar across for five regions in Chicago. Price changes in high-quantile houses lead to price changes in low-quantile houses. 1 1. Introduction The purpose of a house price index is to track the rate of price appreciation over time for a standard or representative house. Using sample averages to construct the index is inappropriate because the distribution of house prices is frequently skewed toward lower-priced homes. A small number of sales of high-priced homes can significantly affect a house price index. Non-academic estimates of price indexes, such as those reported by the National Association of Realtors or by local newspapers, frequently use the sample median as the basis for constructing an index. Although the median is an improvement over the mean, it does not control for house characteristics. If large, new houses dominate sales during later periods, both the mean and the median may imply an artificially high rate of price appreciation. Academic researchers have most often used one of two methods for constructing quality-controlled price indexes. The first method is a straightforward hedonic price function, in which the natural logarithm of sales price is regressed on a vector of house characteristics and variables indicating the time of sale. The coefficients for the time of sale variables produce the house price index. As missing variables may bias the hedonic price function estimates, a repeat sales estimator is often used instead. A repeat sales price index is estimated by regressing the percentage change in sales prices on a vector of discrete variables representing the time of sale. By focusing on price changes rather than levels, the repeat sales estimator avoids missing variable bias associated with house characteristics that remain unchanged over time. However, it may be subject to more severe sample selection bias than the hedonic approach because the relatively small sample of properties that sell at least twice may not be representative of the overall 2 housing market. The repeat sales model may also be more prone to bias associated with missing information on various home improvements taking place since a home’s first sale. As regression-based models, the hedonic and repeat sales approaches are meanbased procedures. As such, they are sensitive to outliers, and they invoke the assumption that all estimated coefficients – including the critical time of sale variables – do not depend on whether a home sale is drawn from the tails or the middle of the house price error distribution. However, the rate of appreciation may, in fact, depend on the home’s position in the error distribution. For example, homes may appreciate especially rapidly if they have recently been remodeled. These observations may appear as outliers since recently remodeled homes are likely to comprise a small portion of the overall sample. Alternatively, appreciation rates may be especially high for unusually high-quality homes or those drawn from premium locations. Variables representing remodeling, high quality, and premium locations are likely to be unobserved, relegating their effects to the error term. In situations such as these, a median-based estimator may imply lower rates of appreciation that a standard, mean-based regression procedure. In this paper, I propose the use of a quantile regression procedure to estimate house price indexes. Using a Monte Carlo procedure, I show that a quantile approach is less sensitive to missing variables than standard estimators and accurately identifies appreciation rates that vary across the error distribution. I illustrate the practical benefits of the quantile approach using data on homes sales in Chicago for 1983-1999. I find evidence of significant quantile effects. Prices drawn from the upper tail of the error 3 distribution increased most rapidly during this period. Observations from the upper tail of the distribution are likely to be recently remodeled or unusually high-quality homes. The quantile approach may help to sidestep one of the vexing problems encountered when estimating price indexes. Unobserved remodeling and missing “quality” variables are endemic to existing data sets. These problems tend to produce an upward bias to estimated home price appreciation rates. A median-based quantile procedure is less vulnerable to this upward bias. Further, varying the target quantile leads to a richer characterization of the dynamics of appreciation rates across the full distribution of home prices. 2. Price Indexes Academic researchers use two primary approaches for estimating house price indexes. The first is the hedonic approach as typified by the following equation: yit xi 2 D2,it ... T DT ,it uit (1) In equation (1), yit is the natural logarithm of the price of home i at time t, xi is a vector of housing characteristics such as square footage and the number of bedrooms, and uit is an error term. Sales dates range from 1 to T. The dummy variables D2,it … DT,it indicate that the home sold during the period represented by the first subscript. Among many possible sources of bias, missing variables are probably the most important. The estimated price index will be biased if the missing variables are correlated with the time dummy variables. For example, suppose that the missing variable is a measure of house quality. If homes selling at later dates tend to be of higher quality than those from early sales, the δ’s from later periods will be biased upward and will overstate the rate of price 4 appreciation. Examples of the hedonic price index approach include Kiel and Zabel (1997), Mark and Goldberg (1984), Palmquist (1980), and Thibodeau (1989). The second common approach is the repeat sales method, which was originally proposed by Bailey, Muth, and Nourse (1963). Examples include Case and Quigley (1991), Case and Shiller (1987, 1989), Follain and Calhoun (1997), and Kiel and Zabel (1997). For the subset of homes in the sample that sold at least twice, we can calculate the difference in sales prices between time s and t, where s<t. The estimating equation for the standard repeat sales estimator is yit yis 2 D2,it D2,is T DT ,it DT ,is uit uis (2) The vector of housing characteristics, xi, does not appear in this equation because we have assumed that the characteristics and the coefficient vector β do not change over time. If these assumptions are correct, the repeat sales estimator provides unbiased estimates of the price index without requiring data on all relevant housing characteristics. Thus, a missing variable such as house quality will not bias the estimates unless it changes over time or if its coefficient changes. The following specification accounts for missing variables and time-varying coefficients by adding a new variable, z, with values that change over time: yit xi 2 D2,it ... T DT ,it zit uit (3) It is irrelevant whether the source of the variation in the new term is a time-varying coefficient or changes in the variable itself (as would be the case with remodeling): we can simply rewrite the model by writing zit as the product of the appropriate time dummy variable and a time-varying coefficient. Equation (3) becomes: yit xi 2 D2,it ... T DT ,it zi 2 zi D2,it T zi DT ,it uit (4) 5 and the repeat sales version of the equation is y it y is 2 D2,it D2,is T DT ,iT DT ,is z D 2 i 2 ,it D2,is T z i DT ,iT DT ,is u it u is (5) The new variables measure changes in z between time t and the base period. The bracketed terms in equations (4) and (5) are the error terms when z is unobserved. The missing variables are correlated with the time variables, which leads to biased estimates of the price index. 3. Quantile Regression As with any mean-based procedure, the ordinary regression model is sensitive to outliers. Although outliers are occasionally simply miscoded data, at other times missing variables lead to extreme values for the error terms. An obvious example in the case of house price models is remodeling, which is likely to produce an extremely high value for the error terms when it is not observed in the data set. The “quality” variable may also be the source of outliers: given observed housing characteristics, unusually high-quality homes will tend to have high prices and large values for the error term. Unlike ordinary least squares, the target for quantile regression estimates is a parameter that is specified before estimation. Let q represent the target quantile. Also, let eit be the residual implied by the econometric model. Quantile parameter estimates are the coefficients that minimize the following objective function: 2q e eit 0 it 2(1 q) eit (6) eit 0 At the median, q = 0.5, which implies that equal weight is given to positive and negative residuals. At the 90th percentile, 2q = 1.8 and 2(1-q) = .2, which implies that more weight 6 is given to positive residuals – observations with high values for the dependent variable, given the values of the explanatory variables. Equation (6) will be minimized at a set of parameter values where 100q% of the residuals are positive. This result differs from ordinary least squares, in which the sum of the residuals equals zero and otherwise there is no constraint on the number of positive residuals. Koenker and Bassett (1978) originally proposed the quantile regression approach. Examples of applications include Albrecht (2003); Bassett and Chen (2001); Buchinsky (1994, 1998a, 2001); Dimelis and Louri (2002); Garcia, Hernandez, and Lopez-Nicholas (2001); Hartog, Pereira, and Jose (2001); Levin (2001); Martins and Pereira (2004); and Thorsen (1994). Buchinsky (1998b) and Koenker and Hallock (2001) present useful surveys. Each of these studies presents estimated equations with the general from y i q xu u qi . The form of this equation implies that the coefficients differ by quantile. For example, Martins and Pereira (2004) find that returns to schooling are higher for more-skilled individuals. Their evidence for this conclusion comes from a regression of the natural logarithm of wages on a set of human capital characteristics, one of which is years of schooling. The coefficient for years of education is higher at higher quantiles. Quantile effects have a straightforward missing variables interpretation that follows directly from the hedonic and repeat sales price index estimators. For example, the contribution of a sale at time t=2 to the price index can be found by taking the derivative of equation (4) or (5) with respect to D2,it. The result, 2* 2 2 z i , varies with the missing variable z. If λ2 > 0, then higher values of z lead to higher values for 2* . But z is part of the error term. Thus, high values of the error term imply high values for 2* – a quantile effect. 7 The intuition behind the quantile effect is the same as the motivation typically offered for selection bias in the repeat sales estimator – that the repeat sales sample is not representative of the rest of the housing market. For instance, the repeat sales sample may draw more heavily from neighborhoods with amenities that attract wealthy, mobile homebuyers, and the prices of these homes may appreciate more rapidly than homes in other neighborhoods. If the full set of neighborhood amenity variables were observed, there would be neither a quantile effect nor a sample selection issue. Similarly, homes that have been remodeled can be represented by a missing variable that adds to the vector of housing characteristics beginning at the time the remodeling is completed. The remodeling variable produces a quantile effect because it is correlated with the time dummy variables. The case for the quantile effect is particularly strong for the remodeling example because only a minority of homes is remodeled over time. Remodeling shows up as an outlier in a standard regression model. Such outliers are drawn from the upper tails of the error distribution. The effects of this unobserved variable would not contaminate other points in the distribution. A median-based estimate (q = .5) will be far less vulnerable to the effects of omitted variables that affect only a portion of the sample. 8 4. A Monte Carlo Analysis In this section, I report the results of a set of Monte Carlo experiments that illustrate the benefits of the quantile approach to estimating house price indexes. The basis for the experiments is a straightforward two-period version of equation (4): yi 5 xi .2 Di zi Di ui (7) The time subscript is suppressed from equation (7) because it unnecessarily complicates the notation of the hedonic model, which is sufficient for illustrating the benefits of the quantile approach. I draw values of x from a unit normal distribution. I generate the time variable D, by making draws from a U(0,1) distribution and setting D = 1 when the randomly drawn value is greater than 0.5. The “missing” variable, z, is drawn from a U(-.5,.5) distribution. Finally, I draw values for the error term, u, from a normal distribution with a mean of zero and a variance that assures that the R2 from a regression of y on x, D, and zD will be approximately 0.9 on average. I let the values of vary from 0 to 1 while maintaining each of the other parameters at the values shown in equation (7). Thus, observations with higher values of z have higher appreciation rates on average. Each experiment has 1000 observations. When a sale occurs during the base time period, D = 0. The price of an identical home is 0.2+z when the sales takes place during the second period. If z represents quality, then the appreciation rate is higher for high-quality homes. This variable would be missing in a typical econometric study. If z is not observed, appreciation rates are higher for observations drawn from the upper tails of the error distribution. Thus, the Monte Carlo setup generate quantile effects, in which the implied marginal effect of D – 0.2+z – varies across the error distribution. 9 Table 1 reports means and standard deviations for quantile regression estimates of 1000 replications of each experiment. I estimate each regression at target quantiles of 0.25, 0.50, and 0.75. The explanatory variables for the regressions are simply x and D; zD is not included. The missing variable, zD, is not correlated with x but it is correlated with D. Thus, omitting zD does not bias the estimated coefficient for x but does lead to biased estimates for the D coefficient. The true coefficient for D rises with , and when >0 it is higher at higher quantiles. Therefore, the question in the Monte Carlo analysis is whether the quantile approach indicates higher appreciation rates – i.e., higher coefficients for D – at higher quantiles. Given the structure of the Monte Carlo setup, the true coefficient is 1.0 for x at all values of . The true intercept is lower at lower regression quantiles because errors are negative on average at q = 0.25 and positive at q = 0.75. All calculations are performed using the QREG command in STATA. The results are precisely as expected. The average estimated coefficient for x is close to 1.0 across the three target quantiles and across the five alternative values of . Since the error term, u, and the omitted variable, zD, both have means of zero, the estimated intercepts are approximately equal to their true value of 5.0 at the residual median (q = 0.5). The average intercepts are lower than 5.0 at q = 0.25 and are higher than 5.0 at q = 0.75. As expect, estimated appreciation rates – the coefficient for D – are approximately equal to the correct value of 0.20 when quantile effects are absent ( = 0). Importantly, estimated appreciation rates are lower than 0.20 when >0 and q = 0.25, and they are higher than 0.20 when >0 and q = 0.75. The estimate appreciation rates average just under 0.20 at the median of the distribution of residuals. 10 The last three rows of the table show the percentage of rejections for the null hypothesis of equal coefficients for the 25% and 75% quantiles. The tests are based on 20 replications of a bootstrap algorithm. As quantile effects are absent for x, we should expect the null hypothesis to be rejected no more than 5% of the time (the nominal size of the test) for this variable. Rejection rates are somewhat lower than 5% for this variable, and they do not vary systematically by . Since quantile effects always exist for the intercept, the tests always reject the null hypothesis of equal intercepts at the 25% and 75% quantiles. The most important finding is that the rejection rate for equal coefficients for D rises with . This result means that, as expected, the statistical test is more likely to indicated quantile effects as the magnitude of the missing variable (Dz) increases. To put these results in perspective, assume that z represents a trait such as remodeling or simply the change in quality between the two periods. Prices of homes with positive values for z increase over time, and prices fall when z is negative. If z is unobserved, standard estimates will typically be biased. If most homes have increased in quality, then standard estimates of the appreciation rate are biased upward. By allowing for differences in coefficients across target quantiles, the quantile estimator can detect differences in appreciation rates. In a conventional case of remodeling, most values of z equal zero while a small percentage are positive. Standard appreciation rate estimates will again be biased upward in this case. In contrast, a median-based estimator will provide accurate estimates, and the estimates at high target quantiles will detect the higher rates of appreciation associated with remodels. 11 5. Data and Model Specification The data set for the empirical application of the quantile regression estimator was drawn from two sources, the Illinois Department of Revenue (IDOR) and the Cook County Assessor’s Office. IDOR conducts reviews of assessment practices for all counties in Illinois, including Cook County. Through a Freedom of Information Act request, IDOR provided data on all sales of single-family homes in the City of Chicago for 1983-1999 with the exception of 1992. Important variables include the sales price, date of sale, and the parcel identification number (or “PIN”). The PIN allows me to merge the IDOR data with the 1997 Cook County file of assessments. The assessment file includes the address and standard housing characteristics.1 However, the housing characteristics are available only for 1997, and there is no way of identifying changes in the characteristics over time. If most homes are not remodeled during this period, then a median-based estimator will provide accurate estimates of constant-quality appreciation rates. Table 2 provides descriptive statistics for sales prices and the housing characteristics. There are 129,251 sales during this period, and 32,959 pairs of repeat sales. In 1997, the average home had 1244 square feet of living area, was on a 4131 square foot lot, had 2.878 bedrooms, and was just under nine miles form the traditional Chicago city center (the intersection of State and Madison Streets). House age naturally varies over time; the average across all sales at the time of sale is 63.572 years. The 1 As described in McMillen (2004), the only address that is available is for the building owner rather than the actual property. The PIN identifies the location of the property down to the quarter section level – a quarter square mile. I used a GIS program to geocode the building owner addresses. The final sample includes only those homes with owners whose addresses are located in the same quarter section. Since a quarter section is ½ x ½ mile, the location of a home may be misidentified by as much as 2x0.5 2 = 0.71 miles, which would happen if the home and its owner were at opposite corners of the quarter section. 12 mode house is built of brick, has a basement and attic, does not have central air conditioning, and has a garage. The average nominal sales price across the sample period is 107,591. The range in sales prices is large – as low as $250 and as high as $4.2 million.2 The repeat sales sample is nearly identical to the overall sample. Average sales price is higher because, by construction, the repeat sales sample is dominated by sales from later dates: whereas any home selling in 1983 is almost certainly a first sale, a 1999 observation may be either a first or second sale. Similarly, the average repeat sales home is about two years older than the average observation from the full sample because homes are older later in the time interval. All variables without a time dimension – building area, lot size, number of bedrooms, distance from the city center, and the dummy variables for brick construction, lack of a basement, an attic, central air conditioning, and a garage – are nearly identical on average across the two samples. With data covering 16 years and sales dates identified by the month of sale, the basic hedonic and repeat sales specification could include as many as 192 time dummy variables. Price indexes estimated with monthly dummy variables are sensitive to extreme values from months with few sales and have misleadingly sharp discontinuities. Quantile regression is slow and cumbersome with thousands of observations and more than 200 explanatory variables. Although aggregating up to the quarter or year of sale reduces the estimation burden, it still produces an index with unrealistic discontinuities over time. Although the IDOR attempts to screen the data for non-arm’s length sales, the small number of sales with extremely low prices should be viewed with skepticism. There is no obvious cutoff point for discarding these observations. An advantage of the quantile regression approach is that it is not sensitive to these outlier observations. 2 13 McMillen and Dombrow (2001) propose a simple procedure for estimating either hedonic or repeat sales price index that produces a smooth, continuous function with a small number of coefficients to be estimated. The basis for the estimator is the Fourier approach of Gallant (1981, 1982). The general form for the effect of the time of sale variables in equation (3) is simply g(T), where T represents the month of sale. Since sales dates range from January 1983 to December 1999, the range for T is 1 to 204. The Fourier approach begins by transforming the time variable to lie between 0 and 2π: zi ≡ 2πTi/204. The Fourier expansion is g(Ti) = 0 + 1zi + 2zi2 + ∑q(qsin(qzi) + qcos(qzi)). A small number of sine and cosine terms turn out to be sufficient to model price indexes. In the empirical section of the paper, I set the maximum order of the expansion at two. Thus, the Fourier expansion is simply g(Ti) = 0 + 1zi + 2zi2 + 1sin(zi) + 1cos(zi) + 2sin(2zi) + 2cos(2zi). In the general form of the repeat sales model, we have yit – yis = g(Ti) – g(Ts) + uit – uis, where Ts represents the earlier date of sale. The Fourier expansion is particularly useful for the repeat sales estimator because it uses a parametric function to approximation g(T), which makes it possible to impose that the g(Ti) and g(Ts) are simply two values of the same function. McMillen and Dombrow (2001) show that the Fourier version of the repeat sales estimator is y i 1 z i 2 z i2 1 sin z i 1 cosz i 2 sin 2 z i 2 cos2 z i (8) where I again impose that the maximum order of the expansion is two, and indicates the change between sales dates (e.g., yi = yit – yis). This approach has been used to estimate price indexes by McMillen (2003) and Ihlanfeldt (2004). 14 Note that equation (8) does not include an intercept. A positive value for the intercept would imply an increase in price even within a single time period. Although a within-period price increase is possible, particularly for a time interval as long as a year, most authors impose that the price index equals zero during the base period.3 Even if the true intercept is zero in the middle of the error distribution, the intercept will be negative at lower quantiles and positive at higher points in the error distribution. The question is how to normalize the implied path of time coefficients so that we can compare rates of price appreciation across quantiles.4 In a standard repeat sales model, we can impose that the intercept equals zero in two ways. The obvious one – omitting the constant term from the regression – is not an option in quantile regression because intercepts cannot equal zero across all quantiles. The second alternative is to estimate the regression with an intercept, and then solve for the restricted least squares estimates that are implied by a zero intercept. Let X be the matrix of explanatory variables for the unrestricted regression, and let R be a vector with a one in the position corresponding to the intercept in X and zeros elsewhere. The 1 formula for the restricted coefficients is ˆ r ˆ X X 1 R RX X 1 R Rˆ . Let aij represent the entry in row i and column j of X X 1 , and assume that a vector of one’s is the first column of X. Then the formula for ̂ ri – the coefficient in row i of ̂ ri – is ˆ i ˆ1 a1i / a11 . Calculating the restricted price index by imposing a zero intercept is not equivalent to obtaining an unrestricted estimate with a non-zero value in the base 3 Goetzmann and Siegel (1995) suggest including an intercept because properties are often upgraded around the time of a transaction, and these upgrades are seldom observed in standard data sets. 4 Normalization is not an issue with the standard hedonic estimator because it always has an intercept. The hedonic estimates directly compare prices in one time to prices in a base year, which is the date whose dummy variable is omitted from the regression. 15 period and subtracting the intercept from all dates – a parallel shift in the price index. The formula for restricted coefficients rotates the price index so that the restricted intercept is zero. This transformation also is a logical basis for the quantile repeat sales estimates. The quantile estimator is not a simple regression, and does not have a direct counterpart to the X X 1 matrix. However, the transformation ˆ i ˆ1 a1i / a11 takes any of set of coefficients and produces a zero intercept (i = 1) while rotating the price index. This transformation rotates the price index equally at each regression quantile, which makes it possible to directly compare the price indexes across quantiles. 6. Estimated Price Indexes Estimated standard and quantile hedonic regression results are shown in Table 3. The standard estimates show that each additional 10% of building area increases sales prices by 69.5%, and the elasticity of sales price with respect to lot size is 0.253. The implied deprecation rate of sales prices with respect to age is -0.5% per year. Controlling for square footage, an additional bedroom reduces sales prices by 1.1%. A home with frame construction, a basement, an attic, central air conditioning, and a garage sells for more than a home without these characteristics. Each additional mile from the city center lowers sales prices by 3.4%. The last column of the table shows the differences in the estimated coefficients for the 25% and 75% quantiles. Nearly all of the housing characteristics exhibit quantile effects. Additional square footage adds more to sales prices at higher quantiles, while larger lot sizes have larger effects at lower points in the error distribution. The effect of 16 age is much more pronounced at lower quantiles. Garages add more to sales prices at lower points in the error distribution. The distance from the city center gradient is larger at higher quantiles. Although the coefficients for the Fourier expansion terms are difficult to interpret directly, the significant differences across quantiles imply that there are differences in the implied price indexes. The differences in coefficients for age support the interpretation of quantile effects as a means of controlling for unobserved quality changes. Although the coefficient for age is significantly negative at each quantile, estimated depreciation rates are much higher at lower quantiles. The positive errors for observations in the upper quantiles may arise because of upgrades. Age becomes far less important a determinant of house prices when a home is remodeled. In contrast, the negative errors for observations at the lower quantiles may come about at least in part because the homes have not been well maintained, and poor maintenance implies a high depreciation rate. The regression results are not particularly informative for the repeat sales estimator because the only explanatory variables are sine and cosine terms used to construct the implied price index. The results can be observed directly in Figure 1, which also shows the implied price index from the base hedonic regression. The two price indexes are nearly identical. The end points in December 1999 both imply an approximate doubling of prices since January 1983: the final values are 1.018 for the hedonic index and 1.066 for the repeat sales price index. Figure 2 shows the hedonic price indexes by quantile. The indexes show a nearly uniform progression in the rate of appreciation over time: low quantiles have low rates of appreciation and high quantiles have high rates. The end values for the 10%, 25%, 50%, 17 75%, and 90% quantiles are 0.834, 0.875, 0.942, 1.010, and 1.072. The paths differ somewhat over time, particularly during the late 1980s when the Chicago housing market first began to rebound from years of slow growth. Prices at the higher quantiles rose first. The price of homes at the 10% and 25% quantiles never enjoyed the rapid increase in the late 1980s, and did not make up the lost ground in the 1990s. Figure 3 shows the repeat sales quantile price indexes. The estimated price paths are somewhat different from the hedonic indexes. Again, the prices at the lowest quantile appreciate most slowly; the final value for the 10% quantile is 0.908. However, the price path for the 25% and 90% quantiles form a pair of similar indexes while the 50% and 75% indexes form another pair. The endpoints at the 25%, 50%, 75%, and 90% quantiles are 1.035, 1.132, 1.158, and 1.080. Quantile effects do not change monotonically when progressing from the lowest to highest quantiles. In this case, prices rose most rapidly for homes nearer the middle of the distribution. Another way to look at the quantile effects is to compare the base indexes with the quantile indexes with which they are most highly correlated. The base hedonic index is most highly correlated with the price index for the 50% quantile, while the base and 25%-quantile repeat sales indexes are most similar.5 These price indexes are shown in Figure 4. The indexes are very similar. Figure 4 illustrates two important points. First, the apparent sample selection problems associated with the repeat sales estimator are exaggerated. Repeat sales and hedonic price index estimates are often quite similar.6 The second point is that quantile estimates for targets from the middle of the distribution 5 The correlations between the base hedonic and 10%, 25%, 50%, 75%, and 90% hedonic quantile indexes are 0.99707, 0.99790, 0.99890, 0.99790, and 0.99700. Counterpart correlations for the repeat sales indexes are 0.99974, 0.99994, 0.99986, 0.99989, and 0.99956. 6 The corollary is that the advantages of the hedonic approach are exaggerated. Hedonic price functions suffer from missing variables that are likely to be correlated with the time of sale explanatory variables. 18 will generally be similar to standard price index estimates. However, the median-based quantile estimate is not necessarily the one that is closest to the standard estimator. The fact that the 25%-quantile estimate is most similar to the base repeat sales estimate suggests that observations with unexpectedly low appreciation rates greatly influence the standard estimator in this application. In comparing the hedonic and repeat sales quantile results, it is important to bear in mind that the form of the error term is different. Whereas the hedonic approach targets points in the underlying error distribution, the repeat sales model targets point in the distribution of the changes in errors over time. In my interpretation of the error term as an unobserved variable representing housing quality, the hedonic approach targets levels of quality and the repeat sales approach targets changes in quality. However, quality is far from the only unobserved housing characteristic. More missing explanatory variables can generate a rich error process. Thus, it is not surprising that the quantile effects differ for the hedonic and repeat sales models; nor is it surprising that the effects are not always uniform across quantiles. 19 7. Conclusion The quantile approach has several advantages over conventional approaches to estimating house price indexes. Targeting quantiles from the middle of the error distribution reduces the effects of outliers. The problem of outliers is particularly important for the repeat sales estimator, which is vulnerable to an upward bias when the sample includes remodeled houses and there is no way to identify which homes have been upgraded. In this situation, a more realistic view of the housing market may be gained by constructing indexes using lower quantiles as the target point. The quantile approach can also provide a richer view of the overall housing market by revealing patterns that vary across quantiles. Data for Chicago from 1983-1999 illustrate some of the advantages of the quantile approach. Hedonic estimates reveal significant differences in coefficients across quantiles. Depreciation rates are lower at higher quantiles, suggesting that these quantiles include homes that have been remodeled. Square footage adds more to sales prices at upper quantiles, while lot size has a great effect on prices at lower quantiles. Hedonic price index estimates reveal a uniform pattern across quantiles: over the full 1983-1999 period, appreciation rates are higher for homes in higher quantiles. The pattern is not uniform for the repeat sales indexes. Although there is again a tendency toward higher appreciation rates at higher quantiles, the appreciation rate is lower for the 90% quantile than for the 50% or 75% quantile. 20 References Albrecht, James, Anders Bjorklund, and Susan Vroman, “Is There a Glass Ceiling in Sweden?,” Journal of Labor Economics 21 (2003), 145-177. Bailey, M.J., R.F. Muth, and H.O. Nourse, “A Regression Method for Real Estate Price Index Construction,” Journal of the American Statistical Association 58 (1963) 933-942. Bassett, Gilbert W., Jr., and Hsiu-Lang Chen, “Portfolio Style: Return-Based Attribution using Quantile Regression,” Empirical Economics 26 (2001), 293-305. Buchinsky, Moshe, “Changes in the U.S. Wage Structure 1963-1987: Application of Quantile Regression,” Econometrica 62 (1994), 405-58. Buchinsky, Moshe, “The Dynamics of Changes in the Female Wage Distribution in the USA: A Quantile Regression Approach,” Journal of Applied Econometrics 13 (1998a), 1-30. Buchinsky, Moshe, “Recent Advances in Quantile Regression Models: A Practical Guideline for Empirical Research,” Journal of Human Resources 33 (1998b), 88-126. Buchinsky, Moshe, “Quantile Regression with Sample Selection: Estimating Women’s Return to Education in the U.S.,” Empirical Economics 26 (2001), 87-113. Case, Bradford and John M. Quigley, “The Dynamics of Real Estate Prices,” Review of Economics and Statistics 73 (1991), 50-58. Case, Karl E. and Robert J. Shiller, “Prices of Single-Family Homes since 1970: New Indexes for Four Cities,” New England Economic Review (1987), 45-56. Case, Karl E. and Robert J. Shiller, “The Efficiency of the Market for Single-Family Homes,” American Economic Review 79 (1989), 125-137. Dimelis, Sophia and Helen Louri, “Foreign Ownership and Production Efficiency: A Quantile Regression Analysis,” Oxford Economic Papers 54 (2002), 449-469. Follain, James R. and Charles A. Calhoun, “Constructing Indices of the Price of Multifamily Properties using the 1991 Residential Finance Survey,” Journal of Real Estate Finance and Economics 14 (1997), 235-255. Gallant, A. Ronald, “On the Bias in Flexible Functional Forms and an Essentially Unbiased Form: The Fourier Flexible Form,” Journal of Econometrics 15 (1981), 211-245. 21 Gallant, A. Ronald, “Unbiased Determination of Production Technologies,” Journal of Econometrics 20 (1982), 285-323. Garcia, Jaume, Pedro J. Hernandez, and Angel Lopez-Nicolas, “How Wide is the Gap? An Investigation of Gender Wage Differences Using Quantile Regression,” Empirical Economics 26 (2001), 149-167. Goetzmann, William N. and Matthew Spiegel, “Non-Temporal Components of Residential Real Estate Appreciation,” Review of Economics and Statistics 77 (1995), 199-206. Hartog, Joop, Pedro T. Pereira, and A. C. Jose, “Changing Returns to Education in Portugal during the 1980s and Early 1990s: OLS and Quantile Regression Estimators,” Applied Economics 33 (2001), 1027-1037. Ihlanfeldt, Keith R., “The Use of an Econometric Model for Estimating Aggregate Levels of Property Tax Assessment within Local Jurisdictions,” National Tax Journal 57 (2004), 7-23. Kiel, Katherine A. and Jeffrey E. Zabel, “Evaluating the Usefulness of the American Housing Survey for Creating Housing Price Indices,” Journal of Real Estate Finance and Economics 14 (1997), 189-202. Koenker, Roger and Gilbert W. Bassett, Jr., “Regression Quantiles,” Econometrica 46 (1978), 33-50. Koenker, Roger and Kevin F. Hallock, “Quantile Regression,” Journal of Economic Perspectives 15 (2001), 143-156. Levin, Jesse, “For Whom the Reductions Count: A Quantile Regression Analysis of Class Size and Peer Effects on Scholastic Achievement,” Empirical Economics 26 (2001), 221-246. Mark, J.H. and M.A. Goldberg, “Alternative Housing Price Indices: An Evaluation,” AREUEA Journal 12 (1984), 30-49. Martins, Pedro S. and Pedro T. Pereira, “Does Education Reduce Wage Inequality? Quantile Regression Evidence from 16 Countries,” Labour Economics 11 (2004), 355-371. McMillen, Daniel P., “The Return of Centralization to Chicago: Using Repeat Sales to Identify Changes in House Price Distance Gradients,” Regional Science and Urban Economics 33 (2003), 287-304. McMillen, Daniel P., “Airport Expansions and Property Values: The Case of Chicago O’Hare Airport,” Journal of Urban Economics 55 (2004), 627-640. 22 McMillen, Daniel P. and Jonathan Dombrow, “A Flexible Fourier Approach to Repeat Sales Price Indexes,” Real Estate Economics 29 (2001), 207-225. Palmquist, R.B., “Alternative Techniques for Developing Real Estate Price Indexes,” Review of Economics and Statistics 66 (1980), 394-404. Thibodeau, Thomas G., “Housing Price Indexes from the 1973-83 SMSA Annual Housing Survey,” AREUEA Journal 17 (1989), 110-117. Thorsen, James A., “The Use of Least Median of Squares in the Estimation of Land Value Equations,” Journal of Real Estate Finance and Economics 8 (1994), 183190. 23 Table 1 Monte Carlo Results Variable, Percentile x, 25% x, 50% x, 75% D, 25% D, 50% D, 75% Intercept, 25% Intercept, 50% Intercept, 75% Rejections of Equal Coefficients for x at 25% and 75% Rejections of Equal Coefficients for D at 25% and 75% Rejections of Equal Coefficients for Intercepts at 25% and 75% λ=0 1.001 (0.015) 1.000 (0.014) 1.000 (0.015) 0.199 (0.029) 0.199 (0.027) 0.199 (0.028) 4.772 (0.020) 5.000 (0.019) 5.228 (0.020) 3.0% λ = .25 1.003 (0.015) 1.003 (0.014) 1.003 (0.014) 0.194 (0.030) 0.199 (0.026) 0.204 (0.028) 4.771 (0.022) 5.000 (0.019) 5.229 (0.021) 2.3% λ = .50 1.006 (0.016) 1.006 (0.014) 1.007 (0.016) 0.180 (0.030) 0.199 (0.027) 0.220 (0.030) 4.770 (0.021) 5.000 (0.018) 5.231 (0.021) 3.9% λ = .75 1.009 (0.015) 1.009 (0.014) 1.010 (0.016) 0.153 (0.030) 0.198 (0.028) 0.244 (0.030) 4.768 (0.021) 5.001 (0.019) 5.233 (0.021) 2.7% λ=1 1.011 (0.016) 1.012 (0.015) 1.013 (0.017) 0.119 (0.031) 0.198 (0.030) 0.278 (0.032) 4.765 (0.021) 5.001 (0.020) 5.236 (0.021) 3.5% 3.1% 5.1% 19.4% 65.7% 96.3% 100% 100% 100% 100% 100% Note. Means and standard deviations (in parentheses) are reported for 1000 simulations. The base model is y = 5 + x + .2D + λzD + u, where z ~ U(-.5,.5). 24 Table 2 Descriptive Statistics Variable Mean Standard Minimum Maximum Deviation Hedonic Sample (n = 129,251) Price 107591.300 83409.050 250 4200000 Building Area (square feet) 1244.148 458.901 297 11512 Lot Size (square feet) 4131.254 3136.628 247 703500 Age 63.572 24.945 1 130 Number of Bedrooms 2.878 0.800 1 9 Brick 0.623 0.485 0 1 No Basement 0.217 0.413 0 1 Attic 0.464 0.499 0 1 Central Air Conditioning 0.200 0.400 0 1 1 Car Garage 0.312 0.463 0 1 2+ Car Garage 0.456 0.498 0 1 Distance from City Center 8.995 2.722 0.877 16.808 Repeat Sales Sample (n = 32,959) Price 134073.000 98128.150 500 2270000 Building Area (square feet) 1246.750 454.007 400 7269 Lot Size (square feet) 4059.298 4103.080 416 703500 Age 65.274 24.736 1 129 Number of Bedrooms 2.871 0.809 1 9 Brick 0.611 0.488 0 1 No Basement 0.216 0.411 0 1 Attic 0.491 0.500 0 1 Central Air Conditioning 0.201 0.401 0 1 1 Car Garage 0.313 0.464 0 1 2+ Car Garage 0.458 0.498 0 1 Distance from City Center 8.698 2.661 1.211 16.722 25 Table 3 Hedonic Regression Results Variable z = 2πT/204 z2 sin(z) cos(z) sin(2z) cos(2z) Natural Log of Building Area Natural Log of Lot Size Age Number of Bedrooms Brick No Basement Attic Central Air Conditioning 1 Car Garage 2+ Car Garage Distance from City Center Intercept R2 or Pseudo-R2 OLS 10% 25% 50% 75% 90% 0.029 (1.323) 0.021 (6.276) 0.009 (2.477) -0.134 (9.923) -0.001 (0.529) 0.001 (0.140) 0.695 (117.043) 0.253 (54.659) -0.005 (74.031) -0.011 (4.938) -0.015 (4.333) -0.043 (11.049) 0.017 (5.889) 0.172 (48.601) 0.100 (27.783) 0.094 (27.106) -0.034 (57.170) 4.552 (84.540) 0.446 0.046 (1.356) 0.014 (2.570) 0.000 (0.055) -0.073 (3.426) 0.010 (2.525) -0.001 (0.182) 0.492 (51.331) 0.370 (48.939) -0.009 (66.471) -0.018 (4.784) 0.008 (1.408) -0.040 (5.804) 0.058 (12.282) 0.126 (22.941) 0.181 (32.098) 0.170 (31.126) -0.022 (19.146) 4.643 (53.794) 0.235 0.032 (1.382) 0.017 (4.762) -0.014 (3.792) -0.100 (6.987) 0.009 (3.710) -0.000 (0.021) 0.599 (97.313) 0.327 (67.848) -0.007 (92.496) -0.017 (6.870) -0.049 (13.415) -0.056 (13.034) 0.034 (10.945) 0.147 (39.426) 0.102 (26.771) 0.096 (26.157) -0.030 (43.125) 4.576 (81.981) 0.271 0.028 (1.266) 0.019 (5.554) -0.017 (4.830) -0.135 (9.724) -0.002 (0.783) 0.005 (1.363) 0.718 (117.398) 0.266 (55.709) -0.005 (74.675) -0.012 (5.075) -0.059 (16.817) -0.063 (15.605) 0.014 (4.698) 0.148 (40.735) 0.082 (22.099) 0.069 (19.166) -0.034 (55.181) 4.412 (79.499) 0.302 -0.045 (2.039) 0.033 (9.566) 0.002 (0.470) -0.203 (14.915) -0.014 (5.634) -0.000 (0.122) 0.756 (117.731) 0.196 (39.123) -0.003 (44.014) -0.011 (4.719) -0.028 (8.048) -0.046 (11.524) -0.012 (4.172) 0.152 (42.581) 0.082 (22.452) 0.060 (17.131) -0.035 (64.484) 4.886 (84.569) 0.321 -0.072 (2.688) 0.039 (9.238) 0.033 (7.697) -0.221 (13.191) -0.018 (6.028) -0.013 (2.715) 0.798 (91.462) 0.153 (22.607) -0.001 (12.955) -0.003 (1.021) -0.016 (3.577) -0.060 (11.635) -0.045 (12.050) 0.164 (37.321) 0.054 (12.069) 0.037 (8.651) -0.041 (59.882) 5.105 (66.021) 0.368 Β75 – β25 -.076 (2.804) 0.016 (3.566) 0.016 (4.646) -0.103 (6.094) -0.023 (6.962) -0.000 (0.064) 0.157 (19.551) -0.131 (21.949) 0.004 (43.187) 0.005 (1.837) 0.021 (3.606) 0.010 (2.161) -0.047 (10.862) 0.005 (1.497) -0.019 (5.156) -0.035 (9.788) -0.005 (6.371) 0.310 (7.269) Note. The regressions have 129,521 observations. Absolute z-values are in parentheses. Twenty bootstrap replications are used to estimate standard errors for the quantile regression estimates. 26 Figure 1 Hedonic and Repeat Sales Price Indexes 1.25 1.00 0.75 0.50 0.25 0.00 1983 1985 1987 1989 Hedonic 1991 1993 Repeat Sales 1995 1997 1999 27 Figure 2 Hedonic Quantile Indexes 1.12 0.96 0.80 0.64 0.48 0.32 0.16 0.00 -0.16 1983 1985 10% 1987 25% 1989 1991 50% 1993 75% 1995 90% 1997 1999 28 Figure 3 Repeat Sales Quantile Indexes 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1983 1985 10% 1987 25% 1989 1991 50% 1993 75% 1995 90% 1997 1999 29 Figure 4 Comparison of Hedonic and Repeat Sales Base and Quantile Indexes 1.25 1.00 0.75 0.50 0.25 0.00 1983 1985 Base Hedonic 1987 1989 50% Hedonic 1991 Base Repeat 1993 1995 25% Repeat 1997 1999