Figure 1 - American Real Estate and Urban Economics Association

advertisement
Quantile House Price Indexes
N. Edward Coulson
Department of Economics
Pennsylvania State University
University Park, PA 16802-3306
fyj@psu.edu
Phone: 814-863-0625
Fax: 814-863-4775
Daniel P. McMillen
Department of Economics (MC 144)
University of Illinois at Chicago
601 S. Morgan St.
Chicago, IL 60607
mcmillen@uic.edu
Phone: 312-413-2100
Fax: 312-996-3344
March 28, 2005
Abstract
Unobserved remodeling and missing “quality” variables, which are endemic to existing
hedonic housing data sets, tend to produce an upward bias to existing data sets. To
reduce the effect of missing variables that tend to occur at certain points in the error
distribution, we propose the use of quantile regression procedures to estimate house price
indexes. We find evidence of significant quantile effects in a sample of home sales
drawn from Chicago for 1983-2001. Prices drawn from the upper tail of the error
distribution increased most rapidly during this period. A time series analysis of the
estimated price indexes suggests that quantile effects are similar across for five regions in
Chicago. Price changes in high-quantile houses lead to price changes in low-quantile
houses.
1
1. Introduction
The purpose of a house price index is to track the rate of price appreciation over
time for a standard or representative house. Using sample averages to construct the index
is inappropriate because the distribution of house prices is frequently skewed toward
lower-priced homes. A small number of sales of high-priced homes can significantly
affect a house price index. Non-academic estimates of price indexes, such as those
reported by the National Association of Realtors or by local newspapers, frequently use
the sample median as the basis for constructing an index. Although the median is an
improvement over the mean, it does not control for house characteristics. If large, new
houses dominate sales during later periods, both the mean and the median may imply an
artificially high rate of price appreciation.
Academic researchers have most often used one of two methods for constructing
quality-controlled price indexes. The first method is a straightforward hedonic price
function, in which the natural logarithm of sales price is regressed on a vector of house
characteristics and variables indicating the time of sale. The coefficients for the time of
sale variables produce the house price index. As missing variables may bias the hedonic
price function estimates, a repeat sales estimator is often used instead. A repeat sales
price index is estimated by regressing the percentage change in sales prices on a vector of
discrete variables representing the time of sale. By focusing on price changes rather than
levels, the repeat sales estimator avoids missing variable bias associated with house
characteristics that remain unchanged over time. However, it may be subject to more
severe sample selection bias than the hedonic approach because the relatively small
sample of properties that sell at least twice may not be representative of the overall
2
housing market. The repeat sales model may also be more prone to bias associated with
missing information on various home improvements taking place since a home’s first
sale.
As regression-based models, the hedonic and repeat sales approaches are meanbased procedures. As such, they are sensitive to outliers, and they invoke the assumption
that all estimated coefficients – including the critical time of sale variables – do not
depend on whether a home sale is drawn from the tails or the middle of the house price
error distribution. However, the rate of appreciation may, in fact, depend on the home’s
position in the error distribution. For example, homes may appreciate especially rapidly
if they have recently been remodeled. These observations may appear as outliers since
recently remodeled homes are likely to comprise a small portion of the overall sample.
Alternatively, appreciation rates may be especially high for unusually high-quality homes
or those drawn from premium locations. Variables representing remodeling, high quality,
and premium locations are likely to be unobserved, relegating their effects to the error
term. In situations such as these, a median-based estimator may imply lower rates of
appreciation that a standard, mean-based regression procedure.
In this paper, I propose the use of a quantile regression procedure to estimate
house price indexes. Using a Monte Carlo procedure, I show that a quantile approach is
less sensitive to missing variables than standard estimators and accurately identifies
appreciation rates that vary across the error distribution. I illustrate the practical benefits
of the quantile approach using data on homes sales in Chicago for 1983-1999. I find
evidence of significant quantile effects. Prices drawn from the upper tail of the error
3
distribution increased most rapidly during this period. Observations from the upper tail
of the distribution are likely to be recently remodeled or unusually high-quality homes.
The quantile approach may help to sidestep one of the vexing problems
encountered when estimating price indexes.
Unobserved remodeling and missing
“quality” variables are endemic to existing data sets. These problems tend to produce an
upward bias to estimated home price appreciation rates.
A median-based quantile
procedure is less vulnerable to this upward bias. Further, varying the target quantile leads
to a richer characterization of the dynamics of appreciation rates across the full
distribution of home prices.
2. Price Indexes
Academic researchers use two primary approaches for estimating house price
indexes. The first is the hedonic approach as typified by the following equation:
yit     xi   2 D2,it  ...   T DT ,it  uit
(1)
In equation (1), yit is the natural logarithm of the price of home i at time t, xi is a vector of
housing characteristics such as square footage and the number of bedrooms, and uit is an
error term. Sales dates range from 1 to T. The dummy variables D2,it … DT,it indicate
that the home sold during the period represented by the first subscript. Among many
possible sources of bias, missing variables are probably the most important.
The
estimated price index will be biased if the missing variables are correlated with the time
dummy variables. For example, suppose that the missing variable is a measure of house
quality. If homes selling at later dates tend to be of higher quality than those from early
sales, the δ’s from later periods will be biased upward and will overstate the rate of price
4
appreciation. Examples of the hedonic price index approach include Kiel and Zabel
(1997), Mark and Goldberg (1984), Palmquist (1980), and Thibodeau (1989).
The second common approach is the repeat sales method, which was originally
proposed by Bailey, Muth, and Nourse (1963). Examples include Case and Quigley
(1991), Case and Shiller (1987, 1989), Follain and Calhoun (1997), and Kiel and Zabel
(1997). For the subset of homes in the sample that sold at least twice, we can calculate
the difference in sales prices between time s and t, where s<t. The estimating equation
for the standard repeat sales estimator is
yit  yis   2 D2,it  D2,is      T DT ,it  DT ,is   uit  uis
(2)
The vector of housing characteristics, xi, does not appear in this equation because we
have assumed that the characteristics and the coefficient vector β do not change over
time. If these assumptions are correct, the repeat sales estimator provides unbiased
estimates of the price index without requiring data on all relevant housing characteristics.
Thus, a missing variable such as house quality will not bias the estimates unless it
changes over time or if its coefficient changes.
The following specification accounts for missing variables and time-varying
coefficients by adding a new variable, z, with values that change over time:
yit     xi   2 D2,it  ...   T DT ,it  zit  uit
(3)
It is irrelevant whether the source of the variation in the new term is a time-varying
coefficient or changes in the variable itself (as would be the case with remodeling): we
can simply rewrite the model by writing zit as the product of the appropriate time dummy
variable and a time-varying coefficient. Equation (3) becomes:
yit     xi   2 D2,it  ...   T DT ,it  zi  2 zi D2,it    T zi DT ,it  uit  (4)
5
and the repeat sales version of the equation is
y it  y is   2 D2,it  D2,is      T DT ,iT  DT ,is  
 z D
2 i
2 ,it
 D2,is     T z i DT ,iT  DT ,is   u it  u is 
(5)
The new variables measure changes in z between time t and the base period. The
bracketed terms in equations (4) and (5) are the error terms when z is unobserved. The
missing variables are correlated with the time variables, which leads to biased estimates
of the price index.
3. Quantile Regression
As with any mean-based procedure, the ordinary regression model is sensitive to
outliers. Although outliers are occasionally simply miscoded data, at other times missing
variables lead to extreme values for the error terms. An obvious example in the case of
house price models is remodeling, which is likely to produce an extremely high value for
the error terms when it is not observed in the data set. The “quality” variable may also be
the source of outliers: given observed housing characteristics, unusually high-quality
homes will tend to have high prices and large values for the error term.
Unlike ordinary least squares, the target for quantile regression estimates is a
parameter that is specified before estimation. Let q represent the target quantile. Also,
let eit be the residual implied by the econometric model. Quantile parameter estimates are
the coefficients that minimize the following objective function:
 2q e
eit 0
it
  2(1  q) eit
(6)
eit 0
At the median, q = 0.5, which implies that equal weight is given to positive and negative
residuals. At the 90th percentile, 2q = 1.8 and 2(1-q) = .2, which implies that more weight
6
is given to positive residuals – observations with high values for the dependent variable,
given the values of the explanatory variables. Equation (6) will be minimized at a set of
parameter values where 100q% of the residuals are positive. This result differs from
ordinary least squares, in which the sum of the residuals equals zero and otherwise there
is no constraint on the number of positive residuals.
Koenker and Bassett (1978) originally proposed the quantile regression approach.
Examples of applications include Albrecht (2003); Bassett and Chen (2001); Buchinsky
(1994, 1998a, 2001); Dimelis and Louri (2002); Garcia, Hernandez, and Lopez-Nicholas
(2001); Hartog, Pereira, and Jose (2001); Levin (2001); Martins and Pereira (2004); and
Thorsen (1994). Buchinsky (1998b) and Koenker and Hallock (2001) present useful
surveys.
Each of these studies presents estimated equations with the general from
y i   q xu  u qi . The form of this equation implies that the coefficients differ by quantile.
For example, Martins and Pereira (2004) find that returns to schooling are higher for
more-skilled individuals. Their evidence for this conclusion comes from a regression of
the natural logarithm of wages on a set of human capital characteristics, one of which is
years of schooling. The coefficient for years of education is higher at higher quantiles.
Quantile effects have a straightforward missing variables interpretation that
follows directly from the hedonic and repeat sales price index estimators. For example,
the contribution of a sale at time t=2 to the price index can be found by taking the
derivative of equation (4) or (5) with respect to D2,it. The result,  2*   2   2 z i , varies
with the missing variable z. If λ2 > 0, then higher values of z lead to higher values for  2* .
But z is part of the error term. Thus, high values of the error term imply high values
for  2* – a quantile effect.
7
The intuition behind the quantile effect is the same as the motivation typically
offered for selection bias in the repeat sales estimator – that the repeat sales sample is not
representative of the rest of the housing market. For instance, the repeat sales sample
may draw more heavily from neighborhoods with amenities that attract wealthy, mobile
homebuyers, and the prices of these homes may appreciate more rapidly than homes in
other neighborhoods. If the full set of neighborhood amenity variables were observed,
there would be neither a quantile effect nor a sample selection issue. Similarly, homes
that have been remodeled can be represented by a missing variable that adds to the vector
of housing characteristics beginning at the time the remodeling is completed.
The
remodeling variable produces a quantile effect because it is correlated with the time
dummy variables.
The case for the quantile effect is particularly strong for the remodeling example
because only a minority of homes is remodeled over time. Remodeling shows up as an
outlier in a standard regression model. Such outliers are drawn from the upper tails of the
error distribution. The effects of this unobserved variable would not contaminate other
points in the distribution. A median-based estimate (q = .5) will be far less vulnerable to
the effects of omitted variables that affect only a portion of the sample.
8
4. A Monte Carlo Analysis
In this section, I report the results of a set of Monte Carlo experiments that
illustrate the benefits of the quantile approach to estimating house price indexes. The
basis for the experiments is a straightforward two-period version of equation (4):
yi  5  xi  .2 Di  zi Di  ui
(7)
The time subscript is suppressed from equation (7) because it unnecessarily complicates
the notation of the hedonic model, which is sufficient for illustrating the benefits of the
quantile approach. I draw values of x from a unit normal distribution. I generate the time
variable D, by making draws from a U(0,1) distribution and setting D = 1 when the
randomly drawn value is greater than 0.5. The “missing” variable, z, is drawn from a
U(-.5,.5) distribution.
Finally, I draw values for the error term, u, from a normal
distribution with a mean of zero and a variance that assures that the R2 from a regression
of y on x, D, and zD will be approximately 0.9 on average. I let the values of  vary from
0 to 1 while maintaining each of the other parameters at the values shown in equation (7).
Thus, observations with higher values of z have higher appreciation rates on average.
Each experiment has 1000 observations.
When a sale occurs during the base time period, D = 0. The price of an identical
home is 0.2+z when the sales takes place during the second period. If z represents
quality, then the appreciation rate is higher for high-quality homes. This variable would
be missing in a typical econometric study. If z is not observed, appreciation rates are
higher for observations drawn from the upper tails of the error distribution. Thus, the
Monte Carlo setup generate quantile effects, in which the implied marginal effect of D –
0.2+z – varies across the error distribution.
9
Table 1 reports means and standard deviations for quantile regression estimates of
1000 replications of each experiment. I estimate each regression at target quantiles of
0.25, 0.50, and 0.75. The explanatory variables for the regressions are simply x and D;
zD is not included. The missing variable, zD, is not correlated with x but it is correlated
with D. Thus, omitting zD does not bias the estimated coefficient for x but does lead to
biased estimates for the D coefficient. The true coefficient for D rises with , and when
>0 it is higher at higher quantiles. Therefore, the question in the Monte Carlo analysis
is whether the quantile approach indicates higher appreciation rates – i.e., higher
coefficients for D – at higher quantiles. Given the structure of the Monte Carlo setup, the
true coefficient is 1.0 for x at all values of . The true intercept is lower at lower
regression quantiles because errors are negative on average at q = 0.25 and positive at q =
0.75. All calculations are performed using the QREG command in STATA.
The results are precisely as expected. The average estimated coefficient for x is
close to 1.0 across the three target quantiles and across the five alternative values of .
Since the error term, u, and the omitted variable, zD, both have means of zero, the
estimated intercepts are approximately equal to their true value of 5.0 at the residual
median (q = 0.5). The average intercepts are lower than 5.0 at q = 0.25 and are higher
than 5.0 at q = 0.75. As expect, estimated appreciation rates – the coefficient for D – are
approximately equal to the correct value of 0.20 when quantile effects are absent ( = 0).
Importantly, estimated appreciation rates are lower than 0.20 when >0 and q = 0.25, and
they are higher than 0.20 when >0 and q = 0.75. The estimate appreciation rates
average just under 0.20 at the median of the distribution of residuals.
10
The last three rows of the table show the percentage of rejections for the null
hypothesis of equal coefficients for the 25% and 75% quantiles. The tests are based on
20 replications of a bootstrap algorithm. As quantile effects are absent for x, we should
expect the null hypothesis to be rejected no more than 5% of the time (the nominal size of
the test) for this variable. Rejection rates are somewhat lower than 5% for this variable,
and they do not vary systematically by . Since quantile effects always exist for the
intercept, the tests always reject the null hypothesis of equal intercepts at the 25% and
75% quantiles. The most important finding is that the rejection rate for equal coefficients
for D rises with . This result means that, as expected, the statistical test is more likely
to indicated quantile effects as the magnitude of the missing variable (Dz) increases.
To put these results in perspective, assume that z represents a trait such as
remodeling or simply the change in quality between the two periods. Prices of homes
with positive values for z increase over time, and prices fall when z is negative. If z is
unobserved, standard estimates will typically be biased. If most homes have increased in
quality, then standard estimates of the appreciation rate are biased upward. By allowing
for differences in coefficients across target quantiles, the quantile estimator can detect
differences in appreciation rates. In a conventional case of remodeling, most values of z
equal zero while a small percentage are positive. Standard appreciation rate estimates
will again be biased upward in this case. In contrast, a median-based estimator will
provide accurate estimates, and the estimates at high target quantiles will detect the
higher rates of appreciation associated with remodels.
11
5. Data and Model Specification
The data set for the empirical application of the quantile regression estimator was
drawn from two sources, the Illinois Department of Revenue (IDOR) and the Cook
County Assessor’s Office.
IDOR conducts reviews of assessment practices for all
counties in Illinois, including Cook County. Through a Freedom of Information Act
request, IDOR provided data on all sales of single-family homes in the City of Chicago
for 1983-1999 with the exception of 1992. Important variables include the sales price,
date of sale, and the parcel identification number (or “PIN”). The PIN allows me to
merge the IDOR data with the 1997 Cook County file of assessments. The assessment
file includes the address and standard housing characteristics.1 However, the housing
characteristics are available only for 1997, and there is no way of identifying changes in
the characteristics over time. If most homes are not remodeled during this period, then a
median-based estimator will provide accurate estimates of constant-quality appreciation
rates.
Table 2 provides descriptive statistics for sales prices and the housing
characteristics. There are 129,251 sales during this period, and 32,959 pairs of repeat
sales. In 1997, the average home had 1244 square feet of living area, was on a 4131
square foot lot, had 2.878 bedrooms, and was just under nine miles form the traditional
Chicago city center (the intersection of State and Madison Streets). House age naturally
varies over time; the average across all sales at the time of sale is 63.572 years. The
1
As described in McMillen (2004), the only address that is available is for the building owner rather than
the actual property. The PIN identifies the location of the property down to the quarter section level – a
quarter square mile. I used a GIS program to geocode the building owner addresses. The final sample
includes only those homes with owners whose addresses are located in the same quarter section. Since a
quarter section is ½ x ½ mile, the location of a home may be misidentified by as much as 2x0.5 2 = 0.71
miles, which would happen if the home and its owner were at opposite corners of the quarter section.
12
mode house is built of brick, has a basement and attic, does not have central air
conditioning, and has a garage. The average nominal sales price across the sample period
is 107,591. The range in sales prices is large – as low as $250 and as high as $4.2
million.2
The repeat sales sample is nearly identical to the overall sample. Average sales
price is higher because, by construction, the repeat sales sample is dominated by sales
from later dates: whereas any home selling in 1983 is almost certainly a first sale, a 1999
observation may be either a first or second sale. Similarly, the average repeat sales home
is about two years older than the average observation from the full sample because homes
are older later in the time interval. All variables without a time dimension – building
area, lot size, number of bedrooms, distance from the city center, and the dummy
variables for brick construction, lack of a basement, an attic, central air conditioning, and
a garage – are nearly identical on average across the two samples.
With data covering 16 years and sales dates identified by the month of sale, the
basic hedonic and repeat sales specification could include as many as 192 time dummy
variables.
Price indexes estimated with monthly dummy variables are sensitive to
extreme values from months with few sales and have misleadingly sharp discontinuities.
Quantile regression is slow and cumbersome with thousands of observations and more
than 200 explanatory variables. Although aggregating up to the quarter or year of sale
reduces the estimation burden, it still produces an index with unrealistic discontinuities
over time.
Although the IDOR attempts to screen the data for non-arm’s length sales, the small number of sales with
extremely low prices should be viewed with skepticism. There is no obvious cutoff point for discarding
these observations. An advantage of the quantile regression approach is that it is not sensitive to these
outlier observations.
2
13
McMillen and Dombrow (2001) propose a simple procedure for estimating either
hedonic or repeat sales price index that produces a smooth, continuous function with a
small number of coefficients to be estimated. The basis for the estimator is the Fourier
approach of Gallant (1981, 1982). The general form for the effect of the time of sale
variables in equation (3) is simply g(T), where T represents the month of sale. Since sales
dates range from January 1983 to December 1999, the range for T is 1 to 204. The
Fourier approach begins by transforming the time variable to lie between 0 and 2π: zi ≡
2πTi/204. The Fourier expansion is g(Ti) = 0 + 1zi + 2zi2 + ∑q(qsin(qzi) + qcos(qzi)).
A small number of sine and cosine terms turn out to be sufficient to model price indexes.
In the empirical section of the paper, I set the maximum order of the expansion at two.
Thus, the Fourier expansion is simply g(Ti) = 0 + 1zi + 2zi2 + 1sin(zi) + 1cos(zi) +
2sin(2zi) + 2cos(2zi).
In the general form of the repeat sales model, we have yit – yis = g(Ti) – g(Ts) + uit
– uis, where Ts represents the earlier date of sale. The Fourier expansion is particularly
useful for the repeat sales estimator because it uses a parametric function to
approximation g(T), which makes it possible to impose that the g(Ti) and g(Ts) are simply
two values of the same function. McMillen and Dombrow (2001) show that the Fourier
version of the repeat sales estimator is
y i  1 z i   2 z i2  1  sin z i    1  cosz i    2  sin 2 z i    2  cos2 z i  (8)
where I again impose that the maximum order of the expansion is two, and  indicates
the change between sales dates (e.g., yi = yit – yis). This approach has been used to
estimate price indexes by McMillen (2003) and Ihlanfeldt (2004).
14
Note that equation (8) does not include an intercept. A positive value for the
intercept would imply an increase in price even within a single time period. Although a
within-period price increase is possible, particularly for a time interval as long as a year,
most authors impose that the price index equals zero during the base period.3 Even if the
true intercept is zero in the middle of the error distribution, the intercept will be negative
at lower quantiles and positive at higher points in the error distribution. The question is
how to normalize the implied path of time coefficients so that we can compare rates of
price appreciation across quantiles.4
In a standard repeat sales model, we can impose that the intercept equals zero in
two ways. The obvious one – omitting the constant term from the regression – is not an
option in quantile regression because intercepts cannot equal zero across all quantiles.
The second alternative is to estimate the regression with an intercept, and then solve for
the restricted least squares estimates that are implied by a zero intercept. Let X be the
matrix of explanatory variables for the unrestricted regression, and let R be a vector with
a one in the position corresponding to the intercept in X and zeros elsewhere. The


1
formula for the restricted coefficients is ˆ r  ˆ  X X 1 R RX X 1 R Rˆ .
Let aij
represent the entry in row i and column j of  X X 1 , and assume that a vector of one’s is
the first column of X. Then the formula for ̂ ri – the coefficient in row i of ̂ ri –
is ˆ i  ˆ1 a1i / a11  . Calculating the restricted price index by imposing a zero intercept is
not equivalent to obtaining an unrestricted estimate with a non-zero value in the base
3
Goetzmann and Siegel (1995) suggest including an intercept because properties are often upgraded around
the time of a transaction, and these upgrades are seldom observed in standard data sets.
4
Normalization is not an issue with the standard hedonic estimator because it always has an intercept. The
hedonic estimates directly compare prices in one time to prices in a base year, which is the date whose
dummy variable is omitted from the regression.
15
period and subtracting the intercept from all dates – a parallel shift in the price index.
The formula for restricted coefficients rotates the price index so that the restricted
intercept is zero.
This transformation also is a logical basis for the quantile repeat sales estimates.
The quantile estimator is not a simple regression, and does not have a direct counterpart
to the  X X 1 matrix. However, the transformation ˆ i  ˆ1 a1i / a11  takes any of set of
coefficients and produces a zero intercept (i = 1) while rotating the price index. This
transformation rotates the price index equally at each regression quantile, which makes it
possible to directly compare the price indexes across quantiles.
6. Estimated Price Indexes
Estimated standard and quantile hedonic regression results are shown in Table 3.
The standard estimates show that each additional 10% of building area increases sales
prices by 69.5%, and the elasticity of sales price with respect to lot size is 0.253. The
implied deprecation rate of sales prices with respect to age is -0.5% per year. Controlling
for square footage, an additional bedroom reduces sales prices by 1.1%. A home with
frame construction, a basement, an attic, central air conditioning, and a garage sells for
more than a home without these characteristics. Each additional mile from the city center
lowers sales prices by 3.4%.
The last column of the table shows the differences in the estimated coefficients
for the 25% and 75% quantiles. Nearly all of the housing characteristics exhibit quantile
effects. Additional square footage adds more to sales prices at higher quantiles, while
larger lot sizes have larger effects at lower points in the error distribution. The effect of
16
age is much more pronounced at lower quantiles. Garages add more to sales prices at
lower points in the error distribution. The distance from the city center gradient is larger
at higher quantiles.
Although the coefficients for the Fourier expansion terms are
difficult to interpret directly, the significant differences across quantiles imply that there
are differences in the implied price indexes.
The differences in coefficients for age support the interpretation of quantile
effects as a means of controlling for unobserved quality changes.
Although the
coefficient for age is significantly negative at each quantile, estimated depreciation rates
are much higher at lower quantiles. The positive errors for observations in the upper
quantiles may arise because of upgrades. Age becomes far less important a determinant
of house prices when a home is remodeled.
In contrast, the negative errors for
observations at the lower quantiles may come about at least in part because the homes
have not been well maintained, and poor maintenance implies a high depreciation rate.
The regression results are not particularly informative for the repeat sales
estimator because the only explanatory variables are sine and cosine terms used to
construct the implied price index. The results can be observed directly in Figure 1, which
also shows the implied price index from the base hedonic regression. The two price
indexes are nearly identical.
The end points in December 1999 both imply an
approximate doubling of prices since January 1983: the final values are 1.018 for the
hedonic index and 1.066 for the repeat sales price index.
Figure 2 shows the hedonic price indexes by quantile. The indexes show a nearly
uniform progression in the rate of appreciation over time: low quantiles have low rates of
appreciation and high quantiles have high rates. The end values for the 10%, 25%, 50%,
17
75%, and 90% quantiles are 0.834, 0.875, 0.942, 1.010, and 1.072. The paths differ
somewhat over time, particularly during the late 1980s when the Chicago housing market
first began to rebound from years of slow growth. Prices at the higher quantiles rose first.
The price of homes at the 10% and 25% quantiles never enjoyed the rapid increase in the
late 1980s, and did not make up the lost ground in the 1990s.
Figure 3 shows the repeat sales quantile price indexes. The estimated price paths
are somewhat different from the hedonic indexes. Again, the prices at the lowest quantile
appreciate most slowly; the final value for the 10% quantile is 0.908. However, the price
path for the 25% and 90% quantiles form a pair of similar indexes while the 50% and
75% indexes form another pair. The endpoints at the 25%, 50%, 75%, and 90% quantiles
are 1.035, 1.132, 1.158, and 1.080. Quantile effects do not change monotonically when
progressing from the lowest to highest quantiles. In this case, prices rose most rapidly for
homes nearer the middle of the distribution.
Another way to look at the quantile effects is to compare the base indexes with
the quantile indexes with which they are most highly correlated. The base hedonic index
is most highly correlated with the price index for the 50% quantile, while the base and
25%-quantile repeat sales indexes are most similar.5 These price indexes are shown in
Figure 4. The indexes are very similar. Figure 4 illustrates two important points. First,
the apparent sample selection problems associated with the repeat sales estimator are
exaggerated. Repeat sales and hedonic price index estimates are often quite similar.6
The second point is that quantile estimates for targets from the middle of the distribution
5
The correlations between the base hedonic and 10%, 25%, 50%, 75%, and 90% hedonic quantile indexes
are 0.99707, 0.99790, 0.99890, 0.99790, and 0.99700. Counterpart correlations for the repeat sales indexes
are 0.99974, 0.99994, 0.99986, 0.99989, and 0.99956.
6
The corollary is that the advantages of the hedonic approach are exaggerated. Hedonic price functions
suffer from missing variables that are likely to be correlated with the time of sale explanatory variables.
18
will generally be similar to standard price index estimates. However, the median-based
quantile estimate is not necessarily the one that is closest to the standard estimator. The
fact that the 25%-quantile estimate is most similar to the base repeat sales estimate
suggests that observations with unexpectedly low appreciation rates greatly influence the
standard estimator in this application.
In comparing the hedonic and repeat sales quantile results, it is important to bear
in mind that the form of the error term is different. Whereas the hedonic approach targets
points in the underlying error distribution, the repeat sales model targets point in the
distribution of the changes in errors over time. In my interpretation of the error term as
an unobserved variable representing housing quality, the hedonic approach targets levels
of quality and the repeat sales approach targets changes in quality. However, quality is
far from the only unobserved housing characteristic. More missing explanatory variables
can generate a rich error process. Thus, it is not surprising that the quantile effects differ
for the hedonic and repeat sales models; nor is it surprising that the effects are not always
uniform across quantiles.
19
7. Conclusion
The quantile approach has several advantages over conventional approaches to
estimating house price indexes.
Targeting quantiles from the middle of the error
distribution reduces the effects of outliers.
The problem of outliers is particularly
important for the repeat sales estimator, which is vulnerable to an upward bias when the
sample includes remodeled houses and there is no way to identify which homes have
been upgraded. In this situation, a more realistic view of the housing market may be
gained by constructing indexes using lower quantiles as the target point. The quantile
approach can also provide a richer view of the overall housing market by revealing
patterns that vary across quantiles.
Data for Chicago from 1983-1999 illustrate some of the advantages of the
quantile approach. Hedonic estimates reveal significant differences in coefficients across
quantiles. Depreciation rates are lower at higher quantiles, suggesting that these quantiles
include homes that have been remodeled. Square footage adds more to sales prices at
upper quantiles, while lot size has a great effect on prices at lower quantiles. Hedonic
price index estimates reveal a uniform pattern across quantiles: over the full 1983-1999
period, appreciation rates are higher for homes in higher quantiles. The pattern is not
uniform for the repeat sales indexes. Although there is again a tendency toward higher
appreciation rates at higher quantiles, the appreciation rate is lower for the 90% quantile
than for the 50% or 75% quantile.
20
References
Albrecht, James, Anders Bjorklund, and Susan Vroman, “Is There a Glass Ceiling in
Sweden?,” Journal of Labor Economics 21 (2003), 145-177.
Bailey, M.J., R.F. Muth, and H.O. Nourse, “A Regression Method for Real Estate Price
Index Construction,” Journal of the American Statistical Association 58 (1963)
933-942.
Bassett, Gilbert W., Jr., and Hsiu-Lang Chen, “Portfolio Style: Return-Based Attribution
using Quantile Regression,” Empirical Economics 26 (2001), 293-305.
Buchinsky, Moshe, “Changes in the U.S. Wage Structure 1963-1987: Application of
Quantile Regression,” Econometrica 62 (1994), 405-58.
Buchinsky, Moshe, “The Dynamics of Changes in the Female Wage Distribution in the
USA: A Quantile Regression Approach,” Journal of Applied Econometrics 13
(1998a), 1-30.
Buchinsky, Moshe, “Recent Advances in Quantile Regression Models: A Practical
Guideline for Empirical Research,” Journal of Human Resources 33 (1998b),
88-126.
Buchinsky, Moshe, “Quantile Regression with Sample Selection: Estimating Women’s
Return to Education in the U.S.,” Empirical Economics 26 (2001), 87-113.
Case, Bradford and John M. Quigley, “The Dynamics of Real Estate Prices,” Review of
Economics and Statistics 73 (1991), 50-58.
Case, Karl E. and Robert J. Shiller, “Prices of Single-Family Homes since 1970: New
Indexes for Four Cities,” New England Economic Review (1987), 45-56.
Case, Karl E. and Robert J. Shiller, “The Efficiency of the Market for Single-Family
Homes,” American Economic Review 79 (1989), 125-137.
Dimelis, Sophia and Helen Louri, “Foreign Ownership and Production Efficiency: A
Quantile Regression Analysis,” Oxford Economic Papers 54 (2002), 449-469.
Follain, James R. and Charles A. Calhoun, “Constructing Indices of the Price of
Multifamily Properties using the 1991 Residential Finance Survey,” Journal of
Real Estate Finance and Economics 14 (1997), 235-255.
Gallant, A. Ronald, “On the Bias in Flexible Functional Forms and an Essentially Unbiased
Form: The Fourier Flexible Form,” Journal of Econometrics 15 (1981), 211-245.
21
Gallant, A. Ronald, “Unbiased Determination of Production Technologies,” Journal of
Econometrics 20 (1982), 285-323.
Garcia, Jaume, Pedro J. Hernandez, and Angel Lopez-Nicolas, “How Wide is the Gap?
An Investigation of Gender Wage Differences Using Quantile Regression,”
Empirical Economics 26 (2001), 149-167.
Goetzmann, William N. and Matthew Spiegel, “Non-Temporal Components of
Residential Real Estate Appreciation,” Review of Economics and Statistics 77
(1995), 199-206.
Hartog, Joop, Pedro T. Pereira, and A. C. Jose, “Changing Returns to Education in
Portugal during the 1980s and Early 1990s: OLS and Quantile Regression
Estimators,” Applied Economics 33 (2001), 1027-1037.
Ihlanfeldt, Keith R., “The Use of an Econometric Model for Estimating Aggregate Levels
of Property Tax Assessment within Local Jurisdictions,” National Tax Journal 57
(2004), 7-23.
Kiel, Katherine A. and Jeffrey E. Zabel, “Evaluating the Usefulness of the American
Housing Survey for Creating Housing Price Indices,” Journal of Real Estate
Finance and Economics 14 (1997), 189-202.
Koenker, Roger and Gilbert W. Bassett, Jr., “Regression Quantiles,” Econometrica 46
(1978), 33-50.
Koenker, Roger and Kevin F. Hallock, “Quantile Regression,” Journal of Economic
Perspectives 15 (2001), 143-156.
Levin, Jesse, “For Whom the Reductions Count: A Quantile Regression Analysis of
Class Size and Peer Effects on Scholastic Achievement,” Empirical Economics 26
(2001), 221-246.
Mark, J.H. and M.A. Goldberg, “Alternative Housing Price Indices: An Evaluation,”
AREUEA Journal 12 (1984), 30-49.
Martins, Pedro S. and Pedro T. Pereira, “Does Education Reduce Wage Inequality?
Quantile Regression Evidence from 16 Countries,” Labour Economics 11 (2004),
355-371.
McMillen, Daniel P., “The Return of Centralization to Chicago: Using Repeat Sales to
Identify Changes in House Price Distance Gradients,” Regional Science and
Urban Economics 33 (2003), 287-304.
McMillen, Daniel P., “Airport Expansions and Property Values: The Case of Chicago
O’Hare Airport,” Journal of Urban Economics 55 (2004), 627-640.
22
McMillen, Daniel P. and Jonathan Dombrow, “A Flexible Fourier Approach to Repeat
Sales Price Indexes,” Real Estate Economics 29 (2001), 207-225.
Palmquist, R.B., “Alternative Techniques for Developing Real Estate Price Indexes,”
Review of Economics and Statistics 66 (1980), 394-404.
Thibodeau, Thomas G., “Housing Price Indexes from the 1973-83 SMSA Annual
Housing Survey,” AREUEA Journal 17 (1989), 110-117.
Thorsen, James A., “The Use of Least Median of Squares in the Estimation of Land
Value Equations,” Journal of Real Estate Finance and Economics 8 (1994), 183190.
23
Table 1
Monte Carlo Results
Variable, Percentile
x, 25%
x, 50%
x, 75%
D, 25%
D, 50%
D, 75%
Intercept, 25%
Intercept, 50%
Intercept, 75%
Rejections of Equal Coefficients for x at
25% and 75%
Rejections of Equal Coefficients for
D at 25% and 75%
Rejections of Equal Coefficients for
Intercepts at 25% and 75%
λ=0
1.001
(0.015)
1.000
(0.014)
1.000
(0.015)
0.199
(0.029)
0.199
(0.027)
0.199
(0.028)
4.772
(0.020)
5.000
(0.019)
5.228
(0.020)
3.0%
λ = .25
1.003
(0.015)
1.003
(0.014)
1.003
(0.014)
0.194
(0.030)
0.199
(0.026)
0.204
(0.028)
4.771
(0.022)
5.000
(0.019)
5.229
(0.021)
2.3%
λ = .50
1.006
(0.016)
1.006
(0.014)
1.007
(0.016)
0.180
(0.030)
0.199
(0.027)
0.220
(0.030)
4.770
(0.021)
5.000
(0.018)
5.231
(0.021)
3.9%
λ = .75
1.009
(0.015)
1.009
(0.014)
1.010
(0.016)
0.153
(0.030)
0.198
(0.028)
0.244
(0.030)
4.768
(0.021)
5.001
(0.019)
5.233
(0.021)
2.7%
λ=1
1.011
(0.016)
1.012
(0.015)
1.013
(0.017)
0.119
(0.031)
0.198
(0.030)
0.278
(0.032)
4.765
(0.021)
5.001
(0.020)
5.236
(0.021)
3.5%
3.1%
5.1%
19.4%
65.7%
96.3%
100%
100%
100%
100%
100%
Note. Means and standard deviations (in parentheses) are reported for 1000 simulations.
The base model is y = 5 + x + .2D + λzD + u, where z ~ U(-.5,.5).
24
Table 2
Descriptive Statistics
Variable
Mean
Standard Minimum Maximum
Deviation
Hedonic Sample
(n = 129,251)
Price
107591.300 83409.050
250
4200000
Building Area (square feet)
1244.148
458.901
297
11512
Lot Size (square feet)
4131.254 3136.628
247
703500
Age
63.572
24.945
1
130
Number of Bedrooms
2.878
0.800
1
9
Brick
0.623
0.485
0
1
No Basement
0.217
0.413
0
1
Attic
0.464
0.499
0
1
Central Air Conditioning
0.200
0.400
0
1
1 Car Garage
0.312
0.463
0
1
2+ Car Garage
0.456
0.498
0
1
Distance from City Center
8.995
2.722
0.877
16.808
Repeat Sales Sample
(n = 32,959)
Price
134073.000 98128.150
500
2270000
Building Area (square feet)
1246.750
454.007
400
7269
Lot Size (square feet)
4059.298 4103.080
416
703500
Age
65.274
24.736
1
129
Number of Bedrooms
2.871
0.809
1
9
Brick
0.611
0.488
0
1
No Basement
0.216
0.411
0
1
Attic
0.491
0.500
0
1
Central Air Conditioning
0.201
0.401
0
1
1 Car Garage
0.313
0.464
0
1
2+ Car Garage
0.458
0.498
0
1
Distance from City Center
8.698
2.661
1.211
16.722
25
Table 3
Hedonic Regression Results
Variable
z = 2πT/204
z2
sin(z)
cos(z)
sin(2z)
cos(2z)
Natural Log of Building
Area
Natural Log of Lot Size
Age
Number of Bedrooms
Brick
No Basement
Attic
Central Air Conditioning
1 Car Garage
2+ Car Garage
Distance from City
Center
Intercept
R2 or Pseudo-R2
OLS
10%
25%
50%
75%
90%
0.029
(1.323)
0.021
(6.276)
0.009
(2.477)
-0.134
(9.923)
-0.001
(0.529)
0.001
(0.140)
0.695
(117.043)
0.253
(54.659)
-0.005
(74.031)
-0.011
(4.938)
-0.015
(4.333)
-0.043
(11.049)
0.017
(5.889)
0.172
(48.601)
0.100
(27.783)
0.094
(27.106)
-0.034
(57.170)
4.552
(84.540)
0.446
0.046
(1.356)
0.014
(2.570)
0.000
(0.055)
-0.073
(3.426)
0.010
(2.525)
-0.001
(0.182)
0.492
(51.331)
0.370
(48.939)
-0.009
(66.471)
-0.018
(4.784)
0.008
(1.408)
-0.040
(5.804)
0.058
(12.282)
0.126
(22.941)
0.181
(32.098)
0.170
(31.126)
-0.022
(19.146)
4.643
(53.794)
0.235
0.032
(1.382)
0.017
(4.762)
-0.014
(3.792)
-0.100
(6.987)
0.009
(3.710)
-0.000
(0.021)
0.599
(97.313)
0.327
(67.848)
-0.007
(92.496)
-0.017
(6.870)
-0.049
(13.415)
-0.056
(13.034)
0.034
(10.945)
0.147
(39.426)
0.102
(26.771)
0.096
(26.157)
-0.030
(43.125)
4.576
(81.981)
0.271
0.028
(1.266)
0.019
(5.554)
-0.017
(4.830)
-0.135
(9.724)
-0.002
(0.783)
0.005
(1.363)
0.718
(117.398)
0.266
(55.709)
-0.005
(74.675)
-0.012
(5.075)
-0.059
(16.817)
-0.063
(15.605)
0.014
(4.698)
0.148
(40.735)
0.082
(22.099)
0.069
(19.166)
-0.034
(55.181)
4.412
(79.499)
0.302
-0.045
(2.039)
0.033
(9.566)
0.002
(0.470)
-0.203
(14.915)
-0.014
(5.634)
-0.000
(0.122)
0.756
(117.731)
0.196
(39.123)
-0.003
(44.014)
-0.011
(4.719)
-0.028
(8.048)
-0.046
(11.524)
-0.012
(4.172)
0.152
(42.581)
0.082
(22.452)
0.060
(17.131)
-0.035
(64.484)
4.886
(84.569)
0.321
-0.072
(2.688)
0.039
(9.238)
0.033
(7.697)
-0.221
(13.191)
-0.018
(6.028)
-0.013
(2.715)
0.798
(91.462)
0.153
(22.607)
-0.001
(12.955)
-0.003
(1.021)
-0.016
(3.577)
-0.060
(11.635)
-0.045
(12.050)
0.164
(37.321)
0.054
(12.069)
0.037
(8.651)
-0.041
(59.882)
5.105
(66.021)
0.368
Β75 –
β25
-.076
(2.804)
0.016
(3.566)
0.016
(4.646)
-0.103
(6.094)
-0.023
(6.962)
-0.000
(0.064)
0.157
(19.551)
-0.131
(21.949)
0.004
(43.187)
0.005
(1.837)
0.021
(3.606)
0.010
(2.161)
-0.047
(10.862)
0.005
(1.497)
-0.019
(5.156)
-0.035
(9.788)
-0.005
(6.371)
0.310
(7.269)
Note. The regressions have 129,521 observations. Absolute z-values are in parentheses.
Twenty bootstrap replications are used to estimate standard errors for the quantile
regression estimates.
26
Figure 1
Hedonic and Repeat Sales Price Indexes
1.25
1.00
0.75
0.50
0.25
0.00
1983
1985
1987
1989
Hedonic
1991
1993
Repeat Sales
1995
1997
1999
27
Figure 2
Hedonic Quantile Indexes
1.12
0.96
0.80
0.64
0.48
0.32
0.16
0.00
-0.16
1983
1985
10%
1987
25%
1989
1991
50%
1993
75%
1995
90%
1997
1999
28
Figure 3
Repeat Sales Quantile Indexes
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1983
1985
10%
1987
25%
1989
1991
50%
1993
75%
1995
90%
1997
1999
29
Figure 4
Comparison of Hedonic and Repeat Sales Base and Quantile Indexes
1.25
1.00
0.75
0.50
0.25
0.00
1983
1985
Base Hedonic
1987
1989
50% Hedonic
1991
Base Repeat
1993
1995
25% Repeat
1997
1999
Download