Oregon_Rent_Estimation

advertisement
1. Background
This work was undertaken to produce two sets of values; a list of zone-specific base
rent estimates, and a set of local effect factors that adjust this base rent to reflect the
individual structure and local influences that are expected to act at a geographic level
smaller than the zone system.
A full discussion of the theory and mathematics behind this estimation is discussed in
the PECAS Model Development Rent Modifier Equations Technical Note, and the
discussion of specific equations and functional forms refers to this document.
2. Data preparation
To prepare the data set for rent estimation, a full synthetic population of housing units
was created for the entire model region. In brief, the census SF3 data was used at the
block level to produce targets, which were then matched with the PUMS data using a
simulated annealing process. This synthetic population then had a floorspace area
imputed for each dwelling. Following this, the synthetic population was assigned to
blocks based on household size and tenure. The census blocks were assigned
geographic attributes by buffering shapefile data, and then were used to match the zone
boundaries.
2.1. Population targets
The targets for developing the synthetic population were taken from the 2000 census
Summary File 3 (SF3). The targets were used to establish, at the census block group
level, an appropriate number of housing units fulfilling various key categories. The
targets considered only owned and rented housing units, setting aside vacant and group
homes. The estimation of the models described in section 3 used only the rented
housing units, to avoid the difficulties of converting a home value into a rent figure.
The targets were selected to ensure the synthetic population matched the most
important characteristics related to housing:

Household size by tenure (H017) – household size would be used in the
assignment of synthetic households to individual census blocks (as this was the
most detailed information released at the block level);

Number of rooms by tenure (H026), and number of bedrooms by tenure (H042) –
these are the two census fields describing the size of the dwelling, and were key
in the floorspace imputation step;

Size of building by tenure (H032) – the field describing the structure type (single
family detached, attached, mobile home and multifamily), for the dwelling type
adjustment factor; and

Age of construction by tenure (H036) – needed for the age adjustment factor.
2.2. Population microdata
The population for the synthesis was taken from the 5% PUMS data. Each PUMA (the
5% area, not the larger 1% sample-based Super PUMA) was synthesized individually,
with a different synthetic population, comprising primarily the PUMS records associated
with the specific PUMA, duplicated to reflect each record’s individual weight.
A typical PUMA has around 2000 housing units available, which is a reasonable
number, but may lack some of the less common configurations. To provide additional
variability in the microdata while still using local data, three to five neighbouring PUMAs
were selected as a supplemental sample, with the microdata from these PUMAs added
to the microdata list for the specific PUMA being synthesized. These supplemental
PUMS records were duplicated to reflect their weight, multiplied by a scale factor of 0.2.
Thus, the population microdata consisted of from 50-65% records from the PUMA being
synthesized, and the remainder from adjacent PUMAs.
2.3. Population synthesis
The synthesis was performed using a simulated annealing process, the detailed
documentation of which is available separately. The core of the process is to populate
each block group with an appropriate number of records, and then to iteratively try a
change to a random record in the block group; to add, remove or swap it with another
record from the population microdata. If the resulting block group better matches the
targets, the change is kept. If the fit is worse, the change may be kept at early stages in
the process, with a probability that declines as the synthesis progresses until only
improvements are kept.
As mentioned, this process was run once independently for each PUMA, with all of the
block groups within that PUMA as individual targets. The number of changes to try was
100,000 * the number of block groups in the PUMA. For PUMAs in Oregon proper, or in
Washington’s Clark County, the number of changes was doubled. The number of
changes ranged from a floor of 5 million for three sparsely populated PUMAs in the
halo, to over 58 million for Oregon PUMA 701 (Eugene).
2.4. Floorspace and property imputation
The goal of the rent estimations was to develop estimates of rent by square foot, but the
census data does not provide this; it only uses the number of rooms. To determine the
floorspace, each dwelling was assigned a floorspace based on a series of equations
previously developed from a linear regression of American Community Survey data,
based on the number of rooms, number of bedrooms, building size and unit age. Since
these values were all used as targets in the synthesis, the inputs to the floorspace
equation are felt to be reasonable. The results were also reasonable, although lacking
the diversity that actual floorspace data would have, were it available.
2.5. Block assignment and record adjustment
Each synthetic housing unit was located at this point within a block group, but the beta
zones cross block group boundaries in some cases. Further, the geoprocessing step
below would work on the block level, and provide a sense of the variety in the spatial
properties. For these reasons, it was necessary to assign each record to an individual
census block.
Complicating this was the paucity of data available at this level; the best values reported
by the census including tenure were by the number of persons. While this is obviously
not the best metric for housing properties, it has at least some relation; larger families,
ceteris paribus, occupy larger houses.
A second complication were small differences in the numbers of records; the census
SF3 used for targets do not entirely agree with the SF1 data at the block level, and the
synthetic population does not match either of these entirely, because the synthesis is a
stochastic process matching multiple sets of targets (of which household size is only
one), and while it can get close within a reasonable time, the last small adjustments
could take days of computing time to match exactly.
The totals of the synthetic population at the block group level were used to adjust the
SF1 block level targets proportionally to match the total synthetic population. The
synthetic records were then sorted by household size, and assigned to blocks from the
smallest size to the largest; this minimized the spatial shift between blocks and also
kept the household size as close as possible for each block. For instance, if a synthetic
block group had more 2 person and fewer 5 person households than the census
targets, the additional 2 person records would be assigned to groups with 3 persons,
causing some 3 person records to be assigned to blocks with 4 person households, and
then 4 person households making up the difference in the 5 person household category.
This is more complicated, but provides more fidelity than the simpler solution of filling in
some blocks needing 5 person households with the extra 2 person households.
While this processing was being done, two more individual record adjustments were
also performed. The age, which is recorded by the census in ranges, was synthesized
to produce specific age values (by randomly drawing an integer in the range of ages,
with the 61+ year old households selected from a range of 61-80 years old).
Finally, the rent was adjusted. The SF3 data file provides a gross rent value for the
entire block group. This was first scaled to match the synthetic population size to the
SF3 size (this was a very small adjustment of a few percent or less in most cases). The
rents reported by the PUMS households in the synthetic data were then scaled so that
the sum of these synthetic rents matched the adjusted block group gross rent value.
This kept the variety in rents from the observations in the PUMS data, while maintaining
the important total rents from the SF3.
The final data set produced from this work included all of the rented housing units, with
discrete age, floorspace area, rent and a specific census block assigned to them.
2.6. Block-level geoprocessing
The final step was to determine the geographic properties that were needed for local
rent effects. The three that were selected were the distance to major roads, the distance
to water, and the local density. These were all calculated for each individual census
block in the model region. The first two are fairly self-explanatory; shapefiles provided
by ODOT were used to define major roads and water, and the buffering process was
capped at 10 miles, to keep as much fidelity as possible in the data. These values
would later be capped at a lower value in estimation, but the large distance permitted
trials with different cap distances.
The density was defined as the number of housing units within ¼ mile of the block.
Each block was assumed to have a constant density of housing units, and the portion of
each block within the buffer was used to determine this value. This density value can be
multiplied by 5.09296 to get the density per square mile for housing units.
3. Model estimation
To estimate the model, a linear ordinary least squares procedure was used, using
Python code. The linear model included three structure-specific parameters, with the
Constant (eq. 1) functional form. These reflect the structure types in the TLUMIP work;
mobile home, single family attached and multifamily, with single family detached
representing the default alternative.
The TLUMIP separation of SFD and mobile home into “urban” and “rural” (small and
large lots) was not feasible to produce from the data; the census SF3 did not provide
targets by lot size. Further, the zone system mostly divides the study area into urban
and rural areas using zone boundaries; trial estimations using an “urban” versus “rural”
housing type produced unreasonable results reflecting the ecological correlation in the
data: rural SFD were deemed to have a price of roughly 20% of that of urban housing
per square foot, but the base price in rural areas like Eastern Oregon was 4 or 5 times
that of the urban areas.
The second set of parameters used reflected the age and neighbourhood density of the
structure. These both used the Reversed Power (eq. 6) functional form. The age was
the age of the housing unit, in years. The density was the number of housing units
within ¼ mile radius of the block. Both functions used a RefDValueg of 1; in other words,
the reference house was newly built, and was the only housing unit within ¼ mile.
The third set of parameters used reflected the proximity to roads and water. These both
used the Shifted Exponential (eq. 3) functional form, with ½ mile used as the
RefDValueg value and the maximum distance. It should be noted that the roads
shapefile used for the buffering was a major roads file, not just a limited access facility
(freeway) shapefile.
Finally, each zone was assigned a parameter using the Constant (eq. 1) functional form.
These represent the base price for housing in the zone (specifically, what the rent is for
a new single family detached house with no neighbours within ¼ mile, and ½ mile or
more from any water bodies or major roads).
As mentioned, the estimation used an ordinary least squares estimation script written in
Python. The model described above specifies 7 local rent modifier parameters, but also
includes 517 zone specific dummy parameters, for a total of 524 parameters. The
synthetic dataset contained 724,084 records, producing a matrix with just under 380
billion individual cells. This was far more than the memory addressable with the
computer and software configuration used for the work.
As a consequence, the synthetic dataset was sampled, and the regression run on this
sample. The largest sample that could be handled in memory was 75,000 records. To
produce the best possible estimate, the estimation was repeated 40 times with different
samples of the population, making some discussion of the variance in estimates
possible.
One zone, 4031 – a rural portion of Skamania County, Washington, in the halo, had no
observations in the data set. This zone is less than one single census block group, and
is mostly land in Gifford Pinchot National Forest. Base rent estimates from adjoining
zones 4030, 4035, 4036, 4040, 4041, 4044 and 4058 were averaged to provide a base
rent estimate. Furthermore, a set of 30 zones had fewer than 80 observations in the
entire synthetic population. For these zones, all of the synthetic records were appended
onto the data sample and used in every observation (preventing missing values and
matrix singularity problems that could occur if the sampling doesn’t select any records
from the zone). As a result, the base rent estimates for these zones have very low
variance, but the lack of sufficient underlying data may reduce the reliability of the
estimates.
Figure 1 shows the model area categorized by the number of available synthetic
observations; a sporadic set of zones (16 in total) have 50 or fewer observations, while
a larger number, comprising much of eastern Oregon have 200 observations or fewer.
The Willamette valley area, including the major cities in the model region, have plenty of
observations in each zone.
Figure 1: Number of observations in synthetic data set per zone
4. Estimation results
The estimation results will be discussed in three sections; the first will address the local
rent modifier parameters, the second the zonal base rent estimates, and the third will
discuss the variability exhibited in these estimations. The first two will use the averages
of the 40 estimation runs, where the third section will explicitly deal with the individual
estimations.
4.1. Local rent modifiers
There were seven rent modifier parameters estimated; three for housing types, and one
each for the age, density, proximity to water and proximity to major roads. All parameter
estimates were statistically significant. The estimates are shown in table 1 below:
Table 1: Rent modifier parameters
Parameter
Form
RefDValue
Estimated
value for ag
Attached Single
Family Detached
dummy
Multifamily
dummy
Mobile home
dummy
Age
Constant
n/a
Constant
Density
Dist. to water
(feet)
Dist. to road
(feet)
Corresponding
value for θg
-0.0973
T-ratio for
estimated
ag value
8.50
n/a
0.3255
54.53
1.384
Constant
n/a
0.0927
8.67
1.097
Reversed
Power
Reversed
Power
Shifted
Exponential
Shifted
Exponential
1
-0.00244
18.93
0.00244
1
0.000103
10.47
-0.000103
2640
0.0395
5.10
0.0395
2640
0.0872
5.47
0.0872
0.907
The housing type parameters estimated were, as shown above, 0.907 for attached,
1.384 for multifamily, and 1.097 for mobile home, with single family detached having the
reference value of 1.
The higher values for mobile home and multifamily (in particular for multifamily) can be
attributed to the larger size per unit for these dwelling types – while the rent for a 600
square foot apartment is less than the rent for an 1800 square foot house, it won’t
necessarily be one third as much (and thus the same on a per square foot basis); in this
example, with the constant of 1.384, the apartment would rent for 46% of the price of
the house.
The same is true for mobile homes, although this is likely tempered by both the larger
size and lower perceived value of mobile homes. Attached dwellings are much closer to
the size of detached dwellings, so the constant below 1 reflects more closely the
reduced value of living in an attached structure.
The constant for age is negative, and is broadly consistent with estimations from other
locations; for instance, -0.0014 for medium density residential space in Baltimore. This
reflects the expected reduction of value of buildings with age. Figure 2 below shows a
chart of the value of this out to 80 year old dwellings (the oldest in the data set).
Figure 2: Rent adjustment factor for age of building
1.05
Rent adjustment factor
1
0.95
0.9
0.85
0.8
0.75
0
10
20
30
40
50
Structure age (y)
60
70
80
The constant for density is positive, which may reflect multiple trends. In larger cities,
density is commonly thought to be unpleasant, but there may be an ecological
correlation effect with other, unobserved, attributes. Further, in smaller towns, the large
zones may include entire towns. In a large city, accessibility can be thought of as the
distance to the shopping centre, the hospital, the downtown and so on; this varies from
zone to zone. However, in smaller towns, accessibility is as much the distance to Main
Street as the distance to other towns and cities. This Main Street distance is reflected in
higher densities at the town level. Figure 3 below shows a chart of the effect of density
up to 750 HU within ¼ mile; the average for urbanized areas ranges from around 150250, although the densest areas (central Portland) have densities of up to 2000 HU
within ¼ mile.
Figure 3: Rent adjustment factor for density
Rent adjustment factor
1.15
1.1
1.05
1
0.95
0
100
200
300
400
500
600
Density (housing units within 0.25 mi)
700
The effect of both roads and water is positive, although with a lower (but still significant)
T statistic as compared with the other local rent modifiers. The effect of roads has been
seen to be both positive and negative in other areas; in Atlanta, distance to local road
was seen as negative for low-density and positive for high-density housing. The
parameter for high-density housing in Atlanta was 0.056, slightly smaller than the
Oregon value of 0.0872. One reason for the higher values in Oregon may be a
combination of higher density in urban areas (in part due to growth boundaries),
combined with a larger amount of very rural areas where distance to a major road
represents access to towns and the goods and services available there.
Water has a larger parameter than proximity to major road in these estimations.
However, it has a much smaller value than the only comparable data available; San
Diego, where beach proximity was seen to be strongly positive, with a parameter of
0.195 (versus 0.0395 in Oregon). This is not unexpected; the warmer climate of San
Diego makes water much more attractive as an amenity. Furthermore, the rocky coasts
of Oregon in particular are difficult to build upon, as compared with the relatively flat
slopes in San Diego. Both the road and water parameters are shown in Figure 4.
Figure 4: Rent adjustment factors for distance to road and water
Rent adjustment factor
1.15
Major road
Water
1.1
1.05
1
0.95
0
500
1000
1500
2000
Distance from feature (feet)
2500
3000
These parameters are multiplicative in nature, which means that the largest possible
range of rents within a single zone is from 1.93x base; for a new multifamily unit in the
densest spot in the model, if it was also immediately adjacent to both water and a major
road to 0.746x base for an 80 year old attached dwelling located a long way away from
water, roads and other houses. This is a reasonable range, particularly since there are
compensating effects – being close to a major road or body of water means that there
are fewer sites for potential dwelling units within ¼ mile, and thus a lower density.
4.2 Zonal base rent estimates
Since there are 517 base rent estimates, a detailed table will be provided in .XLS
format. All rents are in the same money units used by the census, dollars per month,
calculated on a per square foot basis. The ag values estimated need to be transformed
to produce θg, the actual dollar values of base rent; this has been done in all rent values
in this report to facilitate the discussion. The average across zones for various model
regions are as shown in Table 2 below:
Table 2: Average base rent by region
Model region
Average base rent
per square foot
Portland
52.25¢
Vancouver
48.23¢
Salem
41.86¢
Eugene
46.11¢
Medford
45.56¢
Other Oregon
32.72¢
Halo Washington
34.19¢
Halo Idaho
29.48¢
Halo Nevada
27.29¢
Halo California
35.10¢
These base prices are shown in the following two maps. Figure 5 shows the rents in the
entire model region, with Figure 6 providing a more detailed look at base rent estimates
in northwest Oregon.
Figure 5: Base rent ($/sq ft) for model region
Figure 6: Base rent ($/sq ft) for Willamette Valley
These prices are consistent with expectations; the rural areas of eastern Oregon and
the halo have the lowest rent values, and the major urban areas have the highest, most
notably the Portland region, and especially the southwestern suburbs. Smaller centres,
such as Medford, Boise and Kennewick have higher rents than the surrounding
countryside. The most unexpected result is the high base rents for the rural areas to the
west of Bend. The two highest rent zones in the area also have a very low number of
observations, which is likely playing a role in these values. A revisiting of the rent prices
in these areas by a more manual intervention may be necessary.
These rent estimates are also statistically significant; the only value with an insignificant
T-statistic is for zone 2879, with a T-statistic of 0.56. This zone, in Deschutes county,
also has the highest estimate for rent; 88.51¢ per square foot, which is likely an artefact
of the availability of only 9 data points in the zone. There are another 9 zones with Tratios between 3 and 5, and the remainder are much more significant.
4.3 Variability
Because of the difficulties in storing the entire matrix needed to calculate the
parameters in memory, the estimation was done using samples of 75,000 records at a
time, a bit over 10% of the data set. The 40 runs that were used to estimate the
parameters described in this memo were analyzed with respect to the variance seen in
this estimation.
Figure 7 shows a box plot of the model estimates of all 517 zonal specific constants in
the estimation results; the dark red band is the 25th to 75th percentile for each zone,
while the light grey lines show the extent of the minimum and maximum price estimated
from the 40 runs. The zones have been sorted by average rent for clarity.
Figure 7: Deviation of model estimates for zone prices
$0.90
Estimated rent per square foot
$0.80
$0.70
$0.60
$0.50
$0.40
$0.30
$0.20
$0.10
$0.00
The first thing evident from the chart is the roughly normal distribution of prices across
zones. In terms of variance, it is evident that most estimates are fairly consistent across
runs, and that the true value of base rents per zone is likely within the fairly tight set of
estimated parameters from across multiple runs.
The source of the variance in these parameter estimates is assumed to be primarily the
limited sample size for many zones. Figure 8 below shows the variance, measured as
the standard deviation of the 40 estimation runs, as a function of the number of
observations.
Figure 8: Number of observations vs. Base rent estimate variance
Number of data points available
100000
10000
1000
100
10
$0.00
$0.02
$0.04
$0.06
$0.08
Standard deviation of base rent estimates
$0.10
The handful of points in the lower left hand side of the chart, with very low standard
deviation despite small sample size are the 30 zones that had all of the data included in
the estimation every time. The remaining points show a clear trend, with the standard of
deviation of the base rent estimate decreasing as the number of observations increases
(note the log scale on the y axis). Above 1000 observations, the standard deviation is
typically below 3 cents, which is a reasonable margin considering the broad nature of
the question being undertaken.
Download