1. Background This work was undertaken to produce two sets of values; a list of zone-specific base rent estimates, and a set of local effect factors that adjust this base rent to reflect the individual structure and local influences that are expected to act at a geographic level smaller than the zone system. A full discussion of the theory and mathematics behind this estimation is discussed in the PECAS Model Development Rent Modifier Equations Technical Note, and the discussion of specific equations and functional forms refers to this document. 2. Data preparation To prepare the data set for rent estimation, a full synthetic population of housing units was created for the entire model region. In brief, the census SF3 data was used at the block level to produce targets, which were then matched with the PUMS data using a simulated annealing process. This synthetic population then had a floorspace area imputed for each dwelling. Following this, the synthetic population was assigned to blocks based on household size and tenure. The census blocks were assigned geographic attributes by buffering shapefile data, and then were used to match the zone boundaries. 2.1. Population targets The targets for developing the synthetic population were taken from the 2000 census Summary File 3 (SF3). The targets were used to establish, at the census block group level, an appropriate number of housing units fulfilling various key categories. The targets considered only owned and rented housing units, setting aside vacant and group homes. The estimation of the models described in section 3 used only the rented housing units, to avoid the difficulties of converting a home value into a rent figure. The targets were selected to ensure the synthetic population matched the most important characteristics related to housing: Household size by tenure (H017) – household size would be used in the assignment of synthetic households to individual census blocks (as this was the most detailed information released at the block level); Number of rooms by tenure (H026), and number of bedrooms by tenure (H042) – these are the two census fields describing the size of the dwelling, and were key in the floorspace imputation step; Size of building by tenure (H032) – the field describing the structure type (single family detached, attached, mobile home and multifamily), for the dwelling type adjustment factor; and Age of construction by tenure (H036) – needed for the age adjustment factor. 2.2. Population microdata The population for the synthesis was taken from the 5% PUMS data. Each PUMA (the 5% area, not the larger 1% sample-based Super PUMA) was synthesized individually, with a different synthetic population, comprising primarily the PUMS records associated with the specific PUMA, duplicated to reflect each record’s individual weight. A typical PUMA has around 2000 housing units available, which is a reasonable number, but may lack some of the less common configurations. To provide additional variability in the microdata while still using local data, three to five neighbouring PUMAs were selected as a supplemental sample, with the microdata from these PUMAs added to the microdata list for the specific PUMA being synthesized. These supplemental PUMS records were duplicated to reflect their weight, multiplied by a scale factor of 0.2. Thus, the population microdata consisted of from 50-65% records from the PUMA being synthesized, and the remainder from adjacent PUMAs. 2.3. Population synthesis The synthesis was performed using a simulated annealing process, the detailed documentation of which is available separately. The core of the process is to populate each block group with an appropriate number of records, and then to iteratively try a change to a random record in the block group; to add, remove or swap it with another record from the population microdata. If the resulting block group better matches the targets, the change is kept. If the fit is worse, the change may be kept at early stages in the process, with a probability that declines as the synthesis progresses until only improvements are kept. As mentioned, this process was run once independently for each PUMA, with all of the block groups within that PUMA as individual targets. The number of changes to try was 100,000 * the number of block groups in the PUMA. For PUMAs in Oregon proper, or in Washington’s Clark County, the number of changes was doubled. The number of changes ranged from a floor of 5 million for three sparsely populated PUMAs in the halo, to over 58 million for Oregon PUMA 701 (Eugene). 2.4. Floorspace and property imputation The goal of the rent estimations was to develop estimates of rent by square foot, but the census data does not provide this; it only uses the number of rooms. To determine the floorspace, each dwelling was assigned a floorspace based on a series of equations previously developed from a linear regression of American Community Survey data, based on the number of rooms, number of bedrooms, building size and unit age. Since these values were all used as targets in the synthesis, the inputs to the floorspace equation are felt to be reasonable. The results were also reasonable, although lacking the diversity that actual floorspace data would have, were it available. 2.5. Block assignment and record adjustment Each synthetic housing unit was located at this point within a block group, but the beta zones cross block group boundaries in some cases. Further, the geoprocessing step below would work on the block level, and provide a sense of the variety in the spatial properties. For these reasons, it was necessary to assign each record to an individual census block. Complicating this was the paucity of data available at this level; the best values reported by the census including tenure were by the number of persons. While this is obviously not the best metric for housing properties, it has at least some relation; larger families, ceteris paribus, occupy larger houses. A second complication were small differences in the numbers of records; the census SF3 used for targets do not entirely agree with the SF1 data at the block level, and the synthetic population does not match either of these entirely, because the synthesis is a stochastic process matching multiple sets of targets (of which household size is only one), and while it can get close within a reasonable time, the last small adjustments could take days of computing time to match exactly. The totals of the synthetic population at the block group level were used to adjust the SF1 block level targets proportionally to match the total synthetic population. The synthetic records were then sorted by household size, and assigned to blocks from the smallest size to the largest; this minimized the spatial shift between blocks and also kept the household size as close as possible for each block. For instance, if a synthetic block group had more 2 person and fewer 5 person households than the census targets, the additional 2 person records would be assigned to groups with 3 persons, causing some 3 person records to be assigned to blocks with 4 person households, and then 4 person households making up the difference in the 5 person household category. This is more complicated, but provides more fidelity than the simpler solution of filling in some blocks needing 5 person households with the extra 2 person households. While this processing was being done, two more individual record adjustments were also performed. The age, which is recorded by the census in ranges, was synthesized to produce specific age values (by randomly drawing an integer in the range of ages, with the 61+ year old households selected from a range of 61-80 years old). Finally, the rent was adjusted. The SF3 data file provides a gross rent value for the entire block group. This was first scaled to match the synthetic population size to the SF3 size (this was a very small adjustment of a few percent or less in most cases). The rents reported by the PUMS households in the synthetic data were then scaled so that the sum of these synthetic rents matched the adjusted block group gross rent value. This kept the variety in rents from the observations in the PUMS data, while maintaining the important total rents from the SF3. The final data set produced from this work included all of the rented housing units, with discrete age, floorspace area, rent and a specific census block assigned to them. 2.6. Block-level geoprocessing The final step was to determine the geographic properties that were needed for local rent effects. The three that were selected were the distance to major roads, the distance to water, and the local density. These were all calculated for each individual census block in the model region. The first two are fairly self-explanatory; shapefiles provided by ODOT were used to define major roads and water, and the buffering process was capped at 10 miles, to keep as much fidelity as possible in the data. These values would later be capped at a lower value in estimation, but the large distance permitted trials with different cap distances. The density was defined as the number of housing units within ¼ mile of the block. Each block was assumed to have a constant density of housing units, and the portion of each block within the buffer was used to determine this value. This density value can be multiplied by 5.09296 to get the density per square mile for housing units. 3. Model estimation To estimate the model, a linear ordinary least squares procedure was used, using Python code. The linear model included three structure-specific parameters, with the Constant (eq. 1) functional form. These reflect the structure types in the TLUMIP work; mobile home, single family attached and multifamily, with single family detached representing the default alternative. The TLUMIP separation of SFD and mobile home into “urban” and “rural” (small and large lots) was not feasible to produce from the data; the census SF3 did not provide targets by lot size. Further, the zone system mostly divides the study area into urban and rural areas using zone boundaries; trial estimations using an “urban” versus “rural” housing type produced unreasonable results reflecting the ecological correlation in the data: rural SFD were deemed to have a price of roughly 20% of that of urban housing per square foot, but the base price in rural areas like Eastern Oregon was 4 or 5 times that of the urban areas. The second set of parameters used reflected the age and neighbourhood density of the structure. These both used the Reversed Power (eq. 6) functional form. The age was the age of the housing unit, in years. The density was the number of housing units within ¼ mile radius of the block. Both functions used a RefDValueg of 1; in other words, the reference house was newly built, and was the only housing unit within ¼ mile. The third set of parameters used reflected the proximity to roads and water. These both used the Shifted Exponential (eq. 3) functional form, with ½ mile used as the RefDValueg value and the maximum distance. It should be noted that the roads shapefile used for the buffering was a major roads file, not just a limited access facility (freeway) shapefile. Finally, each zone was assigned a parameter using the Constant (eq. 1) functional form. These represent the base price for housing in the zone (specifically, what the rent is for a new single family detached house with no neighbours within ¼ mile, and ½ mile or more from any water bodies or major roads). As mentioned, the estimation used an ordinary least squares estimation script written in Python. The model described above specifies 7 local rent modifier parameters, but also includes 517 zone specific dummy parameters, for a total of 524 parameters. The synthetic dataset contained 724,084 records, producing a matrix with just under 380 billion individual cells. This was far more than the memory addressable with the computer and software configuration used for the work. As a consequence, the synthetic dataset was sampled, and the regression run on this sample. The largest sample that could be handled in memory was 75,000 records. To produce the best possible estimate, the estimation was repeated 40 times with different samples of the population, making some discussion of the variance in estimates possible. One zone, 4031 – a rural portion of Skamania County, Washington, in the halo, had no observations in the data set. This zone is less than one single census block group, and is mostly land in Gifford Pinchot National Forest. Base rent estimates from adjoining zones 4030, 4035, 4036, 4040, 4041, 4044 and 4058 were averaged to provide a base rent estimate. Furthermore, a set of 30 zones had fewer than 80 observations in the entire synthetic population. For these zones, all of the synthetic records were appended onto the data sample and used in every observation (preventing missing values and matrix singularity problems that could occur if the sampling doesn’t select any records from the zone). As a result, the base rent estimates for these zones have very low variance, but the lack of sufficient underlying data may reduce the reliability of the estimates. Figure 1 shows the model area categorized by the number of available synthetic observations; a sporadic set of zones (16 in total) have 50 or fewer observations, while a larger number, comprising much of eastern Oregon have 200 observations or fewer. The Willamette valley area, including the major cities in the model region, have plenty of observations in each zone. Figure 1: Number of observations in synthetic data set per zone 4. Estimation results The estimation results will be discussed in three sections; the first will address the local rent modifier parameters, the second the zonal base rent estimates, and the third will discuss the variability exhibited in these estimations. The first two will use the averages of the 40 estimation runs, where the third section will explicitly deal with the individual estimations. 4.1. Local rent modifiers There were seven rent modifier parameters estimated; three for housing types, and one each for the age, density, proximity to water and proximity to major roads. All parameter estimates were statistically significant. The estimates are shown in table 1 below: Table 1: Rent modifier parameters Parameter Form RefDValue Estimated value for ag Attached Single Family Detached dummy Multifamily dummy Mobile home dummy Age Constant n/a Constant Density Dist. to water (feet) Dist. to road (feet) Corresponding value for θg -0.0973 T-ratio for estimated ag value 8.50 n/a 0.3255 54.53 1.384 Constant n/a 0.0927 8.67 1.097 Reversed Power Reversed Power Shifted Exponential Shifted Exponential 1 -0.00244 18.93 0.00244 1 0.000103 10.47 -0.000103 2640 0.0395 5.10 0.0395 2640 0.0872 5.47 0.0872 0.907 The housing type parameters estimated were, as shown above, 0.907 for attached, 1.384 for multifamily, and 1.097 for mobile home, with single family detached having the reference value of 1. The higher values for mobile home and multifamily (in particular for multifamily) can be attributed to the larger size per unit for these dwelling types – while the rent for a 600 square foot apartment is less than the rent for an 1800 square foot house, it won’t necessarily be one third as much (and thus the same on a per square foot basis); in this example, with the constant of 1.384, the apartment would rent for 46% of the price of the house. The same is true for mobile homes, although this is likely tempered by both the larger size and lower perceived value of mobile homes. Attached dwellings are much closer to the size of detached dwellings, so the constant below 1 reflects more closely the reduced value of living in an attached structure. The constant for age is negative, and is broadly consistent with estimations from other locations; for instance, -0.0014 for medium density residential space in Baltimore. This reflects the expected reduction of value of buildings with age. Figure 2 below shows a chart of the value of this out to 80 year old dwellings (the oldest in the data set). Figure 2: Rent adjustment factor for age of building 1.05 Rent adjustment factor 1 0.95 0.9 0.85 0.8 0.75 0 10 20 30 40 50 Structure age (y) 60 70 80 The constant for density is positive, which may reflect multiple trends. In larger cities, density is commonly thought to be unpleasant, but there may be an ecological correlation effect with other, unobserved, attributes. Further, in smaller towns, the large zones may include entire towns. In a large city, accessibility can be thought of as the distance to the shopping centre, the hospital, the downtown and so on; this varies from zone to zone. However, in smaller towns, accessibility is as much the distance to Main Street as the distance to other towns and cities. This Main Street distance is reflected in higher densities at the town level. Figure 3 below shows a chart of the effect of density up to 750 HU within ¼ mile; the average for urbanized areas ranges from around 150250, although the densest areas (central Portland) have densities of up to 2000 HU within ¼ mile. Figure 3: Rent adjustment factor for density Rent adjustment factor 1.15 1.1 1.05 1 0.95 0 100 200 300 400 500 600 Density (housing units within 0.25 mi) 700 The effect of both roads and water is positive, although with a lower (but still significant) T statistic as compared with the other local rent modifiers. The effect of roads has been seen to be both positive and negative in other areas; in Atlanta, distance to local road was seen as negative for low-density and positive for high-density housing. The parameter for high-density housing in Atlanta was 0.056, slightly smaller than the Oregon value of 0.0872. One reason for the higher values in Oregon may be a combination of higher density in urban areas (in part due to growth boundaries), combined with a larger amount of very rural areas where distance to a major road represents access to towns and the goods and services available there. Water has a larger parameter than proximity to major road in these estimations. However, it has a much smaller value than the only comparable data available; San Diego, where beach proximity was seen to be strongly positive, with a parameter of 0.195 (versus 0.0395 in Oregon). This is not unexpected; the warmer climate of San Diego makes water much more attractive as an amenity. Furthermore, the rocky coasts of Oregon in particular are difficult to build upon, as compared with the relatively flat slopes in San Diego. Both the road and water parameters are shown in Figure 4. Figure 4: Rent adjustment factors for distance to road and water Rent adjustment factor 1.15 Major road Water 1.1 1.05 1 0.95 0 500 1000 1500 2000 Distance from feature (feet) 2500 3000 These parameters are multiplicative in nature, which means that the largest possible range of rents within a single zone is from 1.93x base; for a new multifamily unit in the densest spot in the model, if it was also immediately adjacent to both water and a major road to 0.746x base for an 80 year old attached dwelling located a long way away from water, roads and other houses. This is a reasonable range, particularly since there are compensating effects – being close to a major road or body of water means that there are fewer sites for potential dwelling units within ¼ mile, and thus a lower density. 4.2 Zonal base rent estimates Since there are 517 base rent estimates, a detailed table will be provided in .XLS format. All rents are in the same money units used by the census, dollars per month, calculated on a per square foot basis. The ag values estimated need to be transformed to produce θg, the actual dollar values of base rent; this has been done in all rent values in this report to facilitate the discussion. The average across zones for various model regions are as shown in Table 2 below: Table 2: Average base rent by region Model region Average base rent per square foot Portland 52.25¢ Vancouver 48.23¢ Salem 41.86¢ Eugene 46.11¢ Medford 45.56¢ Other Oregon 32.72¢ Halo Washington 34.19¢ Halo Idaho 29.48¢ Halo Nevada 27.29¢ Halo California 35.10¢ These base prices are shown in the following two maps. Figure 5 shows the rents in the entire model region, with Figure 6 providing a more detailed look at base rent estimates in northwest Oregon. Figure 5: Base rent ($/sq ft) for model region Figure 6: Base rent ($/sq ft) for Willamette Valley These prices are consistent with expectations; the rural areas of eastern Oregon and the halo have the lowest rent values, and the major urban areas have the highest, most notably the Portland region, and especially the southwestern suburbs. Smaller centres, such as Medford, Boise and Kennewick have higher rents than the surrounding countryside. The most unexpected result is the high base rents for the rural areas to the west of Bend. The two highest rent zones in the area also have a very low number of observations, which is likely playing a role in these values. A revisiting of the rent prices in these areas by a more manual intervention may be necessary. These rent estimates are also statistically significant; the only value with an insignificant T-statistic is for zone 2879, with a T-statistic of 0.56. This zone, in Deschutes county, also has the highest estimate for rent; 88.51¢ per square foot, which is likely an artefact of the availability of only 9 data points in the zone. There are another 9 zones with Tratios between 3 and 5, and the remainder are much more significant. 4.3 Variability Because of the difficulties in storing the entire matrix needed to calculate the parameters in memory, the estimation was done using samples of 75,000 records at a time, a bit over 10% of the data set. The 40 runs that were used to estimate the parameters described in this memo were analyzed with respect to the variance seen in this estimation. Figure 7 shows a box plot of the model estimates of all 517 zonal specific constants in the estimation results; the dark red band is the 25th to 75th percentile for each zone, while the light grey lines show the extent of the minimum and maximum price estimated from the 40 runs. The zones have been sorted by average rent for clarity. Figure 7: Deviation of model estimates for zone prices $0.90 Estimated rent per square foot $0.80 $0.70 $0.60 $0.50 $0.40 $0.30 $0.20 $0.10 $0.00 The first thing evident from the chart is the roughly normal distribution of prices across zones. In terms of variance, it is evident that most estimates are fairly consistent across runs, and that the true value of base rents per zone is likely within the fairly tight set of estimated parameters from across multiple runs. The source of the variance in these parameter estimates is assumed to be primarily the limited sample size for many zones. Figure 8 below shows the variance, measured as the standard deviation of the 40 estimation runs, as a function of the number of observations. Figure 8: Number of observations vs. Base rent estimate variance Number of data points available 100000 10000 1000 100 10 $0.00 $0.02 $0.04 $0.06 $0.08 Standard deviation of base rent estimates $0.10 The handful of points in the lower left hand side of the chart, with very low standard deviation despite small sample size are the 30 zones that had all of the data included in the estimation every time. The remaining points show a clear trend, with the standard of deviation of the base rent estimate decreasing as the number of observations increases (note the log scale on the y axis). Above 1000 observations, the standard deviation is typically below 3 cents, which is a reasonable margin considering the broad nature of the question being undertaken.