Lagoas- Marcosende Vigo, Spain - Encuentro de Economía Aplicada

advertisement
Using Data-Driven Prediction Methods in a Hedonic
Regression Problem
MARCOS ÁLVAREZ-DÍAZ1, MANUEL GONZÁLEZ GÓMEZ
Department of Applied Economy, University of Vigo
Lagoas- Marcosende Vigo, Spain
And
ALBERTO ÁLVAREZ
ISME-DSEA Department of Electrical Engineering, University of Pisa
Via Diotisalve 2,56100 Pisa, Italy
(VERSIÓN PRELIMINAR)
Abstract
The traditional studies about hedonic prices apply simple functional forms such as
linear or linearity transformable structures. Nowadays, it’s known in the literature the
importance of introducing non-linearity to improve the models’ explanatory capacity. In
this work we apply data-driven methods to carry out the hedonic regression. These
methods don’t impose any a priori assumption about the functional form. We use the
nearest neigbors technique as non-parametric method and neural networks and genetic
algorithms both as semi-parametric methods. Neural Networks have already been
employed to the specific hedonic regression problem but, to the authors’ knowledge,
this is the first time that a genetic algorithm is employed. The empirical results that we
have obtained demonstrate the usefulness of applying both nonparametric and semiparametric data driven models in the estimation of hedonic price functions. They can
improve the traditional parametric models in terms of out-of-sample R2.
1
Corresponding author: Departamento de Economía Aplicada, Universidad de Vigo Lagoas- Marcosende s/n,
36200 Vigo, Spain. Fax : 986812401; e-mail mad@uvigo.es .
1
I-. Introductions
The hedonic prices theory try to determine how the individual characteristics of a
commodity affect on its price. The hedonic perspective have been applied to many
goods such as automobiles, personal computers, televisions, irrigated lands and,
overcoat, housings. A linear relation between the commodity’s price and its
characteristics is the most widely approach in this sort of studies. The principal reasons
argued is that they are easy to estimate and interpret. Nevertheless, it is also recognized
in the literature the importance of non-linearities in the hedonic price function in terms
of increasing its explanatory capacity (Rasmussen and Zuehlke (1990)). Many works
solve this issue employing techniques that, by means of transformations, allow flexible
parametric functional forms (for example, Box-Cox transformations). However, these
flexible forms impose a structure introducing forced and unnecessary non-linearity. The
results are “over-parameterized” models originating a loss of out-of-sample
performance. To avoid this problem, we can use some data-driven methods that permit
obtain a model without imposing any a priori assumption about the functional form
(nearest neighbours, neural networks and genetic algorithms).
The objective of this paper is to determine if employing both nonparametric and semiparametric data-driven models we can improve forecast accuracy respect the parametric
models. Our empirical study is centered in the real state market in the city of Vigo
(Spain). To carry out our research, we recollected information about renting prices and
houses’ information such as their structural attributes and neighborhood conditions. We
structure the work as follow. In section 2, we briefly present the different prediction
methods that we have employed (linear regressions as parametric method, nearest
neighbors as non-parametric method and neural networks and genetic algorithms as
2
semi-parametric methods). Later, in the next section, we define the data and show our
empirical results on out-of-sample predictions. Finally, we finish with a section
dedicated to conclusions.
II-. Methodology
a) Parametric Techniques
The simplest approach to the hedonic regression problem is to postulate that the
functional form is simply linear. To achieve more flexibility, it’s very usual to do some
non-linear transformation with respect the data. The greater part of studies about
hedonic regression employ as norm a simple functional relations such as linear, semilogarithmic, double-log or quadratic semi-log, between others. The justifications for
such forms is based in the use of traditional estimation method as ordinary least squares,
in the success in previous studies, in empirical test and in the possibility to carrying out
statistical inference and hypothesis testing very easy.
The procedure employed in this work will be to select the linear functional which gets
the higher accuracy in terms of R2 and, to the same time, all the model’s variables are
statistical significant. The selection variables will be carried out by the backward
method. Therefore, we specify a functional form characterized by being linear in
parameters (linear, semi-log, double-log and quadratic semi-log) and consider all the
available independent variables. We estimate the model by ordinary least square and,
each time, we delete the less significant variable. We return to estimate the model
without the deleted variable and we continue with this process until all the survivor
variable are statistical significant. We’ll employ a significant cut level of 5%.
3
b) Nonparametric Approach
In this subsection we briefly explain a generalization of the nearest neighbour method
denominated local linear regression. The method is based on the idea that observations
with similar characteristics should have similar results. Suppose that we have a sample
of observations where we know the inputs and their respective outputs. We want to
predict the unknown output’s value ( y i* ) from a new input vector ( x i* ). We calculate the
Euclidean distant between x i* and the other sample input vectors. In this way, we select
the K closest points and its respective outputs to perform a linear regression. Once
estimate the parameters we can infer the predicted value y i* .
Of great interest is the choice of the nearest neighbours number (K). Consistency of
local linear regression demands that the number of nearest neighbours considered goes
to infinity when the sample size is increased but a slower rate. In the literature we can
find some rules such as K  T  , where T is the sample size and 0    1 (it’s usual
to assume   0.5 ). We prefer to adopt an empirical perspective and to prove with
many values of K. We’ll select the value which achieves the highest out-of-sample
performance.
4
c) Semi-parametric Approach
c.1) Neural Networks
Neural Networks are a class of semi-parametric models inspired by studies about how
the brain and nerve system work. They have been employed to solve a huge range of
economic problems such as financial time series forecasting and bankruptcy prediction,
between others. Some works have already utilized NN successfully to the specific case
of hedonic regression (Curry, Morgan and Silver (2001)). A good introduction to the
NN can be found in Smith (1995) and economic applications in Gately (1996) and
Deboeck (1994). NNs are composed of interconnected elements, called neurons, linked
between them through weights and grouped in layers. The first layer is called the input
layer and the last is the output layer. The middle layers are denominated hidden layers.
Each neuron in the input layer brings into the network the value of one independent
variable and propagate it towards the neurons of the next layer. In its turn, each neuron
of the next layer makes a weighted linear combination of each received input signal,
process this weighted information through a transfer function and sends an output
signal. The signals from all neurons are propagated across the NN in the same way as
far as the final layer where the NN’s output is offered. The difference between the NN’s
output and the known value of the dependent variable is calculated. The NN try to
minimize this error modifying the weights between links. This process will continue
iteratively to find the optimal weight’s values and it will finish when a determined error
level is achieved or, if not, when it have iterated a determined number of times.
The construction of a good NN for a particular application is not a trivial task. To avoid
lackness of generalization, we must choice an appropriate architecture (for example,
number of hidden layers, number of units in each layer, connections between units and
5
transfer functions). Usually, a common practice to build a NN is to select the
architecture by a process of “trying and error” searching the highest performance. In this
work, we’re going to use the most easy and employed NN in economy: a feed-forward
back-propagation. In its statistical expression, this NN can be expressed as
q
n



Yi    0   i   i 0   ij  Xij    


i 1
j 1



i
where Yi is the dependent variable, Xi the input vector, the parameters  and  are the
weighs to be adjusted, n is the number of inputs and q the number of hidden
units,  and  are the transfer functions and  i a disturb term. It’s known and accepted
that a three layers feed-forward NN with a linear transfer function in the output unit
 
and a logistic transfer function in the hidden layer neurons
 
is able to
approximate any non-linear function to an arbitrary degree of accuracy (Qi(1999)).
We’ll employ this architecture and the number of neurons in the hidden layer will be
determined by “trying and error” searching the highest value of the out-of-sample R2.
One important question is how to select the NN inputs. In other words, we have to
determine the independent variables which the NN will employ as input. Medeiros and
Teräsvirta (2001) suggest to carry out the variable selection by linearizing the model
and applying some linear variable selection method. In our case, we’ll employ the
backward method to select the relevant variables.
6
c.2) Genetic Algorithms
Genetic Algorithms (GA) are a functional search procedure based on the Darwinian
theories of natural selection and survival. This procedure have been developed by
Holland (1975) and divulged by Goldberg (1989) and Koza (1992). In general, its
application to economic problems is very scarce and, for the present, it hasn’t been used
to the hedonic regression problem (at least, to the authors’ knowledge).
The GA present advantages respect to the neural networks and nearest neighbors
methods. First at all, this procedure permits to obtain explicitly a mathematic equation
that we can analyze. Moreover, in different with neural networks, GAs are more flexible
because they don’t require the specification of a previous and complex architecture.
We use a specific GA called DARWIN (Álvarez et al. (2001)). DARWIN carries on a
optimization process that finds an optimal functional form from a developing initial
population of alternative equations. The algorithm simulates in a computer the process
of selection and survival observed in the Nature. Briefly, we can explain how DARWIN
works in the following way. First, a set of candidate equations (the initial population)
for representing the relation between variables is randomly generated. These equations
are initially of the form S   A  B  C  D where the arguments A, B, C and D are
the explicatory variables or real-number constants (the coefficients in the equations),
and the symbol  stands for one of the four basic arithmetic operators , ,  and  .
Other mathematical operators are conceivable but increasing the number of available
operators complicates the functional optimization process. Each equation of the initial
population is evaluated and classified according to its R2. The equations with highest
values of R 2 are selected to exchange parts of the character
between them
7
(reproduction and crossover) while the individuals less fitted to the data are discarded.
As a result of this crossover, offspring more complicated than the parents are generated.
The total number of characters in the equations is upper bounded to avoid the generation
of offspring with excessive length. Finally, a small percentage of the equations’ most
basic elements, single operators and variables, are mutated at random. The process is
repeated a large number of times to improve the fitness of the evolving population. At
the end of the evolutionary process, DARWIN offers as result an equation that it
considers optimal to represent the true functional relation between variables.
III-. Data and Results
The sample used consist of 110 observations obtained through interviews that we carry
out to the estate agency in the city of Vigo (Spain), from March to May 1998. For each
house, we recollected information about his renting price and characteristics such as
structural attributes and neighbourhood conditions. The housing characteristics are
presented in table 1. Our focus is to show that both nonparametric and semi-parametric
models can improve forecast accuracy respect the parametric models. The same data are
used for all techniques. The general characteristics of the estimation and forecasting
are as follows. The models were estimated from the first to the 85th observation (training
set). The remain observations were used to test the model and to obtain the out-ofsample predictions (validation set). As it was mentioned before, the variables selection
technique employed was the backward method. Finally, the measure that we have used
to compare the forecasting accuracy of the models considered is the R-Square out-ofsample.
8
Table 1. Description of the Independent Variables
VARIABLES
DESCRIPTION
Rp
Renting Price in Pesetas. For the NN case this variable was
normalized and, for the GA, was divided by 1000.
M2
Square Meters
Cond
Dicotomic Variable that takes the value 1 if the house is
catalogued by the agent as excellent to occupy.
Ind
Actecon
Index built as the add of 5 structural characteristics: existence
of lumber-room, grocery-store, central heating, elevator and if
the kitchen is furnished.
Variable that collect the economic activity of the street where
the house is located. It’s calculated as the ratio between the
number of business in the street and the number of houses.
Npg
Number of garage places
Ncb
Number of bathrooms
As we can see in the table 2, we show the results classified in three divisions. The
first presents the parametric models: linear, semi-log, double-log and quadratic semilog. This last method bears in mind the squared and cross-product effects between
variables and it was already employed successful by Rasmussen and Zuehlke (1990).
The best parametric produces an out-of-sample R2 of 0.7689 and is obtained by the
quadratic semi-log model. However, the sign showed by the variable IndCond is not
consistent with the a priori expectative. For the non-parametric case, in the graphic 1
we can see the sensibility of the R2 with respect the number of nearest neighbours. The
highest accuracy is achieved for K=30 obtaining a value R2=0.8575. We can also
employ the graphic to verify the existence of important non-linearity in the hedonic
regression. The accuracy gets worse when more neighbours are considered. Therefore,
9
the local regression achieves higher accuracy than when all the points are considered
(this case represents the parametric linear regression).
Graphic 1. Local Regression. K Nearest Neighbours Determination
0.9
0.88
Out-of-Sample Accuracy
0.86
0.84
0.82
0.8
0.78
0.76
0.74
0.72
0.7
20
30
40
50
60
K Nearest Neighbour
70
80
In the semi-parametric methods, we start analysing the Neural Network model. The
number of hidden neurons finally selected was 3. As we can observe, NN permits to get
a best result than the parametric methods and a slight improvement respect the local
regression. In the other hand, GA presents an accuracy better than parametric models
but worse than non-parametric and NN. However, to the opposite of these methods, GA
permit to obtain an explicit non-linear expression that represents the relation between
variables. In this way, it can be emphasized 2 important aspects. First, the expression is
conformed by a non-linear component (it affects the variables M2 and Cond) and a
linear component (variables Actecon and Ind). Second, the variables effects on the
renting price are the expected a priori.
10
Table 2. Out-of-Sample Accuracy and Comparison between Models
HEDONIC
METHOD
REGRESSION
R2
Out-OfSample
MODEL
Lineal
Regression
Rˆ pi  366 .28 M 2i  6205 .5 Indi  5035 .1 Acteconi  18924  Condi
0.7232
Semi-log
log Rˆ pi  10.074  0.0047391  M 2i  0.10679  Indi  0.084365  Acteconi  0.2749  Condi
0.7375
log Rˆ pi  8.8911  0.38529  log M 2i  0.10743  log Indi  0.10166  log Acteconi  0.28185  Condi
0.6996
( t 10.12)
( t 107.7 )
( t 4.541)
( t 6.123)
( t 4.475)
( t 3.553)
( t 4.89)
( t 3.208)
( t 4.329)
PARAMETRIC
METHODS
DoubleLog
NON
PARAMETRIC
METHOD
SEMIPARAMETRIC
METHODS
( t 29.99)
( t 5.685)
( t 4.39)
( t 2.959)
( t 4.342)
quadratic log Rˆ pi  10 .393  0.081046  Acteconi  0.70081  Condi  0.5   0.00341  M 2 Indi  0.30031  IndCondi 
 (t 8.640)

( t 166.1)
( t 3.170)
( t 3.131)
( t  2.177)
semi-log
0.7689
Local
Regression
The number of nearest neighbours considered was 30
0.8575
Neuronal
Network
Feed-Forward Back-Propagation with 3 layers and 1 neuron in the hidden layer.
0.8621
Genetic
Algorithm




Rˆ pi  M 2i  1 



Condi 
1
1.76 

 4.52  Acteconi  4.23  Indi 

0.8220
11
In summary, we have been able prove how data-driven methods such as Neural
Network, Genetic Algorithm (semi-parametric methods) and nearest neighbour (nonparametric method) permit to capture substantial non-linearity that cannot be fully
captured by linear transformation models in terms of out-of-sample performance.
IV-. Conclusions
The empirical results that we presented in this paper demonstrate the usefulness of
applying nonparametric and semi-parametric data driven models in the estimation of
hedonic price functions. In all cases, the data-driven models outperform the parametric
models in terms of out-of-sample R2. Despite this improvement, one problem with
nearest neighbours and neural networks is the loss to interpret the results. They don’t
offer explicitly an equation where we can analyse the effects of each independent
variable on renting price. In other hand, GA permit to obtain an analytical expression
easy to interpret and with a high accuracy (less than the other data-driven techniques
but best than the parametric methods). The problem is that we can’t carry out statistical
inference and hypothesis testing.
12
References
Álvarez A., A. Orfila y J. Tintore (2001) DARWIN- an Evolutionary Program for
Nonlinear Modeling of Chaotic Time Series, Computer Physics Communications, in
press.
Curry B., Morgan P. and Silver M. (2001) “Hedonic regressions: miss-specification and
neural networks” Applied Economics, 33, 659-671.
Deboeck G. (1994) “Trading on the edge: Neural, Genetic and fuzzy systems for chaotic
financial markets”, eds. Guido Deboeck, John Wiley & Sons.
Gately E. (1996) “Neural networks for financial forecasting”, eds. P. J. Kaufman, John
Wiley & Sons.
Goldberg D. E. (1989) “Genetic Algorithms in Search, Optimization and Machine
Learning”. Reading, MA: Addison-Wesley.
Holland J. H. (1975) “Adaptation in Natural and Artificial Systems”, Ann Arbor. The
University of Michigan Press.
Koza (1992) Genetic programming: On the programming of computers by means of
natural seleccion”, The MIT Press, Cambridge.
13
Medeiros M. and Teräsvirta T. (2001) “Statistical methods for modelling neural
networks” paper in process.
Qi M. (1999) “Nonlinear predictability of stock returns using financial and economic
variables”, Journal of Business & Economic Statistics, 17, 4, 419-429.
Rasmussen and Zuehlke (1990) “On the choice of functional form for hedonic price
functions”, Applied Economics, 22, 431-438.
Smith M. (1995) “Neural networks for statistical modelling” Van Nostrand Reinhold,
New York.
14
Download