Using Data-Driven Prediction Methods in a Hedonic Regression Problem MARCOS ÁLVAREZ-DÍAZ1, MANUEL GONZÁLEZ GÓMEZ Department of Applied Economy, University of Vigo Lagoas- Marcosende Vigo, Spain And ALBERTO ÁLVAREZ ISME-DSEA Department of Electrical Engineering, University of Pisa Via Diotisalve 2,56100 Pisa, Italy (VERSIÓN PRELIMINAR) Abstract The traditional studies about hedonic prices apply simple functional forms such as linear or linearity transformable structures. Nowadays, it’s known in the literature the importance of introducing non-linearity to improve the models’ explanatory capacity. In this work we apply data-driven methods to carry out the hedonic regression. These methods don’t impose any a priori assumption about the functional form. We use the nearest neigbors technique as non-parametric method and neural networks and genetic algorithms both as semi-parametric methods. Neural Networks have already been employed to the specific hedonic regression problem but, to the authors’ knowledge, this is the first time that a genetic algorithm is employed. The empirical results that we have obtained demonstrate the usefulness of applying both nonparametric and semiparametric data driven models in the estimation of hedonic price functions. They can improve the traditional parametric models in terms of out-of-sample R2. 1 Corresponding author: Departamento de Economía Aplicada, Universidad de Vigo Lagoas- Marcosende s/n, 36200 Vigo, Spain. Fax : 986812401; e-mail mad@uvigo.es . 1 I-. Introductions The hedonic prices theory try to determine how the individual characteristics of a commodity affect on its price. The hedonic perspective have been applied to many goods such as automobiles, personal computers, televisions, irrigated lands and, overcoat, housings. A linear relation between the commodity’s price and its characteristics is the most widely approach in this sort of studies. The principal reasons argued is that they are easy to estimate and interpret. Nevertheless, it is also recognized in the literature the importance of non-linearities in the hedonic price function in terms of increasing its explanatory capacity (Rasmussen and Zuehlke (1990)). Many works solve this issue employing techniques that, by means of transformations, allow flexible parametric functional forms (for example, Box-Cox transformations). However, these flexible forms impose a structure introducing forced and unnecessary non-linearity. The results are “over-parameterized” models originating a loss of out-of-sample performance. To avoid this problem, we can use some data-driven methods that permit obtain a model without imposing any a priori assumption about the functional form (nearest neighbours, neural networks and genetic algorithms). The objective of this paper is to determine if employing both nonparametric and semiparametric data-driven models we can improve forecast accuracy respect the parametric models. Our empirical study is centered in the real state market in the city of Vigo (Spain). To carry out our research, we recollected information about renting prices and houses’ information such as their structural attributes and neighborhood conditions. We structure the work as follow. In section 2, we briefly present the different prediction methods that we have employed (linear regressions as parametric method, nearest neighbors as non-parametric method and neural networks and genetic algorithms as 2 semi-parametric methods). Later, in the next section, we define the data and show our empirical results on out-of-sample predictions. Finally, we finish with a section dedicated to conclusions. II-. Methodology a) Parametric Techniques The simplest approach to the hedonic regression problem is to postulate that the functional form is simply linear. To achieve more flexibility, it’s very usual to do some non-linear transformation with respect the data. The greater part of studies about hedonic regression employ as norm a simple functional relations such as linear, semilogarithmic, double-log or quadratic semi-log, between others. The justifications for such forms is based in the use of traditional estimation method as ordinary least squares, in the success in previous studies, in empirical test and in the possibility to carrying out statistical inference and hypothesis testing very easy. The procedure employed in this work will be to select the linear functional which gets the higher accuracy in terms of R2 and, to the same time, all the model’s variables are statistical significant. The selection variables will be carried out by the backward method. Therefore, we specify a functional form characterized by being linear in parameters (linear, semi-log, double-log and quadratic semi-log) and consider all the available independent variables. We estimate the model by ordinary least square and, each time, we delete the less significant variable. We return to estimate the model without the deleted variable and we continue with this process until all the survivor variable are statistical significant. We’ll employ a significant cut level of 5%. 3 b) Nonparametric Approach In this subsection we briefly explain a generalization of the nearest neighbour method denominated local linear regression. The method is based on the idea that observations with similar characteristics should have similar results. Suppose that we have a sample of observations where we know the inputs and their respective outputs. We want to predict the unknown output’s value ( y i* ) from a new input vector ( x i* ). We calculate the Euclidean distant between x i* and the other sample input vectors. In this way, we select the K closest points and its respective outputs to perform a linear regression. Once estimate the parameters we can infer the predicted value y i* . Of great interest is the choice of the nearest neighbours number (K). Consistency of local linear regression demands that the number of nearest neighbours considered goes to infinity when the sample size is increased but a slower rate. In the literature we can find some rules such as K T , where T is the sample size and 0 1 (it’s usual to assume 0.5 ). We prefer to adopt an empirical perspective and to prove with many values of K. We’ll select the value which achieves the highest out-of-sample performance. 4 c) Semi-parametric Approach c.1) Neural Networks Neural Networks are a class of semi-parametric models inspired by studies about how the brain and nerve system work. They have been employed to solve a huge range of economic problems such as financial time series forecasting and bankruptcy prediction, between others. Some works have already utilized NN successfully to the specific case of hedonic regression (Curry, Morgan and Silver (2001)). A good introduction to the NN can be found in Smith (1995) and economic applications in Gately (1996) and Deboeck (1994). NNs are composed of interconnected elements, called neurons, linked between them through weights and grouped in layers. The first layer is called the input layer and the last is the output layer. The middle layers are denominated hidden layers. Each neuron in the input layer brings into the network the value of one independent variable and propagate it towards the neurons of the next layer. In its turn, each neuron of the next layer makes a weighted linear combination of each received input signal, process this weighted information through a transfer function and sends an output signal. The signals from all neurons are propagated across the NN in the same way as far as the final layer where the NN’s output is offered. The difference between the NN’s output and the known value of the dependent variable is calculated. The NN try to minimize this error modifying the weights between links. This process will continue iteratively to find the optimal weight’s values and it will finish when a determined error level is achieved or, if not, when it have iterated a determined number of times. The construction of a good NN for a particular application is not a trivial task. To avoid lackness of generalization, we must choice an appropriate architecture (for example, number of hidden layers, number of units in each layer, connections between units and 5 transfer functions). Usually, a common practice to build a NN is to select the architecture by a process of “trying and error” searching the highest performance. In this work, we’re going to use the most easy and employed NN in economy: a feed-forward back-propagation. In its statistical expression, this NN can be expressed as q n Yi 0 i i 0 ij Xij i 1 j 1 i where Yi is the dependent variable, Xi the input vector, the parameters and are the weighs to be adjusted, n is the number of inputs and q the number of hidden units, and are the transfer functions and i a disturb term. It’s known and accepted that a three layers feed-forward NN with a linear transfer function in the output unit and a logistic transfer function in the hidden layer neurons is able to approximate any non-linear function to an arbitrary degree of accuracy (Qi(1999)). We’ll employ this architecture and the number of neurons in the hidden layer will be determined by “trying and error” searching the highest value of the out-of-sample R2. One important question is how to select the NN inputs. In other words, we have to determine the independent variables which the NN will employ as input. Medeiros and Teräsvirta (2001) suggest to carry out the variable selection by linearizing the model and applying some linear variable selection method. In our case, we’ll employ the backward method to select the relevant variables. 6 c.2) Genetic Algorithms Genetic Algorithms (GA) are a functional search procedure based on the Darwinian theories of natural selection and survival. This procedure have been developed by Holland (1975) and divulged by Goldberg (1989) and Koza (1992). In general, its application to economic problems is very scarce and, for the present, it hasn’t been used to the hedonic regression problem (at least, to the authors’ knowledge). The GA present advantages respect to the neural networks and nearest neighbors methods. First at all, this procedure permits to obtain explicitly a mathematic equation that we can analyze. Moreover, in different with neural networks, GAs are more flexible because they don’t require the specification of a previous and complex architecture. We use a specific GA called DARWIN (Álvarez et al. (2001)). DARWIN carries on a optimization process that finds an optimal functional form from a developing initial population of alternative equations. The algorithm simulates in a computer the process of selection and survival observed in the Nature. Briefly, we can explain how DARWIN works in the following way. First, a set of candidate equations (the initial population) for representing the relation between variables is randomly generated. These equations are initially of the form S A B C D where the arguments A, B, C and D are the explicatory variables or real-number constants (the coefficients in the equations), and the symbol stands for one of the four basic arithmetic operators , , and . Other mathematical operators are conceivable but increasing the number of available operators complicates the functional optimization process. Each equation of the initial population is evaluated and classified according to its R2. The equations with highest values of R 2 are selected to exchange parts of the character between them 7 (reproduction and crossover) while the individuals less fitted to the data are discarded. As a result of this crossover, offspring more complicated than the parents are generated. The total number of characters in the equations is upper bounded to avoid the generation of offspring with excessive length. Finally, a small percentage of the equations’ most basic elements, single operators and variables, are mutated at random. The process is repeated a large number of times to improve the fitness of the evolving population. At the end of the evolutionary process, DARWIN offers as result an equation that it considers optimal to represent the true functional relation between variables. III-. Data and Results The sample used consist of 110 observations obtained through interviews that we carry out to the estate agency in the city of Vigo (Spain), from March to May 1998. For each house, we recollected information about his renting price and characteristics such as structural attributes and neighbourhood conditions. The housing characteristics are presented in table 1. Our focus is to show that both nonparametric and semi-parametric models can improve forecast accuracy respect the parametric models. The same data are used for all techniques. The general characteristics of the estimation and forecasting are as follows. The models were estimated from the first to the 85th observation (training set). The remain observations were used to test the model and to obtain the out-ofsample predictions (validation set). As it was mentioned before, the variables selection technique employed was the backward method. Finally, the measure that we have used to compare the forecasting accuracy of the models considered is the R-Square out-ofsample. 8 Table 1. Description of the Independent Variables VARIABLES DESCRIPTION Rp Renting Price in Pesetas. For the NN case this variable was normalized and, for the GA, was divided by 1000. M2 Square Meters Cond Dicotomic Variable that takes the value 1 if the house is catalogued by the agent as excellent to occupy. Ind Actecon Index built as the add of 5 structural characteristics: existence of lumber-room, grocery-store, central heating, elevator and if the kitchen is furnished. Variable that collect the economic activity of the street where the house is located. It’s calculated as the ratio between the number of business in the street and the number of houses. Npg Number of garage places Ncb Number of bathrooms As we can see in the table 2, we show the results classified in three divisions. The first presents the parametric models: linear, semi-log, double-log and quadratic semilog. This last method bears in mind the squared and cross-product effects between variables and it was already employed successful by Rasmussen and Zuehlke (1990). The best parametric produces an out-of-sample R2 of 0.7689 and is obtained by the quadratic semi-log model. However, the sign showed by the variable IndCond is not consistent with the a priori expectative. For the non-parametric case, in the graphic 1 we can see the sensibility of the R2 with respect the number of nearest neighbours. The highest accuracy is achieved for K=30 obtaining a value R2=0.8575. We can also employ the graphic to verify the existence of important non-linearity in the hedonic regression. The accuracy gets worse when more neighbours are considered. Therefore, 9 the local regression achieves higher accuracy than when all the points are considered (this case represents the parametric linear regression). Graphic 1. Local Regression. K Nearest Neighbours Determination 0.9 0.88 Out-of-Sample Accuracy 0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.72 0.7 20 30 40 50 60 K Nearest Neighbour 70 80 In the semi-parametric methods, we start analysing the Neural Network model. The number of hidden neurons finally selected was 3. As we can observe, NN permits to get a best result than the parametric methods and a slight improvement respect the local regression. In the other hand, GA presents an accuracy better than parametric models but worse than non-parametric and NN. However, to the opposite of these methods, GA permit to obtain an explicit non-linear expression that represents the relation between variables. In this way, it can be emphasized 2 important aspects. First, the expression is conformed by a non-linear component (it affects the variables M2 and Cond) and a linear component (variables Actecon and Ind). Second, the variables effects on the renting price are the expected a priori. 10 Table 2. Out-of-Sample Accuracy and Comparison between Models HEDONIC METHOD REGRESSION R2 Out-OfSample MODEL Lineal Regression Rˆ pi 366 .28 M 2i 6205 .5 Indi 5035 .1 Acteconi 18924 Condi 0.7232 Semi-log log Rˆ pi 10.074 0.0047391 M 2i 0.10679 Indi 0.084365 Acteconi 0.2749 Condi 0.7375 log Rˆ pi 8.8911 0.38529 log M 2i 0.10743 log Indi 0.10166 log Acteconi 0.28185 Condi 0.6996 ( t 10.12) ( t 107.7 ) ( t 4.541) ( t 6.123) ( t 4.475) ( t 3.553) ( t 4.89) ( t 3.208) ( t 4.329) PARAMETRIC METHODS DoubleLog NON PARAMETRIC METHOD SEMIPARAMETRIC METHODS ( t 29.99) ( t 5.685) ( t 4.39) ( t 2.959) ( t 4.342) quadratic log Rˆ pi 10 .393 0.081046 Acteconi 0.70081 Condi 0.5 0.00341 M 2 Indi 0.30031 IndCondi (t 8.640) ( t 166.1) ( t 3.170) ( t 3.131) ( t 2.177) semi-log 0.7689 Local Regression The number of nearest neighbours considered was 30 0.8575 Neuronal Network Feed-Forward Back-Propagation with 3 layers and 1 neuron in the hidden layer. 0.8621 Genetic Algorithm Rˆ pi M 2i 1 Condi 1 1.76 4.52 Acteconi 4.23 Indi 0.8220 11 In summary, we have been able prove how data-driven methods such as Neural Network, Genetic Algorithm (semi-parametric methods) and nearest neighbour (nonparametric method) permit to capture substantial non-linearity that cannot be fully captured by linear transformation models in terms of out-of-sample performance. IV-. Conclusions The empirical results that we presented in this paper demonstrate the usefulness of applying nonparametric and semi-parametric data driven models in the estimation of hedonic price functions. In all cases, the data-driven models outperform the parametric models in terms of out-of-sample R2. Despite this improvement, one problem with nearest neighbours and neural networks is the loss to interpret the results. They don’t offer explicitly an equation where we can analyse the effects of each independent variable on renting price. In other hand, GA permit to obtain an analytical expression easy to interpret and with a high accuracy (less than the other data-driven techniques but best than the parametric methods). The problem is that we can’t carry out statistical inference and hypothesis testing. 12 References Álvarez A., A. Orfila y J. Tintore (2001) DARWIN- an Evolutionary Program for Nonlinear Modeling of Chaotic Time Series, Computer Physics Communications, in press. Curry B., Morgan P. and Silver M. (2001) “Hedonic regressions: miss-specification and neural networks” Applied Economics, 33, 659-671. Deboeck G. (1994) “Trading on the edge: Neural, Genetic and fuzzy systems for chaotic financial markets”, eds. Guido Deboeck, John Wiley & Sons. Gately E. (1996) “Neural networks for financial forecasting”, eds. P. J. Kaufman, John Wiley & Sons. Goldberg D. E. (1989) “Genetic Algorithms in Search, Optimization and Machine Learning”. Reading, MA: Addison-Wesley. Holland J. H. (1975) “Adaptation in Natural and Artificial Systems”, Ann Arbor. The University of Michigan Press. Koza (1992) Genetic programming: On the programming of computers by means of natural seleccion”, The MIT Press, Cambridge. 13 Medeiros M. and Teräsvirta T. (2001) “Statistical methods for modelling neural networks” paper in process. Qi M. (1999) “Nonlinear predictability of stock returns using financial and economic variables”, Journal of Business & Economic Statistics, 17, 4, 419-429. Rasmussen and Zuehlke (1990) “On the choice of functional form for hedonic price functions”, Applied Economics, 22, 431-438. Smith M. (1995) “Neural networks for statistical modelling” Van Nostrand Reinhold, New York. 14