CITY AND REGIONAL PLANNING 776 Getting Started with Limdep for Windows Philip A. Viton January 11, 2006 Contents 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 2 Data preparation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 3 Limdep's opening screen : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 4 Open the spreadsheet data le : : : : : : : : : : : : : : : : : : : : : : : : : : 5 5 Data examination : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 6 Data transformations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 7 Projects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 8 Setting up the model — dependent variable : : : : : : : : : : : : : : : : : : : 8 9 Setting up the model — independent variables : : : : : : : : : : : : : : : : : 10 10 Run the model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 11 Examine the results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 12 Command les : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 13 Choice-based sampling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16 14 The mixed-logit model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17 15 References : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 1 1 Introduction This note is an illustrated guide to using Limdep/NLogit to estimate a discrete-choice (logit) model. The program is available on the computers of the KSOA network. It should also be available on Civil Engineering computers. A full set of manuals is available in the KSOA library. There are actually two versions of the program: Limdep and NLogit. For most purposes — except some advanced discrete choice models, like the nested logit or mixed logit models, see section 14— they work identically, and datasets are transferable between the two versions. Unless otherwise stated, when I refer to Limdep, I mean Limdep or NLogit. There is also a free student edition available on the KSOA Faculty (typically X;) drive at viton_philip/homework/general/handouts/stat.exe This will only be accessible if you're enrolled in the School of Architecture: others can stop by my of ce with a Zip disk (total capacity 250M or less; the program needs 4M free space, plus another 8M if you want documentation), or a writeable CD, or jump drive. To install the student edition, just run the exe le. There's no point in not accepting the default location, since that will always be created on your hard disk, no matter what you say. The student edition is limited to datasets of less than 50,000 values, with no individual variable having more than 1000 observations: in other words, it is suitable for all but very large analyses. If you have a large dataset, you can develop your analysis at home using a subset of the data (to get all the procedures right, etc); and then save the commands into a le, bring them into one of the labs and run the analysis on the large dataset. I concentrate here on the discrete-choice model; but Limdep is an extremely full-featured general-purpose statistics package: it will allow you to estimate just about any model you nd discussed in the econometrics literature. In this connection, note that even for simple things like regression models, Excel's numerical accuracy is suspect: one recent analysis by a statistician recommended that it should not be used if accurate results were desired. On the other hand, most “real” statistics packages (eg, SAS or SPSS) can estimate a many of these models, though typically not as many as Limdep; and SAS in particular is good at database management, which one of Limdep's weaknesses. One recent PhD student here did all his data manipulation in SAS and then imported the nal massaged data into Limdep to do the actual estimation. In this note I emphasize Limdep's point-and-click GUI interface. My own view is that this is a bit klunky, and that for any serious work you may want to consider the batch-like command-based interface — see section 12. The dataset I'll use is clogit.dat; this contains data on 210 individuals' mode choices for intercity travel in Australia. You can get the data by opening Limdep's online Help 2 and selecting Datasets in the Contents. The actual data is under “Data on Mode Choice for Discrete Choice Models”. You can copy it to the clipboard and then paste it into a NotePad window and save it to disk; or you can paste it directly into one of Limdep's command windows and then Run the window. Since most of your data will not be available in this form, for purposes of illustration I imported the data set into an Excel spreadsheet, and I'll begin with that. 2 Data preparation Limdep can read in data in Excel (.xls) , Lotus-1-2-3 (.wks or .wk1), or plain text formats. Spreadsheet data should consist of values only — formulas should be converted to values before saving. Limdep will consider any non-numeric data (except in Row 1 of a spreadsheet) as missing data, and will assign it value 999. The rst row in your spreadsheet can be used to name the variables. If you don't do this, the variables are named automatically as X1; X2, X3, etc, which is pretty unhelpful. Names must be 8 characters or less; case is not signi cant. Don't try anything fancy like multicolumn titles or Limdep will have trouble understanding them. 3 3 Limdep's opening screen You see an “Untitled project” containing no data variables (you can tell that there are none because there's no “+” beside the “Variables” folder). Note that there are entries in the Matrices and Scalars folders; but if you open the folders and look, they are all empty elements. 4 4 Open the spreadsheet data le Our rst task is to read in some data to analyze. Do Project -> Import -> Variable Find your spreadsheet le via the standard Windows interface. Click Open to read in the data. Note that a little “+” now appears before the Variables folder, indicating that some variables are available to work with. 5 5 Data examination Click on the Project Window under Variables to see which data series have been read in. You can use the Data Editor to see the actual data, though the editor is limited in the number of rows (observations) can be displayed. To start the editor, double-click on the series you want to view. You will also see a few of the series surrounding that one. You can also edit the data from the Cell box at the top of the editor. Note that if you do so, you are not prompted to save any changed data when you close the data editor. To save changed data, do Project -> Export -> Variables and choose your export format. 6 6 Data transformations Limdep includes many ways to create new variables by transforming old ones. Personally I nd it easiest to do this “by hand” in a command window — see section 12 — using the create command. But you can also do it using the point and click interface. First enter the Data Editor (double-click on any variable or do Project -> Data Editor Right-click anywhere in the Data Editor, and then select New variable You get a dialog-based way to create a new variable: see the picture below. – Fill in the name you wish to give your new variable in the Name eld. Remember that names must be 8 characters or less; case is not signi cant. – You can then choose one or more transformations from the list on the right. Unfortunately, some of these are rather cryptically named; and the online help provides no guidance at all. – Choosing a transformation inserts a function name into the Expression box; this function name typically contains one or more place-holders (eg x) where you must ll in the name of an existing variable: unfortunately, there is no dialog-based way to do this. – Click OK to have the new variable created. You can examine it in the Data Editor to ensure that the command you constructed actually did what you intended it to do. 7 7 Projects Reading in the data is probably the slowest part of Limdep. To get round this, you can read in the data once, then save the current Limdep workspace as a Project. Once this has been done, you can read in the Project instead of re-reading the data: this is almost instantaneous. To save a workspace as a project do File -> Save Project As. Provide a name only: the extension .lpj will be automatically appended. At the end of your Limdep session, you'll be asked if you want to save the current project. There's probably no reason to do this unless you've changed the data (for example, added new variables). But you should know that a Limdep project includes all matrices and scalars shown in the Project Window, so if you do want to save the latest versions of these, then (re-)save the project when asked. 8 Setting up the model — dependent variable We will now set up to estimate a logit model of discrete choice for our data. Do Model -> Discrete Choice -> Discrete Choice to start. Note that, despite the model's being known as the Logit model, you do not choose Logit. The name “Logit Model” is used in some of the econometrics literature for a slightly different model, and Limdep respects that usage. Now set up the dependent variable (the one describing the choices the individuals in the sample actually made) in the Main tab. Click on the drop-down menu under Choice Variable to select the choice indicator. In our case, the dependent variable is called MODE 8 You must also provide names for the choices represented in the dependent variable (this is how Limdep keeps things straight internally). The names can be anything you like, but obviously it's a good idea to make them re ect the alternatives actually represented in the data, in the proper order. Here we choose Air,Train,Bus,Car, but we could just as well use (say) A,B,C,D Limdep allows each individual in the sample to have a different-sized choice set: in this case you need to provide a variable describing, for each individual, which modes are available. 9 9 Setting up the model — independent variables Click on the Options tab. In the middle of the page, you select variables from the list on the right. Then click on << in the Attributes frame to add them to the list. If you change your mind, then selecting a variable in that frame and clicking >> removes it from the list of independent variables. The variable ONE is built in to the program: it represents a constant (vector of 1's). But for the discrete choice model is has a special usage: if you include it in your list of independent variables you will automatically get the full set of estimable alternativespeci c constants. (The omitted constant will correspond to the last alternative). However, for this particular dataset, the alternative speci c constants are included as variables (AASC, BASC, CASC, and TASC). If you click on the << in the Interact with ASC frame, you create a version of the variable which differs by choice (that is, the product of a variable with the set of alternative-speci c constants). So, for example, if you wanted the coef cients of In-Vehicle Travel Time to vary by mode, you'd enter it using the Interact with ASC buttons. 10 10 Run the model You can experiment with the other options, but most of the time, the defaults will suf ce. However, the Display frame on the Output tab allows you to request that certain additional results of the estimation (for example a full variance-covariance matrix, or a set of descriptive statistics) be printed. When you're ready to run the model, click Run. Here's the output: Note the little box at the bottom, marked Matrix LastOutp: this is a little spreadsheetlike object containing the estimation results. You can open it, select the entire array by clicking on the top-left cell, copy the contents to the clipboard, and then paste it into an Excel spreadsheet or Word document for further manipulation. Note that this object is not saved when save the results to a text le. 11 Examine the results The most important result is given at the end of the Trace window, where you see the Exit Status for the model you have just estimated. This should always be 0: if it is not, 11 then something has gone wrong, and the results are unreliable. Only after you have checked this should you look at the actual estimation results, in the lower part of the window. You can save the results into a text le by doing File ->Save and then providing a le name. The default extension (.lim) is optional. Here is the contents of the saved le: The variables here are: – INVC : in-vehicle trip cost – INVT : in-vehicle trip time – AASC, TASC, BASC :mode-speci c dummys for car, train and bus (respectively). The entry “Log likelihood function” is what the SFBA handout refers to as the loglikelihood at convergence. The columns marked R-sqrd give the likelihood ratio index (McFadden's / for the models. This statistic represents the gain in information provided by the model, versus the no-information case. It is roughly analogous to the R 2 in linear regression. Note that there are two concepts of “no information” used here: a model in which all the coef cients are zero (“No coef cients”) and a model in which we are assumed to know only mode-speci c constants (“Constants only”). This latter makes sense as a no-information model because we do not need to gather any information (data) in order to run a model with dummys. Note that before the coef cient-estimates results there is a useful box of explanations of how these are computed. 12 12 Command les For many analyses, I prefer to create Limdep commands “by hand”: this makes it easy to modify a command (you just want to add or remove an independent variable, for example) without going through the whole point-and-click menu system. Limdep has a built-in Text/Command Document window, which you can use for this purpose. However, this requires a knowledge of the syntax of Limdep's commands; see the online help. You can get a leg up on syntax from your previous output, which shows commands Limdep built from your point-and-click instructions. They appear in the lower frame of your Trace window (also in any saved output) preceded by -->. For example, suppose you want to re-run the model we've just estimated without the mode-speci c dummys AASC,TASC and BASC. Here's the Output window, showing the command you just ran: Open a Command Document: click File -> New (or click on the New Document icon at the far left) then select Text/Command Document 13 Copy-and-Paste the DiscreteChoice lines (without the --->). Do not remove the $ at the end of the line: this is how Limdep knows that the line has ended. (This also means that you can break up a command onto multiple lines, which may make it easier to read). The lhs= (“left-hand side” of the regression equation) modi er provides the dependent (observed choice) variable. The independent variables are entered using the rhs= (“right-hand side”) modi er, and choices= supplies the names you chose for the alternatives. Both rhs and choices are comma-separated lists of names; and note that each modi er ends with a semi-colon (;). Edit out the alternative speci c constants. You can add comment lines to remind yourself of what you're doing. Comment lines begin with a question-mark (?). They do not appear in your output. Now highlight all the lines (models) you want to run. (You can highlight comments: they will be ignored). 14 Click the green Go button to run the models, or do Run -> Run Selection You can also construct your command le using an external text editor (like NotePad) and import it and then run it from within Limdep. After a while you'll be able to write Limdep commands without having to examine a previously run command to see what they should look like. 15 13 Choice-based sampling As it happens, the sample in this dataset is not random: rather it is choice-based, and as we've seen, ignoring this fact leads to incorrect estimation results. In order to correct the problem, we need to know the population selection proportions, which are as follows1 : Mode Sample Population Air 0:2667 0:1400 Train 0:5200 0:1300 Bus 0:0267 0:0900 Car 0:1867 0:6400 Note what is happening here: we are over-sampling the non-road modes, Air and Train. To have Limdep correct for this, you need to tell it what the true (population) proportions are: you do this in the model setup dialog: Then click Run to run the model 1 There is further discussion of this dataset and its applications in Louvière, Hensher and Swait, Stated Choice Methods; however note that the population proportions given on p. 157 of the book are obviously wrong, since they don't add up to 1.0. I got the proportions in the table above from an old Limdep manual. 16 And here is the output. Note that you no longer have Likelihood Ratio statistics; and that the output reminds you that we are correcting for choice-based sampling. 14 The mixed-logit model The mixed logit model handles the case of unobserved heterogeneity by assuming that (some of) the weighting coef cients vary in the population according to some distribution, and estimating the parameters of those distributions. This model may only be estimated with NLogit, not Limdep (and hence not with the student version of Limdep). In the KSOA there is limited availability of NLogit: it is restricted to 3 simultaneous users. Note that the only advantage of NLogit over Limdep is in a few of these specialized models: for other models, the two programs accept precisely the same commands and give precisely the same results. To estimate a mixed logit model you need to make a few decisions: Which coef cients will you assume to be random, and which distributions will you use? NLogit gives you a choice of normal, uniform, triangular and lognormal distributions. There are some tricky issues here: for example, you'd usually want a price coef cient to be always negative, since increasing prices reduces utility for anyone. But if it had a normal distribution, then there is some probability that it could be negative, since the domain of the normal distribution is the entire real line. To handle this in NLogit, one creates a new variable, the negative of prices, and then forces the coef cient to be from a non-negative distribution. That's why the non-negative lognormal distribution is available. You can also force the triangular distribution to be non-negative. Another issue concerns the values of times (or, in general, substitution between a modal characteristic and cost). As we've seen, this is the ratio of the characteristicscoef cient and the coef cient of cost. If you specify both of these to be random, you are forcing the values to have a very complicated distribution (the ratio of two normals, for example, is not normal). In order to make the interpretation of the results easier, one often speci es that cost (the quantity in the denominator) be non-random. How will you generate the random draws needed for the simulation? You have two choices: via uniform random numbers or by a special procedure known as Halton quasi-random numbers. Most people believe that Halton numbers are substantially better: you get more precise results with fewer repetitions, ie computational effort. How many repetitions will you use? There is no exact answer here: one guideline is that if you use Halton numbers then somewhere between 100 and 500 is reasonable: if you use uniform numbers than you may need between 4 and 10 times as many. That can be seriously time consuming. Do you want robust standard errors? There are two ways of computing the estimated standard errors: the ordinary (fast) way, which however is sensitive to misspeci cation, and the robust way which is not. Since we will not in general know that 17 our speci cation of the model is correct, we will almost always want to request robust standard errors. But these take a bit longer to compute. Warning: there are identi cation (uniqueness of the estimates) issues connected with specifying randomly varying choice-speci c dummies: you can in general estimate only J .J 1/=2 1 of them (where J is the size of the choice set). There may be identi cation issues connected with individual characteristics attached to a single alternative: the situation here is not quite clear. There are however no identi cation issues connected with allowing characteristics of the alternatives (cost, time etc) to vary randomly. See the References for more information on this. To specify a mixed-logit model you will ordinarily use the command editor, and then submit your command to NLogit for estimation: there's no graphical command builder for this model. Here's an example. Note that this is to be considered only an example: as we've seen, this dataset is choice-based, and NLogit contains no way to estimate the mixed logit model on a choice-based sample. In this model we shall allow the coef cient of generalized cost (GC) to have a normal distribution, we shall use Halton quasi-random numbers and R D 100 repetitions for the simulation and request robust standard errors. The picture below shows the commands: note that the main command is nlogit not DiscreteChoice. The lhs, rhs, and choices are as before. rpl is the part of the command that requests the mixed-logit model: “rpl” stands for “random parameters logit” halton requests Halton quasi-random numbers. If you omit this, you get uniform numbers. 18 pts is the number of repetitions. The default is 100. The fcn part of the command is where you specify the distribution of the random parameters. This consists of a variable name (here, gc) followed in parentheses by in identi er for the distribution you want: your choices here are n (normal), u (uniform), l (lognormal) or t (triangular). More complicated setups are possible: you could for example specify that the distributions are non-independent by including the keyword ;correlated. Here's the output from this model: A quick guide to understanding the output: We begin with a standard logit model, in order to get reasonable starting points. That's the only use for the rst block; otherwise you don't need to worry about it. This is followed by the mixed-logit results. Most of the quantities in the top box should be familiar. It tells us how many repetitions we requested, and con rms that Halton numbers were used. It also tells us that a Robust VC (variance-covariance matrix) was used. This is followed by the estimates themselves. The important thing to note here is that in the case of the random coef cients we are estimating the parameters of the underlying distribution, not the individual-speci c weights themselves. The estimates are broken into three parts – The rst, in the section random parameters in utility functions gives the means of the distributions of the random parameters. 19 – The second, nonrandom parameters in utility functions gives the coef cient estimates for parameters you have decided are to be non-random. These have precisely the same interpretation as with “standard” logit. – The third, Derived standard deviations of parameter distributions, gives the estimated standard deviations of the random distributions. (The notation NsGC reminds you that the chosen distribution was Normal; if you'd chosen, say, a triangular distribution, it would have said TsGC; the “s” stands for “standard deviation”). In this model, for example, we have estimated that the generalized cost has a normal distribution with mean :01578 and standard deviation :0001879: Note the low tstatistic on the standard deviation: in fact, we cannot reject the null hypothesis that the standard deviation is 0, implying that in fact GC isn't random at all. 20 15 References There are two extremely good books on discrete choice, both available free over the internet: Kenneth E. Train, Qualitative Choice Analysis: Theory, Econometrics and Applications to Automobile Demand. MIT Press, 1986, 1993. See http://elsa.berkeley.edu/ books/choice.html. Kenneth E. Train, Discrete Choice Methods with Simulation. Cambridge University Press 2003. See http://elsa.berkeley.edu/books/choice2.html. This book focuses on the more dif cult mixed logit model. The issue of identi cation in these models is covered in two papers, both available on the internet at http://web.mit.edu/jwalker/www/home.htm: Joan Walker, “The Mixed Logit (or Kernel Logit) Model: Dispelling Misconceptions of Identi cation”. Also in Transportation Research Record 1805:86–98. Joan Walker, Moshe Ben-Akiva and Denis Bolduc, “Identi cation of the Logit Kernel (or Mixed Logit) Model”. MIT working paper. 21