Stata Tutorial

Stata Tutorial

(Windows version SE 10.1)

Spring 2009

Data and Statistical Services

Social Science Reference Center

Firestone Library

Princeton University data@princeton.edu http://dss.princeton.edu

This guide was prepared to give you a basic overview of commands in reading in, examining, analyzing, and graphing data using Stata. For reading in data, described here use typical data format you may encounter when downloading data files as examples. In examining, analyzing, and graphing, listed here are some of the most commonly used commands. In this new edition, I added an equivalent way to submit commands by using menus.

Menus may make it easier for you to explore the commands – feel free to play around beyond what is described here.

Once you gain familiarity, you can learn more of Stata’s operations using Stata’s help and search functions, trying on-line tutorials, or searching through Stata’s list serve available at http://www.stata.com/statalist/archive. It will be easier to follow the whole document in the sequence, but you may skip and try only the sections that interest you.

If you have any questions about or suggestions for this guide, please email furuichi@princeton.edu.

Table of Contents

1.

INTRODUCTION .......................................................................................................................................................... 3

1.1.

G ETTING HELP IN S TATA ........................................................................................................................................... 4

1.2.

I NTERPRETING S TATA ’ S H ELP P AGES ........................................................................................................................ 4

1.3.

A BOUT D ATA FOR S TATA .......................................................................................................................................... 5

2.

READ IN DATA............................................................................................................................................................. 6

2.1.

R

EADING IN AN

ASCII

DATA FILE WITH A

S

TATA SET UP FILE

................................................................................... 6

2.2.

C REATING A S TATA SET UP FILE . ............................................................................................................................... 8

2.3.

R EADING IN AN E XCEL FILE ...................................................................................................................................... 9

2.4.

R EADING IN S TATA DATA ........................................................................................................................................ 11

3.

DATA MANAGEMENT.............................................................................................................................................. 13

3.1.

A PPEND ................................................................................................................................................................... 13

3.2.

M ERGE .................................................................................................................................................................... 13

3.3.

T RANSPOSE ............................................................................................................................................................. 14

3.4.

R

ESHAPE

................................................................................................................................................................. 14

3.5.

O RGANIZING VARIABLES ......................................................................................................................................... 15

4.

EXPLORE DATA ........................................................................................................................................................ 16

5.

OBTAIN DESCRIPTIVE STATISTICS.................................................................................................................... 19

6.

TRANSFORM VARIABLES AND RECORDS ........................................................................................................ 20

7.

GRAPH ......................................................................................................................................................................... 22

8.

OBTAIN DIFFERENCE OF MEANS STATISTICS ............................................................................................... 23

9.

OBTAIN LINEAR REGRESSION ESTIMATES .................................................................................................... 24

10.

DO FILES ................................................................................................................................................................. 25

11.

SHORTCUT MENUS .............................................................................................................................................. 25

12.

EXPORTING RESULTS......................................................................................................................................... 26

13.

OTHER HELPFUL COMMANDS ........................................................................................................................ 27

14.

ON-LINE TUTORIALS .......................................................................................................................................... 28

15.

REFERENCES ......................................................................................................................................................... 28

* The period (.) in front of Stata commands indicates Stata prompt. Do not type the period as a part of the command.

* Stata commands are separated by “–“ in texts, as in –help-.

* The words to be replaced are written in italics.

* The variable names are in capital letters (unless they are in the command lines). Stata is case sensitive (upper case and lower case are seen as different characters).

Page 2 of 28

1.

Introduction

Stata is a statistical analysis package, used for manipulating, examining, summarizing, and graphing data. Stata contains statistical commands that are built into the program and allows the users to do statistical analyses such as cross-tabulations and regression analyses. Stata stores data in its own format. Once a data set is in its memory, Stata will output the results responding to your commands. Commands can be executed one at a time interactively, or in groups in a command file.

Princeton University has Unix version of Stata on the server tombstone, and Windows and Macintosh versions on the Office of

Information Technology (OIT) cluster computers. If you have Princeton computer accounts, you have access to either version.

To use the Unix version, your tombstone account has to be activated. For more information about activating your account, check OIT’s help sites on the web (see http://helpdesk.princeton.edu/kb/display.plx?id=9682). In this guide, you will learn how to submit individual commands using a Windows version of Stata.

A Windows version of Stata shows four windows as in the picture below*. The rectangular window on the bottom is the

Command window, where you enter your commands. The largest dark window is the Results window. The commands you enter in the Command window and Stata’s response to the commands appear in the Results window. The window on the top left corner, the Review window, is where Stata puts the commands you had submitted. If you want to reuse the command, you can click on the command shown in the Review window to input it in the Command window. The Variable window, shown below the Review window in the picture, shows a list of the variables in the data set in the memory. You may click on the variable names in the Variables window to input them into your commands in the Command window. You can resize the windows by dragging the borders. If you widen the Variables window, you can also see the variable labels (if the labels exist in your data).

You can also submit commands using the menus on the top bar. In this guide, I will show doing the same thing using either commands or menus, where applicable. In the texts, where it says CMD is where you enter commands in the Command window, and where it says MNU is where you click menus from the menu bar.

* The position of the windows may not be exactly the same as in this picture.

Page 3 of 28

1.1.

Getting help in Stata

If you are not sure what the command is, you can search for the command in Stata using the search function.

CMD : Search commands by typing –search- and a keyword in the command window. Stata will display a list of commands and other resources associated with that keyword, if there are any. For example, type in the Command window,

. search regression

MNU : Alternatively, you can click Help => Search from the menu on the top, and input keywords in the dialog box. Click on the name of one of the commands or resources to display the help screen.

If you know what the command is and want to know the detail about the command, use Stata’s help function.

CMD : Type –help- and a command name.

. help regress

MNU : You may click Help => Stata command, then input command name in the dialog box. Stata will then display a help window.

1.2.

Interpreting Stata’s Help Pages

Help page shows the command you just typed, -help regress- in this example, on the top. Next to it are blue letters next to “dialogs:” and “also see:”. Any of the blue letters in help windows are clickable. The commands next to “dialogs:” open the dialog box associated with the command, and the commands next to “also see:” open help pages that are related with the command. Title, which the help is about, shows a letter in brackets. In addition to the on-line help, Stata has print manuals. The letter in the brackets indicates the volume of the printed manual, R, that contains the information on regress.

Syntax is Stata’s command language structure. Underlined letters are the minimum number of characters Stata recognizes as the command. So, if you type reg in the command window, Stata understands it as regress. After the word regress, Stata expects depvar, a dependent variable. The depvar is required. indepvars, if, in, weight, ,options, are all in brackets. The bracketted words are optional, and the command works without them.

What goes in to the ,options are explained underneath.

Description describes what the command is for. More details of the options follow. If you scroll further down, there are examples. Examples are helpful in seeing how to type the commands. Often times, examples contain series of commands you can try out with Stata’s example data set, which comes with the installation.

Page 4 of 28

1.3.

About Data for Stata

To put a data set into Stata’s memory, the data set has to be in a format Stata understands. The following is a list of the extensions of files Stata can read directly.

Data Format File Extension Command to read the data

Stata .dta

Text (ASCII)

Free or fixed columns .raw, .txt

Comma separated values .csv

Fixed columns .dat

SAS export .xport, xpt

MS Access .mdb

. use

. infile using

. insheet using

. infix using

. fdause

. odbc

Many data download sites provide you with data already formatted for a common statistical program such as Stata, SPSS, or

SAS. Formatted data often contain variable labels and value labels, that make it easier for you to understand the contents of the data.

If Stata data are not available, and you can choose a data format between SPSS and SAS, then I would recommend selecting

SPSS. You can use SPSS to open SPSS data, then save the data as Stata data. SPSS versions 12 and up can save the data as

Stata 8 data

1

. Windows version of SPSS is available in McCosh 59 cluster or DSS computer lab. If data are only available in

SAS format, you may use SAS to open SAS data, then create SAS export file, as Stata can read a SAS export file. Windows version of SAS is available in DSS computer lab. Unix version of SPSS and SAS are available at tombstone.

Also, if you acquire data that are in a format other than Stata, you may use DBMS/Copy to convert them into Stata format.

Windows version of DBMS/Copy is available in DSS computer lab. Unix version of DBMS/Copy is available at Tombstone.

If you have SAS data, we recommend converting them into SAS transport file in SAS instead of using DBMS/Copy.

DBMS/Copy has a known issue in converting value labels from SAS to Stata.

If formatted data are not available, data distributers may provide set up files in Stata, SPSS, or SAS along with ASCII data.

ASCII data set is a text file with rows (or columns) of numbers. If a set up file is available in Stata, you can attach the variable information using Stata. If a set up file is available in SPSS, it will be easier to use SPSS to attach the definition, then save the data as Stata data. If a set up file is available in SAS, you may use SAS to attach the file definition, then create a SAS export file in SAS. You may also modify the set up files in text editors to use in Stata. Commands to define data are different in all three programs. If no set up files are available and only PDF codebooks are available, you will need to select the variables you want to use and create your own set up file for Stata.

If you need help in defining or converting data, please come by the Data and Statistical Services computer lab at A-16-H-3 in

Firestone Library during walk-in hours or email data@princeton.edu. The hours and directions are available at http://dss.princeton.edu. If you are emailing questions, please use your Princeton email. Our resources and assistance are available to Princeton University community members.

Stata data format has changed from version 9 to version 10. Stata 10 can read data saved for Stata 9, but Stata 9 can not read data saved for Stata 10, while both has the same extension .dta. If you plan to use Stata 9 after using Stata 10, you may save the data as Stata 9 data in Stata 10. Followinig commands allow you to save data as Stata 9 data in Stata 10.

CMD: . saveold filename

MNU: File=> Save As. Then select “Stata 9 Data” from the drop down list for box “Save As Type:”

1 Stata 8 and Stata 9 data are interchangeable.

Page 5 of 28

2.

Read in data

2.1.

Reading in an ASCII data file using a Stata set up file.

Often times, you may obtain a command and a dictionary files as a set of Stata set up files along with a data file. I suggest that you save all three files in the same directory. The command file has an extension .do, the dictionary file .dct and data file .txt

(or .dat). The command files in Stata are also called do files. Sometimes the do file contains the dictionary, and you have two files, do file and data file. The procedure is similar to having three files.

As an example, I downloaded a Stata set up file and data file for National Health Interview Survey from the Inter-university

Consortium for Political and Social Research (ICPSR) web site, http://www.icpsr.umich.edu. The files usually are zipped when you download. I extracted the zipped files using WinZip, and put them in C:\StataHandsOn\SampleData directory. WinZip is available in DSS lab computers. OIT computers do not have WinZip, but extraction software that comes with Windows can unzip files.

Then I opened the Stata command file using NotePad (any text editor will do, but not a word processor like MS Word).

Instructions are given at the beginning of the command file, sandwiched between lines of asterisks as in the picture below. A forward slash and an asterisc (/* texts */) makes the texts in between comments. Follow the instruction and specify the name and the path of the data, dictionary, and output data files in the do file. Here is a copy of the beginning of the do file for the

National Health Interview Survey data.

Page 6 of 28

/**************************************************************************

| STATA SETUP FILE FOR ICPSR 04349

| NATIONAL HEALTH INTERVIEW SURVEY, 2004

| (DATASET 0004: SAMPLE ADULT)

|

| Please edit this file as instructed below.

| To execute, start Stata, change to the directory containing:

| - this do file

| - the ASCII data file

| - the dictionary file

|

| Then execute the do file (e.g., do 04349-0004-statasetup.do)

**************************************************************************/ set mem 40m /* Allocating 40 megabyte(s) of RAM for Stata SE to read the

data file into memory. */ set more off /* This prevents the Stata output viewer from pausing the

process*/

/****************************************************

Section 1: File Specifications

This section assigns local macros to the necessary files.

Please edit:

"data-filename" ==> The name of data file downloaded from ICPSR

"dictionary-filename" ==> The name of the dictionary file downloaded.

"stata-datafile" ==> The name you wish to call your Stata data file.

Note: We assume that the raw data, dictionary, and setup (this do file) all

reside in the same directory (or folder). If that is not the case

you will need to include paths as well as filenames in the macros.

Replace the file names here.

File Name local raw_data "C:\StataHandsOn\SampleData\04349-0004-Data.txt" local dict "C:\StataHandsOn\SampleData\04349-0004-Stata_dictionary.dct" local outfile "C:\StataHandsOn\SampleData\health.dta"

/********************************************************

Section 2: Infile Command

This section reads the raw data into Stata format. If Section 1 was defined properly, there should be no reason to modify this section. These macros should inflate automatically.

**********************************************************/ infile using `dict', using (`raw_data') clear

Once you have the file paths and names inserted into the do file, execute the do file (in this example named 04349-0004-

Setup.do) in Stata by giving a command:

. do 04349-0004-Setup

In this case, you do not need to modify the dictionary file. In some cases, you may need to specify the data file path and name in the dictionary file.

I specified in the do file to name output Stata data as health.dta (see the third line that starts with “local outfile”), and you see the file listed in the directory in the picture on the previous page.

You may obtain a data definition file for SAS or SPSS. The idea of attaching the data definition in SAS or in SPSS is the same as in Stata, except that their data definition would only be in one file, and they need to be executed in respective program.

Please refer to separate handouts for details in running data definition files using SAS or SPSS.

Page 7 of 28

2.2.

Creating a Stata set up file.

When you have an ASCII data file but not a set up file, you will need to create one to define variables. An ASCII data file contains many rows of numbers and Stata will not know which numbers belong to which variables. You also need to define the type of variables, whether they are numeric (numbers) or string (texts or characters). ASCII data may be in free format, comma separated, or fixed columns. Example files used in this exercise are 2008 Democratic and Republican Presidential

Primaries/Iraq (United States), downloaded from the Roper Center, http://www.ropercenter.uconn.edu.

Here is a portion of a fixed column ASCII data, called lat544.dat:

.25 1 2391 1 1 1 4 2 1 2 3 3 2 & 338 & & & & & & & 131 5 6 7 7 3 1 4 4 4 4 4 4 & & & & 0 & & & 2 0 2 2 1 2 2 0 1 2 1 1 2 2 2 1 4 2 & 5 1 2 3 2 65 & 6 112 5 1 2

1.66 2 9041 1 1 1 4 1 2 1 2 3 2 & 8 13& & & & & & & 5 2 1 3 3 6 3 1 4 & & & & & 4 4 4 4 4 3 5 & 1 2 1 2 4 2 2 0 1 2 1 4 1 1 1 1 5 2 & 1 3 4 1 1 55 & 6 6 1 5 1 2

.47 3 2122 4 4 4 2 3 2 4 2 1 1 & 1 2 112 5 3 6 6 6 & & & & & & & & 1 & & & & & 1 1 1 1 0 & & & 2 2 2 1 1 1 1 4 2 1 2 1 2 1 1 1 1 1 & 1 7 & & & 65 & 7 8 1 2 1 1

.41 4 4122 4 4 4 1 2 4 1 1 1 1 & 1 8 7 2 3 1 2 5 6 & & & & & & & & 1 & & & & & 1 1 1 1 0 & & & 3 3 3 3 1 1 1 4 0 1 2 0 2 1 2 1 2 1 & 1 2 3 1 1 & 5 5 5 2 2 1 1

A portion of the corresponding codebook says ...

Data Locations

Variable Rec Start End Format

This means that the variable WTVAR is at the first record, starts from the column 1 and ends in column 7. The data format is F7.2, meaning that it’s a numeric variable with width 7, and includes two decimal places. In the data above, .25, 1.66, .47, and .41 correspond

WTVAR 1 1 7 F7.2

CASE 1 8 14 F7.0

AREACODE 1 15 21 F7.0

TRACK 1 22 23 A2

GWBUSHJO 1 24 25 A2

GWBECON 1 26 27 A2

GWBIRAQ 28 29 A2 to this variable.

To define the variables in Stata, you can create a “dictionary” file that contains the variable information as in below. You can use any text editors, but here we use Stata’s “do-file editor.” Open a do-flie editor:

CMD: . doedit

MNU: Window-> Do File Editor -> New Do File. There, type infix dictionary using H:\lat544.dat {

WTVAR 1-7

CASE 8-14

AREACODE 15-21 str TRACK 22-23 str GWBUSHJO 24-25 str GWBECON 26-27

Carriage return is Stata’s default signal to end commands. So it is important to type as it appears here. The first line ends after the squiggly-brace

({ ), each variable name and the column locations is in one line, and the last squiggly-brace ( }) is on its own line. str GWBIRAQ 28-29

} and save the file as a dictionary (.dct) file in the same directory as the data file. For example, I saved the file as H:\ lat544.dct, as

I have the data file, lat544.dat, at the root of H drive. The str in front of variable names indicates that they are string variables.

I have omitted record number, as there is only one record in this data. If your data file has more than one record, you need to define which record you are referring to for each of the variables. Please see help infix to see the syntax for multiple records.

Once you save the dictionary file,

CMD: . infix using H:\lat544.dct

MNU: File => Import => ASCII data in fixed format. Then Browse to find the dictionary file name and path.

Stata will show the following in the output window.

. infix dictionary using H:\lat544.dat {

WTVAR 1-7

CASE 8-14

AREACODE 15-21 str TRACK 22-23 str GWBUSHJO 24-25 str GWBECON 26-27 str GWBIRAQ 28-29

}

(1373 observations read)

Check with the codebook and see if the total number of records is 1373.

Page 8 of 28

2.3.

Reading in an Excel file

If the data file is “clean,” all you need to do is to save the file as .csv file in Excel and import it into Stata. However, if the data file is “not clean,” editing it may make it easy to import it into Stata. Here is an example of a “not clean” excel file. header lines variable names include special characters, starts with a number, or have spaces between words

Stata reads the values in the first line as the variable names. Header lines prevent the program to read the variable names. Also, the program each value includes a special character and a comma the second line. blank line and column expects data from the second line, so in this example, Stata will convert all the variables as string.

The variable names in Stata can not have special symbols or start with an underscore (_).

The following is an example of a

“clean” excel sheet. It has the following characteristics:

• The first line has Stata variable names: 32 characters or less, no special characters, and not starting with an underscore or number. Data begin from

• No blank rows or columns. (Blank cells are ok. Stata automatically adds a period (.) if numeric. Do not manually add . in blank cells.)

• Missing numeric data should be an empty cell or values defined as missing, such as 0, 9, or 99. A space (stored space, different from empty), dot, or any other non-numeric character such as n/a will cause the variable to become string.

• Commas in numbers or texts are particularly problematic because Stata may see them as a delimiter and will not read the data properly. You should remove the commas from numeric values before saving the file.

Page 9 of 28

Once you examine the file and make sure that the file is clean, here is a step-by-step instruction for saving a worksheet as a comma separated values file in Excel. As a practice, let’s read in a sample excel data.

1.

Open the Internet Explorer and download auto.xls from https://webshare.princeton.edu/users/furuic hi/auto.xls

and select Save to Disk.

2.

Save the file in your H:\ directory.

3.

Start Excel and read the file by selecting File

=> Open

4.

Under the File menu, select Save As , then

Save as type 'CSV' (comma separated values).

5.

Open Stata

6.

Change the directory in Stata

Note: Renaming the file with a .csv extension in the

Windows Explorer is not the same as saving the file as a .csv file.

If the spreadsheet is small, you may copy the data and paste them into

Stata’s data editor. Highlight all data in Excel, and select Edit =>

Copy . Open Stata, then select Data Editor . Right click and select

Paste , or press Control and v keys at the same time to paste the Excel data contents into Stata’s Data editor.

Stata may mistakenly read numeric variables as strings. Check that original numeric values are numeric in converted Stata data by issuing the command –describe- in Stata and examining the storage type. If the variable has storage type that starts with "str," then Stata has made it a string variable.

If you see that a numeric variable in the original data file has string storage type in Stata, go back to Excel, and change the variable’s format into numeric, and re-save the file as .csv file.

Here is how:

1.

Highlight the column with the numeric variable name.

2.

Click Format => Cells .

3.

In the Format Cells window, select Number tab.

4.

Under Category drop down list, select Number .

5.

Click OK

6.

Under the File menu, select Save As , then Save as type 'CSV' (comma separated values).

You can also change the variable type from string to numeric in Stata. At the command prompt, type

. destring stringvariablename , replace

For this command to work, the stringvariablename can not have any non-numeric characters as its value. If it fails, check the values of the variable to find non-numeric characters.

Page 10 of 28

2.4.

Reading in Stata data

Now, let’s start using Stata. From an OIT computer, a link to Stata may be found in Start => All Programs => Stata10 =>

StataSE10 . A shortcut to Stata may be available from the Special Applications folder on the desktop. Double click the Stata icon.

Typing commands in the Command window

Stata starts in its default working folder, typically C:\Program using Menus Notes and Tips

If you don't change the directory, Stata will assume that the file name you type is in the default Files\Stata\Stata10. Let’s change the directory to

H:\StataHandsOn.

. cd H:\

Let’s create a StataHandsOn directory.

. mkdir StataHandsOn

. cd StataHandsOn

Before reading in a data file, let’s open a log file. A log file stores your output that appears in the Results Window.

. log using auto1.log

Now let’s read in 1978 Auto data. It is a data file that comes with

Stata installation, and available in Stata format.

. sysuse auto

Suppose you want to add a label to the data, so that you can remember what the dataset is about. It is convenient if you make many subsets of data files from the original file. As an exercise, let’s label the data that it is for 1978 auto data for hands on training.

. label data “1978 auto data for hands on training”

We will work more on this data, but let’s save the data at this time.

We’ll give it a name, testauto.

. save testauto

Let’s close a log file at this time and look at the file. Issue a command:

. log close

Use a text editor or MS Word to open the log file.

You can also view a log file in Stata. Remember to include the extension with the file name when typing the –view- command.

. view stata1.log

Let’s clear the data in the memory at this time and exit from Stata.

. clear

. exit

File =>

Change

Working

Directory..

(navigate to

H:) then select Make

New Folder

File => Log

=> Begin....

File=>

Example datasets =>

Example datasets installed with

Stata =>

(auto.dta) use

Data=>

Labels=>

Label dataset

File=> Save as

File=> Log=>

Close

File=> Log=>

View

File=> Exit using Menus directory. The log and data files you save will also be in this directory unless you change it.

If you do not open a log file at the beginning of the session, the output will only be available in the temporary memory. Once you exit from the program, the output will be lost.

A log file with the extension ".log" is a plain text file.

This means you can open and read it in almost any text editor or word processor. If you issued the – log- command without the file extension,

. log using stata1

Stata would create "stata1.smcl." smcl is a log file type specific for Stata.

Notice that “log on (text)” appear on the rectangular space between Results Window and Command

Window once you begin a log.

If you issue –save, replace- command without specifying a file name, what is currently in memory will overwrite original input file. To avoid losing original data file by mistake, always remember to make a master copy before starting to work on data.

Notice that “log on (text)” disappeared from the rectangular space after closing the log.

Results window can not be cleared while in session.

Page 11 of 28


Restart Stata, and check if we are in the directory we first specified.

. pwd then list the files:

. ls

It should show that we are at H:\StataHandsOn directory. Can you find the testauto.dta and auto1.log? What is the size of the data file?

It is a small data file, of only 5.4 kilobytes. As you may have seen at the very first screen, Stata’s memory may be initially set to 10 megabytes. Because this data file is smaller than what Stata allows in the memory at this time, you will not have problems reading in the data.

Let’s open the log file back on to continue to save the output on the same log file. Issue a command:

. log using stata1.log, append

To read testauto data:

. use testauto

If the data file is larger than Stata’s current memory, it will issue an error message. Check the file size and set memory to give Stata more space. For example, if the data file is 36 megabytes, type

. set memory 40m

It gives 40 megabytes worth of data memory in Stata to read the data.

Review Questions:

1.

How can I start Stata?

2.

Which directory is this program pointing at?

3.

How can I change the directory to H:\?

4.

How large is the Auto data?

5.

How do I read the data into Stata?

6.

How do I label the data?

7.

How do I save the data?

8.

I don’t know what commands to use. How do I get a help in Stata?

9.

How do I record the output?

File=>

Change working directory...

File=> Log=>

Begin, then select stata1.log, and Append to existing file

File=> Open

(no equivalent menu)

Notes and Tips pwd stands for Present Working Directory.

You can see the directory Stata is pointing at by looking at the bottom bar of Stata’s window. If you are not at H:\StataHandsOn, change the directoyr by typing in the Command window:

. cd H:\StataHandsOn

You can, of course, start a new log file instead of appending the new results to the existing log file.

To start a new log file, give a new file name as in:

. log using stata2.log

You could clear the data in memory and read in a new data file in one step, by issuing a command:

. use testauto, clear

Only one data file can be read into the Stata’s memory at a time. You need to clear the memory before reading in another set of data. (You can, however, open many instances of Stata in one computer.)

To see the maximum limits in Stata, type in the

Command window:

. help limits

Hints:

. cd C:\mydata

. dir . ls

. set memory 20m

.

. label data “descriptions”

.

. help

.

Page 12 of 28

3.

Data management

Goal: combine, transpose, and reshape data files. Search, order, and rename variables.

You may have multiple files you want to combine. Some files are so large that they are split into multiple files, having the same variables but different cases, for downloading. If you want to combine datasets that have same variables but different cases, the command you use is –append-. Some longitudinal studies follow up same individuals across time and ask same or different questions. If you want to combine datasets that have same cases or people but have different variables, the command you use is

–merge-.

3.1.

Append 3.2.

Merge

Let’s create sample data first, then we will try appending.

. sysuse auto

. keep make price mpg

. keep in 1/5

. save temp

. keep in 4/5

. save data1 data1 will look like:

This time, we will try merging. We will do match merge, meaning we want to merge two files matched by the key variable, in this case MAKE. Because we are adding a variable, we want to make sure that the variable values are assigned to the right observations. For match merge, it’s very important that BOTH files are sorted by the key variable.

. use temp

. keep make price

. sort make /* IMPORTANT!

*/

. save data2 data2 will look like:

. use temp, clear

. keep in 1/3

Let’s combine the two dataset. What I want to do is to stack data1 under the second dataset. While having the second data still in memory,

. append using data1

Resulting dataset looks like:

. use temp, clear

. keep make mpg

. sort make /* IMPORTANT!

*/

. merge using data2

Page 13 of 28

Notice there is a new variable called _merge. Stata automatically creates this variable whenever files are merged. _merge can have values 1, 2, or 3. 1 means that the records are in data in memory before merging, but not in data2. 2 means that the records are in data2 but not in the data that were in memory at the time of the merge. 3 means that the records were in both files. In this example, all records matched up in two files.

3.3.

Transpose

Transposing is switching observations and variables. In

Stata, string values can not be transposed except for variable names. If you have string values, you will need to encode them and make them numeric before transposing.

The command to transpose is –xpose-.

This time, we will use another Stata’s example file. To use this file, your computer needs to be connected to internet.

3.4.

Reshape

Reshaping dataset is useful when you have, for example, time series data and have the same question asked across time as separate variables.

Here again, we will use Stata’s example file. Your computer needs to be connected to internet to use this file.

. webuse reshape1

. drop ue*

I dropped ue variables to save space. If you keep them, you need to include ue in the reshape command after inc.

In this file, you see three persons’ incomes for 1980, 1981, and 1982. Suppose you want to have a variable called year, and have the income values listed for each year for each person. Then what you want to do is to reshape it into a long format.

. reshape long inc, i(id) j(year)

Here, the command is telling Stata that inc is a stem of the variable whose values need to be repeated for a person for different years, by id. The variable year does not exist in the pre-reshaped data, but will be assigned as the variable name for the values attached to inc that will become a variable. It may be easier to understand this by comparing post-reshaped data with pre-reshaped data. varname is an option, but clear is required in xpose command. With the option varname, the transposed dataset will contain the variable names as a variable.

Now you see that there is a new variable called year, and the id and sex is repeated for different years.

To go from the long format to wide format (in the format of post-reshaped file to pre-reshaped file in the above example), the command is –reshape wide-

. reshape wide inc, i(id) j(year)

Page 14 of 28

3.5.

Organizing variables


If you have a large data file with many variables, it may be difficult to find variables by using –describe- or –codebook-. Here are some commands that may be helpful if you have a large data file. The command –lookfor- search for a variable that has either name or label that contains the keyword. Let’s use another example data to try the commands.

. sysuse nlsw88

Let’s see if there are variables that contain the word married, age, and education.

. lookfor married

. lookfor age

. lookfor education

You can order the variables alphabetically by the variable names.

. aorder

If you want to place a variable at a particular location, -order- var1 var2 places var1 before var2. For example,

. order smsa age places smsa before age, after idcode.

If you want to rename variables, the command is –rename-.

. rename idcode id renames the variable idcode to new name, id.

Review Questions:

1.

How do you search for a variable in a dataset?

2.

How do you order variables in an alphabetical order?

3.

How can I move a variable next to another one?

4.

How can I rename variables? using Menus

File=> Example datasets...=>

Example datasets installed with

Stata, click use next to nlsw88.dta

(no equivalent in menu)

Data=> Variable utilities=>

Alphabetize variables


Change order of variables in dataset


Rename variable

Notes and Tips

If Stata returns blank, it means there is no variable that contains the word. In this example, it does not mean there is no education variables. You can see that there are variables called grade and collgrad, as it is a small dataset. It did not find those variables because the keyword education was not a part of the variable name or label. So, -lookfor- helps you find the variable if you know what to look for. You would still need to read the codebook to know the appropriate keywords you can use to search for variables.

If you have many variables with the same stem, such as education1, education2, ..., you can rename the stem education to edu by using the command renpfix

. renpfix education edu

This command will rename education1, education2, ... to edu1, edu2, ....

Hints: lookfor

. order

.

4.

Explore data

Goal: find out what information is in the data – how many variables are in the data, what variables are in the data, and what they mean.


Now let’s see what this data file contains.

. describe

The -describe- command shows you the path, label, date, and the size of the data file, the number of observations and variables, and the name, type, format, and label of the variables in the dataset.

You will also notice that it says “_dta has notes.” Let’s see what the notes say.

. notes

You can also add your own notes to the data.

. note: I used this data set in a hands-on training course during the fall of 2008.

See it by typing –notes-.

We also see that the variable FOREIGN has a label called origin.

You can see the details about the label by typing:

. labelbook origin

Suppose we want to know what the REP78 is about. The

–codebook- command gives you detail of the variable. Type:

. codebook rep78

If you want to get a quick summary of numeric variables, -inspect- reports the number of negative, zero, and positive values; the number of integers and non-integers; the number of unique values; and the number of missing; and it produces a small histogram. Try:

. inspect mpg

The –list- command lists values of the different variables in your dataset on the Results window. Similarly, -browse- open the data browser. You can have browser open only for the variables you want to see. For example,

. list make

. browse make

In using –list-, you may see –more- on the bottom of the screen.

To scroll down the screens, hit space bar or click –more-.

If you only want to see first five observations of the variable

MAKE, type

. list make in 1/5

. browse make in 1/5 using Menus

Data=>

Describe data=>

Describe data in memory

Data=> Notes=>

List notes

Data=> Notes=>

Add notes

Data=> Notes=>

List notes

Data=>

Labels=> Label values=>

Produce codebook of value labels

Data=>

Describe data=>

Describe data contents

(codebook)

Data=>

Describe data=>

Inspect variables

Data=>

Describe data=>

List data

Data=> Data

Browser click by/if/in tab in list dialog box, and select Use a range of observations

Notes and Tips

There are two types of variables: numeric and

string. Numeric variables are numbers. String variables contain texts which can contain any characters on the keyboard: letters, numbers, and special characters.

The storage type refers to the size used in storing the variables. Numeric variables’ storage types include byte, int, long, float, and double. String variables have storage types that begin with "str", followed by a number indicating the maximum length of the string: e.g., str18.

We can do numeric calculations and statistical analysis on numeric variables, but not on string variables. A variable that looks like a number, for example, “20025” could be either a string (a set of five characters that happen to be numbers, like a zip code) or a numeric value (the integer that’s after 20024). It's important to check the variable types to know how you can analyze those variables.

You can also click on Data Browser to see the data file. While you have the browser open, you can not enter commands. Closing the browser does not delete the data file.

Page 16 of 28


The –list- command is particularly helpful to use after sorting data, or combining with if. For example, you can obtain five minimum values of MPG by listing the first five records after sorting.

. sort mpg

. list mpg in 1/5

Suppose you want to see the make of the cars whose price is less than $5000. Try:

. list make price if price<5000

The –ifqualifier

The –ifqualifier is used to isolate a set of observations with variables meeting some particular criteria. Values on variables in a dataset are compared to values on other variables or to numbers or strings using logical comparison operators.

Operator

>=

Meaning greater than or equal to

<=

!= or ~= less than or equal to not equal to

You can put spaces around these operators (e.g., either a >= b or a>=b), but you cannot put spaces within them (e.g., it must be

‘>=’, not ‘> =’).

Combining tests: -and- and –or-

-if- can be combined with and (&) to evaluate for more than one conditions. Let's say you want to find out the MAKE of the cars whose MPG is greater than 30 and PRICE is less than $5000.

. list make if mpg>30 & price<5000

-if- can also be combined with or (|) to look at cases where at least one of two or more conditions is met. For example:

. list make if mpg>30 | price<5000

It is possible to combine the & and | operators. If you have both in one command, & takes precedence over |. Use parentheses to help you organize them and avoid errors, as combining & and | can make the conditions complicated.

. list make if (30<= mpg | 2000<price ) & rep78<4 returns different results from:

. list make if 30<=mpg | 2000<price & rep78<4 using Menus

Data=> Sort=>

Ascending sort

Data=>

Describe data=>

List data click by/if/in tab in list dialog box, and select

Create, type in price<5000 using Menus

Notes and Tips

To list last five records (maximum values):

. list mpg in -5/-1

To sort in reverse order, use –gsort-.

. gsort –mpg sorts MPG in reverse.

. gsort +mpg is the same as . sort mpg.

Pay special attention to that double equals sign! If you are evaluating for equality, use a

double equals sign (==). A single equals sign

(=) is used for assignments, to set something

equal to something else.

For example, if you want to list all information in the dataset about cars whose MAKE is

“subaru”, you would type:

. list if make=="subaru"

String values need to be put in quotes.

Whereas if you want to create a new variable called POWERSTEER for cars whose make is

SUBARU, you would type:

. generate powersteer=1 if make==“subaru”

Refer to the section on “Transform variables and records” for more information on creating new variables.

Note that the –if- statement is included only once.

. list make if mpg>=30 & mpg<=40 (OK)

. list make if mpg>=30 & if mpg<=40 (won’t work)

. list make if 30<= mpg <=40 (won’t work)

“|” can be obtained by pressing shift and \..

The \ key is between Backspace and Enter keys on most key boards.

Among missing values, after a period, the values increase by a combination of a period and an alphabet character. So, .a is larger than ., .b is larger than .a: .z is the largest missing value.

Page 17 of 28


About Missing Values

Stata indicates a missing numerical value as a period (.), and a missing string value an empty string, “”. Missing numerical values are larger than numerical numbers.

We know from the previous examination (.codebook rep78) that five out of 74 records of REP78 are missing. We can use the period to indicate missing record in the command and see which MAKE of the cars are missing in the data.

. list make if rep78 >= .

Check that a period is the largest values in rep78, by sorting by rep78 and listing the last six values.

. sort rep78

. list make rep78 in -6/-1

Review Questions:

5.

How many variables and records are in the data?

6.

What does the note say?

7.

How can I add notes or comments to the data?

8.

What variables are in the data?

9.

How do I sort?

10.

Which variables have missing values?

11.

List the cars for which data is missing.

12.

List the cars whose repair record is less than 3 and the price is less than $5,000

Data=>

Describe data=>

List data click by/if/in tab in list dialog box

Data=> Sort=>

Ascending sort

Notes and Tips

Hints: describe

. codebook . labelbook inspect

.

. sort . gsort

. list [if] [in]

Page 18 of 28

5.

Obtain descriptive statistics

Goal: find out number of missing records, minimum and maximum values, means, and medians, view frequency tables, and cross tabulations.

The commands that are useful for getting basic descriptive statistics include tabulate, summarize, tabstat, and table .


The –tabulate- command gives you a frequency distribution if only one variable is specified, and a cross-tabulation if two variables are specified. If two variables are specified, the first variable will be shown in rows, and the second in columns. using Menus

Statistics=>

Summaries, tables, and tests=>

Tables=> One-

Notes and Tips

–tabulate- can not cross-tabulate more than two variables. If you have more than two categorical variables to crosstab, use –table- (see below).

Because –tabulate- gives you frequency counts, it

. tabulate rep78

. tabulate rep78 foreign

-summarize- gives the number of valid observations, mean, standard deviation, minimum, and maximum values.

. summarize price mpg

What if you wanted to see the average MPG for foreign and domestic cars? The –tabulate- command can be combined with –summarize- to produce a summary of one variable for the variable specified in –tabulate-. For example, if you want to see the average MPG by car type, type:

. tabulate foreign, sum(mpg)

If you want to see more statistics such as total, range, or median, you may use tabstat.

. tabstat price mpg, stat(sum, range, median)

There are more statistics you can see using tabstat. See . help tabstat for a list of statistis.

The –table- command lets you create three-way (or four-way if combined with –by-) cross-tabulations. We can try that after we create more categorical variables in the next section.

Review Questions:

1.

Which five cars yield the lowest gas mileage?

Which five cars yield the highest gas mileage?

2.

What is the average price and average miles per gallon (MPG) of a car in the 1978 auto data?

3.

What is the average price of cars that are below and above the mean MPG?

4.

What is the median MPG?

5.

How are price and MPG different for domestic and foreign cars?

6.

How can I see the number of cars by the car type?

7.

How are the cars distributed by the repair records?

8.

Compare frequency-of-repair records for domestic and foreign cars. way tables, or All possible twoway tabulations

Statistics=>

Summaries, tables, and tests=>Summary and descriptive statistics=>

Summary statistics

Statistics=>


Tables=>

One/two way table of summary statistics makes sense to use it for categorical variables than continuous variables.

It would make sense to summarize continuous variables rather than categorical variables.

You can also see the average MPG by FOREIGN by using –by- and –summarize-.

. bysort foreign, summarize(mpg)

To use –by-, the data have to be sorted by

FOREIGN. You could do .sort foreign, then .by foreign, sum(mpg). . bysort foreign does the sorting and by in one step.

Stata allows shorthand in some commands. –sum- is the shorthand for –summary-. The shorthand is shown as an underscored letters in the help page.

Hints:

. sort

. list

. summarize

. tabulate

. table

. by groupingvarname: summarize varnames

Page 19 of 28

6.

Transform variables and records

Goal: create and label new variables, modify existing variables, keep or delete variables and records from the file, recode values, create dummy variables from existing variables.

The basic commands for creating new variables and modifying old ones are –generate- and –replace-.


The command

. generate newvar = something creates a new variable named newvar and sets it equal to using Menus

Data=> Create or

Notes and Tips something . Something can be a number, a string, a mathematical expression, or a function of other variables. You can combine

–if-, -&-, and -|- in generating new variables.

. generate two = 1+1

. generate mycars = 1 if (rep78==1 & price<5000) |

(rep78==2 & price<5000)

The –replace- command is used to make changes to existing variables:

. generate domestic=0

. replace domestic = 1 if foreign==0

Remember that missing values are larger than numbers. So, if you use –ifqualifier to indicate values larger than a specified value, it could include missing values. For example, say, you want to group cars into two categories by the repair rating, high-repair-rating cars and low-repair-rating cars.

. generate hirep =1 if rep78>=3

. replace hirep = 0 if rep78<3

HIREP also includes the cars whose repair records are missing.

Check it by listing the value of HIREP when REP78 is missing.

Now that we know HIREP contains missing values, let’s delete the variable.

. drop hirep

To exclude the missing values, you needed to specify:

. generate hirep =1 if rep78>=3 & rep78!=.

. replace hirep = 0 if rep78<3

Say, you want to group repair ratings into 3 groups. The easiest way to re-group existing variables would be to use

–recode-. Giving an option of , gen(newrep78), the command

-recode- recodes REP78 into a new variable, NEWREP78, instead of overwriting the existing variable, REP78.

. recode rep78 (1/2=1) (3/4=2) (5=3), gen(newrep78) change variables=>

Create new variable, then click if/in tab, select Create..., type in criteria in the window, click OK

Data=> Create or change variables=>

Change contents of variable

Data=> Variable utilities=> Keep or drop variables=> select Drop variables


Other variable transformation commands=>

Recode categorical variable

You normally want to use replace for second and later steps in multi-step variable creations. When you modify existing variables, make sure you will still have a way to recreate the original variable or have a back-up copy of the variable. Once you write over existing variable, there is no way to get the original data back.

. list hirep if rep78==.

Notice that hirep disappears from the variables window. Once you delete a variable, you can not undo the deletion.

If you issue a command –preserve- before removing a variable, you may restore deleted variable by issuing a command –restore-. This is a temporary measure and only works as a set. Once you issue –restore- command, you need to issue another preserve command to restore.

If you do not specify a new variable name with the generate option, you will overwrite the original variable. Let’s try that with –preserve- and –restore- commands.

. tab rep78

. preserve

. recode rep78 (1/2=1) (3/4=2)(5=3)

. tab rep78

. restore

. tab rep78 gen is a short for generate.

Page 20 of 28


We have already seen how to create a dummy variable (whose outcome is either 0 or 1) using –generate- and –replace-.

Another easy way to create dummy variables is to use – tabulate- command. The –tabulate- command, when used with a generate option, produces dummy variables for each value.

For example, suppose we want to create a dummy variable for each of the outcomes of the categorical variable REP78.

. tabulate rep78, gen(dumrep78)

Suppose you want to group a continuous variable, PRICE, into five equal ranges. First find out the minimum and maximum value that you want to use to group the PRICE by using – summarize-. Then,

. generate ivprice = autocode(price,5,3291,15906)

If you want to group PRICE into five groups of equal frequencies, first sort PRICE, then issue the following command:

. sort price

. generate fqprice = group(5)

Now, we have several more categorical variables to make a four way table. Let’s create a table of repair records by HIREP by

IVPRICE by FOREIGN. Here is how:

. table rep78 hirep ivprice, by(foreign)

You can label the variables so that you know what they are later on. Let’s add a label to HIREP as an example.

. label variable hirep “repair record is 3 or higher”

. label define yesno 1 “yes” 0 “no”

. lable values hirep yesno

Review Questions:

What is the command to

1.

create new variables?

2.

delete variables?

3.

regroup variables?

4.

group continuous variables?

5.

create dummy variables? using Menus


Other variable creation commands=> Create indicator variables


Create new variable, then enter autocode function in the box

Data=> Sort=>

Ascending sort

Statistics=>


Tables=> Table of summary statistics(table)

Data=> Labels=>

Label variable

Data=> Labels=>

Label values=>

Define or modify value labels

Data=> Labels=>

Label variable

Data=> Labels=>

Label values=>

Assign value labels to variable

Notes and Tips

Scroll down the Variables Window to see what Stata created. Alternatively, view the list of variables by:

. describe

You can also add notes to the variables.

. note hirep: “temporary variable created on October 1, 2006”

When you describe data, (-describe-) you will see an asterisk (*) by the variable label indicating that the variable hirep has notes.

See the notes by typing

. notes

The maximum number of variables you can list in –table- is three.

-label variable- adds a label to the variable.

-label define- defines values of a lable.

The label name can be different from the variable name, and can be used for other variables.

-label values- attach label to the variable.

Hints:

. generate newvar =

. drop varnames

. recode oldvar (1/2=1) (3/4=2) (5=3), gen(newvar)

. generate varname = group(5)

. generate newvar = autocode(oldvar,5,min,max)

. generate newvar = 0

. replace newvar = 1 if oldvar > 6165

Page 21 of 28

7.

Graph

Goal: view the relationships of the variables by graphing and save graphs.

Stata has several graphs for graphing distributions of individual variables, the relationship of the variables, as well as many more specialized graphs. Shown here are commands for some basic graphs. You may explore graphs using the menus as well. In

Stata, graphs appear in separate windows that pop up. The graphs do not appear on the Results window, and will not be stored in the log file. If you want to save the graphs, you will need to save each graph as a file.


Here's a simple histogram of PRICE.

. histogram price

You can see the histogram separately for different groups.

For example, you can see a histogram of price for foreign and domestic cars separately and have Y values in frequency.

. histogram price, by(foreign) freq

Another popular graph is box plot. Let’s see box plots of price by foreign.

. graph box price, by(foreign)

The basic command for drawing a bivariate graph is twoway .

The command twoway is followed by a keyword indicating the type of graph. To obtain a scatter plot showing the relationship between MPG and WEIGHT, type

. graph twoway scatter mpg weight

We can obtain the scatter plot by the car type, FOREIGN .

. graph twoway scatter mpg weight, by(foreign)

Twoway graphs can be overlaid: you can draw two twoway graphs on the same set of axes. A common use of this is to draw a scatterplot with a regression line laid overtop of it to show how the regression line fits the data.

We will overlay scatter plot of with regression line fit for

MPG and WEIGHT.

. graph twoway (scatter mpg weight) (lfit mpg weight)

Let’s save the graph. On the Command Window, type:

. graph save OverlaidMpgWeight

Once it’s saved, close the graph window, and bring it up again.

. graph use OverlaidMpgWeight

Review Questions:

How can I …

1.

make a histogram of MPG?

2.

see a scatter plot of MPG against WEIGHT?

3.

fit a regression line over the previous scatter plot?

4.

bring the graph up again after I close the graph window? using Menus

Graphics=> Histogram, insert variable name PRICE in the Variable: box and check the box next to Bins, change the number to 5

Graphics=> Box plot

Graphics=> Twoway

Graph, click Create, select

Scatter in the Basic plots: box, Y variable: mpg, X variable: weight, click

Accept, then in the “By” tab, select Draw subgraphs..., input foreign in Variables: box

Graphics=> Twoway

Graph, click Create, select

Fit plots under plot category, and Linear prediction under Fit plots:,

Y variable: mpg, X variable: weight

File=> Save Graph... or

In the Stata Graph window,

File=> Save

File=> Open Graph...

Notes and Tips

For an introduction to Stata graphs, type

. help graph intro

Default Y value of histogram is density.

To see the histogram in frequency or percentage, type freq or percent after a comma:

. histogram price, freq

To see more options, see

.help histogram

Typing scatter y x draws a graph of y against x.

Here, scatter and lfit are plot types within the twoway family. Alternatively, you can use || to separate the plot types.

. graph twoway || scatter mpg weight || lfit mpg weight

You do need to separate the plot types by the parentheses or the pipes.

Hints:

. histogram

. graph twoway scatter

. graph twoway (scatter y x) (lfit y x)

. graph save

. graph use

Page 22 of 28

8.

Obtain difference of means statistics

Goal: obtain Pearson’s chi-square, t-test, and analysis of variance statistics.

Once we reviewed the variables in the dataset, we may want to see the relationship among the variables. In the cross-tabulation of repair records obtained above, domestic cars appeared to have poorer frequency-of-repair records. Is the difference statistically significant? Let’s obtain a chi-square statistic to test the hypothesis that the frequency-of-repair records are different by the car type.


. tabulate rep78 foreign, chi2

Suppose we reviewed literature on the automobiles made in 1978, and hypothesize that the average MPG of 1978 cars is 20. To test this hypothesis, do a one sample t-test.

. ttest mpg==20

Comparing domestic and foreign cars, it appears that the average MPG differs by the car type. To test a hypothesis that the MPG is the same for foreign and domestic cars, let’s do a two-sample t-test.

. ttest mpg, by(foreign)

We suspect that MPG is really influenced by the cars’ repair records. I want to examine if the mean MPG is significantly different among cars that have different repair records.

. oneway mpg rep78

Suppose that we then decided to keep the impact of foreign in the model in addition to the repair-record in examining miles per gallon. To run two-way analysis of variance,

. anova mpg rep78 foreign

What if I also wanted to see the impact of weight, which is a continuous variable. Analysis of covariance can be done in Stata using anova command, with continuous option.

. anova mpg rep78 foreign weight, continuous(weight)

Review Questions:

How do I obtain…

1.

a chi-square statistic.

2.

t-test statistics?

3.

one-way ANOVA statistics?

4.

two-way ANOVA statistics? using Menus

Statistics=> Exact statistics=>

Two-way tables with measures of association=> select

Likelihood-ratio chi-squared

Statistics=> Summaries, tables, and tests=> Classical tests of hypotheses => One-sample mean-comparison test

Statistics=> Summaries, tables, and tests=> Classical tests of hypotheses => Two-sample mean-comparison test, in

“by/if/in” tab, select Repeat command by groups, then input foreign in Variables taht define groups:

Statistics=> Linear models and related=> ANOVA/MANOVA=>

One-way ANOVA

Statistics=> Linear models and related=> ANOVA/MANOVA=>

Analysis of variance and covariance

One way analysis of variance tests whether the means of mpg differ across categories of repair record.

If instead I wanted to see the mean difference by foreign, one way result is the same as ttest result, as the variable foreing only has two categories.

To learn more about ttest, oneway, or anova, use help.

Hints:

Notes and Tips ttest

.

. anova

Page 23 of 28

9.

Obtain linear regression estimates

Goal: run a multiple linear regression model.


In estimating relationships among variables, you may first want to examine how the variables are correlated.

We suspect that MPG and WEIGHT are correlated.

Let’s see the correlation:

. correlate mpg weight

In addition, we suspect that the correlation may be different between foreign and domestic cars. We can combine the –correlate- command with a by statement.

Before using a by statement, the data need to be sorted by the by-variable.

. sort foreign

. by foreign: correlate mpg weight

It seems that mpg and weight have a relatively high correlation. The correlation is different for foreign and domestic cars, so foreign must also impact MPG.

From the scatterplots we saw earlier, we also discovered that the relationship between WEIGHT and MPG is not exactly linear. We’ll include a square of WEIGHT to improve the model. Let’s run a regression estimating

MPG by WEIGHT, WEIGHT2 and FOREIGN.

. regress mpg weight weight2 foreign

After estimating a regression model, we can use the values estimated by the model, called post-estimation values. Using estimated MPG, we can see how the estimated line fit the original distribution by viewing overlaid graph. To do so, we first need to create a variable for the predicted MPG. We’ll call this

MPGHAT.

. predict mpghat

. graph twoway (scatter mpg weight) (line mpghat weight), by (foreign)

Review Questions:

1.

What is the correlation between MPG and

WEIGHT?

2.

Is the correlation different between domestic and foreign cars?

3.

How do I obtain regression estimates?

4.

How can I compare observed and predicted values on a graph? using Menus

Statistics=> Summaries, tables, and tests=> Summary and descriptive statistics=>

Correlations and covariances

Data=> Sort=> Ascending sort

Statistics=> Summaries, tables, and tests=> Summary and descriptive statistics=>

Correlations and covariances, in

“by/if/in” tab click Repeat command by groups, insert foreign in Variables that define groups:

Statistics=> Linear models and related=> Linear regression

Statistics=> Postestimation=>

Predictions, residuals, etc.,

Graphics=> Two-way graph

(if there are already defined plots in “Plot definitions:”, either

Disable or Edit them to create new combinations)

Notes and Tips

. pwcorr mpg weight, star(.05) adds an asterisc (*) next to the correlation coefficients that are statistically significat at 95% level.

You can also sort and use “by statement” in one step:

. bysort foreign: correlate mpg weight

There are series of regression diagnostics you can do using graphs.

See UCLA’s Stata tutorial site for more information.

To compute a square of WEIGHT,

WEIGHT2, you can multiply WEIGHT by itself, or raise it to the power of 2.

. generate weight2 = weight*weight

. generate weight2 = weight^2 do the same thing.

Stata has a series of “post estimation commands.” After running a regression estimates, for example, you can test if the coefficients are statistically significantly different from

0, or from another independent variable (wald test), or test for heteroscedasticity. For details, see

. help regress postestimation

,xb that appear as an option when menu is used is a default in command window input. It will not appear in the Results window when command is input in the Command window.

Hints: by

.

. predict yhat

. graph two way (scatter y x) (line

yhat x)

Page 24 of 28

10.

Do files

When you have rather intense computations or repeat/modify existing computations, it may be helpful for you to create a file that contains a set of Stata commands. Such files are called “do files” in Stata. Do files can be created by manually entering commands in any text editors, or using Stata’s do-file editor. In Stata, do-file editors can be invoked by:

CMD : .doedit

MNU : Window=> Do-file editor=> New do-file

You may also create do-files by saving commands you submit interactively. When you start a Stata session, start

“command log,” which is a log file with only the commands. It by default attaches .txt file extension if you do not specify the extension. If that is the case, you can change it in Window’s file explorer. For this command, I have not found a menu version.

CMD : .cmdlog using filename.

do

If you forget to start a command log, you may save the commands in the Review window. First, right click in the

Review window then, select “Select All”. Right click in the Review window again, then select “Send to Do-file

Editor”. You can eliminate error commands by clicking the _rc on top of the Review window, which sorts the commands by the errors, then select the error commands, right click, then “Delete”. You can resort the commands in the original order by clicking the top of the numbered column on the far left. For the same token, you can sort the commands by clicking the top bar where it says “Command” and delete commands like –browse- and –help-.

By the way, if you use menu for help and search, they do not appear on the Review or Results window.

11.

Shortcut menus

Open dofile editor

Open data editor

Open data browser

Open data

Save data

Quit 4

Print results

Log 1

Open/ close viewers

Graph window 2 Scroll

Results window 3

1.

Begins log if no log file is open. If a log file is open, it lets you view, close, or suspend the log. You may append to the previous log by selecting an existing log file. Dialog box menu changes accordingly.

2.

Moves graph window upfront. It only becomes active when a graph window is open.

3 . Scrolls the Results window one screen at a time, when you have –more- at the bottom of the Results window. It is equivalent to hitting the space bar or clicking –more-

4 . Quit processing. Useful when a process is taking a log time and you want to stop the process, or when you have

–more- but do not want to see more. It is equivalent to hitting q in Command window or Ctrl-c at the same time.

Page 25 of 28

12.

Exporting results

You can copy what appears in Results window by highlighting and right clicking. There are several options: Copy Text, Copy

Table, Copy Table as HTML, and Copy as Picture. Here are pasted tables for each.

Copy Text

Repair |

Record 1978 | Freq. Percent Cum.

------------+-----------------------------------

1 | 2 2.90 2.90

2 | 8 11.59 14.49

3 | 30 43.48 57.97

4 | 18 26.09 84.06

If you are pasting tables into Excel, copying either as table or HTML will work well.

If you are pasting tables into Word, copying as picture seems to produce the best apperance. If you save them as picture, though, modifying the contents can only be done using a graphic software.

5 | 11 15.94 100.00

------------+-----------------------------------

Total | 69 100.00

Copy Table as HTML

Copy Table

Repair

Record 1978 Freq. Percent Cum.

Repair


2.90

2 8 11.59

3 30

14.49

43.48

5 11

26.09

15.94

Total 69 100.00

Copy as Picture

Repair


1 2 2.90 2.90

2 8 11.59 14.49

3 30 43.48 57.97

4 18 26.09 84.06

5 11 15.94 100.00

Total 69 100.00

Log files with extension .log can be opened in Word. Log files with extension .smcl will show the tags for Stata. See the command in the next section to convert .smcl files into .log files.

Graphs saved as a picture (see section 7. Graph) can be imported into a document. There are several options for the format.

Use the drop down list in Save As box for the selection. Graphs can also be copied and pasted into another application like MS

Word. Right click the graph you want to copy, then select Copy Graph. Paste the graph in Word using Edit=> Paste, right click and Paste, or hit Control and v at the same time. When the graphs are copied into Word 2003, they may not appear correctly when the file is converted into Word 2007.

There are also user created commands to output results. You may check out commands such as outreg, outreg2, estout, tabout, est2tex, mktab, and xml_tab. To read about the commands, use search. For example, type in Stata’s command window,

. search outreg, all

Note about user created commands : Stata, being a programmer friendly program, makes it easy to install and use user made commands. If you see a user made command that you want to use, you can install it by first finding the command by searching for it (you can also type -findit- commandname in Stata’s Command window) and clicking the blue letters “click here to install.”

The help pages on the commands become available after installing the program.

Page 26 of 28

13.

Other helpful commands

If working with a large file:

You can describe data without loading the data by specifying the location and the name of data file.

. describe using datafilename

You can load only the variables you need by specifying the variable names.

. use var1 var2 var3 using datafilename

Some commands produce a log that is more than a page long (-compress-, for example). To save yourself from pressing a key to scroll each page, you may use

. set more off

If you are seeing –more- at the end of the screen after typing search, and want to quit seeing more screens, press q or control and c keys at the same time. Clicking red X button does the same thing.

You can save some memory by compressing the data.

. compress

Shortcuts

Stata can fill in a variable name with a tab key aftrer enough characters to recognize the name are entered. For example, while you have the auto data open, try:

. describe h [hit tab key] Stata fills in the rest of the variable name as headroom

You can bring up previously used commands in the Command window by hitting Page Up key.

You can refer to a set of variables with the same stem using an asterisc (*), as in:

. describe weight* if you had created weight2, it will show both weight and weight2

Miscellaneous

If you forget to start a log file at the beginning of a Stata session, but want to save what you have in the output window, use

. translate @Results outputfilename .txt

The file can be viewed using a text editor or a word processor.

Note : -translate- only saves what is in the buffer (what you see in the Results window). Depending on the length of the output you had produced, earlier results may have been lost. It is a good habit to start a log file each time you start a Stata session.

If you created Stata log file that has a file extension .smcl, you can reformat it into a text file by giving the command:

. translate filename.smcl

filename.log

If you want to perform a mathematical operation on the spot, you can use the –display- command.

. display 1+1 => will return 2

Page 27 of 28

14.

On-line tutorials

UCLA http://www.ats.ucla.edu/stat/stata/

UNC http://www.cpc.unc.edu/services/computer/presentations/statatutorial/

Princeton http://data.princeton.edu/stata/ http://www.princeton.edu/~eszter/stata.html http://www.princeton.edu/~otorres/Stata/ http://opr.princeton.edu/computing/software/stata/intro/default.asp

15.

References

Hamilton, Lawrence C. 2006. Statistics With Stata . Updated for Version 9.

Pacific Grove, CA: Duxbury Press.

Stata Corporation. 2008. Using Stata Effectively: Data Management, Analysis, and Graphics Fundamentals.

Data and Statistical Services, Princeton University. Fall 2007. Stata Hands-on Instruction Guide. Windows version

9.0.

Page 28 of 28

Stata Tutorial

Stata Tutorial

(Windows version SE 10.1)

Spring 2009

Data and Statistical Services

Social Science Reference Center

Firestone Library

Princeton University data@princeton.edu http://dss.princeton.edu

1.1.

Getting help in Stata

1.2.

Interpreting Stata’s Help Pages

1.3.

About Data for Stata

2.1.

Reading in an ASCII data file using a Stata set up file.

2.2.

Creating a Stata set up file.

2.3.

Reading in an Excel file

2.4.

Reading in Stata data

Goal: combine, transpose, and reshape data files. Search, order, and rename variables.

3.1.

Append 3.2.

Merge

3.3.

Transpose

3.4.

Reshape

3.5.

Organizing variables

Goal: find out what information is in the data – how many variables are in the data, what variables are in the data, and what they mean.

Goal: find out number of missing records, minimum and maximum values, means, and medians, view frequency tables, and cross tabulations.

Goal: view the relationships of the variables by graphing and save graphs.

Goal: obtain Pearson’s chi-square, t-test, and analysis of variance statistics.

Goal: run a multiple linear regression model.

1.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib