Educational Psychology 626 Statistical Data Processing Course Guide An Overview of Statistical Data Processing Statistical data processing is made fairly easily with the use of modern computer software. While general purpose programs such as Excel and Word have some statistical capabilities, they are generally limited. However, there are a number of popular programs that are dedicated to statistical analysis. These programs vary in their ease of use, comprehensiveness and cost. Two of the most popular programs are SPSS and SAS. These are the programs that you will find in most universities and social science research facilities. Essentially all statistical programs work the same. First, data must be entered on a data file. These data are then read and analyzed by a series of statements written in the statistical program “language” and contained in a syntax file. Finally, the results of the analysis are written to an output file for viewing or printing. Thus, a typical statistical analysis requires the management of three files: data, syntax and output. These three files work interactively to help you accomplish your analysis. In most cases, these files are written as “text” files with no special formatting. (NOTE: In many applications, the Windows/Mac “pull down menus” have replaced the need for a syntax file.) Program or Syntax file Data file Results or Output file SPSS or SAS may be run on a PC or on the campus Alpha (UNIX) system. On the PC, both SAS and SPSS have Windows based versions. Running these programs on the campus UNIX based system from a PC requires a “terminal” program. There are a number of “freeware” terminal programs available for use with campus networked computers and home PCs. Consult the IM&T web at http://www.uwm.edu/IMT/purchase/soft-dept.html for more information. One word of warning. Statistical programming is very unforgiving of errors both in data entry and in writing program syntax. Prepare to spend considerable time “debugging” your data and program syntax. It may be a bit frustrating at first, but you’ll 2 get the hang of it soon. Initially, it takes great care to make sure everything will work as you intend. 3 Introduction to UNIX and creating text files (in pico) I. UNIX A. Accessing your UNIX (alpha) account: If working from home, you’ll need a terminal program (many are freeware, like “CRT” – see the UWM website for IM&T services) and inquiries should be directed to IM&T on campus if you run into difficulties. The examples shown below assume you are using a PC with a terminal program (“CRT” is the program shown in the figures). When you are working in the UNIX environment, even though you are using a PC with a terminal “window” the mouse functions will not work. All commands are entered as text and functions like saving files and editing documents require keystrokes. This takes a while to get used to this if your only exposure has been to a Windows/Mac environment, but it is not difficult to learn. If you are working from a campus computer you should find a yellow folder, called Alpha, on your computer desktop. Double click on the folder to see a bunch of little computer icons with the word Alpha and then a letter after the word Alpha. This letter is important because it represents the first letter of your login name. This letter (i.e., p) and the word Alpha represent the location of your alpha account. So, to choose your alpha account, use the first letter of your login name. For example, if your login name is “psmith” making the first letter of your login name “p” , your alpha account is alphap. When you double click on your alpha account, a UNIX window will open. You will be asked to type in your login and password . Type those in and press Enter. You are now logged in. You will see a screen similar to this: 4 You are told when your last login was and a bunch of other unimportant information. The important information is in BOLD text. First, the following will be displayed: You have mail. Please don't type anything, trying to determine your terminal type This tells you whether or not you have mail and also not to type anything. Do exactly that. Do not type anything. After a few seconds the second part will be displayed: Terminal recognized as vt100 (Base vt100) TERM = (vt100) Once you see this message, press the Enter key and the following will be displayed: 5 This is the alpha “ez” menu. Options on this menu are selected by typing the number to left of the option and pressing the ‘enter’ key. Notice that you can access UWM student SASI information, class lists and other useful information. B. Getting to the UNIX prompt: From the ‘ez’ menu, type 62 and then press Enter. You will see text like this in the terminal window: Suspend this menu and to go Unix ================================= You are about to set aside this menu system so that you can type in your own Unix commands. When you are through typing your own Unix commands issue the Unix command exit to return to this menu system. If you only want to issue one or two Unix commands you might find it more convenient to issue them at the menu system prompt. You can do so by proceeding the name of the command with an exclamation point, e.g., !ls Note: if you go ahead with this option, and then decide that you want 6 to logout from ALPHA, you will need to first type exit to get back to this menu system. Then you can choose the menu option "Logout! Disconnect! Quit!" Do you wish to continue? (y/n)[y]: Again, the BOLD text is what we are interested in. It is simply asking you if you want to continue, so type ‘y’ and press Enter. You are now in the UNIX shell. In this shell you’ll see the UNIX prompt that looks like that below. You might get additional text before you get to the prompt below. Simply type clear to get rid of this text: (alpha) 1: The number after the (alpha) indicates how many commands you’ve entered. The number will increment by one each time a new command is entered. The ‘clear’ command is one example This command will clear the screen of all the text on the screen that we are not interested in. Type in “clear” at the prompt and press Enter. Your screen should now look like this: (alpha) 2: Notice that we have issued one command, so now we are on command 2, which is indicate by the number behind the alpha. Other UNIX commands work like the ‘clear’ command. You enter text at the UNIX prompt and press enter to execute the command. When the UNIX prompt reappears, the command has been executed and you can issue another command. If you ever want to get back to the initial “ez” menu screen (the one with all the numbers and options), type exit at the UNIX prompt and then press Enter. C. Naming files in UNIX The names that you come up with for your UNIX files need to work with the UNIX file system. The way you name files in UNIX is pretty flexible, but there are specific rules you have to follow. In practice, there are also some good habits to get into when making your own files. 7 Although some UNIX systems might vary, the general rules for file naming are as follows. You can name a file in UNIX using up to fourteen characters in any combination from the following sets: 1. {a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z} 2. {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} 3. {period (.), underscore(_), comma(,)} You can not name a file period (.) or period period (..). Directories are files in UNIX, so directory names follow these same rules also. You need to follow the official rules for file naming in UNIX, but here are some good practices: 1. Never name a file with the same name as a UNIX command. 2. Use file extensions that are widely recognized when possible. For example, if you create a text file, use the character string .txt as the last part of the file name. 3. Make all the letters of all your files (and directory names) lower case. Unix cares about case. That is, the file Alpha.txt is different from alpha.txt. In most cases, the capital letter is only going to cause you problems with trying to remember if the filename or directory name has a capital letter or not. 4. Start or end a file name with a letter or number. If you start it with a dot (.), it will be a hidden file. 5. Make your names short, but not cryptic. Use correctly-spelled nouns when possible. For example, store your inventory in inventory.dat and not invtry.dat II. Pico (a text editior in UNIX similar to Word in Windows) A. Creating a text file using Pico: Pico is a text editor similar to Word except all editing and all functions are done with the keyboard, not the mouse. We will use it to create and edit text files (like the data and syntax files described earlier). To get into Pico, type pico at the UNIX prompt. The PICO editor will then be displayed: 8 On the bottom of the window you should see the commands you can use in Pico. Each command consists of holding down the Ctrl key and a letter. For example, ^X tells the user to hold down the Ctrl key and the X key. ^X is the command for EXIT. Now type something (anything). When you’re done, use the EXIT command (hold down the Ctrl key and the X key). Pico will ask you if you want to save the file: Save modified buffer? Press y (for yes). Pico will then ask you for a filename. Type in a filename (any name you want to give to your file) and press Enter. The file you just created is a plain text file, which is also known as an ASCII file. You should now be back to the UNIX prompt. Type ls at the prompt. The ‘ls’ command in UNIX lists all of the files that you have saved on your UNIX account. The file you just created should be listed. To look at or edit the file you just created, type pico filename where filename stands for whatever you named your file. This will open your file in pico. Make some changes to your file and exit (^X). If you’ve made any changes, pico will ask you if you want to save the file. If you answer yes, it will ask you again for a filename. If you want it to keep the same filename, and replace the old file with 9 this edited file, simply press Enter. If you want it to save the edited file under a new filename, backspace and enter the new filename. D. Common UNIX Commands There are hundreds of UNIX commands that can be entered at the UNIX prompt to accomplish various tasks. You should be able to get by with just a few: ls = list all files in your account cat filename = show contents of the file filename (use the actual file name – not “filename”) mv oldfile newfile = changes file name from oldfile to newfile rm filename = removes (deletes) the file filename pico = involks pico editor logout = logs you out of UNIX exit = exits the UNIX shell to “ez” (if that’s where you started). Of course, when you are doing statistics on the UNIX system, there are a few other commands associated with SPSS or SAS that you’ll need to know. These are detailed in the Doing Statistics on UNIX handout. To exit your account, go back to the ez menu screen by typing exit (then Enter) at the UNIX prompt. When you get back to the ez menu screen, type bye and Enter. Moving files from PC to Alpha (and Back) In general, it’s possible to create files on the PC and send them electronically to the UNIX machine and to retrieve files from UNIX to your PC. Most of these files will be in ASCII format with no formatting. To do this, you’ll need an “FTP” program installed on your PC. 10 This figure is typical of FTP programs. There are two windows: one shows the file contents in your UNIX account (the right window above) and the other shows the files on your PC. Moving files from one computer to another is usually as simple as “dragging and dropping” the file from one window to another. 11 Data Structures: Planning Data Entry for Analysis Before data can be analyzed, it must be “entered” in a form that is readable by various statistical programs. In planning data entry, there are several issues to be considered. These are reviewed below. A data set can be thought of as a subject by variable “array” of data. This two dimensional data “matrix” generally has rows associated with “subjects” and columns associated with “variables.” This general format is followed regardless of the method of data entry (see below). Each variable should have its own column and each subject should have its own row. Variables are often given names. For example, a variable that describes a subjects SAT score might be labeled “sat.” The mechanics of naming variables differs depending on the data entry method chosen (see below). Subjects are often given ID numbers, frequently running from 1 to N (although any unique number may be used, like student ID or even the subject’s name). The first variable in a data matrix is often the subject ID. (NOTE: “Subjects” are not always people. They may be rats or pigeons or even organizations). Variable Types Variables can be coded as numeric (with or without decimals) or text (alpha). While most programs can deal with alpha/text variables for some analyses, it’s a good idea to code data as numeric whenever possible. For example, a gender variable might be coded “1” and “2” instead of entering “male” and “female” in the data. Other categorical variables can be similarly coded. Keep in mind that these numeric labels generally have no quantitative meaning, but are simply numbers used to abbreviate the category. A variable from a questionnaire item with 5 responses ranging from “strongly agree” to “strongly disagree” might be coded a 1=strongly agree; 2=agree; etc. Here, the number has some quantitative meaning. Obviously, data like “age” or some test score can be numerically entered as it stands. In some cases, especially on questionnaires, the subject may choose multiple answers to a particular question. In this case, variables should be created for each option presented to the subject. For example if a question asks “ethnicity” and indicates that multiple responses are allowed (e.g., African-American and Asian), each option should be associated with a particular variable. In this case there would be a variable associated with African-American identity (possibly coded “1” if checked and “0” if not checked), another variable for Asian, etc. Missing Data Almost all data sets have some missing data (the subject left the “age” variable blank; the subject didn’t have a particular test score; etc.). The data entry process should plan for missing data. In general, some unique numeric (or text in the case of alpha variables) should be used as a “missing data indicator” as opposed to leaving the field blank for that subject. While some programs handle blank fields well, others do not. It is good data 12 entry practice to assign missing data code(s) to each variable. For example, the SAT score missing indicator might be “9999”; for age it might be “999”; for gender it might be “3”. Code Book Once you have determined the codes for the variables, the missing data indicators and the variable names (see specific methods below), you should put together a “code book” that details all of this information. Essentially, this is just a list of variable names, valid entry characteristics and missing data codes. This is done so that you have a clear record of the variables, their order and their meaning. Data Entry Methods In general, data may be entered using a word processor (like Word or Notepad), a spreadsheet (Like Excel or Lotus) or a dedicated data entry program. Each method has its relative advantages and disadvantages, but once you gain experience with each your choice will probably be dictated by personal preference. The method you choose to enter data should have the capability of converting your data to “ASCII” text. ASCII text refers to a particular type of file structure that has no hidden formatting codes or other information that can cause errors when the data are eventually read by a statistical program. It’s also called “plain text” or “unicode” by many word processors/spreadsheets and often has a .txt or .dat file extension automatically assigned. It should be noted that some statistical programs can read data that have been entered into a spreadsheet or word processor directly, but some (especially specialized programs) cannot. If you are working in a UNIX environment (like the UWM “alpha” computers), you’ll need to use a UNIX editor (like PICO) to enter your data. Data entry should proceed like that described below for word processor. UNIX editors save data in unformatted ASCII text so no conversion is necessary. Word Processor (e.g., Word): When using a word processor, data may be entered using either a “freefield” or “fixed” format. In freefield format data are entered for each subject in a row with variables separated by a “delimiter” such as a space or comma. Here’s an example: 13 Here data are delimited by a space. There are seven variables and 10 subjects in the data set. Note especially that each subject has the same number of variables (seven) and the variables are in the same order. (NOTE: One reason why using blanks as a missing data indicator is a bad idea is that if subject “1” above had a missing variable, the line would contain only six variables, possibly causing an error when the data is read by the statistical program). Fixed format differs only slightly from freefield. In fixed format, each variable maintains “column integrity” in the data. For example, the ID variable is always in the first two character positions, the second variable is always in the fourth character spot (after the ID and a space), etc. The same data in fixed format would look like: 14 Note in fixed format, all numeric variables are “right justified,” which means that the numbers all “line up” on the right side. In addition, each variable is always in the exact same character position for each subject. While most statistical programs will read freefield data, fixed format is more universal and makes the data easier to visually inspect. Regardless of the format used when entering data using a word processor, a “nonproportional” font should be used. With proportional fonts, different numbers and characters take up different amounts of space which makes keeping column integrity during data entry a more tedious task. Courier type face is the most common of the nonproportional fonts used in word processors. Once data are entered, it needs to be saved as an ASCII (plain text) file. On most word processors, this is a matter of choosing File>Save As and then selecting the ASCII or plain text option. If you have saved your data in a word processing file, you should choose a different name when saving the ASCII data. This will prevent you from overwriting the word processing data and provide you with a convenient backup of your data. If you are entering data on a UNIX text editor (like PICO), data files are automatically saved as ASCII text and nonproportional fonts are the default. Here’s what the data would look like in PICO using “CRT” as the terminal program: 15 Spreadsheets: Spreadsheets (like Excel) are set up naturally with the row and column array required for data entry. This makes data entry very convenient from a formatting standpoint, but the data entry often requires a few more keystrokes as you move around the array to enter data. An Excel sheet with data entered would look like: 16 Once data are entered, it can be saved as ASCII data by selecting File>Save As (choose tab delimited or Unicode as the format). Excel files are saved with an .xls file extension. Dedicated Programs: Some statistical programs (like SPSS) provide for data entry directly. In SPSS, the data entry “window” looks very much like a spreadsheet and data are entered in the same way. If you plan to use SPSS for your data analysis, there is no reason to convert to ASCII: SPSS will read the data directly. The data in SPSS would look like: 17 Note here the “tab” at the bottom indicating “variable view.” When selected, the user is switched to a window that details variable information: 18 Note that this window allows the user to provide variable names, missing data codes and specify the type of data being entered, among other things. Here, the assigned variable names (e.g., v1, v2) are replaced: 19 The “data view” now looks like this with the variable names indicated in the column headings: 20 Data entered in SPSS may be saved as ASCII text for use with other programs using the methods described above for word processors and spreadsheets (File>Save As, then select “fixed ASCII” or “tab delimited” as the file type). SPSS data files are saved with a .sav file extension. Importing Data ASCII data may also be “imported” into the Excel or SPSS spreadsheet form once entered by word processor or other method. If you are obtaining data from another source, it is often safest to ask for the data in ASCII format. If you then need to add to the data, or edit the data you may import the data into Excel or SPSS. In Excel, to import the data you need to select “text” as the file type on clicking File>Open. In SPSS select File>Read Text Data. Both programs will step you through reading the data in the proper format. 21 Basic SPSS & SAS Syntax I. STATISTICAL PROGRAMMING SYNTAX By now, most everyone is used to working in a “windows” or graphical user interface (GUI) environment to accomplish most tasks on the computer. Most of the time this involves using “pull-down” menus, mouse clicks and so forth to accomplish the desired result. Because of the complexity of statistical analysis and the variety of options associated with a particular analysis, GUI based statistical programs can be somewhat limiting: It’s often not possible to include all options and procedures in a series of simple pull-down menus and dialog boxes. Consequently, it is useful in statistical programming to know how to write “syntax.” Syntax implies a series of “statements” written in a text file that “tells” the program what analysis you want done. Each statement is sentence like (although often more cryptic with many abbreviations) that must follow a predefined format. Syntax statements can be divided conceptually into three categories: Those that tell the program where to find the data and define the data characteristics; statements that can be used to transform or manipulate the data; and statements that tell the program to do specific analyses with the data. Together, these statements form a “program file” that, when executed, perform the desired analysis. These statements are generally written in a text editor (either in the statistical program itself or by using a separate text editor – e.g., Word, PICO) and saved as a text file. The general structure of SPSS and SAS syntax are described below. Keep in mind that the syntax structure of both SPSS and SAS are much more complex than what is presented here. While you will be shown only the basics, it should provide a good foundation for self-study or further instruction in advanced statistical coursework. II. SPSS PROGRAM SYNTAX STRUCTURE There are several program commands that you may use for each statistical run. Some of these are required. Many of the optional commands are omitted from this introduction. All commands should begin in column 1 (left justified). If any of the commands extend beyond 72 columns (the limit in some applications), the command may be continued on the next line after a one space indent. ALL SPSS STATEMENTS END WITH A PERIOD ‘.’ – THIS IS THE MOST COMMON SOURCE OF SYNTAX ERRORS IN SPSS. A. Reading the Data: To read ASCII data, you need to tell the program where to find the data (i.e., give the file name) and indicate the variable names (and columns where the variables reside on the file). 22 This process can begin with the TITLE command. The TITLE command is optional and gives your analysis a title that is printed on each output page. It should be used to identify the analysis for future reference. Col. 1 | TITLE 'give your run a title here'. note: The title you use should be given between apostrophes and may extend up to column 72. This title will appear at the top of each page of output. The next command (or first command if you choose not to use the TITLE command) is used to identify the data file you are using and provide the variable names and location of variables in the file. This command has two basic forms depending on the type of data file that you are using. The first form reads data from an ASCII file, while the second form reads data from an SPSS system (.SAV) file. The first form follows this general format: Col. 1 | DATA LIST FILE='{drive}:{file name}'/{var} {cols}. where {drive}:{file name} is the disk drive and the name of the data file that contains your data. Be sure to enclose the drive and file name in apostrophes. (Note: if your data are not on a separate file but included later in the program, omit the FILE='{drive}:{file name}' portion of this command.) {var} {cols} implies a list of variable names each followed by the columns that the variable occupies in the data matrix. For example: DATA LIST FILE='c:mydata.prn'/var1 1-2 var2 4-6. This command indicates that the data can be found on a file named "mydata.prn" on a disk in drive "C" and that the variable "var1" will be in columns 1 and 2 on each data record, and the variable "var2" will be in columns 4, 5, and 6 on each data record. Of course, if your working in UNIX, you won’t have the drive letter included in the file name, but otherwise, the syntax is identical. If the data for each case are formatted on more than one record (i.e., more than one line per case), variable names and formats for each record are separated by a slash. For example, if data are on two records: DATA LIST FILE='c:mydata.prn'/1 var1 1-2 var2 4-6/2 var3 2-4 var4 5-8. In this case, "var3" and "var4" are found on the second line of data for each case in columns 2-4 and 5-8 respectively. 23 If you’re reading data from an SPSS system file (.SAV) the command has a different form: Col. 1 | GET FILE='{drive}:{file name}'. In this case, the drive and file name imply an SPSS System file which already contains information on variable names and locations (see the third example below for how to convert an ASCII file to an SPSS system file). SPSS DATA INPUT EXAMPLES: 1. From ASCII file, your first command might read: DATA LIST FILE='C:DUMBDATA.TXT'/SEX 1 AGE 4-5 SCR1 8-10 SCR2 13-15 SCR3 18-19. If you’re working on UNIX, this might look like this: DATA LIST FILE='UNIXDATA.DAT'/SEX 1 AGE 4-5 SCR1 8-10 SCR2 13-15 SCR3 18-19. If data are entered in “freefield” format (see “data structure” handout), the variable list does not need to contain the column numbers. SPSS will expect to find the variables one after another on each subject’s record. If using freefield format it is extremely important that all data elements are present for each subject and that missing value indicators, not blanks, are used to specify missing data. The example above would look like this if done in freefield format: DATA LIST FILE='C:DUMBDATA.TXT'/SEX AGE SCR1 SCR2 SCR3. 2. From SPSS/PC system file, your first command might read: GET FILE='C:DUMBDATA.SYS'. Or GET FILE='UNIXDATA.SYS'. 3. To convert an ASCII file to an SPSS system file use the following format: 24 DATA LIST FILE='C:DUMBDATA.TXT'/SEX 1 AGE 4-5 SCR1 8-10 SCR2 13-15 SCR3 18-19. SAVE OUTFILE='C:DUMBDATA.SAV'. The "SAVE OUTFILE" command saves your data in SPSS system file format. By default, SPSS uses the file extension ‘.SAV’ for system files, but you don’t have to use this extension. This file also contains the variable names and other information you provide regarding the format and labels to be applied to the variables. If you save your data as an SPSS system file, you can then use the "GET FILE" procedure for accessing data (on system file DUMBDATA.SAV) and avoid all of the formatting information required when using ASCII data. The "SAVE OUTFILE" command will also save into the SPSS system file any computed or transformed variables that precede the "SAVE OUTFILE" command (see below). Note that SPSS system files are readable only by SPSS and not other programs. While it is possible to convert these files to other formats, all of the variable information will not be converted. B. Manipulating the data The above statements are used to read the data scheduled for analysis. In addition, there are several commands that can be used to manipulate or modify the data for analysis. Below is a brief description of some of these statements and how they are useful. Users should consult the SPSS User's Guide for the specific form and syntax of each statement. Col. 1 | MISSING VALUES var1(x) var2(y). In some cases, some of the data in the records may be missing. Missing data codes for each variable can be identified on this command so that they can be excluded from the analysis. The missing value indicator is included between parentheses after each variable name. If all variables have the same missing value indicator (say, 0), you may substitute ALL(0) after the MISSING VALUES specification. When data are set up, considerable thought should be given to what values to assign to missing data. One alternative is to use blanks - SPSS/PC always interprets blank fields as missing. RECODE -- the recode statement recodes data values. For example, if you have a continuous variable in the range 0-100 and wish to recode the variable into four categories, this command allows you to do so. 25 RECODE var01 (0 to 25 = 1)(26 to 50 = 2)(51 to 75 = 3)(76 to 100 = 4). Or if you simply want to change the way a variable is coded (say letter grades to numbers): RECODE var01 (‘D’ = 1)(‘C’= 2)(‘B’ = 3)(‘A’ = 4). Note that “alpha” variables (those expressed by text instead of numbers) are placed in single quotes. COMPUTE -- the compute command allows you to create new variables from mathematical expressions involving old variables. For example, if you had several subscores that you wished to combine into a total score, the COMPUTE statement could be used to sum the subscores. COMPUTE total=var01+var02+var03. SELECT IF -- the select if statement may be used to do analysis on an identified subset of your data (e.g., males only, those with test scores over 50, etc.) SELECT IF (gender eq 1). SELECT IF (test gt 50). C. Specific SPSS Analytical Procedures SPSS analyses are specified through separate commands for each procedure you want to perform (e.g., t-test, correlation, etc.). The command is specific to each statistical procedure. It defines the precise statistical analysis that you desire. Examples of some of these procedures are presented later. D. The final SPSS command Col. 1 | FINISH. This command is the last command in your syntax and tells the computer that you are done with your SPSS analysis. Some Specific Statistical Procedures 26 This section details a few specific procedures that can be specified in the command syntax. 1. CORRELATION The CORRELATION procedure computes correlations between variables. The format of the CORRELATION procedure is as follows: Col. 1 | CORRELATION {list of variable names} /STATISTICS {list} /OPTIONS {list}. In this format, the procedure will produce correlations between all possible combinations of variables listed on the CORRELATION command. The STATISTICS and OPTIONS for these procedures are listed by number and may be found in the SPSS/PC manual. An alternative format of the procedure is: CORRELATION {variable list 1} WITH {variable list 2} In this format, correlations will be produced between variables to the left of 'with' with those on the right (i.e., var1 with var3, var1 with var4, var2 with var3, var2 with var4). 2. FREQUENCIES The FREQUENCIES procedure requests frequency distributions for variables. It has the following format: Col. 1 | FREQUENCIES variables={variable list} /STATISTICS list. This procedure produces frequency distributions for all variables listed to the right of the 'variables=' in the procedure command. Various formats (not shown, consult manual) may be used to produce the tables. Available options and statistics are listed in the SPSS/PC manual. 3. DESCRIPTIVES Program DESCRIPTIVES computes descriptive statistics for variables and is specified in the following format: 27 Col. 1 | DESCRIPTIVES {variable list} /OPTIONS list /STATISTICS list. Descriptive statistics will be produced for all variables listed to the right of the DESCRIPTIVES command. Options and statistics may be found in the manual. 4. GENERAL All SPSS procedures (there are many) follow the general format of the procedures listed above. They are readily found in the corresponding section of the SPSS/PC manual. Complete Syntax – Example Programs A. An SPSS program that contains the data on an UNIX file. TITLE 'SPSS program with data on UNIX'. DATA LIST file=’mydata.txt’/id 1-2 score1 4-5 score2 7-8 sex 10 (a). COMPUTE total=score1+score2. DESCRIPTIVES score1 score2 total/statistics=all. FINISH. B. An SPSS program using data from another file. TITLE 'SPSS program that reads data from a file on a PC'. DATA LIST FILE='c:study.txt'/id 1-2 score1 4-5 score2 7-8 sex 10 (a)/iq 2-4 group 7. COMPUTE total=score1+score2. DESCRIPTIVES score1 score2 total iq. FINISH. C. Saving a system file. TITLE 'The same SPSS program that saves a system file'. DATA LIST FILE='c:study.txt'/id 1-2 score1 4-5 score2 7-8 sex 10 (a)/iq 2-4 group 7. COMPUTE total=score1+score2. DESCRIPTIVES score1 score2 total iq. SAVE OUTFILE='c:study.sav'. FINISH. 28 D. An SPSS/PC program using a system file. TITLE 'SPSS program using a system file'. GET FILE='c:study.sav'. DESCRIPTIVES score1 score2. FINISH. III. SAS Programming Syntax SAS syntax works similarly to SPSS, but there are important differences. SAS uses a series of “proc” (short for procedure) statements to accomplish the statistical tasks. Each syntax line in SAS must end with a semi-colon “;” character. A. Reading the Data In SAS, when data are read the data file is given an “internal name.” This name is used when doing analyses to refer to a particular data set. This way, you can have more than one active data set for your analysis – say “data1” and “data2.” The general form SAS uses to read data includes four basic lines: data internalname; infile ‘drive:filename’; input var1 #-# var2 #-# var3 #-#; run; Following the ‘data’ command you need to supply an internal file name. Next is the ‘infile’ command, followed by the internal file name that you need to supply. The ‘infile’ command tell SAS where to find the (ASCII) data. You need to give the drive letter (PC only) and any subdirectory information here. Here, of course, the drive letter implies the data are on a PC. If you’re using UNIX, there would be no drive letter. Following the ‘input’ command is the ‘input’ command followed by a list of variable names and the column numbers in which the variables can be found. The ‘run’ command causes the syntax to be executed. Here’s an example: data allcases; infile 'c:/sas examples/hsb.dat'; input SEX 5 CONCPT 20-25 MOT 26-30; run; Here, the ‘data’ statement names the active SAS data file to be read. In this example, the internal name given the data is “allcases.” The data are on drive c in the subdirectory “sas examples” and the file name is “hsb.dat.” There are three variables to be read (sex, 29 concpt and mot) in the columns indicated on the input command line. Data may also be read in freefield format by omitting the column numbers: data allcases; infile 'c:/sas examples/hsb.dat'; input SEX CONCPT MOT; run; It’s also possible to save data into another file. This is accomplished as follows: data allcases; {new data file called ‘allcases’ to be input} infile 'c:/sas examples/hsb.dat'; input SEX CONCPT MOT; run; data male; {defines a new data file} set allcases; {inputs data from ‘allcases’} output; {writes cases to ‘male’} run; From here, the user could reference the data set ‘allcases’ or ‘males’ in further procedures and analysis. There are really two new commands here: ‘set’ inputs data from an active SAS data file and ‘output’ writes the data to the file specified on the most recent ‘data’ statement (in this case, the file ‘male’). A. Manipulating the Data Unlike SPSS which has specific programming statements to handle missing values, data transformations, case selections and other ‘data manipulation’ tasks, these must be programmed by the user in SAS. This is generally done through a series of algebraic and logical statements once the data have been read. First, missing data in SAS is handled differently than in SPSS. A common indicator for missing data in SAS is the ‘.’ (period) character. Anytime SAS finds a period where it expects numeric data, it assumes the data is missing. When reading alpha variables, the period is read as a true period, thus blanks may be substituted (as long as freefield formatting is not used). Other transformations (e.g., recoding data, creating new variables, selecting cases) are done through algebraic and logical statements. Transformation possibilities are endless with statistical data analysis, but an example of three common procedures using SAS is presented below. a. Computing new variables Computing new variables in SAS involves creating algebraic expressions. Here are some examples: 30 data male; set male; newvar=sex +1; {creates the variable ‘newvar’ which is the value of sex plus 1} newmot = mot/2; {creates the variable ‘newmot’ which is half the value of ‘mot’} sumvar=concpt + mot; {creates ‘sumvar’ which is ‘concpt’ and ‘mot’ added together} run; The transformation possibilities with syntax are quite extensive. Consult one of the references for more information. b. Recoding Data Assuming the data have been read, recoding a variable would take this form (using the above example): data male; set male; if mot < 30 then mot=1; if mot => 30 and mot < 60 then mot =2; if mot => 60 then mot=3; run; Here, the object of the program is to take a continuous variable (mot) and form three categories based upon the values of mot. In the data file, the new variable values (1, 2 or 3) replace the original numeric values. Note the logical operators here (‘<’ = less than; ‘>’ = greater than; ‘=’ = equal to). In this case, the variable name remains the same. If you wanted to keep the variable ‘mot’ as originally read and create a new variable ‘newmot’ that contains the three categories you’d like to create, the syntax would look like this: data male; set male; newmot=0; if mot < 30 then newmot=1; if mot => 30 and mot < 60 then newmot =2; if mot => 60 then newmot=3; run; It is usually preferable to create a new variable rather than recode an old variable with the original coding since you may want to use the old variable for some other purpose later. 31 c. Selecting Cases If you want to select a subset of cases for analysis in SAS, you essentially need to create a new data file that contains only those cases that you wish to analyze. For example: data allcases; {read data file ‘allcases} infile 'c:/sas examples/hsb.dat'; input id sex concpt mot; run; data males; {specify new data file ‘males’} set allcases; {use existing data file ‘allcases’} if sex=1 then output; {write cases to ‘males’ if the sex variable equals 1} data females; {specify new data file ‘females’} set allcases; {use existing data file ‘allcases’} if sex=2 then output; {write cases to ‘females’ if the sex variable equals 2} run; Once the files ‘males’ and ‘females have been created, analyses can proceed with these data sets. d. Specific SAS Statistical Procedures Specific statistical operations in SAS are accomplished through the use of “proc” (short for procedure) statements. Each statistical analysis requires a specific “proc.” A few examples are provided below. A. Correlation proc corr data=group1; var mot math; run; B. Frequencies proc freq data=one; tables group sex; run; C. Descriptive Statistics proc means data=one; class sex; var test1 test2 test3; run; 32 Complete Syntax – Example Programs A. A SAS program that contains the data on an UNIX file. data example; infile=’mydata.txt’; input id 1-2 score1 4-5 score2 7-8 sex 10 (a); run; data example; total=score1+score2; run; proc freq data=example; tables sex; run; B. A SAS program using data from a PC file. data example; infile=’mydata.txt’; input id 1-2 score1 4-5 score2 7-8 sex 10 (a); run; data example; total=score1+score2; run; proc means data=example; var score1 score2 total; run; C. Saving a modified file. data example; infile=’mydata.txt’; input id 1-2 score1 4-5 score2 7-8 sex 10 (a); run; data example; total=score1+score2; run; data=highscores; set example; if score1>100 then output; run; 33 Running Statistical Programs (SPSS/SAS) on UNIX Running a statistical program with SPSS or SAS on the UNIX system requires that you create (or FTP) your data file and your syntax file on the UNIX system (see Using the Campus UNIX System). Once you have created your files, the UNIX commands for running an SPSS or SAS program are similar. Both programs are most easily run in “batch” mode. Batch mode processing means that you program will be run with no visible output shown on the screen. All output is written to a text file which you can view or edit after the program runs. The SPSS command looks like this: spss –m <inputfile > outputfile where ‘inputfile’ is the name of the file that contains your program syntax and ‘outputfile’ is the file name you specify to receive your output for later viewing. The program listing and output are all contained on this file. The SAS command looks like this: sas inputfile where ‘inputfile’ is the name of the file that contains your program syntax. The output for SAS is written to a file with the same name as your ‘inputfile’ but has a ‘.lst’ file extension. For example, if your syntax file is named ‘program1’ the output will be written to a file called ‘program1.lst’. SAS also creates a program “log” file with an ‘.log’ extension. This file is useful for debugging programs. In this case, this file would be named ‘program1.log’. Both the SPSS and SAS commands are entered at the UNIX prompt. SPSS EXAMPLE Data may be entered on a PC and “FTPed” to the UNIX system or entered in PICO. Regardless, you should examine the data in the PICO editor to make sure everything looks like it should. In this case, the data are named ‘example1.dat’. The data for this example are shown in the Figure below. 34 Likewise, the SPSS program syntax may be “FTPed” or entered directly in the PICO editor. The SPSS program syntax for these dat might look like that shown in the figure below. In this case, the program syntax file is named ‘spssexample1.prg’. 35 To run this program, the following would be entered at the UNIX command prompt: Alpha 14:spss –m <spssexample.prg > spssexmple1.out In this case, the output for the program will be written to a file called ‘spssexample1.out’. This file can be viewed and edited in PICO and, in the case, would look like that in the figure below. 36 SAS EXAMPLE Our SAS example will use the same data as the SPSS example above. Remember that the file is named ‘example1.dat’. The SAS program file, in this case named ‘sasexample1.prg’ is shown in the PICO editor below. Again, this file can be FTPed or entered directly into the editor. 37 The syntax for running the SAS program at the UNIX prompt would look like this: alpha 12: sas sasexample1.prg When the program executes, two additional files will be created: ‘sasexample1.prg.log’ will include the program log which can be inspected for errors and general program information; and ‘sasexample1.prg.lst’ which will contain the program output. This latter file is shown in the figure below. 38 39 Running Statistical Programs in a Windows Environment Both SPSS and SAS are available in a Windows version that can be run on a PC. The Windows versions of these programs make running a program a little less awkward since there are no UNIX commands to worry about. The programs are costly, however which makes their use outside of a business or academic environment impractical for many people. SPSS offers a “Grad Pack” version of SPSS for Windows that is available at the UWM Bookstore. The cost is around $200 and it is not upgradeable. SPSS for Windows is available on many campus computers for use by students. SAS for Windows is also available in some computer labs on campus. SPSS and SAS for Windows involve the same file structure that is used in UNIX. You will still be using three files: data file, syntax file and an output file. (NOTE: SAS has a fourth log file.) SPSS for Windows In general, SPSS for Windows uses a Graphical User Interface (GUI) to accomplish many tasks. It has a “pull down menu” structure that allows the user to accomplish many common analyses. However, the menu structure makes some tasks awkward and others impossible to do without writing some syntax. Consequently, this presentation will involve the use of SPSS syntax in the Windows environment. Students may explore the pull down menu structure on their own. When SPSS is started, the user is presented with a blank data window that looks like that in the figure below. 40 Data can be enter directly into this window (see Data Entry and Structure) or it can be populated by reading and ASCII file from SPSS Syntax. In order to use SPSS syntax in this environment, you must open a syntax window (File>New>Syntax if you’re writing a new file or File>Open>Syntax for an existing syntax file). The figure below illustrates what the GUI will look like with this additional syntax window open. The syntax window works like a word processor that can be used to enter syntax. Note the arrow in this figure that points to the SPSS “run” button. To run SPSS syntax, you need to first highlight the syntax you wish to run, and then click the run button. 41 The syntax above is designed to read ASCII data as we did in the UNIX environment. When this syntax is run, the data window is populated with data you have read and looks like the figure below (NOTE: you may have to save the data before it is visible in this data window – File>Save on the data window menu). 42 Note that the data have been read and are now in “spreadsheet” form. Note also the two “tabs” at the bottom of the data window: Data View and Variable View. If you click on variable view, you’ll see (and can edit) the characteristics of each variable. The variable view for these data are shown in the figure below. 43 Anytime a statistical procedure is run, the output is written to an output file which is contained in a third “window” in the SPSS GUI. For example, the figure below shows some simple syntax for these data added to the syntax window. In this case, the additional syntax specifies some descriptive statistics for a few of the variables. When highlighted and run, this procedure writes output to the output window which is also show in the figure below. The output window also works like a word processor, thus the output can be edited, printed, saved or cut and pasted into another document. SPSS data files can also be saved for future use with SPSS. Once data are entered or read into the data window, they can be saved (File>Save or File> Save As) as an SPSS system file. These files are saved with a ‘.sav’ file extension by default. Subsequent use of these files requires the user to open these files in the data window (File>Open>Data) which makes running the syntax to read the ASCII data unnecessary. SPSS syntax files are save by default with an ‘.sps’ file extension and output files are saved with a ‘.spo’ file extension. 44 SAS for Windows SAS for Windows works like SPSS for Windows but the GUI looks somewhat different. When started, the SAS for Windows displays the GUI shown below. By default, the program log (upper window in the figure) and syntax editor (lower window in the figure) are shown. SAS syntax may be written or opened in the program editor. Note also the two arrows in the figure below. Once a program is run, the output is written to an output file, just like SPSS. The buttons at the bottom of the GUI are a convenient way to view the contents of each window. Clicking on the buttons will display the output, log and program editor windows respectively. 45 The Figure below shows the SAS GUI with syntax contained in the program editor. Note the SAS “run” button indicated by the arrow at the top of the window. Click on this button and the syntax in the editor will be run. If you highlight certain lines of syntax in the editor, only this syntax will be run. 46 Once the syntax has been run, the output will be written to the output editor. The figure below shows the output window visible are the above syntax has been run. Again, the buttons at the bottom of the GUI can be used to display the log and syntax editor respectively. 47 48 Appendix A SPSS Statistical procedures - Syntax examples varx = variable name) (FREQUENCY DISTRIBUTIONS) FREQUENCIES [VARIABLES=]varlist [/FORMAT=[{DVALUE}] [{NOTABLE }]] {AFREQ } {LIMIT(n)} {DFREQ } [/MISSING=INCLUDE] [/BARCHART=[MIN(n)][MAX(n)][{FREQ(n) }]] {PERCENT(n)} [/PIECHART=[MIN(n)][MAX(n)][{FREQ }] {PERCENT} [{MISSING }]] {NONMISSING} [/HISTOGRAM=[MIN(n)][MAX(n)][{FREQ(n) }] {PERCENT(n)} [{NONORMAL}][INCREMENT(n)]] {NORMAL } [/NTILES=n] [/PERCENTILES=value list] [/STATISTICS=[DEFAULT][MEAN][STDDEV][SUM] [MINIMUM][MAXIMUM][RANGE] [SEMEAN][VARIANCE][SKEWNESS][SESKEW] [MODE][KURTOSIS][SEKURT][MEDIAN] [ALL][NONE]] [/GROUPED=varlist [{(width) }]] {(boundary list)} [/ORDER = [{ANALYSIS}] [{VARIABLE}] Simple Example: Frequencies Variables=var1 var2. 49 (DESCRIPTIVE STATISTICS) DESCRIPTIVES [VARIABLES=] varname[(zname)] [varname...] [/MISSING={VARIABLE**} [INCLUDE]] {LISTWISE } [/SAVE] [/STATISTICS=[DEFAULT**][MEAN**][MIN**][SKEWNESS]] [STDDEV** ][SEMEAN][MAX**][KURTOSIS] [VARIANCE ][SUM ][RANGE][ALL] [/SORT=[{MEAN }] [{(A)}]] {SMEAN } {(D)} {STDDEV } {VARIANCE} {KURTOSIS} {SKEWNESS} {RANGE } {MIN } {MAX } {SUM } {NAME } Simple example: Descriptives Variables=var1 var2 var3/ Statistics=mean sddev min max. 50 (T TEST – INDEPENDENT GROUPS) T-TEST GROUPS=varname({1,2** }) {value } {value,value} /VARIABLES=varlist [/MISSING={ANALYSIS**}[INCLUDE]] {LISTWISE } [/CRITERIA=CI[{(0.95 )}]] {(value)} Simple Example: t-test groups=var1(1 2)/ variables=var2 var3. (T-TEST – DEPEDNENT GROUP) T-TEST PAIRS=varlist[WITH varlist [(PAIRED)]] [/varlist...] [/MISSING={ANALYSIS**}[INCLUDE]] {LISTWISE } [/CRITERIA=CI[{(0.95 )}]] {(value)} Simple Example: t-test pairs=var1 with var2. 51 (CHI SQUARE TEST OF ASSOCIATION) CROSSTABS [TABLES=]varlist BY varlist [BY...] [/varlist...] [/MISSING={TABLE**}] {INCLUDE} [/FORMAT={TABLES**}{AVALUE**}] {NOTABLES}{DVALUE } [/CELLS=[COUNT**][ROW ][EXPECTED][SRESID ]] [COLUMN][RESID ][ASRESID] [TOTAL ][ALL ][NONE ] [/WRITE[={NONE** }]] {CELLS } [/STATISTICS=[CHISQ][LAMBDA][BTAU][GAMMA][ETA ]] [PHI ][UC ][CTAU][D ][CORR] [CC ][KAPPA ][RISK][MCNEMAR] [CMH [(value)]] [ALL ][NONE] [/METHOD={MC [CIN({99.0 })] [SAMPLES({10000})]}]†† {value} {value} {EXACT [TIMER({5 })] } {value} [/BARCHART] **Default if the subcommand is omitted. ††The METHOD subcommand is available only if the Exact Tests Option is installed. Simple Example: Crosstabs Tables=var1 var2 by var3/ Statistic=chisq/ Cells=count row column total. 52 (FACTORIAL ANOVA) UNIANOVA dependent varlist [BY factor list [WITH covariate list]] [/RANDOM=factor factor...] [/REGWGT=varname] [/METHOD=SSTYPE({1 })] {2 } {3**} {4 } [/INTERCEPT=[INCLUDE**] [EXCLUDE] ] [/MISSING=[INCLUDE] [EXCLUDE**]] [/CRITERIA=[EPS({1E-8**})][ALPHA({0.05**}) ] {a } {a } [/PRINT = [DESCRIPTIVE] [HOMOGENEITY] [PARAMETER][ETASQ] [RSSCP] [GEF] [LOF] [OPOWER] [TEST [([SSCP] [LMATRIX]] [/PLOT=[SPREADLEVEL] [RESIDUALS] [PROFILE (factor factor*factor factor*factor*factor ...) ] [/TEST=effect VS {linear combination [DF(df)]}] {value DF (df) } [/LMATRIX={["label"] effect list effect list ...;...}] {["label"] effect list effect list ... } {["label"] ALL list; ALL... } {["label"] ALL list } [/CONTRAST (factor name)={DEVIATION[(refcat)]** ‡ } ] {SIMPLE [(refcat)] } {DIFFERENCE } {HELMERT } {REPEATED } {POLYNOMIAL [({1,2,3...})]} { {metric } } {SPECIAL (matrix) } [/KMATRIX= {list of numbers } ] {list of numbers;...} [/POSTHOC =effect effect...([SNK] [TUKEY] [BTUKEY][DUNCAN] [SCHEFFE] [DUNNETT(refcat)] [DUNNETTL(refcat)] [DUNNETTR(refcat)] [BONFERRONI] [LSD] [SIDAK] [GT2] [GABRIEL] [FREGW] [QREGW] [T2] [T3] [GH][C] [WALLER ({100** })]] {kratio} [VS effect] [/EMMEANS=TABLES({OVERALL }) {factor } {factor*factor... } WITH (covariate=MEAN covariate=MEAN) COMPARE ADJ (LSD) 53 (BONFERRONI) (SIDAK) [/SAVE=[tempvar [(name)]] [tempvar [(name)]]...] [/OUTFILE=[{COVB (file)}] {CORB (file)} [/DESIGN={[INTERCEPT...] }] {[effect effect...]} ** Default if subcommand or keyword is omitted. Simple Example: Unianova Var1 by var2 var3/ Design=var2 var3 var2*var3. 54 (REPEATED MEASURES ANOVA) GLM dependent varlist [BY factor list [WITH covariate list]] [/WSFACTOR=name levels[{DEVIATION[(refcat)] }] name... {SIMPLE [(refcat)] } {DIFFERENCE } {HELMERT } {REPEATED } {POLYNOMIAL [({1,2,3...})]**} { {metric } } {SPECIAL (matrix) } [/MEASURE=newname newname...] [/WSDESIGN=effect effect...]† [/RANDOM=factor factor...] [/REGWGT=varname] [/METHOD=SSTYPE({1 })] {2 } {3**} {4 } [/INTERCEPT=[INCLUDE**] [EXCLUDE] ] [/MISSING=[INCLUDE] [EXCLUDE**]] [/CRITERIA=[EPS({1E-8**})][ALPHA({0.05**}) ] {a } {a } [/PRINT = [DESCRIPTIVE] [HOMOGENEITY] [PARAMETER][ETASQ] [RSSCP] [GEF] [LOF] [OPOWER] [TEST [([SSCP] [LMATRIX] [MMATRIX])] ] [/PLOT=[SPREADLEVEL] [RESIDUALS] [PROFILE (factor factor*factor factor*factor*factor ...) ] [/TEST=effect VS {linear combination [DF(df)]}] {value DF (df) } [/LMATRIX={["label"] effect list effect list ...;...}] {["label"] effect list effect list ... } {["label"] ALL list; ALL... } {["label"] ALL list } [/CONTRAST (factor name)={DEVIATION[(refcat)]** ‡ } ] {SIMPLE [(refcat)] } {DIFFERENCE } {HELMERT } {REPEATED } {POLYNOMIAL [({1,2,3...})]} { {metric } } {SPECIAL (matrix) } [/MMATRIX= {["label"] depvar value depvar value ...;["label"]...} ] {["label"] depvar value depvar value ... } {["label"] ALL list; ["label"] ... } {["label"] ALL list } 55 [/KMATRIX= {list of numbers } ] {list of numbers;...} [/POSTHOC =effect effect...([SNK] [TUKEY] [BTUKEY][DUNCAN] [SCHEFFE] [DUNNETT(refcat)] [DUNNETTL(refcat)] [DUNNETTR(refcat)] [BONFERRONI] [LSD] [SIDAK] [GT2] [GABRIEL] [FREGW] [QREGW] [T2] [T3] [GH][C] [WALLER ({100** })]] {kratio} [VS effect] [/EMMEANS=TABLES({OVERALL }) ] {factor } {factor*factor... } {wsfactor } {wsfactor*wsfactor ... } {factor*...wsfactor*...} WITH (covariate=MEAN covariate=MEAN) COMPARE ADJ (LSD) (BONFERRONI) (SIDAK) [/SAVE=[tempvar [(list of names)]] [tempvar [(list of names)]]...] [/OUTFILE=[{COVB (file)}] {CORB (file)} [/DESIGN={[INTERCEPT...] }] {[effect effect...]} † WSDESIGN uses the same specification as DESIGN, with only within-subjects factors. ‡ DEVIATION is the default for between-subjects factors while POLYNOMIAL is the default for within-subjects factors. ** Default if subcommand or keyword is omitted. Simple Example: Glm Var1 var2 var3 by var4 var5/ Wsfactor=factor1 3/ Wsdesign=factor1/ Design=var4 var5 var4*var5. 56 (CORRELATIONS) CORRELATIONS [VARIABLES=] varlist [WITH varlist] [/varlist...] [/MISSING={PAIRWISE**} [INCLUDE]] {LISTWISE } [/PRINT={TWOTAIL**} {SIG**}] {ONETAIL } {NOSIG} [/MATRIX=OUT(file)] [/STATISTICS=[DESCRIPTIVES][XPROD][ALL]] **Default if the subcommand is omitted. Default for /MATRIX OUT is the working data file. Simple Example: correlations variables= var1 var2 var3. 57 (REGRESSION) REGRESSION [MATRIX=[IN(file)] [OUT(file)]] [/VARIABLES={varlist }] {(COLLECT)**} Command Syntax {ALL } [/DESCRIPTIVES=[DEFAULTS][MEAN][STDDEV][CORR][COV] [VARIANCE][XPROD][SIG][N][BADCORR] [ALL][NONE**]] [/SELECT={varname relation value} [/MISSING=[{LISTWISE** }] [INCLUDE]] {PAIRWISE } {MEANSUBSTITUTION} [/REGWGT=varname] [/STATISTICS=[DEFAULTS**][R**][COEFF**] [ANOVA**][OUTS**] [ZPP][CHA][CI][F][BCOV] [SES][XTX][COLLIN] [TOL][SELECTION][ALL]] [/CRITERIA=[DEFAULTS**][TOLERANCE({0.0001**})] {value } [PIN({0.05**})][POUT({0.10**})] {value } {value } [FIN({3.84 })][FOUT({2.71 })] {value} {value} [CIN({ 95**})]][MAXSTEPS(n)] {value} [/{NOORIGIN**}] {ORIGIN } /DEPENDENT=varlist [/METHOD=]{STEPWISE [varlist] } {FORWARD [varlist] } {BACKWARD [varlist] } {ENTER [varlist] } {REMOVE varlist } {TEST(varlist)(varlist)...} [/RESIDUALS=[DEFAULTS][ID(varname)] [DURBIN][{SEPARATE}] {POOLED } [HISTOGRAM({ZRESID })] {tempvarlist} [OUTLIERS({ZRESID })] {tempvarlist} [NORMPROB({ZRESID })] {tempvarlist} [/CASEWISE=[DEFAULTS][{OUTLIERS({ 3 })}] 58 { {value} } {ALL } [/SCATTERPLOT=(varname,varname)] [/PARTIALPLOT={ALL }] {varlist} [OUTFILE={COVB (filename)}] [MODEL (filename)] {CORB (filename)} [/SAVE=tempvar[(name)][tempvar[(name)]...][FITS]] Temporary variables for residuals analysis are: PRED,ADJPRED,SRESID,MAHAL,RESID,ZPRED,SDRESID,COOK, DRESID,ZRESID,SEPRED,LEVER,DFBETA,SDBETA,DFFIT,SDFFIT, COVRATIO,MCIN,ICIN. SAVE FITS saves DFFIT,SDFIT,DFBETA,SDBETA,COVRATIO. **Default if the subcommand is omitted. Simple Example: regression /dependent var1 /method=enter var2 var3. 59 Appendix B SAS Statistical procedures - Syntax examples varx = variable name) (FREQUENCY DISTRIBUTIONS) proc freq data=info page; by x; tables y*z /chisq expected exact sparse nocum nopercent nocol norow; The backslash MUST be used (once) to select any of the options of the tables statement. The data= line is required and specifies the data set to be analyzed. The optional by statement produces separate analyses for each value of the by variable(s). The tables statement produces a table of values for analysis. The form y*z produces a 2 way table with the values of y defining table rows and the values of z defining table columns. A 1 way table is obtained by specifying a single variable name. Multiple tables may be requested within a single tables statement. The short form (q1--q3)*a requests 3 tables and is equivalent to q1*a q2*a q3*a. Note the double hyphen required here. Options for the tables statement follow the /. The chisq option to the tables statement performs the standard Pearson chi-square test on the table(s) requested. The expected option prints the expected number of observations in each cell under the null hypothesis. The exact option requests Fisher's exact test for the table(s). This is automatically computed for 2 x 2 tables. The sparse option produces a full table, even if the table has many cells containing no observations. The nocum option suppresses printing of cumulative frequencies and percentages in the table. 60 The nopercent option suppresses printing of the cell percentages and cumulative percentages. The nocol and norow options suppress the printing of column and row percentages. Simple Example: proc freq data=example; tables=var1 var2; run; 61 (DESCRIPTIVE STATISTICS) proc means data=first alpha=0.10 vardef=n mean std t prt n lclm uclm; var x y; by x; The required data=first statement selects the data set used in the analysis. The alpha= option sets the significance level for any confidence intervals computed (defaults to alpha=0.05 if not specified). The vardef=n option sets the divisor for the sample variance to be the sample size n (default is to use the degrees of freedom n-1 and is the same as vardef=df). NOTE: do not use the vardef option if you also use the t and/or prt options. THE T AND PRT VALUES GENERATED WHEN VARDEF=N ARE INCORRECT. The optional last line gives a list of the statistics to be computed. The mean and std options give the sample mean and sample standard deviation for each variable in the analysis. The t option computes the value of the t-statitistic for a test of zero mean for each variable. The prt option computes the p-value for a test of the null hypothesis of zero mean for each variable. The n option prints the sample size. The lclm and uclm options produce the lower and upper endpoints of a 100(1-alpha)% confidence interval for the mean. If no statistics are specified, n mean std min and max are printed for each variable. Min and max are the sample minimum value and maximum value. The var option can be used to limit the output to a specific list of variables in the data set. If it is not used, the requested statistics will be computed for all variables in the data set. The optional by statement generates separate analyses for each level of the by variable(s). Simple Example: proc means data=example mean; var var1 var2 var3; run; 62 (T TEST – INDEPENDENT GROUPS) proc ttest data=second; class type; var x y; THE CLASS SPECIFICATION IS REQUIRED. The class statement specifies a variable (here type) in the data set which takes exactly two values. The values of this class variable define the two populations to be compared. The optional var statement restricts comparisons to the variables listed; if not used, all variables in the data set (execpt the class variable) are compared pairwise. Simple Example: proc ttest data=example; class var1; var var2 var3; run; (T-TEST – DEPEDNENT GROUP) In SAS, the dependent group t-test is performed by PROC MEANS after the creation of a “gain” variable from the two dependent measures. Simple Example: gain=var1-var2; proc means data=example mean stderr t prt; var gain; run; 63 (CHI SQUARE TEST OF ASSOCIATION) See PROC FREQ above. Simple Example: proc freq data=example; tables=var1*var2 / expected chisq; run; (FACTORIAL ANOVA) PROC ANOVA handles only balanced ANOVA designs. PROC ANOVA CLASS MODEL MEANS DATA=datasetname; factorvars; responsevar = factorvars; factorvars / BON /* See below /* Bonferroni t-tests, g=r(r-1)/2 T or LSD /* Unprotected t-tests TUKEY /* Tukey studentized range SCHEFFE /* Scheffe contrasts ALPHA=pvalue /* default: 5% CLDIFF /* Confidence limits LINES /* Non-significant subsets ; /* use LINES or CLDIFF Simple Example: proc anova data=example; class var1 var2; model var3=var1 var2 var1*var2; (or var1|var2;) run; PROC GLM handles both balanced and unbalanced designs but uses more computing resources. PROC GLM options; CLASS variables; MODEL dependents = independents / options; ABSORB variables; BY variables; FREQ variables; 64 */ */ */ */ */ */ */ */ */ ID variables; WEIGHT variables; CONTRAST 'label' effect values ... / options; ESTIMATE 'label' effect values ... / options; LSMEANS effects / options; MANOVA H = effects E = effect M = equations MNAMES = names PREFIX = name / options; MEANS effects / options; OUTPUT OUT=SASdataset keyword=names ...; RANDOM factorname levels(levelvalues) transformation[, ... ] / options; REPEATED factorname levels(levelvalues) transformation[,...] / options; TEST H = effects E = effect / options; Simple Example: proc glm data=example; class var1 var2; model var3=var1 var2 var1*var2; (or var1|var2;) run; (REPEATED MEASURES ANOVA) PROC GLM ; MODEL dv1 dv2 dv3 dv4 dv5 = /NOUNI; REPEATED reptdfac /PRINTE ; RUN ; Simple Example: proc anova data=example; model var1 var2 var3=var4|var5 / nouni; repeated factor1 3 (1 2 3); run; 65 (CORRELATIONS) PROC CORR options; VAR variables; WITH variables; PARTIAL variables; WEIGHT variables; FREQ variables; BY variables; PROC CORR Options: DATA=SASdaataset -- Names SAS dataset to be used OUTH=SASdaataset -- Names SAS dataset containing Hoeffding statistics OUTK=SASdaataset -- Names SAS dataset containing Kendall correlations OUTP=SASdaataset -- Names SAS dataset containing Pearson correlations OUTS=SASdaataset -- Names SAS dataset containing Spearman correlations. HOEFFDING -- requests Hoeffding's D statistic KENDALL -- requests Kendall's tau-b coefficients PEARSON -- requests Pearson correlation coefficients SPEARMAN -- requests Spearman coefficients NOMISS -- missing values to any variable to be dropped from all calculations VARDEF=divisor -- specifies the divisor used in calculation of variances SINGULAR=p -- criterion for determining singularity ALPHA -- Cronbach's alpha COV -- covariances CSSCP -- corrected sums of squares and cross products NOCORR -- Pearson correlations not calculated or printed SSCP -- sums of squares and cross products BEST=n -- prints n correlations with largest absolute values NOPRINT -- Suppresses printed output NOPROB -- suppresses printing of significance probabilities NOSIMPLE -- suppresses printing of simple descriptive statistics rank -- print correlations in order from highest to lowest Simple Example: proc corr data=example; var var1 var2 var3; run; 66 (REGRESSION) PROC REG options; MODEL dependents=regressors / options; VAR variables; FREQ variable; WEIGHT variable; ID variable; OUTPUT OUT=SASdataset keyword=names...; PLOT yvariable*xvariable = symbol ...; RESTRICT linear_equation,...; TEST linear_equation,...; MTEST linear_equation,...; BY variables; The PROC REG statement is always accompanied by one or more MODEL statements to specify regression models. One OUTPUT statement may follow each MODEL statement. Several RESTRICT, TEST, and MTEST statements may follow each MODEL. WEIGHT, FREQ, and ID statements are optionally specified once for the entire PROC step. The purposes of the statements are: The MODEL statement specifies the dependent and independent variables in the regression model. The OUTPUT statement requests an output data set and names the variables to contain predicted values, residuals, and other output values. The ID statement names a variable to identify observations in the printout. The WEIGHT and FREQ statements declare variables to weight observations. The BY statement specifies variables to define subgroups for the analysis. The analysis is repeated for each value of the BY variable. Simple Example: proc reg data=example; model var1=var2 var3; run; 67