CROP 590 Experimental Design in Agriculture Lab Week 1 Introduction to SAS Recommended reading: Cody and Smith – Pages 1-18 Part I. SAS Online Documentation Go to the SAS online documentation website: For http://support.sas.com/documentation/93/index.html (For version 9.3) 1) Explore the contents of the SAS System Help directory. Base SAS and SAS/STAT software are most relevant for this course. Note that you can also access SAS online documentation by using the Help function when you are running SAS. 2) Select a SAS STAT Procedure (PROC) and explore the background information and syntax documentation. 3) Review the syntax for SAS Help documentation. SAS keywords, such as statement or procedure names, appear as links in all caps. Optional arguments appear inside angle brackets (< >). Values that you must spell as they are given in the syntax appear in uppercase type. Argument group that you can repeat are indicated by an ellipsis. PROC GLM <options> ; CLASS variable <(REF= option)> …<variable <(REF=FIRST | LAST)>> </ globaloptions> ; MODEL dependent-variables = independent-effects </ options> ; Values that you must supply appear in normal text or in italics. Mutually exclusive choices are joined with a vertical bar (|). 1 Part II. The SAS Display Manager for Windows Explore the windows that are displayed and click the help icon for further information about them. Editor – this is where you enter programs. Version 9.3 of SAS uses an ‘enhanced editor’ with many new features. The older ‘program editor’ is also available, but is not recommended. Log – shows the SAS statements that have been submitted, reports system messages and identifies errors in your program Explorer – provides easy access to data sets and SAS files Results – provides easy access to SAS output files In SAS 9.3, the default window for results of SAS procedures and analyses is the Results Viewer. Results are presented in html format. If you prefer the older List (txt) Output format, you can choose that option by selecting ToolsOptionsPreferences; you then select the Results Tab and check the box to create a Listing. Note that you can also navigate SAS windows using the ‘Window’ and ‘View’ drop down menus. Windows can be cleared when they are active by using the blank page icon on the toolbar. This is a good housekeeping practice to avoid appending new log and output files to obsolete ones. Part III. SAS Basics All SAS statements end with a semicolon. Case and spacing generally are not important. Statements can extend to more than one line. Variable names: - begin with a letter - up to 32 characters in length (but for list input, the default is 8) - cannot contain special characters (eg , ; - /) (underscores _ are OK) - should contain no spaces - example of a valid SAS variable name: YLD03_KG - variables may be designated as numeric or character. Character variables are case sensitive and sensitive to leading blanks SAS programs are divided into sections called: - The DATA Step – creates a data set and modifies it as needed - The PROC Step – specifies SAS procedures to perform (e.g., data analyses) - SAS language may be specific to the DATA Step or the PROC Step, but some statements are universal to both. Global statements apply to all subsequent steps in the program. 2 Some useful SAS Procedures: PROC FREQ – Produces frequency and contingency tables for categorical variables and performs Chi-square tests for goodness of fit PROC GLIMMIX – Fits statistical models to data with correlations or nonconstant variability and where the response is not necessarily normally distributed. These models are known as generalized linear mixed models (GLMM). (Available in version 9.2) PROC GLM – Performs Analysis of Variance for balanced and unbalanced data; can accommodate independent variables that are categorical (class variables) as well as continuous (as in regression) PROC GPLOT – Creates graphs of data PROC IML – SAS/IML is a programming language that operates on matrices PROC LATTICE – Computes Analysis of Variance for lattice designs PROC MEANS – Computes descriptive statistics for variables across all observations and within groups of observations PROC MIXED – Performs analysis of mixed models PROC PRINT – Prints the observations in a SAS data set PROC REG – A general-purpose procedure for linear regression PROC SORT – Sorts observations in a SAS data set by one or more character or numeric variables PROC UNIVARIATE – Provides data summarization methods that produce univariate statistics and information on the distribution of numeric variables 3 Example of a SAS program: Beginning of DATA step $ indicates that variety is a character variable Creates a new variable Drops specified observations from the data set; single quotes indicate a value for a character variable In older versions of SAS, a ‘CARDS’ statement was used rather than ‘DATALINES’ Each line represents one observation Each column represents a different variable The period ‘.’ designates a missing value. Missing values that are input from Excel should be left blank. A semicolon is needed after the datalines Comment statement (a note to yourself that does not affect the program) Beginning of PROC Step Sorts data by variety Prints the most recently created data set Another PROC Step A title will appear as a heading on each page of output until it is reset Indicates the variables that you want to analyze. Requests means for each variety Always end the program with a RUN statement DATA EXAMPLE; INPUT VARIETY $ PLOTM2 PLOTWT; YLD03_KG = (PLOTWT/PLOTM2)*10; IF VARIETY = 'MOREX' THEN DELETE; DATALINES; STEPTOE BARONESSE HARRINGTON MOREX STEPTOE BARONESSE HARRINGTON MOREX ; 4.32 4.28 4.89 4.77 4.61 4.66 4.50 4.35 2355 2825 2236 1980 . 2691 2100 2206 /*DATA SUMMARY*/ PROC SORT DATA=EXAMPLE; BY VARIETY; PROC PRINT; PROC MEANS; TITLE 'SUMMARY OF BARLEY DATA'; VAR YLD03_KG; BY VARIETY; RUN; QUIT; 4 1) Copy and paste this program into the program editor in SAS. Note how the program is automatically color coded to signify different types of input. What does each color represent? Try removing some of the SAS statements in the program – what happens to the color coding? Horizontal lines indicate the beginning and end of PROC and DATA steps. It is not a bad idea to explicitly place a ‘RUN’ statement at the end of each step. 2) We have used the simplest form of data input known as ‘list’ input. Each value in a line is separated by one or more spaces. SAS reads each value and assigns it to the corresponding variable in the input statement. Many other formats can be specified, such as column input and comma separated input. 3) Remove some of the blank lines and edit the comments and title statements as you wish. Click on the ‘+’ and ‘-’ icons on the left side of the program editor window to compress and expand parts of your program. Save your program. 4) Run the program using the ‘Run’ dropdown menu or the Running man icon on the toolbar. View the information in the log and output windows. If your program statements are cleared automatically from the editor when you submit them, you might want to consider changing the options for the enhanced editor on the tools menu. Usually it is more convenient to retain your program in the editor window for further use. Part IV. Working with large data sets 1) Open Lab1.xlsx and save the data set on your hard drive. Rearrange the data so that ‘locations’ can be used as a SAS variable. Ensure that the format of the file meets requirements for use as a SAS data set. Variable names will be read by SAS from the first row on the spreadsheet. 2) Open SAS and import the data using the file import wizard. Choose a member name such as ‘barley’ to create a data set called ‘work.barley’. 3) Write a program to summarize this data using PROC MEANS. Rather than enter the data lines directly in the program, instruct SAS to use the data file you have created in the data step: PROC MEANS DATA=BARLEY; Alternatively, you could duplicate the original data set in another DATA step: DATA NEW; SET BARLEY; A copy of work.barley is made and assigned the name ‘NEW’. No input statement is required in this case because the variable names are already defined in the data set. 4) Summarize the data by locations and by varieties. 5