Using PROC MEANS

Using SAS PROCS Part 4 - 64

CHAPTER 4 USING SAS PROCS

© Alan Elliott, 2000 All rights reserved For classroom use only I n the research environment, one of the primary uses for SAS is to perform calculations for statistical data analysis. SAS provides a number of procedures, called PROCS, that allow the user to apply a number of statistical procedures to data. This chapter introduces some of the most commonly used SAS procedures for data analysis in a research environment. The list of procedures covered here is only a sampling of the capabilities of SAS, but they are specifically chosen because they are commonly used in scientific data analysis. Once you learn the syntax of the procedures covered here, you should be able to apply this knowledge to the use of other specialized procedures in SAS not covered in this workbook. Procedures covered in this chapter include  Sorting data  Printing data reports and mean summaries  Calculating one and two dimensional frequency tables  Performing paired and independent grouped t-tests  Performing one-way and two-way analysis of variance  Performing simple and multiple regression using GLM CHAPTER OBJECTIVE: On completion of this chapter you should know how to use SAS procedures to perform common statistical analysis techniques.

Using SAS PROCS

Understanding SAS Support Statements

Part 4 - 65 Before covering specific SAS Procedures for data analysis, you need to understand some supporting SAS statements and procedures that are often used in conjunction with SAS data analysis procedures. The next few sections introduce these support statements.

Using the OPTIONS Statement

The OPTIONS statement temporarily changes one or more SAS system options from the default value. For example: OPTIONS PS=60; Specifies the “pagesize” or lines per page that SAS will print. Some of the more commonly used options for the OPTIONS statement are: (Defaults are underlined.) NODATE (or DATE) - specified if the date will be included in the printout. LS = # - (LINESIZE) specifies the maximum width of the printout. (Usually 80). NONUMBER (or NUMBER) - specifies if the page number will be included in the printout. PS = # - (PAGESIZE) - specified how many lines to print per page. PAGENO = # - specifies beginning page number for output.

TIP

If you run several SAS jobs, SAS will remember the number of pages output. Therefore on subsequent jobs, the beginning page number will be one greater than the previous job. To cause SAS to always begin with page one, you should include a PAGENO command in an OPTIONS statement at the beginning of your job. For example: OPTIONS PS=60 PAGENO=1;

Using the SAS TITLE Statement

The TITLE statement specifies text to be printed at the top of the output pages. Up to 10 titles can be specified. For example: TITLE 'title text'; or

Using SAS PROCS TITLEn ‘title text’; Or suppress a single title line by using the statement TITLEn; Part 4 - 66 The first example specifies a the first title line. Subsequent title lines may be defined using title2, title3, to title9. For example: TITLE ‘The first line of the title’; TITLE2 ‘The second line of the title’; TITLE5 ‘Several lines skipped, then this title on the fifth line’; Once you have specified a title, it is used until you redefine the title lines. You can cancel all title lines with the statement TITLE;

Including Comments in your SAS Code

It is a good programming practice to include explanatory comments in your code. This allows you to understand what you have done when you go back into your code the next day (or next year) to figure out what you did and why. Comment statements can be used anywhere in a SAS job to document the job. You can use any number of comment statements in a job. The comment statement MUST end in a semi-colon. The message can be any length, although it cannot contain semi-colons. Here are some ways to include comments in your code: Using the Asterisk “*” or “/*” Comment Specifiers * This is a message It can be several lines long But it always ends with a ; ************************************************************* * Boxed messages stand out more * *************************************************************; or

Using SAS PROCS /*This is a SAS comment*/; Part 4 - 67 In the first two examples, the message begins with an “*” and ends with a ; (semicolon). In the last example, the message begins with a “/*” and ends with a “*/;”.

Using RUN to Execute Statements

The RUN statement causes previously entered SAS statements to be executed. For example: PROC PRINT; PROC MEANS; RUN; Include a RUN statement after major sections in your SAS code. A RUN statement must be the last statement in your SAS program.

Note

If you fail to include a RUN; statement at the end of your SAS job (or if SAS runs into an error and never sees the RUN; statement) the SAS processor may continue to run in the background. This may cause unpredictable problems. If this occurs, you may want to end the SAS program and restart to make sure it’s memory is cleared.

Understanding the SAS PROC Statement Format

Although there are scores of SAS PROC statements you may choose to use, the format is fairly consistent across all PROCS. The general syntax of the SAS PROC statement is: PROC name options DATA=datasetname; Examples: PROC PRINT; PROC TTEST; CLASS GROUP; VAR AGE; PROC MEANS MEAN MAXDEC=2 DATA=CLASS;

Performing Data Analysis Using SAS PROCS Part 4 - 68 Notice the DATA= statement used in the example above. The DATA= statement tells SAS which data set to use in the analysis. If there has only been one data set previously created, there is no need for this statement. However, if you have created more than one data set during the run, it is a good idea to include a DATA= statement to specify which data set SAS is to use. By default, SAS will use the last data set defined, unless you specify otherwise.

Adding General Information Statements to PROC Statements

Several types of general statements are associated with SAS procedures, and give the procedure additional information about how to process the data — which variables to use (VAR), how to breakdown the analysis by group (BY), and other special information statements. Specifying Variables with the VAR Statement Specifies the variables in the selected SAS data set to be processed by the procedure. The syntax of the statement is: VARIABLES variable list; or VAR variable list; Example: PROC MEANS; VAR HEIGHT WEIGHT AGE; This statement tells SAS to perform PROC MEAN only on the three listed variables. If you have many variables in your data set, this is a way to limit your analysis only to those of interest.

Note

To save time and typing, you can use a list such as Q1-Q50 to indicate 50 different variable names (Q1, Q2, Q3, etc) in a VAR statement. Special Information Statements Are Required for Some PROCS Certain PROCS require parameters to be passed via special information statements. Below are some examples. In the PROC PLOT procedures, use the PLOT statement to specify the variables to plot. PROC PLOT; PLOT X*Y;

Performing Data Analysis Using SAS PROCS Part 4 - 69 In the PROC FREQ procedure, use the TABLES statement to specify the variable(s) to analyze. PROC FREQ; TABLES SEX*RACE; These are just two common examples of special information statements required in some PROCs. Other examples will be introduced later in these notes.

Specifying Subgroup Analysis Using the BY Statement

The “BY” Statement is a very powerful and handy statement that allows you to quickly get subset results from your data. To use it, you must have some sort of grouping variables such as sex, group, race, etc. Then you must sort your data by the variable on which you wish to subset. Finally, you add the BY statement onto an analysis, and SAS will perform the analysis by specified group. The syntax for the BY statement is: BY variable list; You must sort a SAS data set using PROC SORT before you perform an analysis using the BY statement. (See PROC SORT below.) Example: PROC SORT DATA=CLASS; BY SEX; PROC MEANS DATA=CLASS; BY SEX; This results in output where means are reported first for females (F), then males (M) in the CLASS data set. Examples of how the BY Statement can be used are given in some of the following examples.

Performing Data Analysis Using SAS PROCS Part 4 - 70

Using PROC SORT

The SORT procedure can rearrange the observations in a SAS data set or create a new SAS data set containing the rearranged observations. With PROC SORT, you can sort on multiple sort fields and sort in ascending or descending order. The sorting sequence for character values is as follows: blank!"#$%&'()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrstuvwxyz(|)~

Tip

Notice that upper case characters (such as “Z”) come before all lower case characters (such as “a”). Thus, when sorting, Z comes before a. If you have variables such as Sex entered as M and F, and some values are entered as m and f, then your sort will not properly sort the records into a Male and Female grouping. The sorting sequence for numeric variables is: missing values first, then numeric values. The syntax for PROC SORT is: PROC SORT DATA =datasetname OUT = datasetname; General statements used with PROC SORT include: BY variable list; BY DESCENDING variable list;

Note

The default sorting order is ASCENDING. The BY DESCENDING statement tells SORT to arrange the values from highest to lowest instead of the default lower to highest order. Use the OUT= statement to create a new sorted SAS data set. For example: PROC SORT DATA=MYDATA OUT=MYSORT; BY RECTIME; Creates a new SAS data set named MYSORT, which you could then use in a procedure. For example: PROC PRINT DATA=MYSORT;

Performing Data Analysis Using SAS PROCS

Workbook Exercise 4.1 – PROC SORT

Part 4 - 71 The following example illustrates how PROC SORT works. ************************************************** * PROC SORT EXAMPLE * ************************************************** ; DATA MYDATA; INPUT GROUP RECTIME; CARDS; 1 3.1 2 3.6 2 4.2 1 2.1 1 2.8 2 3.8 1 1.8 ; PROC SORT; BY RECTIME; PROC PRINT; Title ‘Sorting Example’; PROC SORT; BY DESCENDING RECTIME; PROC PRINT; RUN;

Note

Note: You can sort on several variables at a time. For example: PROC SORT; BY GROUP SEX STATUS;


Using PROC Statements for DATA Analysis

There are many different PROC statements available to you to use within SAS. The following sections describe some of the most commonly used PROCS for data analysis. The SAS procedures that are described in these notes are computer programs that:  Create, read and manipulate SAS data sets.  Print listings and reports.  Compute statistics, perform analyses. Some procedures that tend to be used to prepare data for analysis, such as PROC SORT, have already been described. Other selected data analysis procedures described in these notes include:  PROC PRINT - Print a listing of the data  PROC MEANS - Calculate univariate statistics on a variable.  PROC FREQ - Calculate frequencies and two-way tables.  PROC TTEST - Analyze an independent group t-test.  PROC ANOVA - Analyze a one-way analysis of variance.  PROC GLM - General Linear Models Procedure for ANOVA and Regression. Some additional procedures that you may want to use include regression (REG), multivariate (FACTOR), scoring (SCORE), nonparametrics (NPAR1WAY) and discriminant analysis (DISCRIM). These and other procedures are described in the SAS/STAT user’s guides. Once you learn how to use the procedures in these notes, you will probably be able to quickly understand the syntax required for other SAS procedures.


Using PROC PRINT

The PROC PRINT procedure prints a listing of the data values in a SAS data set. Features include:       Automatic formatting. Columns labeled with variable names (or variable labels). Special handling of control breaks. Printing summaries. Optimized page space. Special BY formatting. The syntax for PROC PRINT is: PROC PRINT options DATA=datasetname; Options you may include with PROC PRINT statement include: N Include the number of observations in the listing DOUBLE Double space the output UNIFORM Use a uniform appearance of pages ROUND Round data before totaling General Information statements used with PROC PRINT include: BY variable list; VAR variable list; SUM variable list; The BY and VAR statements have been previously described. The SUM statement tells SAS to include totals for the variables specified in the variable list in the data listing.


Using PROC MEANS

The PROC MEANS procedure produces simple univariate (one variable) descriptive statistics (means, standard deviation, minimum, maximum, etc.) for numeric variables. The syntax of the PROC MEANS statement is: PROC MEANS options DATA = datasetname; General Information statements used with PROC MEANS include: BY variable list; VAR variable list; OUTPUT OUT = datasetname; The BY-group specification causes MEANS to calculate descriptive statistics separately for groups of observations (i.e., treatment means). The OUTPUT OUT= statement allows you to output the means to a new data set. Options available in the PROC MEANS statement: Statistics options that may be requested are:(default statistics are underlined.) N NMISS MEAN STD Number of observations Number of missing observations Mean (Arithmetic average) Standard Deviation MIN MAX RANGE SUM VAR USS CSS STDERR T PRT Minimum (smallest) observation Maximum (Largest) observation Range Sum of the observations Variance Uncorrected sum of squares Corrected sum of squares Standard Error Student’s t value for testing H o :  d = 0 P-value associated with t-test above SUMWGT Sum of the WEIGHT variable values Other options available in PROC MEANS include:

Performing Data Analysis Using SAS PROCS NOPRINT MAXDEC=n Do not print output Use n decimal places to print output. Part 4 - 75 Examples: PROC MEANS; * Simplest invocation - uses all defaults *; PROC MEANS N MEAN STD; *Specified statistics and variables *; VAR SODIUM CARBO; PROC SORT; BY SEX; * Subgroup descriptive statistics *; PROC MEANS MAXDEC=2; BY SEX; VAR FAT PROTEIN SODIUM;

Calculating Means Using PROC MEANS

A common step in describing you data is reporting means. PROC MEANS allows you to easily calculate means for an entire data set or by some grouping variables. For example, using the DIXON data set, you can calculate means for the entire database on selected variables using the following code: libname mydata 'a:\'; data cardiac; set mydata.dixon; ; PROC MEANS data=CARDIAC; VAR AGE_1952, BLDCHL52, HEIGHT52; TITLE 'Means from Dixon Data Set'; run; To calculate these means for those who had a coronary event and for those who have not, you could use this procedure: libname mydata 'a:\'; data cardiac; set mydata.dixon; ; PROC SORT;BY CORONARY; PROC MEANS data=CARDIAC; VAR AGE_1952, BLDCHL52, HEIGHT52; run; BY CORONARY; TITLE 'Means from Dixon Data Set'; In the second case, the data is sorted on the 0,1 variable CORONARY, then means are calculated and reported for each group.


Using PROC MEANS to Perform a Paired t-test

Part 4 - 76 To compare two paired groups (such as in a before-after situation) where both observations are taken from the same or matched subjects, you can perform a paired t-test using PROC MEANS. For example, suppose your data contained the variables WBEFORE and WAFTER, (before and after weight on a diet), with 8 subjects. To perform a paired t-test using PROC MEANS, follow these steps: 1 2. Read in your data. Calculate the difference between the two observations (WLOSS is the 3. amount of weight lost), and Report the mean loss, t-statistic and p-value using PROC MEANS. The hypotheses for this test are: H o : u Loss = 0 H a : u Loss  0 (There average weight loss was 0) (The weight loss was different than 0) For example, the following code performs a paired t-test for weight loss data: OPTIONS PS=60; DATA WEIGHT; INPUT WBEFORE WAFTER; WLOSS=WAFTER-WBEFORE; * Calculate WLOSS in the DATA step *; CARDS; 200 190 175 154 188 176 198 193 197 198 310 240 245 204 202 178 ; PROC MEANS N MEAN T PRT; VAR WLOSS; TITLE ‘Paired t-test example using PROC MEANS’; RUN; Notice that the actual test is performed on the new variable called WLOSS, and that is why it is the only variable requested in the PROC MEANS statement. The statistics of interest are the mean of WLOSS, it’s t-statistic associated with the null hypothesis and the p-value. The SAS for this output would include the following information: N Mean T Prob>|T| --------------------------------------- 8 -22.7500000 -2.7884739 0.0270

Performing Data Analysis Using SAS PROCS Part 4 - 77 The mean of the variable WLOSS is –22.75. The t-statistic associated with the null hypothesis is –2.788, and the p-value for this paired t-test is p = 0.027, which provides evidence to reject the null hypothesis.

Comment

Interpreting Statistical p-values

A rticles in scientific journals routinely report the results of studies. Based on these results, important decisions are made. For example, new drugs are often tested against standard drugs to determine if the new drug is more effective. Or, several methods of manufacturing may be compared to select the best technique for fabricating a component. Or, evidence may be examined to determine if there is a possible link between one activity and a result (e.g., asbestos and lung cancer). In most of these kinds of studies, results are commonly summarized by a statistical test, and a decision about the significance of the result is based on a p-value. The reader of an article must decide, like a juror on a criminal case, if the evidence is strong enough to believe. Assuming the study was designed according to good scientific practice, the strength of the evidence is contained in the p-value. Therefore, it is important for the reader to know what the p-value is telling you. To describe how the p-value works, we'll use a common statistical test as an example, the Student's t-test for independent groups. For this test, subjects are randomly assigned to one of two groups. Some treatment is performed on the subjects in one group, and the other group acts as a control -- where no treatment or a standard treatment is given. For this example, suppose group one is given a new headache medicine and group 2 is given then standard headache medicine. Time to relief is measured for both groups. The outcome measurement is assumed to be a continuous variable which is normally distributed, and it is assumed that the population variance for the measure is the same for both groups. For this example the sample mean for group one is 10 and the sample mean for group two is 12. The sample standard deviation for group one is 1.8 and the sample standard deviation for group two is 1.9. The sample size for both groups is 12. Entering this data into a statistical program will produce the results = -2.65 with 22 degrees of freedom, and a p-value of 0.0147. This means that you have evidence that the mean time to relief for group one was significantly different than for group two. To interpret this p-value, you must first know how the test was structured. In the case of this two-sided t-test, the hypotheses are: Ho:  1 =  2 ( Null hypothesis: the means of the two groups are equal.) Ha:  1   2 (Alternative hypothesis: the means of the two groups are equal.)

Performing Data Analysis Using SAS PROCS Part 4 - 78 A low p-value for the statistical test points to rejection of the null hypothesis because it indicates how unlikely it is that a test statistic as extreme as or more extreme than the one given by this data will be observed from this population if the null hypothesis is true. Since p=0.015, this means that if the population means were equal as hypothesized (under the null), there is a 15 in 1000 chance that a more extreme test statistic would be obtained using data from this population. If you agree that there is enough evidence to reject the null hypothesis, you conclude that there is significant evidence to support the alternative hypothesis. The researcher decides what significance level to use -- that is, what cutoff point will decide significance. The most commonly used level of significance is 0.05. When the significance level is set at 0.05, any test resulting in a p-value under 0.05 would be significant. Therefore, you would reject the null hypothesis in favor of the alternative hypothesis. Since you are comparing only two groups, you can look at the sample means to see which is largest. The sample mean of group one is smallest, so you conclude that medicine one acted significantly faster, on average, than medicine two. Using standard APA journal format, this would be reported in an article using a phrase like this: "The mean time to relief for group one was significantly smaller than for group two. (two sided t-test, t(22) = -2.65, p=0.015)." P-values do not simple provide you with a Yes or No answer, they provide a sense of the strength of the evidence against the null hypothesis. The lower the p-value, the stronger the evidence. Once you know how to read p-values, you can more critically interpret journal articles, and decide for yourself if you agree with the conclusions of the author.


Using PROC FREQ - Frequencies and Crosstabulations

The PROC FREQ procedure is used to obtain frequency counts for one or more variables. One-way and two-way tables may be created. The syntax for PROC FREQ is: PROC FREQ options; TABLES specification; For example, to obtain counts of the number of subjects falling into each of the SOCIO categories, use the procedure: PROC FREQ; TABLES SOCIO; To produce a count of CORONARY events by treatment GROUP, use the statement: PROC FREQ; TABLES CORONARY*GROUP;

Frequency Table Example

When only one variable is used in the TABLES statement, PROC FREQ produces a frequency table. For example, using the data from the DIXON data set, the following code will produce a frequency table to the SOCIO variable: libname mydata 'a:\'; data dixon;set mydata.dixon; PROC FREQ;TABLES SOCIO; Title ‘Frequencies on Social Status variable’; run; The output for this job is (in part) : Frequencies on Social Status variable Cumulative Cumulative SOCIO Frequency Percent Frequency Percent -------------------------------------------------- 1 24 12.0 24 12.0 2 37 18.5 61 30.5 3 92 46.0 153 76.5 4 22 11.0 175 87.5 5 25 12.5 200 100.0


Analyzing a Two-Way Table using PROC FREQ

Part 4 - 80 When two crossed variables are given in the TABLES statement (i.e., A * B), PROC FREQ will produce a crosstabulation table. Often, you want to know the statistics associated with this table, so you may use the /CHISQ option in the TABLES statement to request that statistics be reported. For example: PROC FREQ; TABLES CORONARY*GROUP/CHISQ; will create a two-way crosstabulation table and will also cause SAS to report a battery of statistics associated with the table. Test Assumptions: For the Chi-square statistic, the observed data are assumed to be counts of qualitative/categorical data such as hair color, presence of a condition (i.e., a disease) or not, etc. A crosstabulation table (contingency table) is formed by counting the number of occurrences in a sample across two grouping variables (specified in TABLES). The number of columns in a table is usually denoted by c and the number of rows by r. Thus, a table is said to have r x c "cells." For example, if in a dominate-hand (left right) by hair color table, (with 5 hair colors used) the table would be referred to as a 2 x 5 table. Two types of tests are commonly associated with an r x c table. They are the test of independence and the test of homogeneity. The hypotheses for test of independence are: H o : The variables are independent (no association between the two variables) H a : The variables are not independent Thus, in the “hair” example, the null hypothesis would mean that there is no association between dominate hand and hair color (each have the same distribution of hair color.) The alternative hypothesis would mean that left and right handed people have difference distributions of hair color -- perhaps left-handed people have more blondes. Another test that can be performed for a contingency table is a test of homogeneity. In this case, the table is built of data from two populations and tests whether the populations come from the same distribution. In this case the hypotheses are: H o : The populations are homogeneous. H a : The populations are not homogeneous.

In this case the rows (or columns) represent data from different populations, and the

Performing Data Analysis Using SAS PROCS Part 4 - 81 columns (or rows) represent data observed on the population. In the output, the  2 test of homogeneity or independence is reported (mathematically, the tests are equivalent.) Also included in the output is a likelihood ratio chi-square, Mantel Hantzel chi square, phi, contingency coefficient, and Cramer’s V. For a 2*2 table, a Fisher’s exact test is performed (one and two-tailed). For example, examine the following code and resulting output: libname mydata 'a:\'; data dixon;set mydata.dixon; PROC FREQ;TABLES CORONARY*GROUP/CHISQ; run; The output for this job is (in part) : CORONARY GROUP Frequency| Percent | Row Pct | Col Pct |A |B | Total ---------------------------- 0 | 85 | 89 | 174 | 42.50 | 44.50 | 87.00 | 48.85 | 51.15 | | 85.00 | 89.00 | ---------------------------- 1 | 15 | 11 | 26 | 7.50 | 5.50 | 13.00 | 57.69 | 42.31 | | 15.00 | 11.00 | ---------------------------- Total 100 100 200 50.00 50.00 100.00 STATISTICS FOR TABLE OF CORONARY BY GROUP Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 0.707 0.400 Likelihood Ratio Chi-Square 1 0.710 0.400 Continuity Adj. Chi-Square 1 0.398 0.528 Mantel-Haenszel Chi-Square 1 0.704 0.402 Fisher's Exact Test (Left) 0.264 (Right) 0.853 (2-Tail) 0.529 Phi Coefficient -0.059 Contingency Coefficient 0.059 Cramer's V -0.059 The table provides counts for each combination of group and coronary, plus percentages by row, column and total. All of the various tests for this table have p values (Prob) that are greater than 0.05, so there is no evidence that Coronary is related to Group for these data.


Using PROC TTEST - Independent Group t-test

The SAS TTEST procedure is used to perform calculations for an independent group t-test. That is, when the means of two groups are to be compared (where each group consists of subjects that are not related) then PROC TTEST may be used. Assumptions: Subjects are randomly assigned to one of two groups. The distribution of the means by group are normal with equal variances. Sample sizes between groups do not have to be equal. Test: The hypotheses for the comparison of means from two independent groups are: H o : u 1 = u 2 H a : u i  u j (means of the two groups are equal) (means are not equal) The test statistic is a student’s t-test with N-1 degrees of freedom, where N is the total number of subjects. A low p-value for this test indicates evidence to reject the null hypothesis in favor of the alternative. In other words, there is evidence that the means are not equal. For example, suppose we are interested in comparing SCORE across GROUPS, where there are two groups. The purpose is to determine if the mean SCOREs on a test is different for the two groups tested (i.e, control and treatment groups). The syntax for the TTEST procedure is PROC TTEST options; CLASS variable; The CLASS statement is required, and specifies the grouping variable for the analysis. The data for this grouping variable must contain two and only to values. Example: PROC TTEST; CLASS GROUP; VAR SCORE; In this example, the GROUP variable contains the two values, 1 or 2. A t-test will be performed on the variable SCORE.

Performing Data Analysis Using SAS PROCS General Information statements used with PROC TTESTS include: Part 4 - 83 BY variable list; VAR variable list; The BY-group specification causes t-tests to be run separately for groups specified by the BY statement. The following example performs an independent group t-test: ******************************************** * SAS CODE FOR AN INDEPENDENT GROUP T-TEST * ******************************************** ; DATA TTEST; INPUT GROUP 1 - 4 OBS 5 - 12; CARDS; 1 23.00 1 23.00 1 32.00 1 24.00 1 25.00 2 25.00 2 46.00 2 56.00 2 45.00 2 56.00 2 55.00 ; PROC TTEST; CLASS GROUP; VAR OBS; RUN; The output from this analysis is (in part) TTEST PROCEDURE Variable: OBS GROUP N Mean Std Dev Std Error Variances T DF Prob>|T| ---------------------------------------------------- --------------------------------------- 1 5 25.40000000 3.78153408 1.69115345 Unequal -4.2134 6.2 0.0053 2 6 47.16666667 11.95686693 4.88137048 Equal -3.8811 9.0 0.0037 For H0: Variances are equal, F' = 10.0 DF = (5,4) Prob>F' = 0.0445 This output is a little busy, so you have to be careful what you are reading. Notice that there are actually two tables (a left and right table). The left table reports the means, standard deviation and standard errors for the two groups. The right table report two versions of the t-test — one for which variances are assumed to be “Unequal” and the other for which the variances are assumed to the “Equal.” The F-

Performing Data Analysis Using SAS PROCS test below the table is a test for equality of variance. means across groups. Part 4 - 84 For this example, the p-value for the equal variance test is p = 0.0445. Since this is < 0.05, it tells us that the variance for the two groups are not equal. Thus, we use the “Unequal” version of the t-test in the right table. For this t-test, the t-statistic is 4.2134, and the p-value is 0.0053. Our conclusion is that there is a difference in

Note

When you are comparing two “groups” where the groups contain the same subjects, such as in a before-after experiment, or where the two observations are related or are taken from the same subject, then a paired t-test is appropriate rather than an independent group t-test. For this analysis, see the PROC MEANS description earlier in these notes.


Using PROC ANOVA for One Way Analysis of Variance

A one-way analysis of variance is an extension of the independent group t-test where there are more than two groups. Assumptions: Subjects are randomly assigned to one of n groups. The distribution of the means by group are normal with equal variances. Sample sizes between groups do not have to be equal, but large differences in sample sizes by group may effect the outcome of some multiple comparisons tests. Test: The hypotheses for the comparison of independent groups are: (k is the number of groups) H o : u 1 = u 2 ... = u k H a : u i  u j (means of the all groups are equal) (means of the two or more groups are not equal) The results of the analysis are reported in an Analysis of Variance (ANOVA) table. The test statistic is an F test with k-1 and N-k degrees of freedom, where N is the total number of subjects. A low p-value for the F-test indicates evidence to reject the null hypothesis in favor of the alternative. In other words, there is evidence that at least one pair of means are not equal. For example, suppose (in the DIXON data) we are interested in comparing AGE_1952 across the 5 SOCIO levels, to determine if the mean age of individuals across groups is significantly different. To do this, we would use the statements: PROC ANOVA DATA=ANYNAME; CLASSES SOCIO; MODEL AGE_1952=SOCIO; TITLE 'Compare AGE across SOCIO'; SOCIO is the "CLASS" or grouping variable (containing five levels), and AGE_1952 is the continuous variable, whose means across groups are to be compared. The MODEL statement can be thought of as DEPENDENT VARIABLE = INDEPENDENT VARIABLE(S); where the DEPENDENT variable is the "response" variable, or one you measured, and the independent variable(s) is the observed data.

Performing Data Analysis Using SAS PROCS Part 4 - 86 Since the rejection of the null hypothesis does not specifically tell you which means are different, a multiple comparison test is often performed following a significant finding in the One-Way ANOVA. To request multiple comparisons in PROC ANOVA, include a MEANS statement with a multiple comparison option. The syntax for this statement is MEANS SOCIO /testname; where testname is a multiple comparison test. Some of the test available in SAS are BON DUNCAN - Performs Bonferroni t-tests of differences - Duncan’s multiple range test SCHEFFE SNK LSD TUKEY - Scheffe multiple comparison procedure - Student Newman Keuls multiple range test - Fisher’s Least Significant Difference test - Tukey’s studentized range test You may also specify ALPHA = p - selects level of significance for comparisons (default is 0.05) For example, to select the TUKEY test, you would use the statement MEANS SOCIO /TUKEY; Graphical comparison: A graphical comparison allows you to visually see the distribution of the groups. If the p-value is low, chances are there will be little overlap between the two or more groups. If the p-value is not low, there will be a fair amount of overlap between all of the groups. A simple graph for this analysis can be created using the PROC PLOT or PROC GPLOT procedure. For example: PROC GPLOT; PLOT SOCIO*AGE_1952 will produce a dot plot showing the ages by group. Note that if you want a character based plot, substitute a “PLOT” for the “GPLOT” in the above statement. If you want a plot with more substance, you’ll have to do more SAS programming. One book that can help is Quick Results with SAS/Graph Software by Arthur L. Carpenter and Charles Shipp, published by SAS Institute. There are also some plot examples in Chapter 5. Following is an example job that performs a one-way ANOVA, and produces a plot.


One-Way ANOVA Example

Part 4 - 87 Suppose you are comparing the time to relief of three different headache medicines noted as brands 1, 2 and 3. The time to relief data is reported in minutes. For this experiment, 15 subjects were randomly placed on one of the three medicines. Which medicine (if any) is the most effective? Brand 1 24.5 23.5 Brand 2 28.4 34.2 Brand 3 26.1 28.3 26.4 27.1 29.9 29.5 32.2 30.1 24.3 26.2 27.8 Notice that SAS expects the data to be entered as two variables, a group and an observation. Here is the SAS code to analyze this data. DATA ACHE; INPUT BRAND RELIEF; CARDS; 1 24.5 1 23.5 1 26.4 1 27.1 1 29.9 2 28.4 2 34.2 2 29.5 2 32.2 2 30.1 3 26.1 3 28.3 3 24.3 3 26.2 3 27.8 ; PROC ANOVA DATA=ACHE; CLASSES BRAND; MODEL RELIEF=BRAND; MEANS BRAND/TUKEY; TITLE 'Compare RELIEF across MEDICINES - ANOVA Example'; run; PROC GPLOT; PLOT RELIEF*BRAND; run; Following is the (partial) output for the headache relief study:

Performing Data Analysis Using SAS PROCS Part 4 - 88 Analysis of Variance Procedure Dependent Variable: RELIEF Source DF Sum of Squares Mean Square F Value Pr > F Model 2 66.77200000 33.38600000 7.14 0.0091 Error 12 56.12800000 4.67733333 Corrected Total 14 122.90000000 R-Square C.V. Root MSE RELIEF Mean 0.543303 7.751664 2.16271434 27.90000000 Source DF Anova SS Mean Square F Value Pr > F BRAND 2 66.77200000 33.38600000 7.14 0.0091 Tukey's Studentized Range (HSD) Test for variable: RELIEF Alpha= 0.05 df= 12 MSE= 4.677333 Critical Value of Studentized Range= 3.773 Minimum Significant Difference= 3.649 Means with the same letter are not significantly different. Tukey Grouping Mean N BRAND A 30.880 5 2 B 26.540 5 3 B B 26.280 5 1

Interpreting the ANOVA Table and Comparisons

The p-value in this ANOVA is listed as “Pr > F” and is p = 0.0091. The top table is a test of the overall model, and the bottom table is a specific test for the “main effect” BRAND. In the case of a one way ANOVA, these are the same tests. The F-test tells you that a statistically significant difference exists between the brands. The Tukey Studentized Range comparison (at the alpha = 0.05 level) concludes that the mean for brand 2 is significantly higher than the means of brands 1 and 3, and that there is no significant difference between brands 1 and 3. (The “Tukey Grouping” tells you that the “A” group is different than the “B” group. Since there is only one mean in the “A” grouping (30.88, brand 2), it is different than the other means. The “B” grouping contains two means (26.54 and 26.28) that are not significantly different from each other.) Visual Comparison: A quick graph of age across group also shows you the distribution of relief across brands, which visually confirms the ANOVA results. Note than Brand 2 relief results tend to be higher than the levels for brands 1 and 3.


Note

You can also use PROC ANOVA to perform a two way ANOVA (and a lot of other kinds of ANOVAs), using the model statement described in the PROC GLM section, but the data must be balanced (equal sample sizes in all cells.) If speed is a problem for your analysis, use PROC ANOVA, otherwise, you might as well use PROC GLM. You’ll get the same answers either way.


Using PROC GLM for Multi-Way Analysis of Variance

A two-way analysis of variance will use used to illustrate a multi-way ANOVA using GLM. This is only of many kinds of analyses you can create using GLM. It is one of the most powerful PROCS in SAS for performing customized data analysis. Definitions: A two-way (or two-factor) Analysis of Variance is a simultaneous analysis of the effects of two factors. Each factor is a "grouping" type variable such as type of treatment, gender, brand, etc. The factorial design is more complete than a single factor design (one-way analysis) since it allows examination of an interaction effect. An interaction effect is the combined effect of the two factors. For example, who would predict that the "interaction" of hydrogen with oxygen would create water, knowing only their individual (non interacting properties)? The dimensions of a factorial design depend on how many levels of each factor are used. For example, if two of one factor are used and three of the other, the design is called a 2 x 3 (2 by 3) factorial design. Generally, the two-way ANOVA may be called a p x q factorial design. Assumptions: There are several types of factorial designs that depend on the characteristics of the factors. Notice, for example, that some factors levels that can be controlled by the experimenter, such dosage. Other factors may be classifications, such as gender, that cannot be controlled. Factors are classified as fixed or random. When the factor levels completely defines the possible classifications, it is called a fixed factor. Classifying subjects as Male and Female would a fixed factor. When the levels are a subset of all possible classifications, the factor is called a random factor. For example, if dosages of a drug used in the experiment are representative dosages (such as 1mg, 3 mg, 5 mg), the factor would be classified as a random factor. A two-way ANOVA with both factors fixed is called a Model I ANOVA. A factorial design with both random factors is called a Model II ANOVA, which is rarely used. A design with one fixed and one random factor is called a Model III ANOVA. Model type effects how tests are calculated. For this example, we’ll assume that factors are all fixed. Within the p x q possible combinations of factors (cells), subjects should be

Performing Data Analysis Using SAS PROCS Part 4 - 91 randomly assigned to a treatment in such a way as to balance the number of subjects per cell. An analysis may still be performed if the cells are not balanced. There are some additional calculation issues involved with the unbalanced case that are not discussed here. Generally, in the unbalanced case, the Type III sums of squares ANOVA output is recommended. The observed data within cells should be a continuous variable with an approximate normal distribution. That is, it should make sense to compare means of the data between cells. For more information about multi-way ANOVA designs, see: Winer, B.J., Statistical Principles in Experimental Designs, Second Edition, McGraw-Hill Company, 1971. Neter, J., Wasserman, W., and Kutner, M.H., Applied Linear Statistical Models, Richard D. Irwin, Inc., 1990, Third Edition. Tests: There are two type of hypotheses used in a two-way factorial design -- interaction effects and main effects. Usually, the interaction effect is examined first: H o : There is no interaction effect. H a : There is an interaction effect. Interaction implies that the pattern of means across groups is inconsistent. No interaction implies that pattern of means across groups is close to parallel. If there is no interaction effect, then main effects (direct effect of one factors summed over the levels of the other factor), can be examined. If there is an interaction effect, main effects cannot be examined directly since the interaction effect shows that any differences across some main effect are not consistent across all levels of the other factor. Thus, if there is no interaction effect, it is okay to look at the hypotheses for main effects. The "main effects" hypotheses for factor a is: H o : u a1 = u a2 ... = u ak H a : u ai  u aj (means are equal across levels of a, summed over b) (means are not equal across levels of a, summed over b) Similarly, for factor b:

Performing Data Analysis Using SAS PROCS Part 4 - 92 H o : u b1 = u b2 ... = u bk H a : u bi  u bj (means are equal across levels of b, summed over a) (means are not equal across levels of b, summed over a) The three tests (interaction, and two main effects tests) are performed in an Analysis of Variance (ANOVA) table as F-tests. A low p-value (usually less than 0.05) for a test indicates evidence to reject the null hypothesis in favor of the alternative.

Two-Way ANOVA Analysis Example

Use PROC GLM to perform a 2-way ANOVA, using SOCIO and GROUP as the classification variables, and PULSE62 as our response variable (in the DIXON database). The SAS code is: ********************************************************* * EXAMPLE TWO-WAY ANOVA USING DIXON DATA * ********************************************************* ; libname mydata 'c:\mydir'; data dixon;set mydata.dixon; PROC GLM; CLASSES SOCIO GROUP; MODEL PULSE62=SOCIO GROUP SOCIO*GROUP; TITLE 'Compare PULSE across SOCIO and GROUP'; run; In the MODEL statement, PULSE62 is the dependent or response variable, and GROUP and SOCIO are the classification variables. NOTICE the specification SOCIO*GROUP on the right side of the MODEL statement. This specifies that an interaction term, the relationship of SOCIO interacting with GROUP, should be taken into account in attempting to "predict" PULSE62. The following is (partial) output from the example Two-Way ANOVA SAS job: Compare PULSE across SOCIO and GROUP Dependent Variable: PULSE62 Source DF Sum of Squares Mean Square F Value Pr > F Model 9 862.33868958 95.81540995 0.91 0.5202 Error 190 20069.53631042 105.62913848 Corrected Total 199 20931.87500000 Source DF Type I SS Mean Square F Value Pr > F SOCIO 4 383.00193516 95.75048379 0.91 0.4613 GROUP 1 26.80818982 26.80818982 0.25 0.6150 SOCIO*GROUP 4 452.52856460 113.13214115 1.07 0.3721 Source DF Type III SS Mean Square F Value Pr > F SOCIO 4 307.91281863 76.97820466 0.73 0.5733 GROUP 1 177.39978071 177.39978071 1.68 0.1966 SOCIO*GROUP 4 452.52856460 113.13214115 1.07 0.3721

Performing Data Analysis Using SAS PROCS Part 4 - 93 The top table (the line beginning “Model”) is an overall test for the entire design. The p-value associated with this F-test is p=0.5202. This means that none of the tests of hypothesis for this model are significant. You could stop your analysis with this result. Nevertheless, notice that there are two version of an analysis of variance table — one is labeled Type I SS and the other is Type III SS. Without going in to the theory behind these different approaches, most statisticians would agree that the Type III SS table is the one that you will most often use. In this table, you would first examine the interaction effect (the SOCIO*GROUP line). Since p = 0.3721, there is no evidence of interaction, so it is permissible to examine the main effects tests. The test for the SOCIO main effect reports p = 0.5733 and for GROUP, p = 0.1966. Thus, in this example, neither main effect tests are significant. This means that no difference in PULSE was observed across either socioeconomic level or across GROUPS. This may be an important kind of preliminary test to perform on your data if you wish to show that certain factors (usually not important to your main study) are not biasing the outcomes that are important to your study.

Performing Multiple Comparisons in GLM

If a main effect test is significant, it tells you that at least one mean is different, but it not tell you which means are difference. In this case a multiple comparison test may be used to determine specific differences. Case 1. If the test for interaction is significant (less than 0.05), you must perform multiple comparisons across both factors (compare cells to each other) to determine where significant differences exist. Case 2. If the test for interaction is not significant, you may compare levels of each main effect, summing over the other factor, for each comparison. There are some complicated decisions to be made when performing these multiple comparisons. You should consult a statistician for more information.


Comment

Part 4 - 94

Statistical Correlation

One of the most common statistical techniques used in analyzing data is measuring the strength of association between two variables. When those two variables are of a continuous nature (they are measurements such as weight, height, length, etc.) the measure of association most often used is Pearson’s correlation coefficient. This type of correlation is very useful and powerful, but care must be taken in using it. Pearson’s correlation coefficient allows researchers to determine if there is a possible linear relationship between two variables measured on the same subject. For example, there is a known relationship between height and weight in normal individuals – the taller a person the more they weight (on the average). This association may be expressed as a number (the correlation coefficient) that ranges from –1 to +1. This number is often expressed as the Greek letter rho (r). The correlation measures how well a straight line fits through a scatter of points when plotted on an x – y axis. If the correlation is positive, it means that when one variable increases, the other tends to increase. If the correlation is negative, it means that when one variable increases, the other tends to decrease. When a correlation coefficient is close to +1 (or –1), it means that there is a strong correlation – the points are scattered along a straight line. For example, a correlation r = 0.7 may be considered a strong. However, the closer a correlation coefficient get to 0, the weaker the relationship, where the cloud of points is not close to a straight line. For example, a correlation r = 0.1 might be considered weak. For scientific purposes, a t-test is utilized to determine if the correlation coefficient is “strong” or “significant” or not. This will be discussed later. Before using the Pearson correlation coefficient as a measure of association, you should be aware of its assumptions and limitations. As mentioned earlier, this correlation coefficient measures a linear relationship. That is, the relationship between the two variables measures how close the two measurements form a straight line when plotted on an x-y chart. Therefore, it is important that data be graphed before the correlation is interpreted. For example, it is possible that data, when plotted, may show a curved relationship instead of a straight line. When this is the case, a Pearson correlation may not be the best measure of association. There are other conditions when a correlation coefficient may appear important, but when considered in light of a graph, is not a good measure of relationship. Another assumption of correlation is that the both of the variables (the measurements) be of continuous data measured on an interval/ratio scale. Data that are not continuous, such as categorical (i.e. hair color) or binomial (i.e., gender) data would not be acceptable. Also, each variable should be approximately normally distributed. That is, when plotted in a histogram, the data form an approximate bell shaped curve.

Performing Data Analysis Using SAS PROCS Part 4 - 95 Once you have calculated a Pearson correlation coefficient (there are many number of computer programs that will perform this calculation), you may want to know if r is statistically significant. In statistical terminology, this is a test of the following hypotheses H 0 : rho = 0 (the null hypothesis) H a : rho <> 0 (the alternative hypothesis) That is, the statistical procedure tests to see if rho is significantly far enough away from 0 to state that there is a relationship between the two variables. In most programs, the t-test is accompanied by a p-value. If the reported p-value for the test is small (usually less than 0.05) then the conclusion is that rho is not 0, thus the relationship is statistically significant. A research will then have to make a professional judgement to determine if the association is significant in terms of the experiment performed. Care must be taken when interpreting a statistically significant correlation. If your sample size is small or not representative of the population from which you sampled, you may not be able to generalize the correlation to your intended population. Also, a cause and effect relationship cannot be inferred except under special conditions when you have designed the study specifically to detect those phenomena. A statistic related to the Pearson correlation coefficient is R 2 , which is called the “coefficient of determination.” This statistic is often interpreted to represent the proportion of the variance in one variable (Y) contained in the other (X). The correlation coefficient is also closely related to regression analysis, which will be discussed in the next issue.


Using PROC GLM for Regression Analysis

Linear regression procedures are used to study linear relationships between independent and dependent variables. Unlike Analysis of Variance, where the result is often to determine if means across groups are different, regression is often used in determining if a dependent variable can be predicted, given one or more independent variables. Suppose we are interested in determining if SBP_62B can be predicted by AGE_1952 (in the DIXON data), using a linear relationship. To perform a simple linear regression on these two variables, we use the statement: PROC GLM; MODEL SBP_62B=AGE_1952; Notice that a big difference between regression and ANOVA is the absence of the CLASSES statement. It may be that you are interested in how several independent variables can be used to predict a response variable. In this case, specify a model in which AGE, PULSE, BLOOD CHOLESTEROL and WEIGHT can be used to predict SBP_62B. PROC GLM; MODEL SBP_62B=AGE_1952 PLUSE62 BLDCHL62 WEIGHT52;

Simple Linear Regression

A Simple Linear Regression analysis is used to develop an equation (a linear regression line) for predicting a value of the dependent variables given a value of the independent variable. A regression line is the line described by the equation and the regression equation is the formula for the line. The regression equation is given by: Y = a + bX where X is the independent variable, Y is the dependent variable, a is the intercept and b is the slope of the line. Assumptions: For a fixed value of X (the independent variable), the population of Y (the dependent variable) is normally distributed with equal variances across Xs. Related statistics: The correlation coefficient, r, measures the strength of the

Performing Data Analysis Using SAS PROCS association between X and Y. Part 4 - 97 Test: A test that of the slope of the regression line is 0 is used to determine if the regression line shows a statistically significant linear relationship between X and Y. The hypotheses are: H 0 : slope = 0 H a : slope  0 A low p-value for this test (less than 0.05) means that there is evidence to believe that the slope of the line is not 0, or that there is a statistically significant linear relationship between the two variables.

Note

The test for slope is equivalent to the test rho = 0 in a correlation analysis.

Simple Linear Regression Example

AE FR HT IO DP YR QD SW A random sample of 14 elementary school students is selected from a school, and each student is measured on a creativity score (x) using a new testing instrument and on a task score (y) using a standard instrument. The task score is the mean time taken to perform several hand-eye coordination tasks. Since the test for the Creativity test is much cheaper, you wonder if you can substitute it for the more expensive Task score. The data are: STUDENT CREATIVITY(X) TASKS(Y) 28 35 37 50 69 84 40 65 4.5 3.9 3.9 6.1 4.3 8.8 2.1 5.5 DF ER RR TG EF TJ 29 42 51 45 31 40 5.7 3.0 7.1 7.3 3.3 5.2

Performing Data Analysis Using SAS PROCS Part 4 - 98 Use the following SAS code to perform this analysis : ************************************************************ * Simple Linear Regression Example * ************************************************************ ; DATA ART; INPUT SUBJECT $ CREATE TASK; CARDS; AE 28 4.5 FR 35 3.9 HT IO 37 50 3.9 6.1 DP YR QD 69 84 40 4.3 8.8 2.1 SW DF ER RR TG EF 65 29 42 51 45 31 5.5 5.7 3.0 7.1 7.3 3.3 TJ 40 5.2 ; PROC GLM; MODEL TASK=CREATE; TITLE ‘Simple Linear Regression Example’; SYMBOL1 V=STAR I=RL; * see explanation below *; PROC GPLOT; PLOT TASK*CREATE; RUN; Output for Simple Linear Regression Dependent Variable: TASK Source DF Sum of Squares Mean Square F Value Pr > F Model 1 13.70112004 13.70112004 5.33 0.0396 Error 12 30.85387996 2.57115666 Corrected Total 13 44.55500000 R-Square C.V. Root MSE TASK Mean 0.307510 31.75213 1.60348267 5.05000000 Source DF Type I SS Mean Square F Value Pr > F CREATE 1 13.70112004 13.70112004 5.33 0.0396 ource DF Type III SS Mean Square F Value Pr > F CREATE 1 13.70112004 13.70112004 5.33 0.0396 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 2.164519286 1.64 0.1273 1.32140597 CREATE 0.062533638 2.31 0.0396 0.02708943 This test for the slope is reported on the line that begins “CREATE”, which gives a p = 0.0396. This is evidence that the slope is not zero. (It is also a test for the significance of R.) Below the tests are the values of the intercept and slope that can be used to create the regression equation. In this case, the equation would be:

Performing Data Analysis Using SAS PROCS Part 4 - 99 TASK = 2.1646 + 0.06253 * CREATE Thus, if you know a students CREATIVITY using the new instrument, you could predict the TASK score they would have received using the old test. Simple Linear Regression Plot It is always a good idea to plot your data to verify that the relationship you are interested in is really linear. The statements SYMBOL1 V=STAR I=RL; * see explanation below *; PROC GPLOT; PLOT TASK*CREATE; produce the plot shown below. The SYMBOL1 statement gives SAS information about how to plot data using the GPLOT statement. (The 1 specifies the first pair of values in the PLOT Statement (TASK*CREATE).) The V=STAR specification tells SAS to use stars for the points on the graph. The I option specifies interpolation, and I=RL means to fit a regression line that is linear to the points. For more information on these GPLOT options, see the chapter “Enhancing Your Graphics Output Designs” in the SAS/GRAPH User’s Guide.

Multiple Linear Regression

Multiple Linear Regression is an extension of Simple Linear Regression where there is a Dependent variable (Y) and more than one independent (X i ) variable. The regression equation is of the form Y = a + b 1 X 1 + b 2 X 2 + ... + bnXn

Performing Data Analysis Using SAS PROCS Part 4 - 100 where n is the number of independent variables. The intercept, a, and coefficients in this equation b i are calculated using a least squares method. This equation can then be used to calculate new predicted values of Y given known values of X i . Assumptions: For fixed values of X i , the Y values are normally distributed with equal variances. Statistics and tests: Each coefficient in the equation is tested to see if it contributes significant information. Each of these tests use the hypotheses: Ho: b i = 0 Ha: b i  0 The test is performed with a t-statistic and a p-value is reported. If the p-value is low, (less than 0.05) the conclusion is that the variable does not contribute significant information to the equation. Care must be taken since each variable in the equation may be related to another variable, so decisions about the inclusion of exclusion of a particular variable test must take that into consideration. A similar t- test is used to determine if the intercept, a, is equal to 0. The R 2 (R-squared) statistic reports the strength of the relationship between the set of independent variable and the dependent variable. R 2 ranges from 0 (meaning no relationship) to 1.0 (meaning perfect relationship). The adjusted R 2 is adjusted for degrees of freedom. The overall significance of the regression is tested in an analysis of variance table. This is an overall test that all b i are 0. The test statistic is F, and a p-value is reported. If the p-value is low (less than 0.05) the conclusion is that some bi is not 0, and thus the equation will have some predictive value for Y.


Multiple Regression Example

Part 4 - 101 An employer wants to be able to predict how well applicants will do on the job once they are hired. He devises four tests that he thinks will measure the skills required for the job. Ten prospects are selected by random from a group of applicants and given the four tests. Then, they are placed on the shop floor and observed by a supervisor, who gives each of them a job proficiency score (JOBSCORE). The SAS code to perform this analysis using multiple linear regression is: DATA JOB; INPUT SUBJECT $ TEST1 TEST2 TEST3 TEST4 JOBSCORE; CARDS; 1 75 100 90 88 78 2 51 85 88 89 71 3 99 96 94 93 85 4 92 106 84 84 67 5 90 89 83 77 69 6 67 77 83 73 65 7 109 67 71 65 50 8 94 112 105 91 107 9 105 110 99 95 96 10 74 102 88 69 63 ; PROC GLM; RUN; MODEL JOBSCORE=TEST1 TEST2 TEST3 TEST4; TITLE ‘Job Score Analysis using GLM’; Output for Multiple Linear Regression The output for the multiple linear regression is very similar to the previous simple linear regression output. The “Model” test, (p=.0003) is a test of the overall model. It also tests that R=.975 is not zero. The table at the bottom of the output contains tests that the intercept=0 and that the coefficients for all of the independent variables = 0. In this case, the p-values for TEST2 and TEST4 are not significant. One strategy would be to then drop TEST2 and TEST4 from the model, and rerun the program to see how the newer (and smaller) model fits the data.

Output for Multiple Linear Regression

This printout is partial output for the SAS Multiple Regression Example for the JOB data:

Performing Data Analysis Using SAS PROCS Part 4 - 102 Job Score Analysis using GLM Dependent Variable: JOBSCORE Source DF Sum of Squares Mean Square F Value Pr > F Model 4 2495.96647757 623.99161939 49.58 0.0003 Error 5 62.93352243 12.58670449 Corrected Total 9 2558.90000000 R-Square C.V. Root MSE JOBSCORE Mean 0.975406 4.724067 3.54777458 75.10000000 Source DF Type I SS Mean Square F Value Pr > F TEST1 1 110.99866420 110.99866420 8.82 0.0312 TEST2 1 1338.06178775 1338.06178775 106.31 0.0001 TEST3 1 1020.14512294 1020.14512294 81.05 0.0003 TEST4 1 26.76090269 26.76090269 2.13 0.2046 Source DF Type III SS Mean Square F Value Pr > F TEST1 1 89.38255135 89.38255135 7.10 0.0446 TEST2 1 30.49915564 30.49915564 2.42 0.1803 TEST3 1 497.57812932 497.57812932 39.53 0.0015 TEST4 1 26.76090269 26.76090269 2.13 0.2046 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT -95.55938507 -7.45 0.0007 12.82483177 TEST1 0.17631107 2.66 0.0446 0.06616209 TEST2 -0.22343929 -1.56 0.1803 0.14353957 TEST3 1.74602280 6.29 0.0015 0.27769962 TEST4 0.26865059 1.46 0.2046 0.18424404


WORKBOOK Exercise 4.2 - Two-Way Analysis of Variance

When two factors are considered in defining the groups, two way ANOVA may be appropriate. Data for a two-way analysis of variance (Zar, p 207) are plasma calcium concentrations from male and female subjects under no hormone and hormone treatment. Two Way Factorial ANOVA No Hormone Treatment Hormone Treatment ------------------ ------------------- Female Male Female Male 16.5 14.5 39.1 32.0 18.4 11.0 12.7 10.8 14.0 14.3 12.8 10.0 26.2 23.8 21.3 28.8 35.8 25.0 40.2 29.3 Create the needed SAS code to run this analysis. Hint the data will be in the form: 1=hormone 0= no hormone, M=Male, F=Female 0 F 16.5 0 F 18.4 : 1 M 29.3 the classes will be HORMONE and SEX, and the model will be: MODEL PLASMA = HORMONE SEX HORMONE*SEX; Use PROC GLM to perform this analysis .


Output for Workbook Exercises

EXERCISE 4.1 RESULTS - Sorting

************************************************** * PROC SORT EXAMPLE * **************************************************; DATA MYDATA; INPUT GROUP RECTIME; CARDS; 1 3.1 2 3.6 2 4.2 1 2.1 1 2.8 2 3.8 1 1.8 ; PROC SORT; BY RECTIME; PROC PRINT; Title ‘Sorting Example’; PROC SORT; BY DESCENDING RECTIME; PROC PRINT; RUN; Output for Sort Example: Note: First output contains data sorted by rectime, and second contains data sorted (decending) by rectime. Sorting Example OBS GROUP RECTIME 1 1 1.8 2 1 2.1 3 1 2.8 4 1 3.1 5 2 3.6 6 2 3.8 7 2 4.2 Sorting Example OBS GROUP RECTIME 1 2 4.2 2 2 3.8 3 2 3.6 4 1 3.1 5 1 2.8 6 1 2.1 7 1 1.8


EXERCISE 4.2 RESULTS – Two Way ANOVA

Code to produce the Two-Way Analysis of Variance results:

**************************************** * EXAMPLE TWO-WAY ANOVA * * ATWOWAY2.SAS * ****************************************; options ps=60; data twoway; input group gender $ calcium; cards; 0 F16.5 0 F 18.4 0 F 12.7 0 F 14.0 0 F 12.8 0 M 14.5 0 M 11.0 0 M 10.8 0 M 14.3 0 M 10.0 1 F 39.1 1 F 26.2 1 F 21.3 1 F 35.8 1 F 40.2 1 M 32.0 1 M 23.8 1 M 28.8 1 M 25.0 1 M 29.3 ; PROC GLM; CLASSES GROUP GENDER; MODEL CALCIUM=GENDER GROUP GENDER*GROUP; TITLE 'Compare CALCIUM across GENDER and GROUP'; run; Part 4 - 105


Results for Two-Way ANOVA:

General Linear Models Procedure Class Level Information Class Levels Values GROUP 2 0 1 GENDER 3 F F16.5 M Number of observations in data set = 19 Dependent Variable: CALCIUM Source DF Sum of Squares F Value Pr > F Model 4 1912.39354386 19.54 0.0001 Error 14 342.55066667 Corrected Total 18 2254.94421053 R-Square C.V. CALCIUM Mean 0.848089 23.40229 21.1368421 Source DF Type I SS F Value Pr > F GENDER 2 597.02046053 12.20 0.0009 GROUP 1 1300.75803571 53.16 0.0001 GROUP*GENDER 1 14.61504762 0.60 0.4525 Source DF Type III SS F Value Pr > F GENDER 2 168.41028270 3.44 0.0609 GROUP 1 1313.50019048 53.68 0.0001 GROUP*GENDER 1 14.61504762 0.60 0.4525

Using PROC MEANS

CHAPTER 4 USING SAS PROCS

Understanding SAS Support Statements

Using the OPTIONS Statement

Using the SAS TITLE Statement

Including Comments in your SAS Code

Using RUN to Execute Statements

Understanding the SAS PROC Statement Format

Adding General Information Statements to PROC Statements

Specifying Subgroup Analysis Using the BY Statement

Using PROC SORT

Workbook Exercise 4.1 – PROC SORT

Using PROC Statements for DATA Analysis

Using PROC PRINT

Using PROC MEANS

Calculating Means Using PROC MEANS

Using PROC MEANS to Perform a Paired t-test

Interpreting Statistical p-values

Using PROC FREQ - Frequencies and Crosstabulations

Frequency Table Example

Analyzing a Two-Way Table using PROC FREQ

Using PROC TTEST - Independent Group t-test

Using PROC ANOVA for One Way Analysis of Variance

One-Way ANOVA Example

Interpreting the ANOVA Table and Comparisons

Using PROC GLM for Multi-Way Analysis of Variance

Two-Way ANOVA Analysis Example

Performing Multiple Comparisons in GLM

Statistical Correlation

Using PROC GLM for Regression Analysis

Simple Linear Regression

Multiple Linear Regression

WORKBOOK Exercise 4.2 - Two-Way Analysis of Variance

Output for Workbook Exercises

EXERCISE 4.1 RESULTS - Sorting

EXERCISE 4.2 RESULTS – Two Way ANOVA

Related documents

Products

Support

Using PROC MEANS

CHAPTER 4 USING SAS PROCS

Understanding SAS Support Statements

Using the OPTIONS Statement

Using the SAS TITLE Statement

Including Comments in your SAS Code

Using RUN to Execute Statements

Understanding the SAS PROC Statement Format

Adding General Information Statements to PROC Statements

Specifying Subgroup Analysis Using the BY Statement

Using PROC SORT

Workbook Exercise 4.1 – PROC SORT

Using PROC Statements for DATA Analysis

Using PROC PRINT

Using PROC MEANS

Calculating Means Using PROC MEANS

Using PROC MEANS to Perform a Paired t-test

Interpreting Statistical p-values

Using PROC FREQ - Frequencies and Crosstabulations

Frequency Table Example

Analyzing a Two-Way Table using PROC FREQ

Using PROC TTEST - Independent Group t-test

Using PROC ANOVA for One Way Analysis of Variance

One-Way ANOVA Example

Interpreting the ANOVA Table and Comparisons

Using PROC GLM for Multi-Way Analysis of Variance

Two-Way ANOVA Analysis Example

Performing Multiple Comparisons in GLM

Statistical Correlation

Using PROC GLM for Regression Analysis

Simple Linear Regression

Multiple Linear Regression

WORKBOOK Exercise 4.2 - Two-Way Analysis of Variance

Output for Workbook Exercises

EXERCISE 4.1 RESULTS - Sorting

EXERCISE 4.2 RESULTS – Two Way ANOVA

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib