Creativity.sas: Explanation of code Goals of code:

advertisement
Creativity.sas: Explanation of code
Goals of code:
• Display the creativity data (histogram, boxplot, dotplot)
– As one group or for each treatment
• Calculate summary statistics for each group
• Randomization test
• T-test with equal variances
Saving a copy of results in a rtf file: ods rtf file=’creativity.rtf’;
This line is optional. It tells SAS to save a copy of the output in the RTF format file called
creativity.rtf. This file is easily readable by WORD, and makes it easy to copy selected results to
another document.
Reading the data set: data creativity;
The data creativity; statement tells SAS that you want to create a SAS data set called creativity. All
the statements from data ; to the run; are part of the data step. The infile command names the file
in the working directory that contains the data. This file has a header line with variable names, so
the real data starts on the second line. The firstobs=2 option tells SAS to start reading data from
the second line. The input line tells SAS how many values to read on each line, what names to give
the variables, and how to read the values. This is described in the Introduction to SAS document.
Treatment is a character string, so it is followed by a ; scoreisnumericsoitisnotf ollowedbya.
I generally check the line in the log that says “The data set WORK.CREATIVITY has 47 observations and 2 variables”. I check that SAS read the number of observations I expected and produced
the number of variables I expected.
Descriptive statistics and histograms: proc univariate;
PROC UNIVARIATE calculates and reports many different statistics. The VAR statement names
the variable to be summarized. You can name multiple variables if you want more than one summary. The optional HISTOGRAM statement requests a histogram in addition to the numeric
summaries. The optional title statement provides a title (in quotes).
SAS reports a lot of numbers, including many that we won’t talk about. You will probably be
interested in:
In the box of results labeled Moments:
N: number of observations
Mean: sample average
Std Deviation: sample standard deviation
In the box of results labeled Basic Statistical Measures:
1
Mean: a repeat of the sample average
Median: sample median Std Deviation: a repeat of the sample standard deviation
Interquartile Range: the IQR
Coeff Variation: the coefficient of variation
You may be interested in:
In the Quantiles box: 100% Max, 75% Q3, 25% Q1 and 0% Min:
the max, 75’th percentile, 25’th percentile, and minimum.
Some of the other numbers will be used later. The rest have specialized uses.
Descriptive statistics and histograms for groups: proc univariate; class treatment;
Adding a class statement to the proc univariate code tells SAS to do everything twice, once for
each unique value of the treatment. Same format for the numeric output, but the top of the page
is labeled treatment = to indicate which treatment is being described. The histogram statement
with a class statement produces nice side by side histograms.
Note: Remember that the grouping variable goes in the class statement and the result variable (to
be described) goes in the var statement. If you reverse these, e.g. class score; var treatment;, you
get mush or worse.
Boxplots: proc boxplot;
Proc boxplot produces side-by-side boxplots. One quirk is that it requires all the intrinsic observations to be together and all the extrinsic observations to be together. The easiest way to ensure
this is to sort the data by treatment first. The proc sort; by treatment; run; does this. The
plot statement specifies the response variable and the grouping variable. The response is first, then
the group, separated by a *. Again, if you reverse these, the result is either mush or a lot of errors.
Note: If you expected two boxes (for two groups) and got many, the data set is probably not sorted.
Permutation test: proc npar1way scores=data;
The procedure name has the digit 1 in it. It is shorthand for NonPARametric ONE WAY. We’ll talk
later about nonparametric and one-way. The scores=data goes before the semi-colon. It is part of
the proc statement and tells SAS to permute the data. In a few weeks we’ll use proc npar1way for
non-parametric tests using ranks.
Just as with proc univariate, the class statement specifies the grouping variable and the var statement specifies the response. The exact statement specifies that we want to enumerate all possible
permutations of treatments to observations. For large data sets, this can be very time consuming
(and not necessary). The /maxtime=10 option tells SAS to spend no more than 10 seconds on the
problem. This is a failsafe in case you ask for a computation that will take months. You will see a
note in the log that this is one of those “very long” computations. SAS will stop after 10 seconds
and tell you that it’s stopping. If you see this, change to using a random sample of permutations
(next proc step).
2
Including the exact statement is important. If you omit it, SAS uses something called a normal
approximation, which in this case is no different from the usual t-test.
Randomization test: proc npar1way scores=data; exact /mc n=10000;
If the problem is very large, the p-value can be estimated from a sample of permutations of group
labels to observations. It isn’t necessary to enumerate them all. The /mc option requests a MonteCarlo analysis, i.e. a random sample of possible permutations. This is usually called a randomization
test. The n=10000 requests 10000 samples.
The most difficult part of this analysis is finding the number you really want. That is the two-sided
p-value from the Exact test (or Monte Carlo approximation). That number is the Two-sided ...,
Estimate number in the box labeled Monte Carlo Estimates for the Exact Test.
Two sample t-test: proc ttest;
Proc ttest requests a two-sample t-test. Again, the class statement names the groups and the var
statement names the response.
We will talk about many of the numbers reported here over the next week. For now, the p-value
that is usually reported in scientific papers is found in the third box of results: Look for the Pooled
Method, Equal variances. The p-value is found in the Pr >| t | column. Here it is 0.0060. That is
a two-sided p-value.
Closing the rtf output: ods rtf close;
This tells SAS to stop saving results in the RTF file. SAS will then open up WORD so you can
look at the results.
3
Download