MORE DETAILS ON SAS

advertisement
MORE DETAILS ON SAS
In what follows, SAS keywords are typed in capital letters. You will not have to type
them in capitals, but you must spell them exactly as SAS does.
Variable names and dataset names are typed in lowercase bold letters. I have used
generic names here like x or y. You would replace these generic names with the names
you have decided on in your program. These names must start with a letter or underscore
_, and must contain only letters, numbers or underscores. Of course, you will not be
typing them in bold. SAS has some special variable names that it uses, which begin and
end with an underscore, so it is safer not to start your variable names with an underscore.
Optional information is typed in lowercase italics. You do not have to include this
information unless you want to customize SAS to perform the analysis in the way you
wish.
THE DATA PARAGRAPH
Data paragraphs create data sets, manipulate them to create new variables or to select
subsets of the data, and can output permanent data sets in SAS format. Data paragraphs
always begin with the DATA statement.
DATA mydata;
Instructs SAS to begin creating a working file named mydata. You do not have to give
names to working files, but it helps in complicated programs when many datasets might
be accessed.
To assign names to the variables that are being read in, you need an INPUT statement.
INPUT a1 $ a2 $ x1 x2;
would describe a situation where there are four variables recorded on each line of the data
set. The first two variables, a1 and a2, are character variables with non-numeric entries,
so their variable name is followed by a $. The last two variables, x1 and x2, are numeric.
Whenever you wish to refer to these variables later in the program, you must spell the
names exactly as they were spelled in the input statement.
In this simplest form of the INPUT statement, there is no formatting information
to tell SAS where on a line to find the variables. Hence, SAS will default to expecting
the variables to each be separated by one or more spaces.
After the INPUT statement, you can add lines which tell SAS how to create new
variables from the original variables listed in the INPUT statement. For example,
suppose I wanted to create a new variable y = x1/x2. After the INPUT statement, I could
put
y = x1/x2;
In general, these kinds of new variable lines have the form
Newvariablename = mathematical expression in terms of existing variables;
Your mathematical expression can use the usual arithmetic operators:
1
+ addition
- subtraction
* multiplication
y
** power (as in 10**y which is 10 )
/ division
There are also a number of built in functions that you might find useful, such as
Y=SQRT(x1) ; which computes the square root of x1.
Y=LOG(x1); which computes the natural logarithm of x1.
Y=LOG10(x1); which computes the logarithm base 10 of x1.
Y=ABS(x1); which returns the absolute value of x1;
There are many dozens of functions listed in SAS Help.
You can use IF statements to perform different actions depending on some conditions.
For example, I might want to create a new variable called big which is 0 when x1 is less
than 100 and 1 when x1 is greater than or equal to 100. Then I could use two lines:
big = 0;
IF (x1 ge 100) THEN big=1;
SAS uses the abbreviations = or EQ for ‘equal’, ~= or NE for ‘not equal’, < or LT for
‘less than’, > or GT for ‘greater than’, <= or LE for less than or equal to, and GE or >=
for ‘greater than or equal to’.
Sometimes you only want to use part of a dataset. For example, if I were analyzing data
on second and third graders, I might just want to use the third graders for my analysis.
You can adapt the IF statement for this. The statement
IF (mathematical expression);
Will keep any data line where the mathematical expression is true, and discard those
where it is false. For example, if I only wanted to keep datalines where x1 is greater than
50, I could write
IF (x1 gt 50);
The actual data comes last in the paragraph, following a CARDS statement.
Some people say DATALINES instead of CARDS. End the data with a semicolon, then
a RUN; statement to actually execute the paragraph.
2
EXAMPLE
*this program reads in the student names and first three quiz scores;
*it checks to see if the lowest quiz score is less than 5.
*if the lowest quiz score is less than 5, it creates a warning flag;
DATA grades;
INPUT student $ quiz1 quiz2 quiz3;
Worst=MIN(quiz1,quiz2,quiz3);
Warning=0;
IF (worst lt 5) THEN warning=1;
CARDS;
Sam 8 10 9
Joe 10 4 8
Abby 8 9 7
;
RUN;
DATA trouble;
SET grades;
IF (warning = 1);
RUN;
PROC PRINT DATA=trouble;
RUN;
The first DATA paragraph created the original data set grades, used the MIN function to
find the minimum of the quiz scores, and created a variable called warning that is 1
when the worst quiz score is too low.
The second DATA paragraph creates a new data set named trouble. To begin, it used the
SET statement to import the original dataset grades. Since grades was created in SAS,
SAS remembers all the variable names. There is no need for another INPUT statement.
The dataset trouble will only have one line in it – the line for the student (Joe) who has
warning=1.
Finally, PROC PRINT will print out the information in the dataset trouble, so I can see
which students have a low quiz score.
3
SOME USEFUL PROCS
PROC PRINT data=datasetname;
Prints out the information in the dataset specified in the optional data=datasetname field.
If you leave out data=datasetname, then it will print the contents of the most recently
created dataset.
 VAR -- If you do not want to print out every variable in the dataset, list the ones
you do want to print in the VAR statement.
Example: PROC PRINT data=grades;
VAR name worst quiz1 quiz2 quiz3;
RUN;
PROC SORT data=datasetname;
BY variablename;
Sorts the lines in the information in ascending order of the variable specified in the BY
statement. If variablename is a character variable, the order is alphabetic.
Example: PROC SORT data=grades;
BY student;
RUN;
PROC PRINT data=grades;
VAR student worst;
RUN;
Would give output like
Abby
Joe
Sam
7
4
8
PROC MEANS data=datesetname;
VAR variablenames (separated by spaces);
BY variablenames;
Produces simple means, standard deviations, sample sizes, minimums and
maximums for the variables listed in the VAR statement. If there is more than one
variable, separate the names by spaces.
Using the BY statement is optional, but you can use it if you wanted the means,
etc., computed separately depending on their value of the variable named in the BY
statement. To use a BY statement, you must have first SORTED the data by the value of
that same variable.
Example: Suppose your data set had vocabulary scores for 2nd graders and 3rd graders.
You want to get the means and standard deviations separately for the 2nd and 3rd graders.
4
DATA vocabulary;
INPUT student $ grade score;
CARDS;
Tim 2 28
Jack 3 42
Ellen 2 31
Shayna 2 18
Antwaan 3 39
;
RUN;
PROC SORT data=vocabulary;
BY grade;
RUN;
PROC MEANS data=vocabulary;
VAR score;
BY grade;
RUN;
PROC UNIVARIATE data=datasetname;
VAR variablenames;
BY variablename;
HISTOGRAM variablenames;
Proc Univariate is very similar to PROC MEANS but produces far more detailed output,
with medians, quartiles, etc. Use of the VAR and BY statements is just like for PROC
MEANS.
HISTOGRAM is optional. If you just list HISTOGRAM with some variablenames, SAS
will choose its own intervals and label the bars using the midpoints of the intervals. I like
to control the choice of intervals, and my preference is to label the left edges of the bars.
You can customize the choice of intervals the way you like by using the ENDPOINTS or
MIDPOINTS options. Options come after the main part of the HISTOGRAM statement,
following a / mark. For instance, suppose in the vocabulary score data I wanted
histograms that used a class width of 5, starting at 15. All the data must lie between the
first and last endpoints. Then I could say:
PROC UNIVARIATE data=vocabulary;
VAR score;
HISTOGRAM score / endpoints=(15 to 45 by 5);
5
Download