Laboratory 1

advertisement
Laboratory 1:
Introduction to SAS
This session and the next session are designed to provide you with hands on
instructions to ensure that you develop a basic understanding of data handling in SAS
(Statistical Analysis System) including reading data to create SAS dataset(s) and
running some basic procedures using the latest version of SAS, version 8.2.
Datasets for the laboratory sessions are available on the web. Please download them
on a floppy disk and bring it (along with a blank floppy disk) to each laboratory session.
Make sure you make copies of data files on another floppy disks in case the first floppy
disk gets corrupted for some reason or you have no access to the internet. Today you
will work on 4 case-control978.* files. The others are for future use.
Description of the Data
Data of a case-control study of esophageal cancer among men, each row (record)
contains data for one subject. There are 978 rows containing information on 200
esophageal cancer cases and 778 population controls. There are 7 columns containing
data on 7 variables. The definition of these 7 variables with a description of the variable
codes are given below:
-----------------------------------------------------------------------------------------------Column no
Variable Name
Range/Description
-----------------------------------------------------------------------------------------------1
Record
Record number, 1-978
2
ID
1-1640 = control (n = 778)
6000-6206 = case (n = 200)
3
Age
Age in years (21-91)
4
Tobacco
Cigarette smoking
0 = never smoked
1 = 1- 4 grams/day
2 = 5- 9 grams/day
3 = 10-14 grams/day
4 = 15-19 grams/day
5 = 20-29 grams/day
6 = 30-39 grams/day
7 = 40+ grams/day
8 = did not answer
9 = unknown
5
Alcohol
Daily consumption in grams
6
Status
Case-control status
0 = controls
1 = cases
7
Sex
1 = male
-----------------------------------------------------------------------------------------------
Applied Epidemiologic Analysis - P8400
Lab1: Introduction to SAS
Page 1
Henian Chen
Fall 2002
The data set looks like this:
record
1
2
3
.
.
.
977
978
id
age
tobacco
alcohol
status
sex
1
2
3
.
.
.
6205
6206
42
45
35
..
..
..
71
40
0
2
0
.
.
.
5
2
139
66
24
…
…
…
91
141
0
0
0
.
.
.
1
1
1
1
1
.
.
.
1
1
We saved this data as:
1. case-control978.dat:
2. case-control978.txt:
3. case-control978.wk3:
4. case-control978.dbf:
“raw” data file (text file)
tab-delimited, write variable names to spreadsheet
Lotus 1-2-3 Release 3 spreadsheet
dBASE III file
Five Windows of SAS
When SAS is started, there are five main windows open, namely the Editor, Log,
Output, Results, and Explorer. The Editor, Log, and Explorer windows are visible. The
Results window is hidden behind the Explorer window and the Output window is hidden
behind the program Editor and Log windows. You can also use the function keys to
switch windows. F7 brings you to the Output window, F5 the Editor, and F6 the Log.
1. Editor: The Editor window is for typing in editing, and running programs. Some
aspects of the Editor window will be familiar as standard features of Windows
applications. The File menu allows programs to be read from a file, saved to a file, or
printed. The File menu also contains the command to exit from SAS. The Edit menu
contains the usual options for cutting, copying, and pasting text and those for finding
and replacing text.
The Run menu is specific to the Editor window and will not be available if another
window is the active window. The program currently in the Editor window can be run by
choosing the Submit option from the Run menu. It is possible to run part of the program
in the Editor window by selecting the text and then choosing Submit from the Run
menu.
When a SAS program is run, two types of output are generated: the log and the
procedure output, and there are displayed in the Log and Output winds.
Applied Epidemiologic Analysis - P8400
Lab1: Introduction to SAS
Page 2
Henian Chen
Fall 2002
2. Log: The log window shows the SAS statements that have been submitted together
with information about the execution of the program, including warning and error
messages. The contents of the Log window cannot be edited. The Clear all option in
the Edit menu will empty the window (the same if you use Ctrl + e).
3. Output: The Output window shows the printed results of any procedures. It is here
that the results of any statistical analyses are shown. The contents of the Output
window cannot be edited. The Clear all option in the Edit menu will empty the window
(the same if you use Ctrl + e).
4. Results: The Results window is a graphical index to the Output window useful for
navigating around large amounts of procedure output. Right-clicking on a procedure, or
section of output, allows that portion of the output to be viewed, printed, deleted, or
saved to file.
5. Explorer: The Explorer window allows the contents of SAS data sets and libraries to
be examined interactively, by double-clicking on them.
Managing the windows can be done with the normal windows controls, including the
Window menu. There is also a row of buttons and tabs at the bottom of the screen that
can be used to select a window.
SAS Language
In SAS there are almost no point and click commands, so it is necessary to learn to
write in code. SAS uses a color-coded system. Most SAS statements begin with a
keyword that identifies the type of statement. When it recognizes keywords as they are
typed SAS changes their color to blue. If a word remains red, this indicates a problem.
The word may have been mistyped or is invalid for some other reason. This color-coded
system is very helpful in understanding the syntax and in finding your errors.
A typical SAS program consists of data steps and procedure (proc) steps. A data step is
used to prepare data for analysis. It creates a SAS data set and may reorganize the
data and modify it in the process. A proc step is used to perform a particular type of
analysis, or statistical test, on the data in a SAS data set.
Data and proc steps begin with a data or proc statement, respectively, and end at the
next data, proc or run statement. When a data step has the data included within it, the
step ends after the data. Understanding where steps begin and end is very important
because SAS programs are executed in whole steps. If an incomplete step is submitted,
it will not be executed. The statements that were submitted will be listed in the log, but
SAS will appear to have stopped at that point without explanation. In fact, SAS will
simply be waiting for the step to be completed before running it. For this reason it is
Applied Epidemiologic Analysis - P8400
Lab1: Introduction to SAS
Page 3
Henian Chen
Fall 2002
good practice to explicitly mark the end of each step by inserting a run statement and
especially important to include one as the last statement in the program. The Editor
offers several visual indicators of the beginning and end of steps. The data, proc, and
run keywords are color-coded in Navy blue, rather than the standard blue used for other
keywords.
Global statements can be placed anywhere. If you are placed within a step, they will
apply to that step and all subsequent steps until reset. A simple example of a global
statement is the title statement, which defines a title for procedure output. The title is
then used until changed or reset.
Statements can extend over more than one line and there may be more than one
statement per line. However, keep to one statement per line, as far as possible, to avoid
errors.
Names must be given to variables and data sets in writing a SAS program. These can
contain letters, numbers, and underline characters, and can be up to 32 characters in
length but cannot begin with a number. Variable names can be in upper or lower case,
or a mixture, but changes in case are ignored.
Don’t forget, all SAS statements must end with a semicolon. The most common mistake
for new users is to omit the semicolon and the effect is to combine two statements into
one.
Data Step
Before data can be analysed in SAS, they need to be read into a SAS dataset. Creating
a SAS data set for subsequent analysis is the primary function of the data step. A data
step is also used to manipulate, or reorganize the data. The data can be “raw” data or
come from a previously created SAS data set. In SAS, “raw” data means the data you
can input directly to SAS or a text file, or ASCII file. Such files only include the printable
characters plus tabs, spaces, and end-of line characters.
Instructions follow for creating datasets by input cards, reading a raw data file, using a
previously created SAS dataset, and importing a dataset from an external data source.
1. Using “cards” statement to input your data to SAS and create a SAS dataset
We have a small set of data about seniority, publication, and sex on 6 faculty members.
Instead of storing this data as a “raw” data file it would be easier, in this case, to include
the data directly in the SAS program.
Applied Epidemiologic Analysis - P8400
Lab1: Introduction to SAS
Page 4
Henian Chen
Fall 2002
----------------------------------------------------------------------Time since Ph.D.
Sex
No. of Publications
----------------------------------------------------------------------6
F
3
8
M
17
9
F
11
6
M
6
10
M
48
5
F
30
----------------------------------------------------------------------Write a SAS program in the Editor window to create a SAS dataset (named “exercise”)
by inputting above data.
data exercise;
input time sex $ publication;
cards;
6 f 3
8 m 17
9 f 11
6 m 6
10 m 48
5 f 30
;
run;
data statement gives a SAS dataset name: exercise
input statement in the example specifies three variables: time, sex, and publication,
and the dollar sign ($) after sex indicates that it is a character variable.
SAS has only two types of variables: numeric and character.
cards statement must follow all other statements in the DATA step. The data lines are
followed by a line with just a semicolon. This line is called a null statement and identifies
the end of the data.
Please run this program, and check the Log window. The Log window will show that
“The data set WORK.EXERCISE has 6 observations and 3 variables .” if the program
ran successfully.
To view the SAS dataset (exercise), use PRINT Procedure and view it on the Output
window. Please run the Proc PRINT as follows:
proc print data=exercise;
run;
Now you have a SAS dataset called exercise in the temporary “WORK” library. This is a
temporary SAS dataset and will be lost if you quit SAS without saving it as a permanent
SAS dataset.
Applied Epidemiologic Analysis - P8400
Lab1: Introduction to SAS
Page 5
Henian Chen
Fall 2002
Using a “libname” statement to save your permanent SAS dataset.
libname mine 'a:';
data mine.myexercise;
set exercise;
run;
libname statement specifies the library name of your choice (here it is mine) linked to
the directory (here it is a: your floppy disk).
data statement gives the name you have chosen for this permanent SAS data set
(myexercise).
set statement gives the name of the temporary data set.
For the purpose of this course we will avoid requiring you to permanently save any file
as you don’t have access to the hard drive and your floppy disk may have insufficient
disk-space for saving a big permanent SAS dataset. On your home computer, though,
you are encouraged to use libnames and permanently save files.
Please save this program on your floppy disk as ‘a:exercise.sas’ using the ‘save as...’
option in the ‘file’ menu.
The double trailing at-sign (@@): We just have one record for each data line in
program exercise.sas. The double trailing at-sign is useful to have more than one record
on one data line. For example, we can change exercise.sas to:
data exercise;
input time sex $ publication @@;
cards;
6 f 3 8 m 17 9 f 11 6 m 6 10 m 48 5 f 30
;
run;
2. Using “infile” statement to read a “raw” data file and create a SAS dataset
Write a SAS program in the Editor window to create a SAS dataset (named
“case_control978”) by using an existing text file (case-control978.dat).
/*
*****************************************************************************
* Name:
Student's name
* Date:
9/26/02
* Program: a: case-control978.sas
* Purpose: reading raw data to create SAS dataset using "infile" statement
*****************************************************************************
*/
Applied Epidemiologic Analysis - P8400
Lab1: Introduction to SAS
Page 6
Henian Chen
Fall 2002
title1 '****************************************************************';
title2 'Laboratory 1: Case-Control Study of Esophageal Cancer Among Men';
title3 '****************************************************************';
data case_control978; /* create a SAS dataset called 'case_control978' */
infile 'a:case-control978.dat'; /* indicate the location of the raw data */
input record id age tobacco alcohol status sex;
run;
proc print; /* check the data structure */
run;
Please save this program on your floppy disk as ‘a: case-control978.sas’.
We started with a comment statement in this program. Comment statements are
global statements in the sense that they can occur anywhere. There are two forms of
comment statement. The first form begins with an asterisk and ends with a semicolon.
The second form begins with /* and ends with */.
The title statement is a global statement and provides a title that will appear on each
page of printed output. The text of the title must be enclosed in quotes. Multiple lines of
titles can be specified with the title2 statement for the second line, title3 for the third
line, and so on up to ten.
infile statement specifies the file where the raw data are stored. infile statement
precedes the input statement.
3. Using a previously created SAS dataset
To read data from a SAS data set, rather than from a raw data file, the set statement is
used in place of the infile and input statement.
/* Reading Data from an existing SAS Data Set */
data new1; /* create a new SAS dataset called new1 */
set 'a:myexercise'; /* read in the data from a:myexercise */
female=0; /* create a new variable called female, let female=0 */
if sex='f' then female=1; /* let female=1 if sex is 'f' */
run;
proc print data=new1; /* check the new1 structure */
run;
data new2; /* create a new SAS dataset called new2 */
set new1; /* read in the data from new1 */
drop sex; /* delete variable sex */
run;
proc print data=new2; /* check the new2 structure */
run;
Applied Epidemiologic Analysis - P8400
Lab1: Introduction to SAS
Page 7
Henian Chen
Fall 2002
IMPORT Procedure
In SAS, the files produced by database programs, spreadsheets, and word processors
are not normally “raw” data. SAS cannot read those data files by using infile, and input
statements. We have to use IMPORT Procedure to read data from an external data
source and write it to a SAS dataset. External data sources can include DBMS tables,
PC files, spreadsheets, and delimited external files (in which columns of data values are
separated by a delimiter such as a blank, comma).
/* import delimited file (tab-delimited values) to SAS */
proc import datafile='a:case-control978.txt' out=delimited dbms=tab replace;
getnames=yes;
run;
proc print data=delimited;
run;
/* import Lotus file to SAS */
proc import datafile='a:case-control978.wk3' out=lotus dbms=wk3 replace;
getnames=yes;
run;
proc print data=lotus;
run;
/* import dBASE file to SAS */
proc import datafile='a:case-control978.dbf' out=dbase dbms=dbf replace;
run;
proc print data=dbase;
run;
Today we did some typical SAS jobs that you will need to use throughout the semester
for homework exercises. You should practice as much as possible with SAS this week
in order to get yourself comfortable with its use.
We will practice data management and Proc Steps in next week’s laboratory.
Please bring back the dataset ‘case-control978.dat’ and the SAS program
‘case-control978.sas’.
Applied Epidemiologic Analysis - P8400
Lab1: Introduction to SAS
Page 8
Henian Chen
Fall 2002
Download