Ingen bildrubrik

advertisement
How to Navigate the Guide
To navigate this SAS Guide, use the PageDown and PageUp
buttons on the keyboard.
A copy of this PowerPoint document can be downloaded
from
http://www.biostat.ku.dk/~lts/varians_regression/sasguide.ppt
1
Preface
This is The Beginners’ Guide To SAS. The document was
originally written by Anna Johansson, MEP, Stockholm.
It has been lightly edited by Peter Dalgaard and Lene Theil
Skovgaard for the Ph.D. course on SAS at the Faculty of
Health Sciences, University of Copenhagen, May 2002,
and later by LTS for the Ph.D. Course in Analysis of
Variance and Regression.
2
Introduction
What is SAS?
SAS is a software package for managing large amounts of
data and performing statistical analyses.
It was created in the early 1960s by the Statistical Department
at North Carolina State University. Today SAS is developed
and marketed by SAS Institute Inc. with head office in
Cary, North Carolina, U.S.A.
3
Introduction (cont.)
SAS in Denmark
The Danish subdivision of SAS Institute provides consulting
and a wide range of courses. It is located in Copenhagen.
SAS Institute A/S
Købmagergade 7-9
1150 Kbh. K
Tel: 70 28 28 70
Fax: 70 28 29 91
Email: info@sdk.sas.com
4
Introduction (cont.)
The SAS System
The SAS System is mainly used for
- Data Management (about 80% of all users)
- Statistical Analysis (about 20% of all users)
The power of SAS lies in its ability to manage large data sets. It is fast and
has many 5statistical and non-statistical features.
The disadvantage of SAS is its steep learning curve. It takes quite a bit of
an effort to get started. User-friendly interfaces do exist, though.
5
Introduction (cont.)
Start af SAS på kursussalen:
-
Flyt på musen (eller tænd maskinen)
Login er kursusxx
Password skifter
Vælg START, efterfulgt af STATISTIK og SAS 8.2
6
Introduction (cont.)
Getting Started
A very good start is to enter the SAS Online Training.
Choose in the menu Help + Getting Started with the SAS
Software, then click on the “book”.
7
Introduction (cont.)
SAS Files
If your data is not yet in a SAS data set, you access the raw
data by creating a SAS data set from it.
Once you have made the SAS data set, you use SAS
programs to analyse, manage and/or present the data.
SAS data sets can be permanent or temporary. A special
library called WORK is created on start-up and deleted on
exit.
8
Introduction (cont.)
SAS Programming
SAS programming works in two steps:
Data Step
1. reads data from file
2. makes transformations and adds new variables
3. creates SAS Data Set
Proc Step
4. uses the SAS Data Set
5. produces the information we want, such as tables, statistics,
graphs, web pages
9
Introduction (cont.)
Data and Proc Steps
Example of a SAS program:
data work.main;
set work.original;
age=1997-birthyr;
Data Step
bmi=weight/(height*height);
run;
proc print data=work.main;
var id age bmi;
run;
Proc Steps
proc means data=work.main;
var age bmi;
run;
10
Introduction (cont.)
SAS Modules
The SAS system is made up of several modules, each used for
different purposes.
This Guide deals only with the SAS BASE
and the GRAPH modules, giving knowledge on basic data
management and simple statistical analyses.
Other modules are SAS/Stat (statistical analyses),
SAS/Access (data base applications), SAS/Graph,
SAS/Assist (menu-driven info system), SAS/FSP (data
entry and retrieval), SAS/Connect (remote submit), etc.
11
Introduction (cont.)
SAS at Biostat Dept.
We primarily use SAS on a Unix server
whereas these notes assume that the programs are run locally
on a PC
The basic programming is the same regardless of what
platform you use. This is one of the big advantages of SAS.
We do tend to prefer running SAS non-interactively though.
12
The SAS Environment
Windows
The main feature of SAS is its division of the main window
into two halves. The left part is a navigator of SAS libraries
and Results (from the Output window).
The right part is divided into three separate windows:
- Program window or Enhanced Editor
- Log window
- Output window
13
The SAS Environment (cont.)
Windows
The log and output windows are always opened by default
when you start SAS (although they may be hidden behind
each other).
The program window and the Enhanced Editor are two
different windows but they are used for the same purpose,
i.e. writing code and executing it. One of them will open by
default.
Other windows are also available and are opened on request
(use View), for instance the Graphics window.
14
The SAS Environment (cont.)
Windows
(The program window is a reminiscent of the older SAS
version 6. The Enhanced Editor is a new feature of version
8, and is more user-friendly, since it colours the code and
works more like an ordinary text editor.)
15
The SAS Environment (cont.)
Windows
To check which windows are opened, choose Window in the
menu. At the bottom there is a list of opened windows.
The active window is indicated by a . A star * after the
window name indicates that the file has not been saved
since its latest alteration.
If you are missing any of the windows (Enhanced Editor,
Log, Output), you can open it by choosing in the menu
View + window-name
16
The SAS Environment (cont.)
Windows
You switch between the windows by choosing
Window + ENHANCED EDITOR
Window + OUTPUT
Window + LOG
in the menu.
17
The SAS Environment (cont.)
Windows
The window location on the screen can be changed by
choosing
Window + Tile
Window + Cascade
or by pulling the lower right corner of the window with the
mouse.
When you exit SAS, the window setting will be kept for the
next session (unless someone else ...).
18
The SAS Environment (cont.)
Enhanced Editor / Program Window
In the Enhanced Editor you write the SAS programs.
The programs tell SAS to produce the data sets, tables,
statistics, etc.
A program consists of data steps and proc steps.
A SAS program is executed (submitted) by choosing Run +
Submit in the menu (or by clicking on the “Running Man”
icon, fourth from the right in the menu).
19
The SAS Environment (cont.)
Output and Log Windows
The result of a program execution is printed to the Output
window. There you will find the prints, tables and reports,
etc.
A log file is printed to the Log window.
The log file contains information about the execution,
whether it was successful or not. It usually points out your
mistakes with warning and error messages so that you can
correct them.
20
The SAS Environment (cont.)
Example: SAS Log
65
proc gplot data=work.influnce;
66
plot di*pred / vaxis=axis1 haxis=axis1;
ERROR: Variable DI not found.
NOTE: The previous statement has been deleted.
67
run;
Make a habit of checking the Log window after every
execution.
Even if SAS has accepted and executed the program, you may
have made a methodological error. Check the note on how
many observations were read, and if there were any
missing values.
21
The SAS Environment (cont.)
Example: SAS Output
patientens alder
Cumulative
ALDER
Frequency
Frequency
__________________________________
0 - 24
25 - 44
45 - 64
65-
41
176
77
25
41
217
294
319
22
The SAS Environment (cont.)
File Types
These files are created by SAS:
- .sas file (SAS program)
- .log file (Log)
- .lst file (Output)
The SAS data sets are saved as .sd7 or .sas7bdat files.
(Other file types, e.g. catalogs, are also used and created by
SAS, but we will not pursue this any further.)
23
The SAS Environment (cont.)
Using the SAS System
You work with SAS using
- Menus and Toolbar
- Command Line
- Key Functions F1-F12
24
The SAS Environment (cont.)
Example
Three different ways to Open a File in the Enhanced Editor:
1. Menus: choose File + Open
2. Toolbar: press the icon for “Open”
3. Command line: write
include ‘N:\temp\bp.sas’
and press Enter.
25
The SAS Environment (cont.)
Commands and Keys
Command
Key
Open File
include ’P:\catalogue\filename.sas’
Ctrl-O
Save File
file ’P:\catalogue\filename.sas’
Ctrl-S
Clear Window
clear
F12
Zoom On/Off Window
zoom
F11
Recall Submitted Programs
recall
F4
Go To Enhanced Editor
wpgm
F5
Go To Log Window
log
F6
Go To Output Window
output
F7
Submit a Program
submit
F8
Change Key Definitions
keys
F9
26
The SAS Environment (cont.)
Write and Read
In the Enhanced Editor you can
- create new, or edit existing, programs
- submit programs
- save programs (an unsaved file is marked with * after the
file name)
You can NOT edit the log file or the output file in their
windows. They are only readable. If you wish to edit these
files, save them and use the Enhanced Editor or Word.
27
SAS syntax
Statements
The SAS code (syntax) consists of statements (sætninger).
Statements mostly begin with a keyword (nøgleord), and
they ALWAYS end with a SEMICOLON.
data work.cohort;
set course.males98;
run;
proc print data=work.cohort;
run;
Examples of keywords: data, set, run, proc.
28
SAS syntax (cont.)
Statements
SAS statements can begin and end anywhere on a line.
data work.cohort;
One or several blanks can be used between words.
data
work.cohort;
One or several semicolons can be used between statements.
data work.cohort;;;
;
29
SAS syntax (cont.)
Statements
The statement can begin and end on different lines.
data
work.cohort;
SAS will not object to several statements on the same line.
However, it is not considered good programming to have
more than one statement per line. It makes the code
difficult to read. Avoid this!
data work.cohort; set course.males98; run;
30
SAS syntax (cont.)
Indenting to improve readability
Improve the readability of your program by adding more
space to the code (= indenting).
Begin data steps and proc steps in the first position, as far left
as possible. The ending run statement should also be in the
first position.
All statements in between should start a few blanks in from
the left margin.
This creates blocks of data steps and proc steps, and you can
easily see where one ends and another begins.
31
SAS syntax (cont.)
Example of Indenting
data work.height;
infile 'h:\mep\rawdata_height.txt';
input name $ 1-20
kon
21
alder
22-23
height 24-30;
if kon=0 and (height ne .) then
do;
if 0<height<81.75 then lnapprx=50;
else
if 81.75<=height then lnapprx=100;
end;
else lnapprx=.;
run;
32
SAS syntax (cont.)
Indenting
Within statements it is also VERY useful to use indenting. Put
similar syntactic words in the same position below each
other.
Use blank lines a lot!
Markers of blocks should be placed in the same position
below one another (e.g. data-run, proc-run, if-else, do-end).
33
SAS Data Sets
What is a SAS Data Set?
A SAS data set is a special file type (.sas7bdat) which
consists of a descriptive part and a data part.
The DESCRIPTIVE part includes
- general information, such as data set name, date of creation,
number of observations and variables etc.
- variable information, such as variable name, type (character
or numeric), format, length, label etc.
34
SAS Data Sets (cont.)
The Data
The DATA part is the data values.
Data is organised with observations in the rows and variables
in the columns.
35
SAS Data Sets (cont.)
Descriptive Part
Proc CONTENTS prints the descriptive part of a data set.
The CONTENTS Procedure
Data Set Name: PPT_EX8.MAIN
Observations:
Member Type:
DATA
Variables:
Engine:
V8
Indexes:
Created:
17:17 Tuesday, August 7, 2001
Observation Length:
Last Modified: 17:17 Tuesday, August 7, 2001
Deleted Observations:
Protection:
Compressed:
Data Set Type:
Sorted:
Label:
-----Alphabetic List of Variables and Attributes----#
Variable
Type
Len
Pos
Format
Informat
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
2
BIRTHYR
Num
8
0
BEST8.
F8.
5
CASE_1
Num
8
24
1
ID
Char
8
56
$8.
$8.
4
LENGTH
Num
8
16
BEST8.
F8.
3
WEIGHT
Num
8
8
BEST8.
F8.
6
age
Num
8
32
4.
8
bmi
Num
8
48
9
generatn
Char
5
64
7
height
Num
8
40
64
9
0
72
0
NO
NO
36
SAS Data Sets (cont.)
Data Part
Proc PRINT prints the data part of a data set.
OBS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ID
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
BIRTHYR
WEIGHT
HEIGHT
AGE
1954
1956
1956
1962
1954
1953
1955
1955
1960
1962
1961
1954
1956
1962
1958
1959
1962
1957
62
68
65
56
58
52
69
75
82
68
65
62
58
61
58
62
59
73
1.65
1.67
1.72
1.68
1.59
1.62
1.75
1.73
1.7
1.72
1.68
1.69
1.68
1.64
1.63
1.65
1.64
1.8
43
41
41
35
43
44
42
42
37
35
36
43
41
35
39
38
35
40
BMI
22.7732
24.3824
21.9713
19.8413
22.9421
19.8141
22.5306
25.0593
28.3737
22.9854
23.0300
21.7079
20.5499
22.6800
21.8300
22.7732
21.9363
22.5309
37
SAS Data Sets (cont.)
Create a Data Set
A SAS data set is created from
- SAS data set (.sas7bdat file)
- raw data file (.txt file)
- another external file through importing (EXCEL file, etc.)
or by
- manually entering the data
38
SAS Data Sets (cont.)
Create a Data Set
To use an existing data set, a .sas7bdat file, is the most
common way to create a SAS data set.
How to create a SAS data set from a raw data file is described
in chapter Read Raw Data Into SAS.
Importing non-SAS data is not trivial. Use File + Import
Data. Ask for help if you run into trouble.
- or use the program STAT-Transfer
39
SAS Data Sets (cont.)
Create a Data Set
The easiest way to manually enter data into SAS is via the
Viewtable facility (see later on in this chapter).
You can also use the CARDS or DATALINES statement
(chapter Read Raw Data Into SAS).
40
SAS Data Sets (cont.)
Existing SAS Data Set (.sas7bdat)
Create a SAS data set from an existing SAS data set:
data work.main;
set work.original;
statements;
run;
This will yield an exact copy of the old data set “original”.
The name of the copy is “main”.
Usually we wish to change the new data set, by adding
programming statements after the SET statement.
41
SAS Data Sets (cont.)
Naming Data Sets
PLEASE, use descriptive names for your data sets.
It is not considered clever to name your data sets: final1,
final2, final3, etc.
Other names to avoid are: new, old, mydata, analys, yourname, etc.
More on this topic in the chapter Naming Data Sets and
Variables.
42
SAS Data Sets (cont.)
Viewtable
The Viewtable facility is a user-friendly tool to look at your
data set without using data steps or proc steps.
You enter the Viewtable window by issuing the “viewtable”
command in the Command line.
This will yield a window very similar to EXCEL, with cells,
rows and columns.
43
SAS Data Sets (cont.)
Viewtable
It is very easy to create a data set in the Viewtable window.
Just enter the data manually into the cells. The variable
names are created by clicking on the column header and
following the instructions.
When you click to save the data set, it is saved into a
.sas7bdat file, which may then be used in any data step or
proc steps in the Enhanced Editor.
44
SAS Data Sets (cont.)
Viewtable
If you wish to open an existing data set into the Viewtable
window, just issue the command
“viewtable name-of-data- set”
and it will open.
You can also open a data set from the Explorer window in the
window area to the left. Just navigate to the right library
and double click on the data set icon.
45
SAS Data Sets (cont.)
Variables
There are two types of variables in SAS: character (char) and
numerical (num).
The type refers to the values the variable have.
Examples of a variable called MONTH:
A character variable: MONTH with values ‘Jan’, ‘Feb’, …,
‘Dec’.
A numerical variable: MONTH with values 1, 2, … , 12.
46
SAS Data Sets (cont.)
Variables
The values of a character variable are between quotes ‘’.
When the value is printed, all characters within the quotes
are printed.
Typical character values are letters, while numerical values
always are digits.
Character variables may include digits as well.
47
SAS Data Sets (cont.)
Variables
Missing values of a character variable are represented by a
blank, while a period “.” (punktum) denotes missing values
of a numeric variable.
Character values can be 32767 characters long at most.
(200 characters in version 6)
Good rule: Never use char variables to store numeric values.
For example, always store Patient Number as a numeric.
48
SAS Data Sets (cont.)
Naming Variables
Variable names (e.g. age, bmi) and data set names (e.g. main,
original)
- can be 32 characters (letters, underscores and digits) long at
most
- can be uppercase or lowercase or mixed (mAiN = MaIn)
- must start with a letter (A-Z) or an underscore (_), not a
digit
49
SAS Data Libraries
What is a SAS Data Library?
A SAS data library is the catalogue where your data sets are
stored.
A data library is like a drawer in a filing cabinet. The cabinet
may have several drawers representing several different
libraries.
A data set is a file within a drawer. A drawer may contain
several files.
50
SAS Data Libraries (cont.)
Figure: SAS Data Libraries
51
SAS Data Libraries (cont.)
WORK and SASUSER
Two data libraries are created automatically by SAS:
WORK and SASUSER.
The WORK library is a temporary library. All its contents are
deleted when you exit SAS. If you wish to keep your data
sets, do not put them in the WORK library.
The SASUSER library is a permanent library. All its contents
are kept when you exit SAS.
52
SAS Data Libraries (cont.)
WORK and SASUSER
The physical location of the permanent SASUSER library is
under ‘C:\’.
It is not especially clever to save all the SAS programs and
data sets in the same folder.
Generally we wish to store them in separate folders for
separate projects or papers.
Therefore, it is possible to create your own permanent
libraries.
53
SAS Data Libraries (cont.)
Libraries and Folders
A library is a physical folder anywhere on your hard disk or
server disk, or even floppy disk.
The physical folder for the SASUSER library on Anna’s
computer is ‘C:\Winnt\Profiles\annaj\…\V8’.
Similarly, if you wish to store data sets in a particular folder,
you can create a library by using a libname statement.
54
SAS Data Libraries (cont.)
Libnames
SAS data sets have names of the form
work.original
WORK is the library where the data set is stored.
ORIGINAL is the data file (.sas7bdat) in that library.
The two components of the name are separated by a period
(punktum).
55
SAS Data Libraries (cont.)
Libnames
More formalised, SAS data sets have names of the form
libref.filename
The libref (library reference) is the name of the library.
The filename is the name of data file (.sas7bdat).
56
SAS Data Libraries (cont.)
The WORK Library
For the WORK library you can omit the libref WORK, and
simply write “original” as the data set name.
If no other libname is used, all data sets and formats (see
chapter Formats) created during a SAS session are saved to
the WORK library.
Be aware that since WORK is a temporary library, all its
contents are deleted when you exit SAS.
57
SAS Data Libraries (cont.)
Libname Statement
A libref is defined by a libname statement, which links the
libref to the physical location of a folder.
Libname statements are of the form
libname libref
‘location-of-folder ’;
Libname statements are written outside data steps and proc
steps, generally at the top of a program. Once it has been
submitted the libref will remain defined until you exit SAS.
58
SAS Data Libraries (cont.)
Libname Statement
Example of SAS program
libname sas engine ‘.../phd/artikel1/sasdata’;
data sammenlign;
set sas.glostrup;
proc ...
59
SAS Data Libraries (cont.)
Librefs
The LIBREF is just a link.
It makes the data set easier to access since you do not need to
specify the complete location (P:\catalogue\subcatalogue(s)...\filename.sas) in the data and proc steps.
When a LIBREF is deleted, i.e. the SAS session ends, the
folder it refers to still exists.
60
Submitting a SAS Program
Submit a Program
To submit (= execute) a SAS program:
- Menus: choose Run + Submit, or
- Toolbar: press the icon with “the running man”, or
- Command line: issue “submit” command.
To halt a submitted program press CTRL + BREAK. You may
have to press it a few times.
61
Submitting a SAS Program (cont.)
Submit a Program
When a program is submitted each step (data or proc) is
executed one at a time.
The contents of a data step is performed on each observation
one at a time (i.e. creation of new variables etc.).
Each step generates log to the Log window. Output (if any)
is generated to the Output window.
62
Submitting a SAS Program (cont.)
Check the Log
ALWAYS browse the Log window
after a submission!
If the submission is stopped by errors the data set is
unchanged and you might do incorrect analyses on an old
data set. You will only see it has been stopped by looking at
the log.
(We emphasise this, since we are too well experienced with
the consequences of the opposite.)
63
Submitting a SAS Program (cont.)
Error Messages in the Log
Note (blue): information which do not indicate errors. They
are usually informative to rule out any methodological
errors in you programming.
Warning (green): points out errors which SAS could correct
itself. The execution was performed with these changes.
Still you should check whether it was done properly.
Example: misspelled keywords.
64
Submitting a SAS Program (cont.)
Error Messages in the Log
Error (red): serious errors which SAS could not handle. The
execution was stopped. These errors must be corrected by
you. Example: forgotten semicolons, invalid options,
misspelled variable names.
Especially if you are updating data sets, be aware that red
errors mean NO updating!
65
Submitting a SAS Program (cont.)
Unbalanced Quotes
A special type of syntactic error is unbalanced quotes (‘’).
Quotes must come in pairs. If they do not, the execution will
keep on running forever. You halt it by submitting
’;
66
Submitting a SAS Program (cont.)
Enhanced Editor
When you press submit, all the code in the Enhanced Editor is
executed.
If you only wish to submit a limited number of rows of the
program code, mark it and press submit.
67
The Data Step
Data Statement
Data sets are created through a data step. The data step begins
with a DATA statement.
General form of the DATA statement:
data SAS-data-set;
The SAS-data-set is a name of the form
libref.filename
68
The Data Step (cont.)
Data Statement
To create a data set ORIGINAL in the temporary library
WORK:
data work.original;
When using the temporary WORK library it is possible to
skip the work prefix and just write:
data original;
69
The Data Step (cont.)
Create Variables
A new variable is created by:
variable=expression ;
The expression may consist of numbers, other variables and
operators such as
+
*
/
**
addition
subtraction
multiplication
division
exponentiation (potensopløftning)
70
The Data Step (cont.)
Examples
data work.main;
set work.original;
age=1997-birthyr;
height=height/100;
bmi=weight/(height*height);
run;
71
The Data Step (cont.)
Functions
Useful are the predefined functions in SAS, such as
exp(argument)
log(argument)
int(argument)
exponential function
natural logarithm
the integer part of a numeric argument
There are also non-mathematical SAS specific functions. A
list of useful functions may be found in the SAS manual
SAS Language (pp 521-616).
72
The Data Step (cont.)
Examples
data work.main;
set work.original;
highest=max(height1, height2, height3);
birthyr=year(brthdate);
total=sum(x1,x2,x3,x4,x5);
run;
73
The Data Step (cont.)
Variable Names
If you are creating a series of variables, such as repeated
measurements, put the order number at the end of the
name, e.g. x1, x2, x3, since
total=sum(x1,x2,x3,x4,x5) 
total=sum(of x1-x5)
The use of the notation x1-x5 is widely accepted in many
expressions and procedures. It will not work if the order
number is in the middle of the name.
Also see the chapter Naming Data Sets and Variables.
74
The Proc Steps
Procedures
A procedure (proc) is a predefined function that operate on
data sets. By specifying the predefined statements in the
procedure you can adapt it to your needs and wishes.
Examples of procedures:
proc contents
proc print
proc freq
proc means
prints the descriptive part of a data set
prints the data part of a data set
creates frequency tables, etc.
calculates means and other statistics
75
The Proc Steps (cont.)
Data Step Vs. Proc Step
Usually, for beginners as well as among advanced users, the
data step is more comprehensible as a concept. Not seldom
is extensive programming done in the data step, when the
same result easily could have been obtained through a
simple option in a procedure.
As a rule, operations on observations (within rows) are done
in the data step, e.g. adding two variables together to make
a third.
Operations on variables (within columns) are done in a proc
step, e.g. taking the mean of a variable.
76
The Proc Steps (cont.)
Proc CONTENTS
The CONTENTS procedure prints out the descriptive part of
a data set.
The descriptive part includes:
- General information: data set name, number of
observations, number of variables, etc.
- Variable information: variable name, type, length, position,
format, label, etc.
77
The Proc Steps (cont.)
Proc CONTENTS
The general form of the CONTENTS procedure:
proc contents data=SAS-data-set;
run;
Example:
libname course ‘h:\SasAtMEP\Course’;
proc contents data=course.main;
run;
78
The Proc Steps (cont.)
Proc PRINT
The PRINT procedure prints out the data part of a data set.
It is possible to choose which variables to print. If none are
chosen, all variables will be printed.
The first column is the OBS column, which indicates
observation. If there are more variables than will fit into the
output window, the output is split and the exceeding
variables printed on the following page. The OBS column
is reprinted to indicate observation.
79
The Proc Steps (cont.)
Proc PRINT
The general form of the PRINT procedure:
proc print data=SAS-data-set;
run;
OR to print a specified list of variables from the data set
proc print data=SAS-data-set;
var variable1 variable2 variable3 ... ;
run;
80
The Proc Steps (cont.)
Proc PRINT
Examples:
proc print data=course.main;
run;
proc print data=course.main;
var age bmi;
run;
81
The Proc Steps (cont.)
Proc FREQ
The FREQ procedure is mainly used to create frequency
tables, although it has a wide range of statistical features as
well.
It creates both one-way and multiple-way tables.
(FREQ is pronounced “frek” in Danish and “freek” in
English.)
82
The Proc Steps (cont.)
Proc FREQ
The general form of FREQ procedure:
proc freq data=SAS-data-set;
tables var1;
run;
OR for a two-way table
proc freq data=SAS-data-set;
tables var1 * var2 / nopercent norow nocol;
run;
83
The Proc Steps (cont.)
Proc FREQ
Example one-way table:
proc freq data=course.main;
tables age;
run;
The SAS System
Cumulative Cumulative
AGE
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
35
4
22.2
4
22.2
36
1
5.6
5
27.8
37
1
5.6
6
33.3
…
43
44
3
1
16.7
5.6
17
18
94.4
100.0
84
The Proc Steps (cont.)
Proc FREQ
Example two-way table:
proc freq data=course.main;
tables case*age/nopercent nocol norow;
run;
The nopercent option suppresses the printing of cell
percentages.
Nocol and norow suppress column and row cell percentages
respectively.
85
The Proc Steps (cont.)
Proc FREQ
The SAS System
TABLE OF AGE BY CASE_1
AGE
CASE_1
Frequency‚
0‚
1‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
35 ‚
3 ‚
1 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
36 ‚
1 ‚
0 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
37 ‚
0 ‚
1 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
...
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
44 ‚
0 ‚
1 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
9
9
Total
4
1
1
1
18
86
The Proc Steps (cont.)
Proc SORT
The SORT procedure sorts the data set according to a chosen
variable. The sorted data set replaces the unsorted data set,
unless you define an OUT data set.
The SORT procedure can sort by several variables, in
ascending (default) or descending (option) order.
Missing values are defined as minus infinity, i.e. less than all
other numeric values.
87
The Proc Steps (cont.)
Proc SORT
The general form of the SORT procedure:
proc sort data=SAS-data-set out=SAS-data-set;
by variables;
run;
OR if you want descending order
proc sort data=SAS-data-set out=SAS-data-set;
by descending variables;
run;
88
The Proc Steps (cont.)
Proc SORT
Example:
proc sort data=course.main out=course.sortage;
by case_1 age;
run;
This will yield a data set called course.sortage, where the
observations are sorted by case_1 and within each category
of case_1 by age.
89
The Proc Steps (cont.)
Proc SORT
There is no result in the Output window from proc SORT, but
a proc PRINT of data set course.sortage gives:
OBS
ID
BIRTHYR
WEIGHT
HEIGHT
AGE
1
2
3
4
...
9
10
11
12
13
14
...
17
18
BMI
CASE_1
010
014
017
011
1962
1962
1962
1961
68
61
59
65
1.72
1.64
1.64
1.68
35
35
35
36
22.9854
22.6800
21.9363
23.0300
0
0
0
0
012
004
009
002
003
007
1954
1962
1960
1956
1956
1955
62
56
82
68
65
69
1.69
1.68
1.7
1.67
1.72
1.75
43
35
37
41
41
42
21.7079
19.8413
28.3737
24.3824
21.9713
22.5306
0
1
1
1
1
1
005
006
1954
1953
58
52
1.59
1.62
43
44
22.9421
19.8141
1
1
90
The Proc Steps (cont.)
Proc MEANS
The MEANS procedure calculates basic statistics. By default
the statistics are
N = number of non-missing observations
Mean = mean value, average
Std Dev = standard deviation
Minimum = minimum value
Maximum = maximum value
Optional statistics include Nmiss (number of missing
observations), range (maximum-minimum), etc.
91
The Proc Steps (cont.)
Proc MEANS
The general form of MEANS procedure:
proc means data=SAS-data-set;
run;
This will yield summary statistics on all variables in the data
set.
Missing values are excluded from the analysis.
92
The Proc Steps (cont.)
Proc MEANS
Example:
proc means data=course.main;
run;
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
BIRTHYR
62
1959.35
3.3294354
1953.00
1967.00
WEIGHT
63
61.7460317
6.5129356
47.0000000
80.0000000
LENGTH
64
1.6675000
0.0617213
1.4800000
1.8000000
CASE_1
64
0.4687500
0.5029674
0
1.0000000
age
62
37.6451613
3.3294354
30.0000000
44.0000000
height
64
1.6675000
0.0617213
1.4800000
1.8000000
bmi
63
22.1982718
1.9262282
17.9591837
29.3847567
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƑƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Naturally, the character variable ID is not displayed.
93
The Proc Steps (cont.)
Proc MEANS
You can modify the proc MEANS code to suit your wishes:
- To specify the number of decimals used in the printout add
the option MAXDEC.
- If you are only interested in a selection of variables, use a
VAR statement.
94
The Proc Steps (cont.)
Proc MEANS
Example:
proc means data=course.main maxdec=2;
var age bmi;
run;
Variable
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
AGE
BMI
18
18
39.44
22.65
3.24
1.95
35.00
19.81
44.00
28.37
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
95
The Proc Steps (cont.)
Proc MEANS
In proc MEANS it is possible to calculate statistics on
subgroups of the data set, e.g. the mean bmi and age for
cases and controls separately.
There are two different ways to deal with subgroup statistics,
depending on what output you are interested in:
- BY statement
- CLASS statement
96
The Proc Steps (cont.)
Proc MEANS
The BY statement:
proc means data=course.main maxdec=2;
var age bmi;
by case_1;
run;
The BY statement requires that the data set has previously
been sorted according to the BY variable.
97
The Proc Steps (cont.)
Proc MEANS
Result from BY statement:
CASE_1=0
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
age
32
37.50
3.08
33.00
44.00
bmi
34
22.09
1.95
17.96
29.38
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
CASE_1=1
Variable
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
age
30
37.80
3.62
30.00
44.00
bmi
29
22.33
1.92
19.61
27.34
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
98
The Proc Steps (cont.)
Proc MEANS
The CLASS statement:
proc means data=course.main maxdec=2;
var age bmi;
class case_1;
run;
The CLASS statement does NOT require any sorting.
99
The Proc Steps (cont.)
Proc MEANS
Result from CLASS statement:
The MEANS Procedure
N
CASE_1
Obs
Variable
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
0
34
age
32
37.50
3.08
33.00
44.00
bmi
34
22.09
1.95
17.96
29.38
1
30
age
30
37.80
3.62
30.00
44.00
bmi
29
22.33
1.92
19.61
27.34
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
100
The Online HELP
Quick Help on Syntax
SAS has a very good ONLINE HELP. In this help you can get
full information on syntax. For more theoretical issues you
should use the paperback manuals or the Online
Documentation (see later on how to use them).
The online help is accessed through the Command line and
command “help”.
help print
help means
(Do not write “proc” before the process name.)
101
The Online HELP (cont.)
Using the Online Help
When a help command is issued the HELP Window will open
with topics:
Introduction
Syntax
Additional Topics (occasionally)
To get full information on how to write the code, choose
SYNTAX.
102
The Online HELP (cont.)
Example
As an example, access online help for proc MEANS.
“help means” + choose SYNTAX
These are all the possible statements that proc MEANS
accept. If you want to know more about a specific
statement just click on it and read:
103
The Online HELP (cont.)
Example
PROC MEANS: Syntax
PROC MEANS <option(s)> <statistic-keyword(s)>; BY
<DESCENDING> variable-1 <... <DESCENDING> variablen><NOTSORTED>;
CLASS variable(s) </ option(s)>;
FREQ variable;
ID variable(s);
OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)>
<id-group-specification(s)> <maximum-id-specification(s)>
<minimum-id-specification(s)> </ option(s)> ;
TYPES request(s);
VAR variable(s) < / WEIGHT=weight-variable>;
WAYS list;
WEIGHT variable;
104
The Online HELP (cont.)
Explanation to the Online Help Text
- underlined word = keyword referring to a statement
(statements within a procedure are optional, the PROC and
the RUN statements are required)
- black word = required if the corresponding keyword is used
- words within “< >” = optional, not required
- words separated by “|” = possible choices of values for a
specific option
105
The Online HELP (cont.)
Example
If you click on the “PROC MEANS”, a list of possible
options will be displayed.
Among them is the MAXDEC= option which we have
already used. The equal sign is required. Next to
MAXDEC= is the black word “number”. If you use the
MAXDEC option you are required to fill in a number
corresponding to the maximum number of decimals to be
displayed.
(The exact conventions depend on which version of the help
you use…. –pd)
106
Labels
What are Labels?
Each variable has a variable name (e.g. birthyr) and a LABEL
(e.g. Year of Birth). The label is how the variable is written
on the output. By default the label = variable name unless
you specify it.
To define and assign a label, use the LABEL statement.
label variable1 = ‘label-name1’
variable2 = ‘label-name2’
...
;
107
Labels (cont.)
What are Labels?
Labels can be 256 characters long at most.
The output from proc CONTENTS include a column with
labels for all the variables in the data set.
To delete a label simply define the label equal to space:
label variable1 =
‘ ‘;
108
Labels (cont.)
Permanent or Temporary Labels
Labels can be assigned inside a data step or a proc step.
Labels assigned in a data step are permanent. They are also
transferred to new data sets.
Labels assigned in a proc step are temporary. A temporary
label replaces a permanent label throughout the execution
of the procedure step.
Most common are permanent labels defined in the data step.
109
Labels (cont.)
Example
Assigning permanent labels in a data step:
data course.main;
set course.original;
age=1997-birthyr;
height=height/100;
bmi=weight/(height*height);
label birthyr=‘Year of Birth’
age=‘Alder’
height=‘Højde’
bmi=‘BMI’;
run;
110
Labels (cont.)
Example
With label:
Without label:
Year of
OBS
1
2
3
4
5
6
7
...
18
Birth
1954
1956
1956
1962
1954
1953
1955
1957
OBS
1
2
3
4
5
6
7
...
18
BIRTHYR
1954
1956
1956
1962
1954
1953
1955
1957
111
Formats
What are Formats?
Formats are used on variable values to
- display the values differently from the raw values (e.g. with
fewer decimals, or as dates)
- group the values (values 0-25=low, values 26-100=high)
There are predefined formats in SAS which you may use, but
you can also create your own formats. The procedures are
designed to handle formats and use them accordingly.
112
Formats (cont.)
Assign Formats
To assign formats you use the FORMAT statement inside a
data step (permanently) or a proc step (temporarily).
The general form of the FORMAT statement is
format variable1 format1.;
The following yields a value with two digits, a decimal point
and two decimals (5 positions, of which two are decimals):
format bmi 5.2;
113
Formats (cont.)
Example Permanent Assignment
data course.main;
set course.original;
age=1997-birthyr;
height=height/100;
bmi=weight/(height*height);
format age 4.0
bmi 4.2
birthyr best4.;
run;
114
Formats (cont.)
Example Temporary Assignment
proc print data=course.main;
var birthyr age bmi;
format age 4.0
bmi 4.2
birthyr best5.;
run;
Usually, the format statement is at the end of the data or proc
step together with the label statement.
115
Formats (cont.)
Predefined SAS Formats
Formats are all of the form (where < > indicates optional and
is not to be typed in):
format-name<w>.<d>
w indicates maximum number of positions used to display the
value
d indicates optional number of decimals in a numeric format
116
Formats (cont.)
Predefined SAS Formats
Formats for character variables need a $ sign in the first
position:
$format-name<w>.<d>
All formats, numeric or character, MUST contain a period
(. punktum), either at the end or before the d value.
See examples.
117
Formats (cont.)
Predefined SAS Formats
w.d
=
$w.
=
COMMAw.d =
BESTw.
=
numeric values at most w positions long, and d of
these positions are decimals
character values w positions long
numeric values with commas and decimal points:
12,345.67
chooses the best notation with w positions for
numeric values
The period (.) occupies one position in all of these formats.
118
Formats (cont.)
Example
Stored Value
Format
Displayed Value
12345.9876
10.4
12345.9876
12345.9876
10.2
12345.99
12345.9876
8.2
12345.99
12345.9876
COMMA11.4
12,345.9876
12345.9876
BEST8.
12345.99
12345.9876
BEST4.
12E3
(E3=103)
119
Formats (cont.)
User-defined Formats
There are situations when the predefined formats do not
suffice.
An example, you wish to group the BMI values into three
categories; underweight, normal weight, overweight.
There is no predefined format to meet your demands in this
situation. The solution is to create your own format.
120
Formats (cont.)
User-defined Formats
To use your own formats you must
- define the format
- assign the format
Several variables may be assigned to the same format
and
A variable may be assigned to different formats in different
procedures
121
Formats (cont.)
Proc FORMAT defines formats
Formats are defined through the FORMAT procedure.
proc format;
value format-name range1 = ‘label’
range2 = ‘label’
... ;
run;
The labels must be inside quotes (‘’).
122
Formats (cont.)
User-defined Formats
Format names are like any other SAS names, however they
must not end in a number.
A format for a character variable must have a dollar sign $ as
its first character.
Format names do NOT end with a period (.) in proc
FORMAT. The period is only used when assigning the
format in a data or proc step.
123
Formats (cont.)
Example: User-defined Formats
A case/control format (case_1f) and a BMI format (bmif).
proc format;
value case_1f
value bmif
0=‘Case’
1=‘Control’
other=‘Other’;
low-20.0=‘Underweight’
20.0-25.0=‘Normal weight’
25.0-high=‘Overweight’
other=‘Other’;
run;
124
Formats (cont.)
Example: User-defined Formats
Above, a value of 20.0000 would fall into Underweight, but
20.0001 would fall into Normal weight.
The first true range alternative is used for a value of a
variable assigned by the format.
125
Formats (cont.)
Special Format Values
other = all other values, including missing values
low
= the lowest value (minimum) of the variable assigned
to the format, including missing values. (For
character formats low does not include missing
values.)
high
= the highest value (maximum) of the variable
assigned to the format
126
Formats (cont.)
Assigning User-defined Formats
User-defined formats are assigned by a FORMAT statement,
exactly as with the predefined formats.
proc freq data=course.main;
tables bmi;
format bmi bmif.;
run;
Cumulative Cumulative
BMI
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Underweight
2
11.1
2
11.1
Normal weight
14
77.8
16
88.9
Overweight
2
11.1
18
100.0
127
Formats (cont.)
Assigning User-defined Formats
proc means data=course.main maxdec=1;
class bmi;
var age;
format bmi bmif.;
run;
The MEANS Procedure
Analysis Variable : age
N
bmi
Obs
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Underweight
7
7
37.9
3.3
35.0
44.0
Normal weight
51
49
37.6
3.4
30.0
44.0
Overweight
5
5
38.0
3.3
35.0
42.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
128
Formats (cont.)
Assigning User-defined Formats
As shown above, user-defined formats are assigned in the
exact same way as the SAS formats:
format variable1 format1.;
129
Titles and Footnotes
Titles
You can add titles to the output with a TITLE statement. A
TITLE statement is one of the global statements which do
not have to be included in a data step or a proc step (other
global statements are the LIBNAME and OPTIONS
statements) .
The form of the TITLE statement is
title ‘here-you-write-the-title’;
The title must be surrounded by quotes (’).
130
Titles and Footnotes (cont.)
Example
title ‘BMI Body Mass Index’;
proc freq data=course.main;
tables bmi;
format bmi bmif.;
run;
BMI Body Mass Index
Cumulative Cumulative
BMI
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Underweight
2
11.1
2
11.1
Normal weight
14
77.8
16
88.9
Overweight
2
11.1
18
100.0
131
Titles and Footnotes (cont.)
Delete Titles
A title will stay defined, and be printed to all output, until it is
changed, or deleted.
To delete a title simply write
title;
132
Titles and Footnotes (cont.)
Several Titles
It is also possible to have second titles below the main title.
A maximum of 10 titles can be used simultaneously.
title1 ‘here-you-write-the-first-title’;
title2 ‘here-you-write-the-second-title’;
...
title10 ‘here-you-write-the-tenth-title’;
133
Titles and Footnotes (cont.)
Several Titles
The unnumbered title statement, is equal to the title1
statement.
It is possible to have, for example, title2 undefined or deleted
while title3 is defined. It will result in a gap between title1
and title3 on the printout representing title2.
However, when you delete say title3, all titles beneath it
(title4-title10) will also be deleted.
title3 ;
134
Titles and Footnotes (cont.)
Example
title ‘BMI Body Mass Index’;
title2 ‘Women 35-45 yrs’;
proc freq data=course.main;
tables bmi;
format bmi bmif.;
run;
135
Titles and Footnotes (cont.)
Example
BMI Body Mass Index
Women 35-45 yrs
Cumulative Cumulative
BMI
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Underweight
2
11.1
2
11.1
Normal weight
14
77.8
16
88.9
Overweight
2
11.1
18
100.0
136
Titles and Footnotes (cont.)
Titles Window
A shortcut to defining titles is the Titles window.
Issue the “title” command in the Command line.
The Titles window will open, with all your current title
definitions. From here the titles can be changed directly by
editing.
The disadvantage of this shortcut is that you can NOT save
the title definitions, as you could have if you had written
them in code. When a program is rerun later after many
title changes, the titles will not be as originally.
137
Titles and Footnotes (cont.)
Footnotes
Footnotes work in the exact same way as titles. The only
difference is that footnotes are written at the bottom of the
printout.
footnote ‘here-you-write-the-footnote’;
To delete a footnote write
footnote;
138
Titles and Footnotes (cont.)
Example
footnote ‘BMI Body Mass Index’;
footnote2 ‘Women 35-45 yrs’;
proc freq data=course.main;
tables bmi;
format bmi bmif.;
run;
139
Titles and Footnotes (cont.)
Example
Cumulative Cumulative
BMI
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Underweight
2
11.1
2
11.1
Normal weight
14
77.8
16
88.9
Overweight
2
11.1
18
100.0
BMI Body Mass Index
Women 35-45 yrs
140
Titles and Footnotes (cont.)
Footnotes Window
To open the Footnotes window and edit footnotes directly,
issue the command “footnote” in the Command line.
________________________________________________
There are lots of additional features to titles and footnotes
available, such as fonts, sizes and orientation, etc. See the
manual SAS/GRAPH Software Vol. I.
141
Subsetting a Data Set
Subsets of Data
Often one wants to use only a subset of a data set, e.g.
persons older than 60 years, women, cases etc.
This is particularly useful when performing data cleaning, and
you only want to print the observations with extreme values
of a variable, say blood pressure > 200.
142
Subsetting a Data Set (cont.)
WHERE option
In procedures you use the WHERE data set option to subset
the data set.
proc print data=SAS-data-set(where=(expression));
run;
The WHERE data set option may be used in any procedure. It
can also be used in data steps, although it is less usual.
data course.cases;
set course.main(where=(case_1=1));
run;
143
Subsetting a Data Set (cont.)
WHERE option
The expression must be a logical one, resulting in true or
false.
Only observations for which the expression is true will be
used in the proc step.
Examples of expressions are:
where=(birthyr gt 1950)
where=(1947<birthyr<1950)
where=(birthyr=1947)
144
Subsetting a Data Set (cont.)
Conditional Operators
Possible conditional operators (use sign or abbreviation):
=
^=
>
<
>=
<=
eq
ne
gt
lt
ge
le
equal to
not equal to
greater than
less than
greater than or equal to
less than or equal to
145
Subsetting a Data Set (cont.)
Logical Operators
You can also combine several expressions by using logical
operators:
AND
OR
if all expressions are true then the
compound expression will be true,
else it is false
if any of the expressions are true, then the
compound expression will be true,
else it is false
146
Subsetting a Data Set (cont.)
Examples
proc freq data=course.main(where=(birthyr<1950 and case_1=1));
... ;
run;
proc means data=course.main(where=((birthyr<1950 or
birthyr>1953)
and
(case_1 ne .)));
... ;
run;
We recommend that you use parentheses as much as you can.
It will enhance the reading as well as reduce the errors.
There is no limit to how many pairs you are allowed to use.
147
Subsetting a Data Set (cont.)
Special Operators
Special operators to use with the WHERE option:
IS MISSING
BETWEEN - AND
true if value is missing
true if value falls within the range
Examples:
where=(birthyr is missing)
 where=(birthyr=.)
where=(birthyr between 1950 and 1952) 
 where=(1950<=birthyr<=1952)
148
Subsetting a Data Set (cont.)
Special Operators
Special character operators to use with the WHERE option:
LIKE
true if value is identical to the argument
where=(lastname like ‘Johnsson’);
149
Subsetting a Data Set (cont.)
Special Operators
The power of the LIKE operator is that you can use different
types of matching arguments:
‘Smi%’
matches all character values that begin
with “Smi”
‘%th’
matches all character values that end
with “th”
150
Subsetting a Data Set (cont.)
Special Operators
‘Smi_’
matches all character values that are
four characters long and begin with
“Smi”
‘Smi_ _’ matches all character values that are
five characters long and begin with
“Smi”
151
Subsetting a Data Set (cont.)
Special Operators
Another special operator is CONTAIN.
CONTAIN yields true if any string of the value matches the
argument.
where=(lastname contains ‘nss’);
All last names with the letters “nss” in them will be selected.
152
Subsetting a Data Set (cont.)
Create a Subset
Sometimes you want to make a new data set from a subset.
These are two ways to accomplish this:
data course.cases(where=(case_1=1));
set course.main;
run;
which will delete the redundant observations at the end of the
data step.
153
Subsetting a Data Set (cont.)
Create a Subset
data course.cases;
set course.main(where=(case_1=1));
run;
which will delete the redundant observations at the beginning
of the data set.
However, a good advice is to minimise the number of data
sets as much as possible. The fewer data sets the better!
There is nearly always a procedure to use on the main data
set instead of creating a new data set.
154
Subsetting a Data Set (cont.)
WHERE statement
There is a third way too, the WHERE statement:
data course.cases;
set course.main; where (case_1=1);
run;
Note that no equal sign is used in the WHERE statement!
Apart from the equal sign, the WHERE statement works as
the WHERE option.
May also be written
if case_1=1;
155
Subsetting a Data Set (cont.)
OBS and FIRSTOBS
In both data and proc steps you can use the OBS data option
to subset a specific range of observations.
To subset the ten first observations:
proc print data=course.main(obs=10);
To subset observations 5 to 10:
proc print data=course.main(firstobs=5 obs=10);
Note that no WHERE is used!
156
Subsetting a Data Set (cont.)
Subsetting IF
There is one occasion when a WHERE statement will not
work, namely if you wish to condition on a variable which
you create in that same data step.
Instead you can use a subsetting IF statement:
if (your-expression);
157
Subsetting a Data Set (cont.)
Example
data course.less20yr;
set course.main;
age75=1975-birthyr;
if (age75 < 20);
run;
This will yield a data set where all the observations were
younger than 20 yrs in 1975.
158
Subsetting a Data Set (cont.)
Delete Observations
It is possible to delete both observations and variables from a
data set.
Observations are deleted through the DELETE statement:
if expression then delete;
(The IF-THEN statement is explained in detail later on.)
159
Subsetting a Data Set (cont.)
Example
data course.after65;
set course.main;
if birthyr<1965 then delete;
run;
This will yield a data set where all observations with
birthyr<1965 have been deleted.
160
Subsetting a Data Set (cont.)
DELETE vs. WHERE vs. IF
The same result could have been obtained through a WHERE
statement, or by a subsetting IF
if birthyr>=1965;
However, if variable BIRTHYR is created in the same data
step, then you need to use IF or a WHERE data set option
on the target data set:
data course.after65(where=(birthyr>65));
...;
run;
161
Subsetting a Data Set (cont.)
Delete Variables
To delete variables use either of the KEEP and DROP
statements.
keep variable1 variable2 ... ;
drop variable1 variable2 ... ;
The KEEP statement will yield a data set where all variables
NOT listed in the KEEP statement list are deleted.
The DROP statement will yield a data set where all variables
listed in it are dropped.
162
Subsetting a Data Set (cont.)
Delete Variables
You use KEEP and DROP in the same way as WHERE, either
as a data set option, or as a statement in a data step.
Thus KEEP and DROP data set options can be used in
procedures as well.
Be aware that you can NOT use both KEEP and DROP in the
same data step.
163
Subsetting a Data Set (cont.)
Examples
KEEP statement:
data course.half;
set course.main;
keep birthyr age case_1;
run;
164
Subsetting a Data Set (cont.)
Examples
KEEP data set option:
data course.half(keep=birthyr age case_1);
set course.main;
run;
or
data course.half;
set course.main(keep=birthyr age case_1);
run;
165
IF-THEN-ELSE
IF-THEN-ELSE
Quite often one wishes to create a variable of the form
1 if age<30
X =
0 else
The value of X depends on the value of variable age.
166
IF-THEN-ELSE (cont.)
IF-THEN-ELSE Statement
To define the X variable in SAS is simple. The code is as
follows:
data course.groups;
set course.main;
if (age<30) then X=1;
else X=0;
run;
Note, that since a missing value is interpreted as negative infinity, it will fall under
X=1. More about this problem shortly.
167
IF-THEN-ELSE (cont.)
IF-THEN-ELSE Statement
The general form of the IF-THEN-ELSE statement is
if expression1 then expression2 ;
else expression3 ;
Expression1 must be a logical expression yielding either true
or false.
If expression1 is true than expression2 is calculated.
If expression1 is false then expression3 is calculated.
168
IF-THEN-ELSE (cont.)
Example
It is very useful to use several IF-THEN-ELSE statements in
a series.
if (age<30) then generatn=1;
else
if (30<=age<60) then generatn=2;
else
if (60<=age) then generatn=3;
else generatn=.;
169
IF-THEN-ELSE (cont.)
Example
Say that the age=67, then (age<30) will yield false, and we
will go right to the ELSE line.
There the expression is a new IF-THEN-ELSE statement
where we continue our execution.
(30<=age<60) is false too, and so we continue at the next
ELSE line.
There the expression (60<=age) is true, and consequently we
calculate the THEN-expression generatn=3.
170
IF-THEN-ELSE (cont.)
Good Programming
This definition of the variable generatn was perfect in the
sense that it included all possible values of age in the IFexpressions.
A good IF-THEN-ELSE statement should!
If all the IF-expressions yield FALSE (that is, if you have not
included all possible values of the conditioning variable),
then the new variable will automatically be set to missing.
171
IF-THEN-ELSE (cont.)
Good Programming
There should always be an IF-expression taking care of
missing values of the conditioning variable (age in our
example).
Usually the missing value option is put at the end of a long
series of IF-THEN-ELSE statements, as the final ELSEstatement.
However, this is not always appropriate. In our example
age<30 will be true for a missing value of age (missing =
negative infinity).
172
IF-THEN-ELSE (cont.)
Good Programming
This problem is avoided if an extra IF-THEN-ELSE
statement is added at the beginning:
if (age eq .) then generatn=.;
else
if (age<30) then generatn=1;
else
if (30<=age<60) then generatn=2;
else generatn=3;
This will prevent any missing values from falling under the
expression (age<30).
173
IF-THEN-ELSE (cont.)
ELSE Statement
Question:
What happens if I don’t use the ELSE statement in the IFTHEN statement?
I.e.
if
if
if
if
(age eq .)
(0<age<30)
(30<=age<60)
(60<=age)
then
then
then
then
generatn=.;
generatn=1;
generatn=2;
generatn=3;
174
IF-THEN-ELSE (cont.)
ELSE Statement
Answer:
Nothing. The result is the same as before.
However, the ELSE statement is a far better alternative when
it comes to complicated IF-THEN statements (see the next
picture).
Make a habit of using the ELSE statement.
175
IF-THEN-ELSE (cont.)
ELSE Statement
The difference between IF-THEN and IF-THEN-ELSE:
As soon as an IF statement is found to be true, the following
IF statements in the series of IF-THEN-ELSE statements
are not calculated.
If you only use IF-THEN statements, all IF statements will be
checked, even after you have found the true one.
Have you by any chance not made them exclusive of each
other, the latest assignment will be in effect.
176
IF-THEN-ELSE (cont.)
ELSE Statement
Example of a faulty definition:
if
if
if
if
(age eq .)
(age<30)
(age<60)
(60<=age)
then
then
then
then
generatn=.;
generatn=1;
generatn=2;
generatn=3;
All persons younger than 60 years will be have generatn=2!
177
IF-THEN-ELSE (cont.)
DO-END
If you wish to assign several variables in a IF-THEN-ELSE
statement you can use a DO-END statement.
data course.main;
set course.original;
if (age eq .) then
do;
generatn=.;
credit=.;
end;
else
if (age<30) then
do;
generatn=1;
credit=1;
end;
...;
run;
178
IF-THEN-ELSE (cont.)
DO-END
The general form of the DO-END statement is
IF expression THEN
DO;
statements
END;
ELSE
DO;
statements
END;
179
Set and Merge
Joining Data Sets
Previously we have taken a subset of a data set. It is also
possible to join (concatenate and merge) data sets together.
Concatenate= adding observations (“lægge dataset efter
hinanden”)
Merge = adding variables, match data sets (“lægge dataset
ved siden af hinanden”)
180
Set and Merge (cont.)
SET
To concatenate two or several data sets, use the SET
statement.
data SAS-data-set;
set data-set1 data-set2 data-set3 ... ;
run;
The top observations in the SAS-data-set will be from dataset1, the next observations from data-set2 and so on.
181
Set and Merge (cont.)
SET
If there are different variables in the data sets, missing values
will be generated for the observations which did not have
the variables before the concatenation..
Variables not present in all the data sets will be found to the
far right of the concatenated data set.
182
Set and Merge (cont.)
Example
We wish to add three observations to our data set
course.main. The three observations are stored in data set
course.extraobs.
data course.concatenate;
set course.main course.extraobs;
run;
183
Set and Merge (cont.)
Example
Concatenate.sas7bdat:
OBS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ID
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
029
030
031
BIRTHYR
WEIGHT
AGE
BMI
ALDER
1954
1956
1956
1962
1954
1953
1955
1955
1960
1962
1961
1954
1956
1962
1958
1959
1962
1957
.
.
.
62
68
65
56
58
52
69
75
82
68
65
62
58
61
58
62
59
73
.
.
.
43
41
41
35
43
44
42
42
37
35
36
43
41
35
39
38
35
40
.
.
.
22.77319
24.38237
21.97134
19.84127
22.94213
19.81405
22.53061
25.05931
28.3737
22.9854
23.03005
21.70792
20.54989
22.67995
21.82995
22.77319
21.93635
22.53086
28.59
22.14
23.85
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
51
42
course.main
course.extraobs
184
Set and Merge (cont.)
Example
Since alder = age it would be nice to transform the three alder
values to age values. Use a RENAME data set option.
rename=(old-var-name=new-var-name);
In the example:
data course.concatenate;
set course.main
course.extraobs(rename=(alder=age));
run;
185
Set and Merge (cont.)
Example
OBS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ID
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
029
030
031
BIRTHYR
WEIGHT
AGE
BMI
1954
1956
1956
1962
1954
1953
1955
1955
1960
1962
1961
1954
1956
1962
1958
1959
1962
1957
.
.
.
62
68
65
56
58
52
69
75
82
68
65
62
58
61
58
62
59
73
.
.
.
43
41
41
35
43
44
42
42
37
35
36
43
41
35
39
38
35
40
48
51
42
22.77319
24.38237
21.97134
19.84127
22.94213
19.81405
22.53061
25.05931
28.3737
22.9854
23.03005
21.70792
20.54989
22.67995
21.82995
22.77319
21.93635
22.53086
28.59
22.14
23.85
course.main
course.extraobs
186
Set and Merge (cont.)
MERGE
To merge or match two or several data sets, use the MERGE
statement:
data SAS-data-set;
merge data-set1 data-set2 data-set3 ... ;
by variable;
run;
The first variables in the SAS-data-set will be from data-set1,
the next variables from data-set2 etc.
187
Set and Merge (cont.)
MERGE
The BY variable (matching variable) is usually an
identification variable unique to all observations and
present in all data sets to be merged.
The data sets must be sorted according to the matching
variable before MERGE is used.
If there are different observations in the data sets, missing
values will be generated for observations not present in
other data sets. These observations will be found at the end
of the merged data set.
188
Set and Merge (cont.)
Example
We wish to add a variable to our data set course.main.
The new variable is the number of children each woman has,
variable CHILD.
This variable is stored together with the variable ID in a data
set called course.extravar. ID is our matching variable,
unique for all observations.
189
Set and Merge (cont.)
Example
proc sort data=course.main course.extravar;
by id;
run;
proc sort data=course.extravar;
by id;
run;
data course.merged;
merge course.main course.extravar;
by id;
run;
190
Set and Merge (cont.)
Example
merged.sas7bdat:
OBS
ID
BIRTHYR
WEIGHT
AGE
BMI
1
2
3
4
5
6
7
8
9
10
11
...
17
18
19
20
21
001
002
003
004
005
006
007
008
009
010
011
1954
1956
1956
1962
1954
1953
1955
1955
1960
1962
1961
62
68
65
56
58
52
69
75
82
68
65
43
41
41
35
43
44
42
42
37
35
36
22.7732
24.3824
21.9713
19.8413
22.9421
19.8141
22.5306
25.0593
28.3737
22.9854
23.0300
1
0
2
2
0
3
1
2
4
.
2
017
018
029
030
031
1962
1957
.
.
.
59
73
.
.
.
35
40
48
51
42
21.9363
22.5309
28.59
22.14
23.85
2
0
0
3
1
course.main
CHILD
course.extravar
191
Read Raw Data Into SAS
Raw Data
Sometimes you have a raw data file you want to read into
SAS. The data are not stored in a SAS data set, or .sas7bdat
file.
Perhaps data are of the form:
001
002
003
004
005
006
...
016
017
018
Anderson 1954 62 43
Askelund 1956 68 41
Bengtson 1956 65 41
Carner 1962 56 35
Holmfors 1954 58 43
Johnsson 1953 52 44
Torberg 1959 62 38
Turner 1962 59 35
Öhlund 1957 73 40
192
Read Raw Data Into SAS (cont.)
List Input Format
This form of data is called list input format. It means that
each data value is separated by a blank.
List input formatted data have the following characteristics:
- all values contain eight or fewer characters
- values are separated by one or several blanks
- missing values (both numeric and character) must be
written as a single period (.)
- numerical values must include all necessary decimal points
193
Read Raw Data Into SAS (cont.)
INPUT and CARDS
To create a data set using list input formatted data, write
data course.nameincl;
input id $ name $ birthyr weight age;
cards;
001 Anderson 1954 62 43
002 Askelund 1956 68 41
003 Bengtson 1956 65 41
...
016 Torberg 1959 62 38
017 Turner 1962 59 35
018 Öhlund 1957 73 40
;
run;
194
Read Raw Data Into SAS (cont.)
INPUT and CARDS
In the INPUT statement you list the variables in the same
order as they are presented.
If a variable is a character variable, put a dollar sign ($) after
the name.
When there is no dollar sign, then SAS interprets the variable
as a numeric.
195
Read Raw Data Into SAS (cont.)
INPUT and CARDS
The CARDS statement indicates that data lines are to follow.
It is written below the INPUT statement.
Note that no semicolons are present at the end of each data
line, ONLY at the end of all the data. It marks the end of
data entry.
Equivalent to CARDS is the keyword DATALINES.
196
Read Raw Data Into SAS (cont.)
Column Input Format
If the data you wish to read into a data set is specified in
columns (= positions), you may use the column input
format.
001
002
003
004
005
006
...
016
017
018
Andersson
Askelund
Bengtsson
Carner
Holmfors
Johansson
1954
1956
1956
1962
1954
1953
62
68
65
56
58
52
43
41
41
35
43
44
Torberg
Wernersson
Öhlund
1959 62 38
1962 59 35
1957 73 40
----+----1----+----2----+----3
197
Read Raw Data Into SAS (cont.)
Column Input Format
Each variable occupies a specified number of characters or
positions (1 position = 1 column).
The first variable ID is in columns 1-3, NAME in 5-12
(including blanks at the end), BIRTHYR in 13-16 etc.
The ruler below the data is a SAS ruler. It is often used in the
Log window.
Each column is indicated by “-”. The digit 1 indicates column
10, 2 column 20 etc. The + indicates columns 5, 15, 25 etc.
198
Read Raw Data Into SAS (cont.)
Column Input Format
To create a data set using column input formatted data, write
data course.nameincl;
input id $ 1-3 name $ 5-12
birthyr 13-16 weight 18-19 age 21-22;
cards;
001
002
003
...
016
017
018
Andersson
Askelund
Bengtsson
1954 62 43
1956 68 41
1956 65 41
Torberg
Wernersson
Öhlund
1959 62 38
1962 59 35
1957 73 40
;
run;
199
Read Raw Data Into SAS (cont.)
Column vs. List Input Format
The only difference from the code for list input formatted data
is that you must specify the columns in the INPUT
statement. (ID $ 1-3, name $ 5-12, etc.)
Differences in data characteristics:
- values must not contain more characters than are specified
in the input statement
- missing values are written as blank for character variable,
period (.) for numerical
200
Read Raw Data Into SAS (cont.)
Data from .TXT File
To read data (list or column formatted) from external text files
(.txt) use an INFILE statement.
data course.nameincl;
infile ‘h:\SAS-folder\data-file.txt’;
input id
$ 1-3
name
$ 5-12
birthyr 13-16
weight
18-19
age
21-22
;
run;
201
Read Raw Data Into SAS (cont.)
Data from .TXT File
The INPUT statement is written as before, depending on
whether you have column or list formatted data in the text
file.
No CARDS (or DATALINES) statement is used.
The INFILE statement must be written above the INPUT
statement.
202
The SAS Manuals
The SAS Manuals
There is a heap of heavy manuals for SAS, often not easy to
read for a beginner but indispensable when you get more
experienced.
These are the most useful of them:
SAS Language, Reference
SAS Procedures Guide
To lower the threshold of SAS manuals, you can begin by
reading the thin (!) SAS Language and Procedures,
Introduction. Most of its contents are presented in this
guide, so it is easy to compare.
203
Importing and Exporting Files
It is also possible to import data from other systems file
formats, including Excel and Access, as well as delimited
file types.
Use the entries on the Files menu for this.
Or use the program STAT/transfer.
204
The SAS Manuals (cont.)
Manuals vs. Online Help
The SAS manuals/Online Documentation work similarly to
the Online Help (see chapter The Online HELP).
The Online Help is easy to use when you know what you are
looking for, a specific syntactic problem, options or
keywords.
The manuals are best used when you are looking for a
solution to a bigger SAS problem, such as which procedure
to use, why does it not work as I thought it would, or what
is SAS actually doing.
205
Comments
Commenting Programs
Comments in SAS programs are code that will not be
executed when the program is submitted.
It is often useful to comment out parts of a program.
Also it is good to add comments which explain the code, such
as new variables, dates at changes, general purpose of the
program, etc.
206
Comments (cont.)
General Comment
Comments in programs should be inside /* and */ .
/* comment */
It is NOT possible to have comments within comments
/* comment /* comment */ comment */
207
Comments (cont.)
Statement Comment
Another way to comment out text is to use a *.
*age=1996-birthyr;
All text after the * and until a semicolon is reached will be
commented.
This will avoid the comment-inside-comment problem, but it
is more cumbersome, since you must comment every
statement.
208
Comments (cont.)
RUN CANCEL
If you wish to prevent a data step or procedure from being
executed you can change the RUN statement to
RUN CANCEL;
209
Naming Data Sets and Variables
Data Set and Variable Names
The name of a data set or a variable should reflect its
contents.
It is not obvious to an outsider that the variable “rel51yrd”
refers to “relapse 5 years after first diagnosis”. A better
name would perhaps be “relapse5”.
Add comments in the data step to clarify what the variables
are!
210
Naming Data Sets and Variables (cont.)
Data Set and Variable Names
- can be 32 characters (letters, underscores and digits) long at
most
- can be uppercase (versaler) or lowercase (gemena) or
mixed (OrIgInAl = oRiGiNaL)
- must start with a letter (A-Z) or an underscore (_), not a
digit
211
Naming Data Sets and Variables (cont.)
Data Set and Variable Names
The increased character limit to 32 characters in version 8.0 is
welcomed by many SAS users (formerly it was only 8
characters in version 6.12).
But a long name is not always to prefer either. It makes the
code tiresome to read, not to mention write.
A good idea is therefore to try to keep to a maximum of 8 to
12 characters also in future. Use labels.
212
Naming Programs
Program File Names
Contrary to the SAS data set and variable names, there is no
limit to 32 characters when you name the program files.
Do not hesitate to use long names, although the same rules as
for the variables names apply.
Be consistent, no cryptic abbreviations.
213
Categorising Variables
Dichotomous Variables
The most common categorising is dichotomization, a variable
that only takes two values, i.e. woman/man, case/control,
etc.
Dichotomised variables should take values 1 and 0
(not 1 and 2).
Example:
1 if overweight
0 if not
214
Categorising Variables (cont.)
Dichotomous Variables
Dichotomous variables are often called indicator variables,
since the 1 indicate a characteristic and the 0 does not.
It is therefore clever to name dichotomous variables after the
“1” characteristic.
Example:
1 if BMI>25
fat =
0 otherwise
215
Categorising Variables (cont.)
Defining Dichotomous Variables
This is a how you may define the 1/0 variable in a data step:
data course.main;
set course.original;
if (bmi ne .) then
do;
if bmi>25 then fat=1;
else fat=0;
end;
else fat=.;
run;
Or simply
fat=(bmi>25);
....but
216
Categorising Variables (cont.)
Dummy Variables
Some procedures can not use variables with several
categories, only dichotomised, so you need to create 1/0
variables (dummy variables) for each category.
Example:
A BMI variable is categorised into three categories, using a
format:
bmi =
< 20
20-25
> 25
217
Categorising Variables (cont.)
Dummy Variables
The three new dummy variables should have bmi as a prefix:
1 if bmi<20
bmi1=
1 if 20<bmi<25
bmi2=
0 else
1 if bmi>25
bmi3=
0 else
0 else
With the same prefix as the mother variable it will be very
easy to understand their origin.
The suffices 1, 2 and 3 are much easier to comprehend than
say bmilow, bmimiddl and bmihigh.
218
Categorising Variables (cont.)
Example
This is one way to define dummy variables in a data step:
data course.main;
set course.original;
if (bmi ne .) then
/* Checks for non-missing values */
do;
if bmi<20 then bmi1=1;
else bmi1=0;
if (20<=bmi<=25) then bmi2=1;
else bmi2=0;
if bmi>25 then bmi3=1;
else bmi3=0;
end;
else
/* If missing value */
do;
bmi1=.;
bmi2=.;
bmi3=,;
end;
run;
219
Categorising Variables (cont.)
Example to Avoid
This is BAD way to define dummy variables (well, Anna
doesn’t like it….):
data course.main;
set course.original;
if (bmi ne .) then
do;
bmi1=(bmi<20);
bmi2=(20<=bmi<=25);
bmi3=(bmi>25);
end;
run;
220
Categorising Variables (cont.)
Example to Avoid
First of all, what does it mean?
Expressions (bmi<20), (20<=bmi<=25) and (bmi>25) are
logical expressions. They are either TRUE or FALSE.
SAS interprets the logical value TRUE as the numeric value
1, and FALSE as 0.
So, if bmi<20 say, then bmi1=1, bmi2=0 and bmi3=0.
221
Organise Your Programming
Structuring SAS Files
Here is a suggestion of how to organise your SAS programs.
It is not presumed to be The Ultimate Structure, but it is a
good start to get organised.
All ideas to improve this structure are most welcome.
222
Organise Your Programming (cont.)
We Want To Avoid This
The most common SAS programming is:
You make a program file which begins with a data step
reading original.sas7bdat. In the data step you do the
changes and adds, create formats, and then the procs follow
using this data set.
The reason why people write programs like this is probably
because it is the natural way to do it. First data, then
analysis.
223
Organise Your Programming (cont.)
We Want To Avoid This
The problem with this programming is:
All programs create new data sets.
So whenever you want to update the data you must update
ALL the program files.
Eventually you will not know which ones you have updated,
and which you have not.
224
Organise Your Programming (cont.)
Our Idea
To avoid this time consuming dilemma, I suggest the
following file structure:
Separate the data steps from the procs by saving them to
different files. Data files and analysis files.
Preferably, create only ONE main data set where you make
ALL your changes, and then use it in every proc file.
Whenever you make a change to the data, all the analyses are
changed accordingly (when rerun).
225
Organise Your Programming (cont.)
Data Files
Four files handles the data:
the original data file saved as a
1.
.sas7bdat file
the data set programming is in
2.
main.sas
which creates
3.
main.sas7bdat
formats are created by code in the file
4.
formats.sas
226
Organise Your Programming (cont.)
Analysis Files
Other files can be divided into three groups:
- data cleaning files
- analysis files
- published analysis files
However, these groups are not clearly defined. Some files
could fall into several groups.
We have yet to come up with a clever structure of these files.
Suggestions are kindly accepted.
227
Organise Your Programming (cont.)
Organising the Data Files
The original data file (a SAS data set or an external data base
file), should NOT be used when you run SAS.
SAS executions may go wrong and create permanent changes
to the data.
Start by making a copy of the original data file.
228
Organise Your Programming (cont.)
.sas7bdat File
If the original data file is a .sas7bdat file, save the copy as
copy-of-name.sas7bdat (or a similar name).
229
Organise Your Programming (cont.)
Non-SAS File
If the original data file is not a SAS data file (.sas7bdat) then
you should create a SAS data file (name.sas7bdat) from it.
This is not always trivial and you may need help from the
Analysis group.
Golden Rule: NEVER try to import THE original file. Always
make another COPY of it before you import it to SAS.
Imports can crash.
230
Organise Your Programming (cont.)
Do Not Alter Original Data
You now have the original data stored as the SAS data file
name.sas7bdat.
Golden Rule No 2: The file name.sas7bdat should ONLY be
altered whenever the original data file is altered!
It is an exact copy of the original data and should remain so.
231
Organise Your Programming (cont.)
Main.sas
From name.sas7bdat we create the data set main.sas7bdat in
the program called main.sas. This is our version of
main.sas:
/***
your-libname.main creates the main.sas7bdat data set for project
‘project-name’.
Created: 2001-08-07 by Anna Johansson
Used by: all analysis files, otherwise mentioned
***/
data your-libname.main;
set your-libname.name;
/* Changes and adds (new variables) */
continues ...
232
Organise Your Programming (cont.)
Main.sas
/* Assign variable labels */
label var1=‘label1’
var2=‘label2’
...;
/* Assign formats */
format var1 format1.
Var2 format2.
...;
run;
proc contents data=your-libname.main;
run;
proc print data=your-libname.main;
run;
/* END of main.sas */
233
Organise Your Programming (cont.)
Main.sas
The comment header of main.sas includes some general
information. It is optional, but very useful.
The data set your-libname.main is created from yourlibname.name in the data step that follows.
Changes and adds to the original data are made.
At the end, labels and formats are assigned to the variables.
These lists become longer as the work progress.
234
Organise Your Programming (cont.)
Main.sas
After the data step, two procedures, CONTENTS and PRINT,
are found.
The logs and outputs they generate are used to check whether
the main.sas7bdat data set was created satisfactorily.
If you have a large data set it might be a good idea to put a
restriction to the PRINT procedure, so that only a fraction
of the data is printed.
proc print data=course.main(obs=20);
235
Organise Your Programming (cont.)
Formats.sas
The formats assigned at the end of the main data step are
created in a file we call formats.sas:
proc datasets library=your-libname memtype=(catalog);
delete formats / memtype=catalog;
run;
proc format library=your-libname fmtlib;
value bmif
low-20.0=‘Underweight’
20.0-25.0=‘Normal weight’
25.0-high=‘Overweight’
other=‘Other’;
continues ...
236
Organise Your Programming (cont.)
Formats.sas
value case_1f
value yes10no
0=‘Case’
1=‘Control’
other=‘Other’;
0=‘No’
1=‘Yes’
.=‘Missing’;
run;
/* END of formats.sas */
237
Organise Your Programming (cont.)
Formats.sas
All formats are saved permanently to a format catalogue (see
section on Formats).
The first part (proc datasets) will delete the existing SAS
format catalogue.
Make sure the occurrences of your-libname are correct.
238
Organise Your Programming (cont.)
Formats.sas
The formats are created by proc FORMAT. Here the format
names are bmif, case_1f and yes10no.
The name should reflect the variable it refers to.
Occasionally there are more general formats like “yes10no”
(0=‘No’, 1=‘Yes’). They may have a general name and be
assigned to several variables.
A good idea is to place the format definitions in alphabetical
order in formats.sas.
239
Organise Your Programming (cont.)
Working With Several Formats
Often you switch between several formats for the same
variable, for instance when you are checking different ways
to divide categories.
Do not use one format which you change and redefine, but
several formats and change the assignment of them (either
in main.sas or in the proc you are using, see the next
picture).
These formats should have similar names, like bmi3grp,
bmi5grp, bmi7grp. (Note that a format name can not end
with a digit.)
240
Organise Your Programming (cont.)
Working With Several Formats
The permanent assignment in main.sas should reflect the
format you wish to keep as the permanent. It is possible to
temporarily change a format in ANY procedure by adding a
FORMAT statement.
Example:
proc freq data=course.main;
tables bmi*case_1;
format bmi bmi5grp.;
run;
241
Organise Your Programming (cont.)
Working With Several Formats
To delete all format assignments simply assign no format:
proc freq data=course.main;
tables bmi*case_1;
format bmi ;
run;
242
Organise Your Programming (cont.)
When Must I Rerun Main.sas?
Whenever you have
- made a change to the data step, or
- changed the permanent format assignment in main.sas
you must rerun main.sas.
You do not need to rerun main.sas when you have changed
the formats.sas, UNLESS you have also changed the
assignments in main.sas.
243
Organise Your Programming (cont.)
Organising the Analysis Files
Organising the proc files (analysis files) is more complicated.
How are they naturally grouped?
By the tables presented in the article?
By the different procs (MEANS in one file, FREQ in
another)?
The only advise I have is to give the files good names, and
perhaps add dates in comments. And use comments, since
the proc files are the documentation of your analysis work.
244
Organise Your Programming (cont.)
Documentation
Remember:
The SAS files are the documentation of your data analysis.
Documentation should be easy to access and understand.
You should be able to reproduce all published results from the
original data file.
245
Appendix
List of Useful Procedures
PROC APPEND = concatenates data sets (compare SET
statement)
PROC CATALOG = administrates SAS Catalogues
PROC CHART = creates simple charts, bar charts, block
charts, pie charts, etc.
PROC COMPARE = compares the contents of two data sets,
or variables within a data set
PROC CONTENTS = prints the descriptive part of a data set
PROC COPY = copies a SAS Data Library or parts of it
PROC CORR = calculates correlation coefficients
PROC DATASETS = lists, renames, deletes data sets etc.
246
Appendix (cont.)
List of Useful Procedures (cont.)
PROC FORMAT = defines formats and informats
permanently or temporarily
PROC FREQ = creates frequency tables and basic statistical
tests
PROC MEANS = creates descriptive statistics
PROC OPTIONS = lists all current SAS System options
values
PROC PLOT = creates simple Y-X plots
PROC PRINT = prints the data part of a data set
PROC PRINTTO = defines destination of output and log, e.g.
when you wish to use an output as input for another
procedure
247
Appendix (cont.)
List of Useful Procedures (cont.)
PROC RANK = calculates ranking of different sorts
PROC REPORT = creates better layout for output
PROC SORT = sorts data sets by one or several variables
PROC SQL = implementing query language SQL, used to
retrieve data sets
PROC STANDARD = standardises variables to a given mean
and variance
PROC SUMMARY = descriptive statistics in form of
summations
PROC TABULATE = descriptive statistics in tables
PROC TRANSPOSE = transposes data sets, i.e. transforms
248
rows to columns, and vice versa
Appendix (cont.)
List of Useful Procedures (cont.)
PROC UNIVARIATE = creates descriptive distribution
statistics and plots, histograms etc.
249
Acknowledgements
Anna wishes to thank the following MEP colleagues for their
support.
Hanna Svensson, Statistician, Doctoral Student
Lena Brandt, Statistician
Paul Dickman, Senior Statistician
Gunnar Petersson, Programmer
-- and Peter and Lene wish to thank Anna!!
250
Download