SAS tutorial - School of Informatics

advertisement
SAS Programming Notes
1
SAS Programming Notes
For
Data Mining and Exploration
Lecturer: Amos Storkey
School Of Informatics
University of Edinburgh
Further topics
SAS Programming Notes
2
Further topics
Acknowledgements: These notes are extensively based on
notes developed over a long period by the School of
Accounting, Economics & Statistics, Napier University.
People who have worked on or contributed to these notes
over that time include Amos Storkey, Ana Costa Da Silva, Phil
Darby, Helen Storkey, Jeff Dodgson, Dorothy Currie, Kate
Houston and Kirsty Davidson.
I am very grateful for permission to use and develop these
notes for the Data Mining and Exploration course.
First published
September 2000
Updated
September 2001 (SAS version 8) and February, July
September 2002
October 2004 (SAS version 8.1) and September 2005
December 2006 (SAS version 9.1.3)
January 2008 (SAS 9.2 and linux differences)
File: SASv9.2.doc
SAS Programming Notes
3
Contents
1. GETTING STARTED
7
1.1 What is the SAS system?
Error! Bookmark not defined.
1.2 The SAS Workspace
7
1.3 Creating and running a SAS program
9
1.4 Submitting and correcting your program
10
1.5 Saving files and clearing text from windows
11
1.6 Reading a saved program
11
1.7 A Data Analysis Flow Chart
12
1.8 Importing data using a wizard
12
1.9 Viewing a data set
14
1.10 Creating a SAS Program
15
1.11 Rules for entering SAS statements
16
1.12 Adding comments to a program
17
1.13 Including titles in your SAS output
17
1.14 Creating new variables
18
1.15 Printing and saving SAS output
18
2. DATA FILES AND SAS DATA SETS
2.1 Reading data files using the INFILE statement
2.2 LIBNAME and permanent SAS Data Sets
2.3 Referencing a permanent SAS data set
2.4 Contents of a file
2.5 Importing data from other packages
2.6 Missing values
2.7 The INPUT statement
23
23
24
26
26
27
28
29
3. SAS PROCEDURES
3.1 Structure of a SAS program
3.2 Sample program
35
35
35
4. SUMMARISING DATA
4.1 SAS System Options
4.2 HTML output
4.3 Summary Procedures
4.4 PROC SORT
4.5 PROC MEANS
4.6 PROC UNIVARIATE
4.7 PROC FREQ
4.8 General syntax for a procedure
4.9 Help
39
39
40
40
41
42
43
43
45
47
5. GRAPHS AND CHARTS
5.1 Graphics procedures
5.2 PROC PLOT
5.3 PROC CHART
49
49
49
50
SAS Programming Notes
4
Contents
6. CORRELATION AND REGRESSION
6.1 PROC CORR
6.2 PROC REG
53
53
54
7. EXPLORATORY DATA ANALYSIS
7.1 SAS/INSIGHT
7.2 Accessing SAS/INSIGHT
7.3 Features of SAS/INSIGHT
7.4 Using SAS/INSIGHT
Tools
57
57
57
59
60
61
8. MODIFYING DATA AND OUTPUT
8.1 Introduction
8.2 SET statement
8.3 DROP and KEEP
8.4 Labelling output
8.5 PROC PRINT
8.6 PROC FORMAT
8.7 Recoding data
8.8 Conditional statements
8.9 VALUE statement
8.10 OUTPUT
63
63
63
64
65
66
66
67
67
68
69
9. PROC TABULATE
71
10. FUNCTIONS AND FORMATS
10.1 MEAN function
10.2 NMISS function
10.3 N function
10.4 Functions to handle character variables
10.5 Date and Time Formats
75
75
76
76
76
77
11. ITERATIVE PROCESSING
11.1 Do loops and arrays
11.2 Reading data in repeated patterns
11.3 Arrays
11.4 Generating random numbers
11.5 Random numbers from a uniform distribution
11.6 Random numbers from a normal distribution
11.7 The SAS Program Data Vector
11.8 The RETAIN and Sum statements
81
81
81
82
84
84
85
85
86
12. FURTHER TOPICS
12.1 Combining Data Sets
12.2 Hints on Using Word with SAS and SAS/INSIGHT
89
89
90
SOLUTIONS TO EXERCISES
95
SAS Programming Notes
5
Contents
Various files are referred to in these notes. These can be found in a zip file on
the Data Mining and Exploration web site
www.inf.ed.ac.uk/teaching/courses/dme/
SAS Programming Notes
7
Getting started
1. Getting started
1.1. Introduction
The SAS system is a widely used resource for statistical analysis and data
mining. It is rare to find a job advert for a data mining practitioner that does
not ask for SAS skills. The main positive points of SAS are its ability to handle
large files fairly transparently, the ease and comprehensive way that standard
analyses can be done, the interactive way that analyses can be built
alongside a systematic programming environment, and the data handling
capabilities. Its main negative points are its graphical capabilities, and that
adding your own extensions to the techniques using macros and the
interactive matrix language are slightly more cumbersome than other
languages (e.g. matlab, R) and than more modern language constructs.
This tutorial will introduce you to the SAS System. This tutorial should be
suitable for those working on either a Linux or Windows system. Interface
tools in SAS for Windows are much better and so where there are differences
these will also be mentioned.
SAS is, at its heart a piece of software for data handling and storage,
statistical and data analysis, data mining decision support and report writing.
It has been extended to a whole business intelligence package, but the best
way of understanding SAS is from the inside out, and so this tutorial will teach
the base SAS software to get you started. With base SAS software you can
store data values and retrieve them, modify data, compute simple statistics,
and create reports all in one SAS session. The difference between SAS and
most statistical packages is that SAS incorporates both a database
management system and a high-level programming language. There is also
SAS software which provides graphics, forecasting, data entry, and statistics.
The SAS system also contains other sophisticated applications that are
valuable to large enterprises. All are available in one system.
1.2. The SAS Workspace
To start SAS on a linux system type SAS at the command prompt. On
windows, select SAS from the start menu.
When you go into SAS, the first thing you see is a set of windows as shown in
Figure 1. Your display may appear a little different since this has been
adjusted to allow all the windows to be seen at once. There are five different
windows shown in this figure. Two further windows are available in SAS
SAS Programming Notes
8
Getting started
version 9, you can switch between them by clicking on the buttons at the
bottom of the SAS window.
Run
Libraries
Editor
Output
Explorer
Results
Log
Figure 1 SAS window on opening in Windows
Explorer
Libraries
Log
Editor
Output
Figure 2 SAS window on opening in Linux
The five windows are:
the EDITOR window where you enter the SAS statements you wish to
execute. The EDITOR has handy features like colour coding and expandable
and collapsible sections.
SAS Programming Notes
9
Getting started
the LOG window which contains information on your SAS run, e.g. date and
time of run, a listing of your SAS statements as they are executed and any
errors which have occurred during processing.
the OUTPUT window which displays the actual results of the program.
the EXPLORER window, which allows you to view and manage your SAS files
and create shortcuts to non-SAS files. For example you can use this window
to create new libraries or to open any SAS file.
the RESULTS window helps you navigate and manage output from SAS
programs you submit. You can view, save, and print individual items of output.
(By default, the Results window is positioned behind the Explorer window but
when you submit a SAS program that creates output it moves to the front of
your display)
The two windows not shown are:
the GRAPH window, will appear when graphical output is to be displayed.
A seventh window will appear when html output is used. The output delivery
system (ODS) can be turned on using programming code or by using the
menu options.
You may turn on or turn off a window by using View from the main menu.
Just choose the window you need (use this if you ‘loose’ a window).
Task 1
Resize the 3 windows on the right hand side so that you see the OUTPUT as
well as the EDITOR and the LOG. Make the EDITOR the largest window.
You can activate any of the windows by
clicking on the window (Windows or Linux)
selecting Window from the menu, then the window you want (Windows)
selecting View from the menu (Windows or Linux)
1.3. Creating and running a SAS program
The following lines of code are a simple SAS program.
When they are typed into the editor window the words will become colour
coded.
SAS Programming Notes
10
Getting started
Reserved words appear blue (e.g. proc, print, input)
Comments appear green in Windows and in black in Linux (See below
for details of entering comments.
Errors appear red.
data class1;
input height weight sex $;
datalines;
152 45.4 F
178 73.0 M
178 68.8 M
175 59.7 M
157 44.5 F
165 61.7 M
175 74.1 M
160 49.5 F
run;
proc print;
run;
Task 2
Enter the SAS program in the EDITOR window.
1.4. Submitting and correcting your program
There are several methods of submitting your program.
1.
Highlight the section of code you wish to run and press the running
man icon (in Windows).
2.
Ensure that your cursor is in the EDITOR window, then select
Run  Submit (in Windows or Linux).
3.
You can also run just a few lines of code by selecting Run -> Submit top
line or Submit N lines (in Windows or Linux).
Right click with the mouse and select Submit All or press the man running
icon
An alternative to pressing the man running icon is to press the key F3 in
Windows or the key End in Linux.
Examine your LOG window to check that there were no error messages: if all
is well examine your output in the OUTPUT window.
SAS Programming Notes
11
Getting started
If you have error messages in your log file you will need to correct the
mistakes and resubmit it.
After submitting your code you may find that it has disappeared from the
editor window. To overcome this problem select Run  Recall Last Submit.
1.5. Saving files and clearing text from windows
When you have succeeded in getting your program to run you can save it as
filename.sas ( SAS automatically gives it a .sas ending to remind you that it is
a SAS program). Make sure your EDITOR window is active before doing
File  Save (or pressing the floppy disk icon). Otherwise you might be
saving the contents of your log or output window instead of your program.
Save log files as filename.log and output files as filename.lst if you want to
save them too. It is usually not necessary to save the log file.
Important: - In order to avoid getting confused about which output and log
refers to which program or version of a program, make it a habit to clear your
windows before submitting a new program. Do this by selecting
Edit  Clear All.
Run  Recall Last Submit returns the program you have just run to the
EDITOR window. This is useful if you have cleared the program by mistake.
Task 3
Create a new folder in your personal disk space called
MA71064 Statistical computing.
Submit the SAS program from the program editor. When it is working
satisfactorily save the file as class1.sas in the folder you have just created.
1.6. Reading a saved program
A SAS program needs to be in an EDITOR window before it can run. To open
a saved SAS program activate the EDITOR window and use FileOpen. The
program can then be submitted in the usual way.
SAS Programming Notes
12
Getting started
You can have more than one EDITOR window open at the same time.
However this can be confusing and it is easiest at first to have only one
program open at a time.
1.7. A Data Analysis Flow Chart
Data analysis can be thought of in terms of a process flow. Actions proceed
in a sequence. Often the output from one action leads to the input of another.
A simple flow chart is given below.
Start
Read or Create a data set
Data step
Display the data
Proc Print
Calculate the averages
Proc Means
End
Figure 3 A simple data analysis flow diagram
SAS programs can contain combinations of DATA steps and PROCEDURES.
The SAS program you used above executed the first 2 blocks in the flow
diagram. Quite quickly you will be producing more complicated programs that
will have many DATA steps and PROCS.
1.8. Importing data using a wizard
The next example reads the excel file Class0 into the temporary SAS library
called Work. The format of the data is displayed and the summary statistics
(count, average and standard deviation) of the height readings is calculated.
The simplest method of entering data into SAS is using the import wizard.
SAS Programming Notes
13
Getting started
File  Import Data will display the dialogue shown in Figure 4.
A.
The source type default is Excel
but others are available from the
pull down menu.
Next
B. Locate the source file by
pressing the Browse button
OK
C. Select the appropriate
worksheet
From the options ensure that
‘Use data in the first row as SAS
names’ is ticked.
OK
Next
D. Enter the Member as Class0.
Finish
Check the log window for errors.
Figure 4 Import data wizard
SAS Programming Notes
14
Getting started
In Linux, there is naturally not the option of importing from Excel. However,
there is the option of importing csv files. An Excel file can be opened using
Open Office and can be saved as a Comma Separated File (csv). It can then
be imported straight into SAS.
In Linux steps B and C above are replaced by the dialogue in Figure 4.
Similar options are available when pressing the respective button. The
remainder dialogue is similar to that in Figure 4.
Figure 5 Import data wizard in Linux
The final step of the wizard, in both Linux and Windows, is optional and offers
the possibility of saving the importation command in a specified file, which
can be opened with the Program Editor. This can be copy-pasted into any
program and be run, without need to follow the steps of the wizard again.
1.9. Viewing a data set
Once the data is into the SAS format you can look at it in a variety of ways.
1 Proc print; run;
2 From the explorer window, double click on the libraries icon to reveal
libraries that are present. These libraries are simply pointers to
Windows XP folders where the data sets are stored. Double clicking
on the work library reveals the data set Class0.
3 Double click on the data set to open the data set.
4 Right click on the data set to display a set of options. These include;
Open,
View the Columns
View in Excel (only in Windows).
SAS Programming Notes
15
Getting started
Task 4
Import the excel file class0.xls into SAS using the import wizard then display
the imported data set using excel.
1.10.
Creating a SAS Program
You have already submitted a simple SAS program which created and then
printed out a set of data. The following is an extension of that program. The
line numbers have been included to help explain the structure of the
program: they are not part of the program itself and should not be typed.
Line number
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
Program
data class2;
input height weight sex
datalines;
152
45.4
F
6
61
178
7.0
M
8
59
178
68.8
M
12
58
175
59.7
M
5
76
157
44.5
F
5
53
165
61.7
M
10
70
175
74.1
M
5
76
160
49.5
F
2
67
161
52.6
M
5
80
180
85.4
M
7
84
160
57.2
F
7
98
170
69.9
M
7
69
178
67.0
M
11
60
163
57.0
F
8
70
160
60.9
F
12
57
185
73.1
M
5
68
188
79.1
M
3
53
159
49.5
F
6
69
run;
proc print;
run;
proc means;
run;
$ bends pulse1 pulse2;
84
102
95
83
102
110
102
118
103
102
115
102
79
98
84
.
69
112
SAS Programming Notes
16
Getting started
Line number
001
Explanation
The DATA statement tells SAS to create a data set called class2.
002
The INPUT statement names the variables in the order they appear in the
data lines. Variable names must start with a letter, be no more than 32
characters in length (eight characters in version 6) and must not contain
blanks, commas and so on. To read data as characters, rather than
numbers, a dollar sign is put after the variable name.
003
004
to
021
The DATALINES statement indicates that the next lines are data.
The data are entered with a space(s) separating each item. The data
must be in the same order as declared in the input statement. A new line
is used for each record.
019
022
023
025 & 026
A full stop indicates a missing numerical value.
RUN tells SAS to execute the preceding statements
PROC PRINT is a procedure to print data in the Output Window
PROC MEANS is a procedure to calculate the mean and other statistics
of all the numeric variables, RUN completes the procedure.
This example illustrates the basic structure of a SAS program:
A DATA step consisting of a DATA statement and other statements that form
part of this step
SAS PROCECURES begin with a PROC statement. Procedure statements
may also be followed by statements that are part of the procedure step,
although there are none in these two examples
1.11.
Rules for entering SAS statements
SAS statements:
usually begin with an identifying keyword
always end with a semicolon
(check carefully before you submit any program!)
can be in uppercase or lowercase letters
SAS statements are free format.
they can begin and end in any column
one statement can continue over several lines
several statements can be on one line
SAS Programming Notes
17
Getting started
Readability is improved if you add comments and leave spaces between the
DATA and PROC steps and perhaps also indent code within a DATA or
PROC step. Develop your own style and stick with it.
1.12.
Adding comments to a program
There are two ways of writing comments in a SAS program:
begin the comment line with an asterisk and end with a semi-colon
e.g. *This program was developed by J Smith;
begin with a forward slash asterisk and end with an asterisk forward slash
e.g. /* J Smith February 2005 */
Inserting comments is essential if you are doing any serious programming.
The /* style */ is also useful for ‘commenting out’ blocks of a program when
testing or debugging.
Task 5
Read the file class1.sas into the program editor. Edit the program so that it is
the same as the sample program in section 1.10 but with the addition of a
comment which gives your name and today’s date. Submit the program and
when it is working properly save it as class2.sas.
What information did the Proc Means procedure give you?
1.13.
Including titles in your SAS output
The TITLE statement is used to provide titles on your output. The TITLE
statement can appear anywhere in a program (an example of a global
statement) and subsequently each page of output (and each graph) will have
the title until it is reset. For example program class2.sas could be enhanced
as follows:
...
proc print;
title ‘Information on Students in Class’;
run;
proc means;
title ‘Summary Statistics of Students in Class’;
run;
SAS Programming Notes
18
Getting started
However the final title will adorn all future output until it is reset with another
title or ‘cancelled’ with
title;
run;
Task 6
Experiment with the TITLE statement in program class2.sas.
1.14.
Creating new variables
If you need to analyse variables, that are derived from the input variables,
then you must create these variables in the DATA step. For example, if you
want to use two new variables ‘the difference in pulse rates’ and ‘the log of
the number of bends’ then these variables must be defined before the lines
of data are read in. The rules about naming new variables are the same as for
input variables.
data class2;
input height weight sex $ bends pulse1 pulse2;
diff=pulse2-pulse1;
lnbends=log(bends);
datalines;
...
Run;
Some commonly used operators and functions are as follows:
Operator
Meaning
Function
*
multiplication
log( )
natural log
/
division
exp( )
exponential
**
exponentiation
sqrt( )
square root
1.15.
Printing and saving SAS output
The contents of the OUTPUT window may be sent to a printer using the
OUTPUT window print command. You can change the way the output looks
using, for example, the LINESIZE and PAGESIZE options (see Section 4.1 or
SAS Programming Notes
19
Getting started
SAS Help).The whole of the OUTPUT window listing may be saved as
filename.lst using the OUTPUT window save command.
In WINDOWS, it is often convenient to copy all or part of the OUTPUT
window into a Word document or another text processing software. This can
be achieved with the copy and paste operation. However the results might be
disappointing. The appearance may be improved by

using a fixed space font such as SAS Monospace (available if SAS is
running)

avoiding ‘wrap around’ by reducing the font size and avoiding unnecessary
leading spaces.
See Section 12.2 for further advice on incorporating SAS numeric and
graphical output into a Word document.
Task 7 (WINDOWS ONLY)
Create a Word document (using copy and paste) which consists of program
class2.sas and the output it produces. Experiment with improving the layout
of the document.
Exercises
1.1 The following table shows the heights and weights of 16 eleven-year-old
girls.
ID no
Height(cm)
Weight(kg)
ID no
Height(cm)
Weight(kg)
59
135
25
71
133
30
82
146
33
78
149
35
27
153
56
12
141
33
52
154
51
37
164
48
55
139
31
28
146
37
13
131
25
48
149
45
01
149
43
69
147
36
15
137
32
16
152
47
(a)
Write the SAS statements to create a SAS data set called ELEVEN.
The ID number should be stored as a character variable.
(b)
Insert a comment statement to indicate that this data came from
Exercise 1.1.
SAS Programming Notes
20
Getting started
(c)
Create a new variable which is the ratio of weight to height.
(d)
Produce a printout of the data and a table which shows the mean,
standard deviation, the maximum and the minimum values for the
variables height, weight and the ratio of weight to height. The output
should be suitable labelled.
(e)
(WINDOWS ONLY) Copy and paste your program and its output into a
word document (edit to ensure an attractive appearance).
(f)
How would the output have differed if you had input the ID number as a
numeric variable?
1.2 Modify your program in 1.1 above in order to determine the body mass
index (BMI = weight in kilograms/(height in metres)2 or BMI = W/(H * H) ).
1.3 Several measurements of water quality were taken at eight different sites
along the Firth of Forth. The data are shown below.
Site
CR
WG
EG
SF
PB
JO
SS
FN
Salinity
30.11
31.48
31.79
31.37
31.50
31.60
30.50
31.96
Phosphate
0.068
0.059
0.068
0.185
0.116
0.106
0.047
0.060
Nitrogen Chlorophyll
0.297
1.693
0.165
1.464
0.144
1.100
0.278
1.787
0.223
2.099
0.207
1.067
0.162
1.563
0.130
0.753
Faecal
Coliforms
2.917
3.149
3.196
3.418
3.049
2.903
2.895
2.797
(a)
Write the statements to create a SAS data set called FORTH.
(b)
The units of phosphate are mg/litre. Create a new variable which gives
phosphate in units of g/litre where 1mg = 1000g (1 milligram = 1,000
micrograms).
(c)
Produce a table which shows summary values for each variable.
SAS Programming Notes
(d)
21
Save the program in a file called forth.sas.
Getting started
SAS Programming Notes
23
Data files and SAS data sets
2. Data files and SAS data sets
2.1. Reading data files using the INFILE statement
In the examples in the previous section you created temporary SAS data sets
from data, which were included in the program, with a DATALINES statement.
In practice, a large set of data is more likely to be available as a raw data file
(known as an ASCII or text file) and it will be more convenient to read the
external data directly into SAS.
To illustrate this we will create a small ASCII data set using the Notepad
editor, read the data file into SAS and then print the contents.
Task 1
Open Notepad / text editor and type in the following data set. Save it as
blood.txt on your floppy disk. (Note that Notepad automatically gives the
extension .txt.)
1 107 100
2 110 114
3 123 105
4 129 112
5 112 115
6 111 116
7 107 106
8 112 102
9 136 125
10 102 104
The variables in the data set are patient number and blood pressure
measurements before and after treatment.
The code required to input the data into SAS and get a printout in the Output
Window is as follows.
data blood;
infile 'a:\blood.txt';
input patient $ before after;
run;
proc print;
run;
The only changes that are required to the previous method of data input are
that:
the INPUT statement is preceded by an INFILE statement to tell SAS where
to find the external data file.
the DATALINES statement and the lines of data are omitted.
SAS Programming Notes
24
Data files and SAS data sets
Task 2
Type the above SAS program into the EDITOR window and save the program
under a suitable name. Submit the program and confirm that the values of
the variables together with variable names have been printed in the OUTPUT
window.
You can verify that the data is stored in the correct location on your hard
drive. An example is given in Figure 6.
Figure 6 The data set Blood in the SAS Work library and temporary directory
2.2. LIBNAME and permanent SAS Data Sets
In the programs you have written so far the data set used in any analysis has
been created in the data step. Such a data set is described as temporary in
the sense that it only exists during your current SAS session and will be
deleted when you close SAS.
This kind of temporary file is stored in a SAS library called WORK. You can
check what files you have created in the current session by going to the
EXPLORER window and clicking on Libraries and then on the library WORK.
(Use
View  Up One Level
or
View  Show Tree
to navigate back to the original EXPLORER window)
SAS files are given a two part name. The first part of the name is the library
name in which the file is stored and the second part is the name of the
particular file. You probably noticed that when you created the previous data
sets, for example class2, that SAS referred to this file in the log window as
WORK.CLASS2.
SAS Programming Notes
25
Data files and SAS data sets
If you wanted to do further analyses on this type of data set in a different
session you would need to recreate the data set by running the data step
once again. This can be a time-consuming process especially if you have a
large amount of data and have created many new variables, changed the
format of variables and so on. The alternative approach is to create a
permanent SAS data set. This is a special type of file, unique to SAS, which
stores the data, variable names and other information such as formats.
You can set up a library to store your data sets and save them so that they
can be used in another SAS session. The SAS LIBNAME statement defines
the name of the library where the file is to be stored. For example, if you want
to store your data on your own disk in drive A, then you need to give this a
SAS libname using a statement like the one below. The actual name of the
library, in this case mydisk, is chosen by the programmer and is just a
convenient name that can be referred to later in the program.
LIBNAME
myadisk ‘a:\’;
LIBNAME
mydisk
LIBNAME
myhomedisk ‘c:\My documents\Napier\MA71064 Stat Comp\’;
‘h:\MA71064 Statistical computing\’;
You can then save your SAS data sets in this library using a two level SAS
name. The first part of the name is the libname and the second part is the
name given to the SAS data set. So to create the permanent SAS data set
called class2, on the H: drive, would require the following SAS code.
libname mydisk ‘h:\MA71064 Statistical computing\’;
* The data library called ‘mydisk’ will;
* be located on the H: drive;
data mydisk.class2;
*Create the new data set class2;
input height weight sex $ bends pulse1 pulse2;
datalines;
152 45.4 F 6 61 84
178 53.0 M 8 59 102
165 61.7 M 10 70 110
...................
175 74.1 M 5 76 102
160 49.5 F 2 67 118
run;
Task 3
Modify your program, class2.sas, to create a permanent data set. (If you want
the data set stored on a hard drive make sure you give the full path name of
the required directory.) Check the messages in the LOG Window and check
you can see the permanent data set in the EXPLORER window.
SAS Programming Notes
26
Data files and SAS data sets
SAS version 8 puts an automatic SAS7BDAT extension on permanent data
sets (version 6 uses SD2).
Task 4
Go to Windows explorer and check that you have a file class2.SAS7BDAT in
the appropriate directory.
2.3. Referencing a permanent SAS data set
Suppose that you have a permanent SAS data set stored in a particular
directory. You may have created this yourself or possibly have downloaded it
from the web. You may carry out procedures on the data set directly by using
the DATA option in the procedure statement. All the details of variable
names and so on will be held in the data set.
In the following example permanent SAS data set prac1 is stored in directory
h:\sas\sasdata.
libname xyz ‘h:\sas\sasdata’;
proc print data = xyz.prac1;
proc means data = xyz.prac1;
run;
Note that the first part of the name given by LIBNAME is a pointer to a
directory and does not have to be the same name as was used when the data
set was created. It is the second part of the name that refers to the particular
data set.
2.4. Contents of a file
You can use the SAS procedure CONTENTS to get information about a data
set and a list of the variables it contains. This procedure is useful for larger
data sets that would be too long or have too many variables to list completely,
and it gives you information about when and where the data set was last
modified. For example,
proc contents data = xyz.prac1;
run;
Note: if you have already submitted a LIBNAME statement in the current SAS
session, it is not necessary to do so again. You can simply refer to the twolevel data set name.
SAS Programming Notes
27
Data files and SAS data sets
Task 5
Get information about the SAS data set COMPANY stored in SASHELP.
What information is given in each of the columns?
An alternative way of inspecting what variables are in a large file is to print out
only the first few observations. This can be done using an option in the PROC
PRINT statement.
proc print data=sashelp.company (obs=6);
run;
Remember you can also view data sets from the EXPLORER window
2.5. Importing data from other packages
Software such as Excel, Minitab and SPSS store data in file types unique to
themselves. Some packages have the ability to export into or import out of
other formats.
Use the import wizard or PROC IMPORT (see SAS Help).
A safe approach to importing data from such application software into SAS is
to export from the other package into ASCII format and input the resulting file
into SAS in a DATA step.
Large data sets from outside sources (other companies or organisations) are
usually supplied in ASCII format since such data is often held in proprietary
databases. Most software allows data to be written to an ASCII or raw data
file. This approach is illustrated bellow:
INFILE
statement
export
Application
data file

ASCII
data file

SAS
data set
It is a good idea to check the ASCII file with an editor such as WordPad (and
possibly ‘tidy up’ if necessary). The ASCII file can then be read into SAS
using the INFILE statement assigning variable names with the INPUT
statement (as explained in Section 2.1).
Advice on importing data into SAS from popular applications software is
summarised below.
Excel (WINDOWS)
For a spreadsheet containing data only (values of variables in columns):
SAS Programming Notes
28
Data files and SAS data sets
Right align columns (necessary for character data)
File Save AsFormatted Text (Space delimited)
to give ASCII file filename.prn.
Edit filename.prn with WordPad if column headings need deleting, missing
values need replacing with ‘.’ etc.
Minitab
FileOther FilesExport Special Text
Specify columns (accept Period Decimel Separator).
Results in ASCII file filename.dat.
Note that Minitab’s missing value symbol is ‘*’. SAS will find this invalid and
replace by ‘.’.
SPSS
FileOther FilesFixed ASCII
Results in filename.dat.
Note that SPSS’s missing value symbol is ‘.’ However this will be blank in
filename.dat and cause SAS to misread the data set when using simple list
input.
Alternative: SPSS allows data to be saved directly as a permanent SAS data
set:
File Save AsSASv7 Windows long extension
In recent versions of SAS (e.g. 9.2), SPSS files can be imported directly.
2.6. Missing values
Uncoded missing values present special problems for using list input. To
provide some protection for the integrity of your output data set when input
data contain uncoded missing input values, use the MISSOVER or
STOPOVER options in the INFILE statement. Use the MISSOVER option to
set all remaining variables in the INPUT statement to missing. Use the
STOPOVER option to prevent an observation from being written to the data
set when the input line does not contain a value for each variable in the
INPUT statement and to stop the DATA step from further processing.
e.g. the program
data test1;
input id $ var1 var2 var3 var4 var5;
datalines;
1001 115 45 65 83 78
1002 86 27 55
86
SAS Programming Notes
29
Data files and SAS data sets
1004 93 52 63 76 88
1015 73 35 43 112 108
;
run;
would result in the following inaccurate data set
obs
1
2
3
id
1001
1002
1015
var1
115
86
73
var2
45
27
35
var3
65
55
43
var4
83
86
112
var5
78
1004
108
var3
65
55
63
43
var4
83
86
76
112
var5
78
.
88
108
If we use the MISSOVER option i.e.
data test1;
infile cards missover;
input id $ var1 var2 var3 var4 var5;
cards;
1001 115
1002 86
1004 93
1015 73
;
run;
45
27
52
35
65 83 78
55
86
63 76 88
43 112 108
we will get the following data set
obs
1
2
3
4
id
1001
1002
1004
1015
var1
115
86
93
73
var2
45
27
52
35
Using the MISSOVER option prevents the uncoded missing value in the
second data line from causing the third record to be read incorrectly as well.
The second observation is still incorrect, but the errors have been restricted to
one observation. The STOPOVER option would prevent observation 2 from
being written to the data set at all. In order to read the data in properly, either
column input or formatted input would have to be used. (See next section)
2.7. The INPUT statement
The INPUT statement names the variables being read in via a DATALINES or
INFILE statement and tells SAS where on the DATALINES, or on the lines of
INFILE, the values of the variables can be found. There are three main types
of INPUT that you can use to describe a record’s values : LIST, COLUMN
and FORMATTED. The choice of which type of input you use will depend on
the type and arrangement of the incoming data. The $ symbol is placed after
a variable name to indicate a character variable.
SAS Programming Notes
30
Data files and SAS data sets
In the previous examples you have used only the simplest type of INPUT,
LIST INPUT. List INPUT is seldom useful for large commercial or scientific
work because it is too easy to get missing values or errors in big files. It is
commoner for real data to come in fixed column format, where the fields on
each line are aligned in columns one under each other.
2.7.1. LIST INPUT
- the values are separated by spaces
- missing values must be represented by full stops
- by default, character values cannot be longer than 8 characters
- character values cannot contain embedded blanks
- fields must be read in order
e.g.
data one;
input height weight name $ age;
datalines;
65
150
60
125
68
180
;
run;
Chris
Kelly
Leslie
50
35
29
2.7.2. COLUMN INPUT
- data must be aligned within the column positions specified
- character values can contain embedded blanks
- input values can be read in any order
- character values can be of length 1 to 200 characters
- leading and trailing blanks within a field are ignored
e.g.
data two;
input name $ 1-7 age 9-10 birthdate $ 11-22 sport $ 23-30
;
datalines;
Ronald 40Dec 3 1954 golf
Michael 37Jul 4 1957 fishing
Laurel 33Jun 23 1961 softball
;
run;
2.7.3. FORMATTED INPUT
- character values can be of length 1 to 200 characters
SAS Programming Notes
31
Data files and SAS data sets
- a full stop is not needed for numeric missing values
- nonstandard data, such as dates or numbers can be read in
- with the use of pointer controls, values can be read in any order
This method of input uses pointer controls and informats for reading in
nonstandard data from external data files. An informat is used for reading in
data containing dates, numbers with commas, etc.
The informat w.d after a variable specifies the width w and the number of
decimal places d to be used in reading in a number.
e.g. for the number 2346,
 the informat 4.2 would result in the number 23.46 being read in.
 the informat 4. with no ‘d’ specified would result in the number 2346
being read in
The informat $w. after a variable specifies the length of a character variable
Dates such as 21/10/89 can be read using the informat DDMMYY8. (Note
the full stop at the end of the informat)
Pointers indicate the position of a variable e.g.
@n
go to column n
+n
move the pointer on n positions
e.g.
/* A line of place counters is often useful to put to help
alignment
000000000111111111122222222223333333333444444444455555555
1234567890123456789012345678901234567890123456789012345*/
data three;
input @1 name $7. @10 age 2.0 @14
@28 sport $8. / @9 gradyr 4.0
@20 occupation $20.;
datalines;
Ronald
40 Dec
1973
2
Michael 37 Jul
1975
2
Laurel
33 Jun
1979
0
;
run;
birthdate $11.
@16 numchild 1.0
3 1954
golf
masonry contractor
4 1957
fishing
bricklayer
23 1961
softball
attorney
/ tells the pointer to go to the next line. Once you go to the next line, you
cannot move back to the previous line.
SAS Programming Notes
32
Data files and SAS data sets
Exercises
2.1
Create a permanent data set of the data given in Exercise 1.1. How
are you going to retrieve this data without having to retype it? You
should be able to modify the program you have saved.
2.2
(a) Download the pulse data file Minitab version from the web or
WebCT (pulse.mtw not pulse.prn). Open Minitab load pulse.mtw
using File  Open Worksheet (not Open Project). Use File 
Other Files  Export Special Text (not File  Save Current
Worksheet as) to export the PULSE file as an ASCII file. You have
to highlight the variables to export then press select. Press OK.
Enter a suitable file name. Change the file type to ANSI Text Files
(*.TXT). Finally press save.
(b) Create a permanent SAS data set of the data.
2.3
To illustrate the dangers of list format input, take the data file
blood.dat and edit it with a text editor (notepad or Word). Make one
or two mistakes in it by removing some of the entries in one or more
lines. Now save it as a text file and use it to input and print a SAS
data set. Examine your log file and output, to see what has gone
wrong, and how you are warned.
2.4
(a) (WINDOWS ONLY) Create an Excel file containing the following
data where column1 is size, column 2 is colour, column 3 is price
and column 4 is transport cost. Save it as a formatted text space
delimited file.
Large
Medium
X-Large
Small
Red
Blue
Black
Orange
18.97
24.68
29.99
15.89
0.25
1.10
1.75
0.90
(b) Write and submit a SAS program to read in the data using list
input and print the variables colour, size and price in that order.
(c) Redo (b) using column input
(d) Redo (b) using formatted input
SAS Programming Notes
33
Data files and SAS data sets
2.5 Copy the text file houses.dat from the web. The file contains the following
five variables for each of the 120 houses in a survey of house prices.
Examine it with an editor (Wordpad or Word).
VARIABLE CONTENTS
COLUMN LOCATION
style
Type of house
1
sqfeet
Floor area
3-6
bedroom
Number of bedrooms
8
baths
Number of bathrooms
10-12
Price
Price of house
14-19
Use column input to create a permanent SAS data set for the housing data
and print the contents.
2.6
(a) Download the cars Excel file from the web.
To create a file of raw data for reading into a SAS data set:Open the file up in Excel.
Right align the columns.
Delete the coding information about the origin variable (in column L).
Save the data as a formatted text space delimited file (.prn extension),
or as a csv file in Linux.
(b) Use this file to create a SAS data set (use column input).
To identify the column location for each variable, open the .prn file
up in Notepad. Move the cursor along the row of data, taking a
note of the column locations.
SAS Programming Notes
35
SAS procedures
3. SAS procedures
3.1. Structure of a SAS program
Once you have got the data organised a simple SAS program consists of a
series of procedures. You have already used three of these procedures.
PROC PRINT, PROC MEANS and PROC CONTENTS. Apart from specifying
which data set to use you had no control on the type of output that SAS
produced. This may have given the impression that SAS is rather inflexible.
However, this is far from the truth. Most procedures have several options
which can be invoked and in addition there are statements which can be
incorporated into a program (which themselves have options). The
procedures and subsequent statements determine the nature of the output
produced. Most SAS procedures use the following syntax:
PROC PROCNAME options;
STATEMENTS / statement options;
RUN;
A program will typically consist of several such blocks of code.
3.2. Sample program
libname unit3 'c:\sas\sasdata';
proc sort data=unit3.pulse out=sorted;
by activity;
run;
proc print data=sorted noobs N; *NOOBS removes observation numbers;
format height 6.0;
title 'Pulse data from Minitab sorted by activity';
var pulse1 pulse2 weight height;
by activity;
run;
proc freq;
tables ran smokes activity;
tables sex/nocum nopercent;
tables sex*smokes;
run;
proc means maxdec=2 mean std;
title 'Pulse rates before and after exercise';
var pulse1 pulse2;
run;
Task 1
SAS Programming Notes
36
SAS procedures
Print the pulse data that you saved as a permanent SAS data set in Exercise
2.2. Now run the first two parts of the sample program (PROC SORT and
PROC PRINT) and compare the output. Remember to specify an appropriate
library.
One way of printing separate tables for different subgroups is to use a BY
statement. In order to do this the data set must be already sorted by this BY
variable. If you do not want to overwrite the original file then the sorted data
must be stored in a new file. The statements:
proc sort data=unit3.pulse out=sorted;
by activity;
run;
sort the pulse data by activity level and store the sorted data in a new file
called sorted.
Task2
Look in the libraries to see where this file is stored. Is it a permanent or a
temporary data set?
The option NOOBS suppresses the observation numbers and the option N
allows the sample size to be printed at the end of each table. The format
statement gives an instruction to print the values of height with a maximum of
six characters and no decimal places.
Task 3
Type in the rest of the sample program and see if you can work out what the
remaining statements and options are doing. Look carefully at the titles. What
happens if no title statement is made in a procedure?
Individual procedures will be looked at in more detail in the next few sections.
Information about the options available for individual procedures is given in
the SAS help though it is not always very easy to follow!
It is not strictly necessary to have a run statement between each procedure.
SAS recognises that a new procedure statement indicates that the previous
SAS Programming Notes
37
SAS procedures
statements refer to the preceding procedure. However, it is generally
advisable to include additional run statements and it is essential to put a run
statement at the end of the program.
SAS Programming Notes
39
Summarising data
4. Summarising data
4.1. SAS System Options
You have probably noticed that the date and a page number are included on
all the output produced by SAS. The type of output produced by SAS is
determined by the system but may be changed by making use of SAS
System Options.
There are dozens of options available which deal with hardware and software
interfacing, and the input and processing as well as just the output of jobs.
A list of the options may be found in help. . The following are some
commonly used options which may be used to change the output.
Option
Action
CENTRE/NOCENTRE
Output centred / left aligned
DATE/NODATE
Date shown / date not shown
NUMBER/NONUMBER
Pages numbered / not numbered
PAGESIZE=
Determines the number of lines per page
LINESIZE=
Determines the printer line width
FIRSTOBS=
Specifies the first observation to include from
the data set
OBS =
Specifies the last observation to include. This is
useful for testing code using large data sets.
OBS = max
Includes all observations
The following lines of code will produce a print out of observations 20 to 45
inclusively of the pulse data, with no page numbering, no date, left aligned
and with 20 rows on the page.
options nonumber nodate nocentre pagesize=20 linesize=80 firstobs=20
obs=45;
libname unit4 'c:\sas\sasdata';
proc print data=unit4.pulse;
run;
Options firstobs = 1 obs = max;
/* Uses all observations in
any analysis that follows./*;
SAS system options remain in place for the whole of a SAS session unless
subsequently changed. If an OPTIONS statement is entered within a DATA or
PROC step then it takes effect immediately. An OPTIONS statement entered
outside of a step takes effect with the following step.
SAS Programming Notes
40
Summarising data
4.2. HTML output
HTML output can be turned on from the menu
Tools  Options  Preferences.
Select the Results tab and select the Create HTML box. The dialogue
windows are shown below, for both Linux and Windows. It includes an option
to write the output into a specified folder. If this option is not used the output
file is written into the folder specified for the work library. I
Figure 7 Dialogue to turn on HTML output (in Windows / in Linux)
4.3. Summary Procedures
Four procedures PROC SORT, PROC MEANS, PROC UNIVARIATE and
PROC FREQ may be used to summarise data. The most commonly used
options and statements for these procedures together with sample
programmes are given below. The complete set of options can be obtained in
SAS help,
HelpSAS Help and Documentation
Choose the SAS Products, Base SAS, SAS Procedures then Procedures.
From there you should click on the procedure you require.
SAS Programming Notes
41
Summarising data
4.4. PROC SORT
Options
Description
DATA=
Data set to be used, uses the last data set created by default
OUT=
Specifies the name of file to store the sorted data. If no OUT
option is used the original file will be overwritten.
Statements
BY <DESCENDING>
A list of variables to sort by must be specified. DESCENDING
placed before a variable name will sort the data in descending
order for that variable.
options centre pagesize=50 firstobs=1 obs=92;
proc sort data=unit4.pulse out=sorted;
by sex descending ran;
run;
proc print;
run;
The lines of code sort the pulse data by sex (males first, followed by females)
and within sex by whether the students ran. Those who did not run (coded 2)
are placed before those that did run (coded 1) because DESCENDING has
been specified. Note that no data statement is used with PROC PRINT. SAS
automatically uses the sorted data set because that was the last data set
created.
Task 1
Using the first 50 observations only of the pulse data, create a data set sorted
by smoking (non-smokers first) and by activity. Print out the sorted data set.
Check carefully that the output is what you expect.
SAS Programming Notes
42
Summarising data
4.5. PROC MEANS
Options
Description
DATA=
Data set to be used, by default will use the last data set created
MAXDEC=
Gives the maximum number of decimals to be used in the output (must be
between 0 and 8)
NOPRINT
Suppresses the printing if the procedure is only being used to send summary
output to a file (see OUTPUT statement).
ALPHA
Gives value for confidence limits (ALPHA=0.05 for 95% C.I.)
statistic
keyword list
By default PROC MEANS prints out the variable name, count, mean, std dev,
min and max values. Particular statistics may be requested.
Procedure options
N
Number of non-missing observations in a subgroup
NMISS
Number of missing observations
MEAN
Mean
STD
Standard deviation
MIN
Minimum value
MAX
Maximum value
RANGE
Range
STDERR
Standard error
CLM
Confidence limits for the mean
(For additional keywords see SAS Help)
Statements
VAR
Specify a list of numeric variables for which statistics are required.
BY
Specify a list of alphanumeric variables (data must be sorted by these
variables). Descriptive statistics are given for each subgroup.
CLASS
Specify a list of alphanumeric variables. Descriptive statistics are given for
each subgroup. Uses more memory than the BY command but does not need
the data to be sorted.
OUTPUT
There are various ways of storing all or some of the summary statistics
requested. Need to specify a file name using OUT=filename and which
variables/statistics are required. See specimen programme for a simple
example of how this can be done.
The following lines of code may be used to get summary statistics (the
means, standard deviations and standard errors) for pulse1 and pulse2 in
subgroups defined by sex and whether the students ran. These summary
values are stored in a file named summary.
proc means data=unit4.pulse maxdec=2 mean std stderr;
var pulse1 pulse2;
class sex ran;
output out=summary
mean = mean_p1 mean_p2
std = std_p1 std_p2
stderr = se_p1 se_p2;
SAS Programming Notes
43
Summarising data
run;
proc print data=summary;
run;
Task 2
Submit the previous sample program. (Remember you may need to change
the library name and file name of the data set.) What data has been stored in
the file ‘summary’? What does the TYPE variable indicate in the print output?
4.6. PROC UNIVARIATE
Options
Description
DATA=
Data set to be used, uses the last data set created by default
PLOT
Produces stem-and-leaf plots, boxplot and normal probability plots of the data.
NOPRINT
Suppresses all printing.
Statements
VAR
Specify a list of numeric variables for which statistics are required.
BY
Specify a list of character variables (data must be sorted).
OUTPUT
Need to specify a file name using OUT=filename and which statistics/variable
names are required.
Task 3
Submit the following program and see how the printout differs from that
produced by PROC MEANS. How does the output file containing summary
values differ?
proc univariate data=sorted plot;
var height weight;
by sex;
output out=summary
mean = mean_ht mean_wt
std = std_ht std_wt;
run;
proc print data=summary;
run;
4.7. PROC FREQ
Options
Description
DATA=
Data set to be used, uses the last data set created by default
SAS Programming Notes
44
Summarising data
Statements
TABLES
Specify a list of alphanumeric variables for which tallies are
required. Smaller subgroups may be defined by the use of an *
e.g. sex*ran*activity
Table statements options
NOCOL
Does not show column percentages
NOCUM
Does not show cumulative frequencies or percentages
NOFREQ
Does not show cell frequencies
NOPERCENT
Does not show cell percentages
NOROW
Does not show row percentages
CHISQ
Gives results of chi-squared tests of independence
The following code produces frequency tables for sex and smoking habit
separately and a two way table of sex and smoking habit. The output also
includes the results of a chi-squared test of independence for these two
variables.
proc freq data=unit4.pulse;
tables sex smokes smokes*sex/nocol norow nocum chisq;
run;
4.7.1. Chi-square test
Proc freq is used to carryout a chi-square test for the association of 2
categorical variables. In this case the null hypothesis is that there is no
association between smoking and sex. The same proportion of smokers
should be found amongst males and females.
It is convenient to add the row percentage to the cross tabulation as an easy
way to look for a possible association. This is achieved by removing the
option “norow”. The options “nocol” and “nopercent” have been left in the
statement to remove clutter from the output.
proc freq data=unit4.pulse;
tables sex*smokes/ chisq nocol nopercent;
run;
The output from SAS gives
SAS Programming Notes
The FREQ Procedure
Table of Smokes by Sex
Smokes
Sex
Frequency‚
Row Pct ‚1
‚2
‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
1 ‚
20 ‚
8 ‚
‚ 71.43 ‚ 28.57 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
2 ‚
37 ‚
27 ‚
‚ 57.81 ‚ 42.19 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
57
35
45
Summarising data
Total
28
64
92
Statistics for Table of Smokes by Sex
Statistic
DF
Value
Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square
1
1.5321
0.2158
Likelihood Ratio Chi-Square
1
1.5699
0.2102
Continuity Adj. Chi-Square
1
1.0089
0.3152
Mantel-Haenszel Chi-Square
1
1.5154
0.2183
Phi Coefficient
0.1290
Contingency Coefficient
0.1280
Cramer's V
0.1290
The probability of the chi-square statistic being as large as 1.5321 by chance
alone is 0.2158. This indicates that there is not an association between sex
and smoking in this sample.
Fisher's Exact Test
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cell (1,1) Frequency (F)
20
Left-sided Pr <= F
0.9310
Right-sided Pr >= F
0.1576
Table Probability (P)
0.0886
Two-sided Pr <= P
0.2502
Sample Size = 92
Some times your data may simply be the counts of each table cell. The data
set must contain a variable called something like “count” which contains the
number of observations in each cell. In this case use the weight statement in
the procedure,
e.g. weight count;
4.8. General syntax for a procedure
A general example of a procedure is given below. Each procedure uses only
a certain combination of statements but the action of each statement is
common across the procedures in which it can be used.
SAS Programming Notes
46
PROC PROCNAME DATA = lib1.data1 OUT
Weight WeightVar ;
Summarising data
= lib2.data2 noprint;
Specifies which variable weight the
analysis ;
FORMAT NumVar 8.0 CatVar $3. ;* Specifies formats of variables.;
BY CatVar ;
* Give 1 output per group, SORT needed;
CLASS CatVar ;
* Similar to BY statement, no SORTing
needed.
Makes a numeric variable
act as a categorical variable.;
VAR NumVar ;
* Restricts analysis to named variables;
OUTPUT OUT = Summary keyword= DescriptiveStatistic;
* Named output dataset, specifies names;
WHERE NumVar2 > 1000 ;
* Only uses certain cases;
TABLE CatVar * NumVar ; * Specifies an output table;
FREQ NumVar3 ;
* Variable giving the observation
Frequency;
MODEL YVar= XVar + . .; * Fits models;
PLOT YVar * XVar ;
RUN;
* Plots a scatter plot;
SAS Programming Notes
47
Summarising data
4.9. Help
Extensive documentation on each procedure can be found by using
Help  SAS Help and Documentation. An example of the help screen for
the BASE SAS procedures is shown in Figure 8. Other useful help modules
are SAS/STAT and SAS/GRAPH
Figure 8 Base SAS Procedures help
Exercises
4.1
(a) For the pulse data, get a printout which shows the number of
observations, the mean, standard deviation, maximum and
minimum values of pulse2 in each of the four subgroups defined
by whether the student smoked/did not smoke and ran/did not
run. Make the printout left-aligned with the summary values
shown to one decimal place.
(b) Obtain confidence limits for the four means produced in part (a).
Does it appear that smoking or running on the spot had any effect
on the second pulse rate?
(c) Obtain comparative boxplots of the second pulse rate in each of
the four subgroups.
SAS Programming Notes
48
Summarising data
(d) What percentage of smokers were made to run on the spot? Get
suitable SAS output to give you this information.
4.2
For the SAS data set RETAIL (From the explorer tab look in the
library SASHELP) get a printout which shows the number of
observations and the mean and standard deviation of retail sales in
each year. Print the mean and standard deviation to two decimal
places. Output the mean sales for each year to a new file called
‘summary’ and get a print out of this file
SAS Programming Notes
49
Graphs and charts
5. Graphs and charts
5.1. Graphics procedures
SAS can produce two types of graphics, high or low resolution. High
resolution graphics are sent to a special graphics windows where the graphs
can be edited and copied into Word documents. GPLOT and GCHART are
two procedures which produce high resolution graphics: the equivalent low
resolution procedures are PLOT and CHART.
5.2. PROC PLOT
Options
Description
DATA=
Data set to be used, uses the last data set created by default
Statements
PLOT
Specify yvariable*xvariable. Can produce several
plots with a single statement by including a list of
variables in parentheses e.g. (list of n
yvariables)*(list of m xvariables) will produce nm
separate plots.
BY
Specify a list of character variables (data must be sorted) to
produce separate graphs for subgroups.
PLOT options
=‘symbol’
Specify a symbol to be used for plotting
=variable
Identifies each point by the value of another variable
The following code produces a plot of weight against height for all students,
separate plots of weight against height for each sex and a single plot with a
different symbol for males and females.
proc sort data=unit5.pulse out=sorted;
by sex;
proc plot data=sorted;
plot weight*height;
proc plot;
plot weight*height=’*’;
by sex;
run;
proc plot;
plot weight*height=sex;
run;
SAS Programming Notes
50
Graphs and charts
Task 1
Input the program into SAS and examine the output. (Remember to assign a
LIBNAME as the first statement.) Resubmit the program using PROC GPLOT
instead of PROC PLOT. What differences in the output do you observe?
When it has run successfully in PROC GPLOT, you will find that you are in a
graph window. The graph can be edited in SAS by clicking on the painting
icon. To come out of editing the plot, click on file and then down to end. You
can save the graph to a file or cut and paste it into Word where it can be
further edited if required. The graph window must be closed down (by clicking
on ) before another SAS program can be run.
5.3. PROC CHART
Options
Description
DATA=
Data set to be used, uses the last data set created by default
Statements
HBAR
Specify variable to produce a frequency bar chart (horizontal bars)
VBAR
Specify variable to produce a frequency bar chart (vertical bars)
PIE
Specify variable to produce a pie chart.
BLOCK
Specify variable to use on the x-axis. Used in conjunction with GROUP and
SUMVAR options to produce three-dimensional bar charts.
BY
Specify a list of character variables (data must be sorted) to produce separate
charts for subgroups.
HBAR/VBAR/PIE/options
SUMVAR=
Specify an analysis variable the sum of which is to be shown on the y-axis
TYPE=
May be used on its own or in conjunction with SUMVAR to produce statistics
other than the frequency or sum on the y-axis. The options for TYPE are
FREQ (frequency counts), PCT (percentages), CFREQ (Cumulative
frequencies), CPCT (Cumulative percentages), SUM (Totals), MEAN (Means)
The default is TYPE=SUM if SUMVAR is used otherwise the default is
TYPE=FREQ.
LEVELS=
Specifies the number of equal width classes for numeric variables.
MIDPOINTS=
Specifies the midpoints of classes for numeric variables
MIDPOINTS=lower_limit TO upper_limit BY interval
DISCRETE
Prevents SAS from dividing a discrete variable into inappropriate intervals e.g
ensures a variable coded from 1 to 5 will produce 5 classes.
GROUP=
Produces separate bar charts on the same graph for different discrete values
of the GROUP variable.
SAS Programming Notes
51
Graphs and charts
The following program illustrates some of the features of PROC CHART.
proc chart data=sorted;
hbar smokes;
by sex;
run;
proc chart;
vbar height/levels=6 group=sex sumvar=pulse2 type=mean;
run;
Task 2
Run this program and look carefully at the output obtained. Adapt the code to
show other statistics on the y-axis of the vertical bar chart, a different number
of bars and so on to familiarise yourself with the procedure.
Exercises
5.1
Plot weight against height for the data from Exercise 1.1. This should
be stored somewhere as a permanent SAS data set. Use a plus sign
as your plotting symbol.
5.2
For the pulse data:
(i) Plot the second pulse rate against weight, using a different symbol
for males and females.
(ii) Produce a pie chart which shows the percentage of students who
usually have particular levels of activity.
(iii) Produce a horizontal bar chart which shows the mean of the
second pulse rate for those students that did and did not run.
(iv) Produce vertical bar charts side by side which show the
percentage of students who have different levels of physical activity
for males and females separately.
(v) Produce a three dimensional bar chart which shows the mean of
the first pulse rate for subgroups defined by level of activity and
smoking habit.
SAS Programming Notes
53
Correlation and regression
6. Correlation and Regression
Two procedures may be used to obtain information about the relationship
between two or more continuous variables. PROC CORR determines the
correlation coefficients between selected variables and PROC REG fits a
regression model to data and allows output to be saved for further analyses.
Remember that the statements and options given in these notes are only a
very small subset of those that can be used with particular procedures. The
help facility may be used to investigate further possibilities.
6.1. PROC CORR
Options
Description
DATA=
Data set to be used, uses the last data set created by default
SPEARMAN
Calculates the Spearman rank correlation coefficient. By default
the Pearson product moment correlation is calculated.
Statements
VAR
Specify a variable list (essential)
WITH
Specify a variable list to be used with VAR. The VAR variables
are given at the top of the table of correlations and the WITH
variables at the side. If WITH is not used a matrix of the
correlations between all pairs of variables is produced.
BY
Specify a list of character variables (data must be sorted) to
produce separate tables of correlations.
An example of the use of PROC CORR is shown below using the data
concerned with heights and weights of eleven-year-olds (Exercise 1.1).
proc corr data=unit6.eleven;
var height weight;
run;
proc corr data=unit6.eleven;
var height ;
with weight;
run;
Task 1
Try running this program and look at the difference in output produced when
the WITH statement is included or not included. What do the values under the
correlation coefficients indicate?
SAS Programming Notes
54
Correlation and regression
6.2. PROC REG
Options
Description
DATA=
Data set to be used, uses the last data set created by default
CORR
Prints a correlation matrix for all variables listed in the MODEL
statement.
Statements
MODEL
For simple linear regression, specify a response variable and an
independent variable, response=independent variable
OUTPUT
Need to specify an output file using OUT=filename and a list of
keywords(statistics)/names required. Commonly used keywords
are p (predicted values), r (residuals), student (standardised
residuals). See the specimen program for syntax.
PLOT
Allows scatters plots to be produced using any variables in the
model or keywords in the OUTPUT statement. Note the
keywords in the OUTPUT must be followed by a full-stop when
used as variables in the PLOT statement.
BY
Specify a list of character variables (data must be sorted) to
produce separate tables of correlations.
PLOT options
OVERLAY
Superimposes several scatter plots on the same graph.
The following code produces output which shows:
 the correlation matrix of weight and height
 output from a simple linear regression analysis of weight on height
 a plot of weight against height with the predicted values (line)
superimposed
 a plot of the standardised residuals against height
 a printout of the output file
 a normal probability plot for each of the residuals and standardised
residuals together with the results of normal probability test.
proc reg data=unit6.eleven corr;
model weight=height;
output out=regout
p=pred
r=resid
student=stdresid;
plot weight*height='*' p.*height= 'P'/overlay;
plot student.*height;
run;
proc print data=regout;
proc univariate plot normal data=regout;
var resid stdresid;
run;
Task2
SAS Programming Notes
55
Correlation and regression
Obtain the output from the previous program. Why is it important to produce
residual plots and normal probability plots for the residuals from a linear
regression model?
Exercises
6.1
Data set beetle, available as a permanent SAS data set on the web
page, gives information on a sample of beetles and the damage they
cause.
(a)
Download the file and examine the data.
(b)
(i) plot area against male beetle length
(ii) regress area against male beetle length
(iii) obtain a plot of the standardised residuals against length
(iv) obtain a normal probability plot of the standardised residuals.
(c)
Why isn’t a simple linear regression model satisfactory? What
might be a useful thing to try in order to improve the model?
(d)
Similarly, investigate the relationship between the amount of frass
produced and female beetle length.
SAS Programming Notes
57
Exploratory data analysis
7. Exploratory data analysis
7.1. SAS/INSIGHT
SAS Insight software is a tool for data exploration and analysis. It is
interactive which means you can explore data through graphs and analyses
linked across multiple windows. This facility can be used to identify outliers,
highlight subgroups on a graph and so on. SAS/INSIGHT allows you to
analyse univariate distributions, investigate multivariate distributions and fit
explanatory models such as a simple linear regression model to your data.
7.2. Accessing SAS/INSIGHT
SAS/INSIGHT may be used with any SAS data sets that have been created
previously. If you wish to investigate a permanent SAS data set make sure
that you have set up an appropriate library using a LIBNAME statement
before invoking SAS/INSIGHT. It is also possible to enter data directly into the
SAS/INSIGHT data window.
Within SAS, select SolutionsAnalyseInteractive data analysis. The
data of interest can then be accessed using LibraryData SetOpen or a
new data set created in the data window using New.
7.3. Creating a Scatter Plot
To investigate SAS/INSIGHT initially, choose a data set you are fairly familiar
with and try out some of the features. This example used the Pulse data set.
Analyse > Interactive Analysis
Figure 9 Opening the SAS/INSIGHT dialogue in Windows
SAS Programming Notes
58
Exploratory data analysis
Figure 10. Opening the SAS/INSIGHT dialogue in Linux
You can create charts using the analyze menu. An example of creating a
scatter plot is shown in Figure 11.
Analyse > Scatter Plot (X/Y)
Figure 11 The Scatter Plot ( X / Y ) dialogue
SAS Programming Notes
59
Exploratory data analysis
Figure 12 Two SAS/INSIGHT scatter plots with 3 points highlighted
A key feature of SAS/INSIGHT graphs is that they are interactive. Click on a
point on one chart and all corresponding points become highlighted. An
example of this interactivity is shown in Figure 1. The 3 points labelled with
‘1’ use a larger symbol and indicate the level of the variable ran (1 = yes). To
click on several points just hold down the ctrl key while clicking. To highlight
an area of points left click and hold in one corner of the region then drag the
cursor to the opposite corner. All points in the rectangle will be highlighted.
You can then turn off the points by clicking on a point ouside the rectangle.
7.4. Features of SAS/INSIGHT
SAS/INSIGHT software provides an extensive range of tools for investigating
data and carrying out analyses. Some of the activities that you can carry out
using SAS/INSIGHT are shown below.










enter data from the keyboard
identify observations in plots
examine all values for selected observations
brush observations in graphs
create overlaid line plots
rotate data in three dimensional plots
manipulate histograms to explore the distribution of data
compare distributions in box plots and mosaic plots
compute descriptive statistics
fit parametric (normal, lognormal, exponential, Weibull) and kernel
density estimates
 fit parametric cumulative distribution functions
SAS Programming Notes
60
Exploratory data analysis
 create quantile-quantile plots
 calculate correlations and principal components to find the structure
of your data
 fit a general linear model
 create residual and leverage plots
 transform variables
 process data by groups for every analysis
7.5. Using SAS/INSIGHT
 Once you have a data set in SAS/INSIGHT, manipulations and analyses
are carried out by using either Edit or Analyse on the main menu.
 Operations are also available from pop-up menus by:
 clicking the left mouse button in the corners of graphs and tables
 pressing the right mouse button over an appropriate object.
 Variables to be analysed may either be selected before clicking on
Analyse or entered as requested within the particular analysis window.
 In the data set window:
 a variable is selected by clicking on the name
 several variables can be selected by holding down the left mouse
button and dragging across the selection
 non-contiguous variables or observations may be selected by
holding down the control button and clicking on individual names or
row numbers.
 In WINDOWS, any plots produced can be printed directly or copied and
pasted into Word documents.
 Tabular output can be saved into the normal SAS Output Window using
FileSaveTables
Commands from your SAS/INSIGHT session can be recorded and later
resubmitted. The FILE and INFILE options allow you to produce a file
containing commands to document and reproduce a SAS/INSIGHT session.
This is very useful for exploratory analysis that you need to interrupt or repeat
on different sets of data.
Example
filename note ‘h:\MA71064 Statistical Computing\insight.txt’;
proc insight file = note;
run;
SAS Programming Notes
61
Exploratory data analysis
After doing your analysis and then exiting, your file will contain the commands
that were used to create and close your Insight session. You can begin your
second Insight session from where you left your first session with the
following code:
filename note ‘h:\MA71064 Statistical Computing\insight.txt’;
proc insight infile = note;
run;
Alternatively, just code the FILE keyword without a filename specified and
your commands will be recorded in the SAS Log window.
Task 1
To get a feel for what SAS/INSIGHT can do, work through the following
exercises which are based on the pulse data.
(a)
Obtain a histogram of height.
Obtain a histogram of sex in a new window.
Click on the bar representing males on the sex histogram.
Look at the histogram of height. What has changed on this histogram?
(b)
Obtain comparative boxplots of weight for each sex. (Input weight as
the Y-variable and sex as the X-variable.)
Click on the outlier for male weights. Which observation number is
this?
Double click on this outlier. What information do you get?
Highlight observation numbers 1, 31 and 67 in the data window..
Press the right mouse button and click ‘Label in plots’.
Obtain a scatter plot using pulse2 as the Y-variable, pulse1 as the
X-variable, sex as the group variable and ran as the label. What
information is shown on these plots?
Double click on one of the points. What information do you get?
(c)
(d)
Highlight the variable names for pulse1, pulse2, height and weight in
the data window.
Obtain a scatter plot.
What plots do you get? What are the values shown in each plot?
(e)
Using the Fit option in the Analyse menu, input pulse2 as the
Y-variable, weight as the X-variable and ran as the group variable.
Look carefully at the output obtained.
Redo the analysis to show Residual Normal QQ plots and store
predicted values and standardised residuals in the data sheet.
7.6. Tools
Edit > Windows > Tools turns on a menu that allows data points to be
coloured or selects different symbols by data value.
SAS Programming Notes
62
Exploratory data analysis
Figure 13 The tools dialogue labelling observations by sex
Press the coloured square from the tools window to set a particular set of
points to that colour, eg colour red all those points representing people who
smoke.
Similarly press one of the symbol buttons to select a given symbol for a set of
observations.
Other features of SAS/INSIGHT can be investigated using the help facility.
Task 2
(a)
Find out what is meant by ‘brushing observations’.
(b)
Produce summary statistics and graphs for each of the continuous
variables in the pulse data.
(c)
Input, into the data sheet, the following data which are chloride
content (mg/l) of waters draining from a particular type of rock.
6.0
0.4
0.2
5.0
6.0
0.8
0.5
1.2
0.2
0.5
0.2
1.7
0.6
0.7
0.5
10.0
0.3
6.0
(i) Produce a boxplot of the data and comment on the distribution.
(ii) Create a new variable which is the log of the chloride content.
(iii) Check whether the log values could reasonably be assumed to
have come from a normal distribution.
SAS Programming Notes
63
Modifying data and output
8. Modifying data and output
8.1. Introduction
In the previous units you have discovered how to create permanent SAS data
sets and produce output using a variety of procedures. The OPTIONS
statement was introduced in the unit on Describing Data which allows you to
make some changes to the output produced. Generally speaking though,
there has been very little flexibility in either changing the style of the output or
modifying the data set that has been used. In this unit you will learn how to
make changes to a permanent SAS data set and to customise some types of
output.
8.2. SET statement
The SET statement is a very versatile statement which is used in a DATA
step and enables a variety of tasks to be carried out depending on which
options are used. One of its most common uses is reading observations and
variables from existing SAS data sets so that further processing can take
place. Another use is combining two or more data sets so that analyses can
be carried out on a larger set of variables or observations.
The same operations can often be done in different ways because some SAS
statements can be incorporated as options into either the DATA or SET
statements. Have a look at the following examples of code which both
achieve the same thing.
data unit8.beetles1 (drop=site);
set unit8.beetles;
where site=’1’;
run;
data unit8.beetles1 (drop=site);
set unit8.beetles (where=(site=’1’));
run;
Both pieces of code produce a new permanent SAS data set called beetles1
which has data from site one only and does not include site as a variable. The
number 1 is shown in single quotes because site is a character variable in this
data set. The DROP option allows variables not required to be omitted from
the new data set. It can be included in either the DATA or SET statement but
you have to be a bit careful. If DROP is used in the SET statement then the
variables involved cannot be used for further processing. In this example
using the DROP option with the variable site in the SET statement would
result in an error message – try it!
The following table shows some of the commonly used options in the DATA
and SET statements.
SAS Programming Notes
64
Modifying data and output
Data Set Options
Description
DROP=
Specify one or more variables to exclude either from further
processing or from the new data set
FIRSTOBS=
Specify the first observation required for processing
KEEP=
Specify one or more variables to include in further processing or
in the new data set
OBS=
Specify the last observation required for processing
LABEL=
Specify names to be given to variables (see section Labelling
output)
RENAME=
Specify new names for variables
WHERE=
Specify a condition to select certain observations from a SAS
data set
8.3. DROP and KEEP
The following code shows how the DROP and KEEP options may be used in
a program.
data lengths (keep=height cond mlength flength)
damage (drop=mlength flength);
set unit8.beetles (drop=site);
run;
proc print data=lengths;
proc print data=damage;
run;
Note that more than one data set can be specified in a single DATA
statement. In this example two temporary data sets are produced lengths and
damage.
Task 1
Look at the preceding code and see if you can work out which variables will
be contained in each data set. Submit the code and see if you are correct. If
you wanted to create permanent data sets what changes to the code would
you have to make?
SAS Programming Notes
65
Modifying data and output
8.4. Labelling output
Variable names in SAS are restricted to being eight characters in length, and
by default these variable names are used as column headings. A LABEL
statement may be used to associate a descriptive label with a variable. If the
labels are required in the output then either a LABEL or SPLIT option must be
used in the PROC PRINT statement. If a LABEL statement is made in a
DATA statement then the labels are permanently associated with the
variables in that SAS data set.
An example of the use of labels follows.
proc sort data=unit8.beetles1 out=sorted;
by cond height;
proc print data=sorted split=’*’;
var height mlength area frass;
by cond;
label
mlength='male*length'
area='leaf area*consumed'
frass='number of*frass*pellets';
pageby cond;
sum frass;
run;
Task 2
Submit the code in SAS and inspect the output. What effect have the
PAGEBY and SUM statements had?
If you had used label names without the asterisks and used the LABEL option
in the PROC PRINT statement then SAS would split the label names
automatically at a suitable place but you have no control over the process.
SAS Programming Notes
66
Modifying data and output
8.5. PROC PRINT
Options
Description
DATA=
Data set to be used, uses the last data set created by default
LABEL
Ensures that column labels are used in the output
SPLIT=
Specify a character in the label names which splits the column
headings onto two or more lines
NOOBS
Suppresses the printing of observation numbers in the output
Statements
VAR
Specify variables to be printed
ID
Specify variable to use as identification instead of the observation
number
BY
Specify a list of character variables (data must be sorted) to produce
separate tables.
PAGEBY
Used with BY statement to output each table on a separate page.
SUMBY
Prints subtotals for the specified BY variable
SUM
Specify numeric variables for which the sum of the values is required.
8.6. PROC FORMAT
The LABEL statement allows you to give longer names to variables so that
any output is easier to interpret. It is also possible to assign names to
individual categories for character variables and to save these names as
permanent formats. This is done using PROC FORMAT. The permanent
formats are saved in a location which is specified using the LIBNAME
statement with the special libref name LIBRARY. For example,
Libname library ‘C:\sasdata’;
will store the formats in a directory sasdata on the hard disk when the
LIBRARY option is used in the FORMAT procedure.
Task 3
Assign a library called LIBRARY in a suitable location. (The location where
you have stored your SAS permanent data sets is probably the most
appropriate.)
The individual names are assigned using a value statement. These formats
are independent of any particular data set and if appropriate may be used
with any variable. The following example shows suitable labels for the plant
height and plant condition categories from the beetles data but the format
SAS Programming Notes
67
Modifying data and output
$health, for example, could be used for any variable where 1, 2 and 3
represent poor, satisfactory and good respectively.
proc format library=library;
value $height
'1'='less than 10cm'
'2'='10cm<20cm'
'3'='20cm<30cm'
'4'='30cm<40cm'
'5'='40cm or more';
value $health
'1'='poor'
'2'='satisfactory'
'3'='good';
run;
When these labels are required in output the format names must be shown
with a full-stop in the FORMAT statement.
proc print data=unit8.beetles;
format height $height. cond $health.;
run;
Task 4
Create the above formats and print out the beetles data set to see the effect
of using these formats. Check the library called LIBRARY. You should find a
catalog called FORMATS. Double-clicking on FORMATS will give a list of all
permanent formats you have created.
The permanent formats which have been created may be used at any time. If
you want to use them in a future session remember to use a LIBNAME
statement initially to assign the libref LIBRARY.
(To use more than one permanent format library use options fmtsearch - see
HELP for details)
_____________________________________________________________
8.7. Recoding data
You may find that when you are presenting results, or carrying out an analysis
of a set data that you may wish to code a continuous variable such as height
into discrete categories, for example, short, medium and tall. In some
circumstances you may wish combine categories for an analysis. This type of
operation can be done either by using conditional statements or making use
of the VALUE statement in PROC FORMAT.
_____________________________________________________________
8.8. Conditional statements
Conditional statements can take several forms (see the SAS Help). Two
commonly used in recoding are:
SAS Programming Notes
68
Modifying data and output
 if expression then statement;
 if expression then statement; else statement;
The following code shows the use of the IF-THEN/ELSE statement for
recoding pulse1 and pulse2 into four numeric categories.
data unit8.pulse2 (keep=pulse1 pulse2 ran);
set unit8.pulse;
if 40 <= pulse1 < 60 then pulse1=1;
else if 60 <= pulse1 < 80 then pulse1=2;
else if 80 <= pulse1 < 100 then pulse1=3;
else pulse1=4;
if 40 <= pulse2 < 60 then pulse2=1;
else if 60 <= pulse2 < 80 then pulse2=2;
else if 80 <= pulse2 < 100 then pulse2=3;
else pulse2=4;
run;
proc freq data=unit8.pulse2;
tables pulse1*pulse2/nocol nocum nopercent norow;
by ran;
run;
_____________________________________________________________
8.9. VALUE statement
An alternative way to code data is to create new formats for the required
variables and use these formats when they are needed to produce particular
output. For example putting the pulse rates into categories and creating a two
way table could be done by creating a new format as follows.
libname library 'c:\sasdata';
run;
proc format library=library;
value pulse
40-59=1
60-79=2
80-99=3
100-200=4;
run;
proc freq data=unit8.pulse;
tables pulse1*pulse2/nocol norow nocum nopercent chisq;
by ran;
format pulse1 pulse2 pulse.;
run;
Task 5
Try out these alternative ways of recoding the data. What do you think are the
advantages and disadvantages of each method? What do the resulting tables
tell you about the relationship between the first and second pulse rates in
each group?
SAS Programming Notes
69
Modifying data and output
The values that may be assigned to a particular code or description may be
specified in the following ways.
Range specification in
the VALUE statement
Description
value
a single value
value1-valuen
a range of values
value1, value2, ….
a list of values
HIGH
the highest possible value
LOW
the lowest possible value
OTHER
anything that does not fall into any range
For example a format for age groups could be created as follows.
libname library 'c:\sasdata\';
run;
proc format library=library;
value agegroup
low-24=’under 25’
25-49=’25 or more but less than 50’
50-high=’50 or over’;
run;
____________________________________________________________________
8.10.
OUTPUT
The OUTPUT statement is used in conjunction with the SET statement to
create multiple SAS data sets. The IF statement is used with the OUTPUT
statement to control which observations are output to which SAS data sets.
e.g.
data american japan british;
set mydata.cars;
if
origin = 1 then output american;
else if origin = 2 then output japan;
else if origin = 3 then output british;
run;
Exercises
8.1 Create a new temporary data set, using the pulse data, named mpulse
which contains data for males only. Omit the variables sex, height and weight
in this set. Label pulse1 ‘First pulse rate’, pulse2 ‘Second pulse rate’ and print
out the data set using these variable labels.
8.2 Create formats for the alphanumeric variables in the pulse data to give the
following information:
SAS Programming Notes
70
ran
1=ran in place
2=did not run in place
smokes
1=smokes regularly
2=does not smoke regularly
sex
1=male
2=female
activity
1=slight
2=moderate
3=a lot
Modifying data and output
Construct two way tables which show frequencies only for the following pairs
of variables, using the formats you have created.
(i) sex and smokes
(ii) smokes and activity
(iii) sex and ran
8.3 In Task 5 there was a warning that the chi-squared test may not be valid
because of small expected numbers. Create a new format for the pulse rates
which has two categories only. (Choose the categories so that there are
roughly equal numbers in each category for this particular set of data. Repeat
the chi-squared tests for independence for each of the two groups (ran/did not
run) as in Task 5 and comment on the results.
SAS Programming Notes
71
Proc Tabulate
9. PROC TABULATE
Proc tabulate displays descriptive statistics in tabular format. It computes
many of the same statistics that are computed by other descriptive statistical
procedures such as MEANS, FREQ, and SUMMARY, but incorporates more
flexibility.
Statement Description
s
CLASS
Identifies the categories on which calculations are carried out
Are either character or discrete numeric. e.g. a department code
Supplies the values used in the structure of the table
Must be present in PROC TABULATE statements
VAR
Contain values appropriate for calculating statistics
Are continuous numeric
Supplies the values in the table cells
Optional in PROC TABULATE statements
TABLE
BY
FORMAT
FREQ
element1 * element2 where the elements are class variables and optionally,
the ALL variable and/or various statistics
Table operators:comma - produces a multidimensional table
asterisk - produces a hierarchical table
blank - concatenates tables
Specify a list of character variables (data must be sorted)
Specify variables with the formats wanted
Specify the variable
KEYLABEL
Used to label the statistics available e.g. keylabel all=’Grand Total’;
LABEL
Specify a label for a variable
WEIGHT
Specify a variable to be used for weighting the entries in the table
PROC TABULATE is the only procedure that has a SAS manual of its own. It
is worth understanding the format of the table statement that controls the
position of the variables.
There are 3 operators that determine where the variables are positioned in
the output. Notice that the variables must be categorical. If the variables are
numeric then the CLASS statement is used to tell the SAS system to treat the
numeric variables as if they categorical.
9.1. The comma table operator
Determines page, row and column positions, i.e. cross tabulation.
Table <page variable>, <Row variable>, <Column variable> ;
e.g.
proc tabulate data = students;
class faculty sex;
table faculty, sex;
run;
SAS Programming Notes
72
Proc Tabulate
SEX
Male
Female
N
N
FACULTY
Science
Arts
450
350
360
550
9.2. The asterisk table operator
Nests two variables in the column or row
Table <Column variable1> * <Column variable2> ;
e.g.
proc tabulate data = students;
class faculty sex;
table faculty*sex;
run;
FACULTY
Science
Arts
SEX
SEX
Male
Female
Male
Female
N
N
N
N
450
360
350
550
9.3. Using a blank table operator
Places variables side by side
Table <Row variable>, <Column variable1> <Column variable2> ;
e.g.
proc tabulate data = students;
class faculty sex;
table faculty sex;
run;
FACULTY
Science
Arts
N
N
810
900
SEX
Male
Female
N
N
800
910
SAS Programming Notes
73
Proc Tabulate
9.4. Using the ALL variable
e.g.
proc tabulate data = students;
class faculty sex;
table faculty all, sex all;
run;
Male
N
FACULTY
Science
Arts
450
350
800
SEX
Female
N
ALL
N
360
550
910
810
900
1710
9.5. Other Statistics
The statistics that can be requested in PROC TABULATE include the
following
N
SUM
NMISS
MEAN
STD
MIN
MAX
number of nonmissing observations
sum of the VAR variable for each class of the CLASS variables
number of missing observations
arithmetic means
standard deviation
minimum value
maximum value
e.g.
proc tabulate data = students;
class faculty sex;
var exammark;
table faculty*sex*exammark*max;
run;
FACULTY
Science
SEX
Male
EXAMMARK
MAX
83
Female
EXAMMARK
MAX
85
Arts
SEX
Male
EXAMMARK
MAX
79
Female
EXAMMARK
MAX
76
_____________________________________________________________
SAS Programming Notes
74
Proc Tabulate
The operators can be mixed to produce any output you require. For example
it is possible to nest 2 variables on the rows and columns of a table.
Table <Row var1>*<Row var2>, <Column var1>*<Column var2> ;
Exercises
9.1 Use proc tabulate on the pulse data to produce
(a) the average of pulse 1 in a table of activity by sex
(b) the average difference in pulse in a table showing whether they smoke by
whether they ran or not
_____________________________________________________________
SAS Programming Notes
10.
75
Functions and formats
Functions and formats
Functions are a useful means of writing SAS code because it simplifies the
coding involved and often results in you having to write fewer lines of code.
Over 120 functions are available within the SAS system. Some functions
operate on numeric values, others on character values. Some are specialised
and operate on specific types of values such as dates and times. All functions
operate on arguments which may be variable names or specific values
_____________________________________________________________
10.1.
MEAN function
Example
data myinfo;
set info;
m_val = mean (var1, var2, var3);
run;
If the info data set consisted of the following data,
var1
2.5
6.0
var2
5
3
var3
1.5
3.0
then myinfo would be as follows:var1
2.5
6.0
var2
5
3
var3
1.5
3.0
m_val
3
4
The mean function calculates the mean of the three variables listed.
Alternatively we could have written the expression as
m_val = mean (of var1 -
var3);
An important difference between the MEAN function and the expression:
m_val = (var1 + var2 + var3)/3;
is that the MEAN function returns the mean of the nonmissing values. So if
we had a missing value for var2, the function would return the mean of var1
and var3 whereas the expression above would return a missing value if any of
the var values were missing.
SAS Programming Notes
10.2.
76
Functions and formats
NMISS function
The NMISS function returns the number of missing values in a list of
variables. This can be useful if for example we want to exclude observations
from a calculation where there are too many missing values e.g. suppose we
have recorded 50 readings for each instrument and want to compute the
mean of these 50 readings, but only for those instruments with at least 30
readings
data myavges;
input (x1 - x50);
if nmiss (of X1 - X50 ) lt 30 then ave = mean (of X1 - X50);
cards;
etc.
_____________________________________________________________
10.3.
N function
The N function operates in a similar fashion, but returning the number of nonmissing values.
_____________________________________________________________
10.4.
String Handling Functions
The above functions are for numeric variables and do not work with strings of
characters. Some of the most commonly used character functions are
SUBSTR (char_variable, starting_position, length)
which extracts a substring
INDEX (char_variable, index-string)
which returns the position of a substring
VERIFY (char_variable, verify_string)
which returns the position in the char_variable that is not present in the verify
string
Example
data dept;
set mydir.jobs;
tot = substr (‘ABCDEFG’, 3, 2);
dept = substr (account, 4, 3);
ind = index (account, ‘tch’);
ver = verify (account, ‘sth’);
run;
If the data set mydir.jobs consisted of
SAS Programming Notes
account
spsmrk
spsmrk
spstch
77
Functions and formats
code
m003
m005
t003
then the data set dept would consist of
account
spsmrk
spsmrk
spstch
code
m003
m005
t003
tot
cd
cd
cd
dept
mrk
mrk
tch
ind
0
0
4
ver
2
2
2
The substr function operates on character literals to extract part of a variable
value. The structure of the substr function is substr(argument, position,
length). The argument may be a character value or a variable name, the
position gives the position from which to start reading, and the length gives
the number of characters to read.
Task
Investigate what functions are available in SAS by selecting
Help  SAS Help and Documentation
then choose SAS Products, Base SAS, SAS Language Dictionary, Dictionary
of Language Elements and Functions and CALL Routines
10.5.
Date and Time Formats
The SAS System processes calendar date values by converting dates to
integers representing the number of days between January 1 1960, and a
specified date.
For example, the following calendar date values represent the date July 26
1989:
072689
07/26/89
26JUL89
890726
26JUL1989 26 Jul 1989
The SAS date value representing July 26, 1989 is 10799.
The trick is to convert dates to numerics and back again. SAS has many date,
time and datetime informats and formats. We read the data in with date/time
informats and get them back out of SAS using date/time formats. Many of the
date/time informats are more or less the inverse of formats of the same
name.
The above dates would be read in using the following informats in the input
statement e.g.
SAS Programming Notes
78
Functions and formats
data test;
data test;
input var1 MMDDYY6. +1 var2 DATE7.
+1 var3 MMDDYY8. +1 var4 DATE9. ;
cards;
072689 26JUL89 07/26/89 26JUL1989
;
run;
To print them out all using different formats we reverse the process e.g.
proc print;
format var1 DATE9.
run;
Var2 MMDDYY6. Var3 DATE7. Var4 YYMMDD6.;
The dot at the end of the informat (or format) indicates that it is an informat
(or format) statement and not a variable.
Details of the different formats and informats available in SAS can be found in
the SAS System help.
Blanks and other special characters can be placed between day, month, and
year values. Width values must allow space for blanks and special
characters.
Note: SAS defaults to a date in the 1900s if yy is two digits. Use the
YEARCUTOFF= system option to override the system default and specify a
date range of your choice.
Example
Data Lines
1jan1990
01 jan 90
1 jan 90
1-jan-1990
SAS Statement
input day date9.;
Results
10958
10958
10958
10958
The TIMEw. informat reads time values in the form hh:mm:ss.ss, where hh
and mm are integers representing the hour and minute, and ss.ss is an
optional fractional field representing seconds and decimal fractions of
seconds. If you do not enter a value for seconds, SAS assumes a value of 0.
Example
Data Line
14:22:25
SAS Statement
Result
input begin time8.; 51745
and the DATETIMEw. informat reads date and time values e.g. 8:30 p.m. of
May 6 1989 could be represented as 6MAY78:20:30 using DATETIME12.
SAS Programming Notes
79
Functions and formats
Another way to specify SAS date/time values is with special constants e.g. 18
February 1951 is represented as ‘18FEB51’D, high noon as ‘12:00’T and a
moment in date and time e.g. ‘1OCT82:15:27:05’DT
Exercises
10.1 A character variable alphabet = ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’
Work out what you think the following functions will return then check your
answers with a SAS programme?
(a) index(alphabet, ‘FGHIJ’)
(b) index(alphabet, ‘JNP’)
(c) substr(alphabet, 3, 3)
(d) verify(alphabet, ‘ABDEI’)
10.2 Write a SAS program to find out how old you were on 1 October 1999 (in
years).
10.3 Download the SAS data set napier.SAS7BDAT from the web. This
contains data on first year Napier students in 1993/4. There are 1872 records
and 33 variables. The variables that you need to consider are
crsecd
dob
Numeric variable giving course code
Date of birth
Identify the oldest student. Which course is he/she in
?_________________________
Which course has the youngest students (on average)?_______________
Which courses have the largest numbers of
students?__________________________
In order to answer these questions you will need to
(a) create a variable giving the age in years.
(b) Sort the data set by age, storing your sorted data set in a temporary data
set. Examine the data set from the Explorer window to find the oldest
student and their course.
(c) Sort the data set by crsecd, storing your sorted data set in a temporary
data set. Use proc means to make a new temporary data set that will
contain the mean, minimum and maximum age of students on each
course. Examine your data set of means to find the course with the
youngest students
(d) Sort the data set by the number of records for each course (_freq_) to find
the course with the largest number of students
SAS Programming Notes
11.
11.1.
81
Iterative processing
Iterative processing
Do loops and arrays
The DO statement allows us to perform an iterative loop
e.g.
do i = 1 to 5;
(lines of SAS code)
end;
would result in the lines of SAS code being repeated 5 times, with the value of
i taking on values 1,2,3,4 and 5.
i is used as the counter, 1 is the start value and 5 is the end value. The
default increment is 1. We can specify the increment
e.g.
do i = 1 to 7 by 2;
(lines of SAS code)
end;
would result in the lines of code between the do and the end statements
being repeated 4 times, with the value of i taking on the values of 1,3,5,7.
_____________________________________________________________
11.2.
Reading data in repeated patterns
The quality control department takes 4 sample cans of oil from a production
line and weighs them, every hour for 12 hours. Each record in the raw data
contains the following fields :
hour
: the hour in which the samples were taken
weight 1-4 : weights of the four sample cans
The quality control department wants to analyse these data. The first step is
to create a SAS data set so that it contains a single observation for each
measurement taken. The DATA step must create four observations from each
record.
i.e.
first record of raw data :1 8.024 8.135 8.151 8.065
first four observation in the data set:HOUR
1
1
1
1
WEIGHT
8.024
8.135
8.151
8.065
SAS Programming Notes
82
Iterative processing
The first INPUT statement reads a value from the first field and assigns it to
HOUR. The value for HOUR is the same for all four observation to be created
from the first record.
data oil1;
input hour @
The single trailing @ sign is used to hold the current record, preventing the
next INPUT statement from reading a new record.
The next step is to read each value for WEIGHT (four in each record) and
write an observation after each is read. An iterative do loop enables us to
write a single pair of INPUT and OUTPUT statements to read a value and
write an observation multiple times.
data oil1 (drop=i) ;
input hour @;
do I = 1 to 4;
input weight @;
end;
cards;
1 8.024 8.135 8.151 8.065
2 7.971 8.165 8.166 8.157
3 8.024 8.135 8.151 8.065
etc.
12 7.971 8.165 8.166 8.157
;
proc print;
run;
The results would be
OBS
1
2
3
4
5
6
7
HOUR
1
1
1
1
2
2
2
WEIGHT
8.024
8.135
8.151
8.065
7.971
8.165
8.166
etc.
_____________________________________________________________
11.3.
Arrays
Arrays in SAS are used as a shorthand way of processing many variables
with a few statements. An array is an ordered list of variable names. It is often
used along with a DO statement to carry out an action repeatedly on a
sequence of variables. When defining an explicit array, the ARRAY statement
must contain
SAS Programming Notes




83
Iterative processing
an array name
a subscript that indicates the number of elements in the array
a dollar sign if the array is of character variables
a description of the elements ( variables a, b, c, d and e in the above
example)
If we do not know the number of elements in an array we can use an asterisk
to define the array
e.g.
array{*} score1 score 2 score 5 score 8;
although this is a lot less efficient in processing time than specifying the
actual array dimensions.
Multidimensional arrays can be specified similarly
e.g.
array x{3,5} test1 - test15;
could specify an array where the first dimension is the class number and the
second is the test number ( i.e. 5 tests for each of 3 classes). The elements
of this array are referred to by (for example) x{2,3} which gives the second
row element of the third column of the array.
_____________________________________________________________
Example
The following program recodes missing scores in a test to 0
data results;
infile class6;
input id age score1-score5;
if score1 = . then score1 =
if score2 = . then score2 =
if score3 = . then score3 =
if score4 = . then score4 =
if score5 = . then score5 =
run;
0;
0;
0;
0;
0;
This can be rewritten using arrays as follows
data results;
infile class6;
input id age score1 - score5;
array ss(5) score1-score5
do I = 1 to 5;
if ss(I) = . then ss(I) = 0;
end;
drop I;
run;
The reduction in code is not very much with 5 scores but if we had 150 scores
arrays would be much more efficient.
_____________________________________________________________
SAS Programming Notes
11.4.
84
Iterative processing
Generating random numbers
In the data step examples we have examined so far, the SAS system has
read a single record during the data step and written it to the output data set
at the end of the data step. The SAS instruction to write a record to a data set
is
OUTPUT ( keep = vars) ;
where vars stands for the variables to be saved.
If you write a SAS data step that does not contain an OUTPUT statement,
then SAS will assume that you want a record output at the end of the data
step. If your program does contain an OUTPUT statement, then SAS will write
a record at this point in the program and not at the end of the data step.
The OUTPUT statement can be used to write a program that will produce a
series of 100 records each containing a different uniform random number with
this code. The seed 2762 can be replaced with any number you like.
data random;
do I = 1 to 100;
x = ranuni (2762);
output ;
keep = x;
end;
run;
The keep statement is optional; if it is not included all variables will be saved
to the file.
In generating random numbers, the ones at the beginning of a sequence can
sometimes not be very random for certain choices of seed (check this from
your own sequence), but they are generally OK after 500 numbers or so. To
make sure that your sequence will be OK, use the following code at the
beginning of any program that generates random numbers.
do I = 1 to 500;
x = ranuni (1279);
* or any number as a seed;
end;
This will give the random number generator a whirl to ensure it is running
smoothly.
_____________________________________________________________
11.5.
Random numbers from a uniform distribution
In order to be able to select a random sample or randomly assign subjects to
groups we can use SAS functions to generate random numbers. The two
functions UNIFORM(0) and RANUNI(0) generate uniform random numbers in
the range from 0 to 1. Random number generators require an initial number,
called a seed, which they use to calculate the first random number. This
number is then used to generate the next and so on. For both of these
functions, a seed of zero will cause the function to use a seed derived from
SAS Programming Notes
85
Iterative processing
the time clock, thus generating a different series each time it is used. The
RANUNI function can be seeded with any number, but the UNIFORM function
must be seeded with a 5, 6 or 9 digit odd number. In either case, if you supply
the seed, the function will generate the same series of random numbers each
time. In order to generate a series of random numbers from 1 to 100, we
could use
X = 1 + 99*RANUNI (0)
_____________________________________________________________
11.6.
Random numbers from a normal distribution
To generate random numbers from a normal distribution with a mean of 0 and
a standard deviation of 1 we can use the RANNOR function which works in a
similar way to RANUNI.
_____________________________________________________________
11.7.
The SAS Program Data Vector
While at work on a DATA step, the SAS System maintains a temporary data
structure in computer memory called the SAS program vector, or PDV.. The
program data vector represents one observation, a data row, and can be
thought of as a linear set of boxes in which values of SAS variables can be
contained. Unlike a SAS data set, which survives between steps, the PDV is a
dynamic entity that is created during a DATA step, and goes away after the
step has completed execution.
To construct the PDV is one of the DATA step compiler’s first jobs as it
passes through DATA step source code. The compiler looks at all the SAS
statements in the step’s source to find out what variables are named - in
INPUT, attribute, or other statements - and creates a PDV with space for
each variable’s length.
There are a couple of automatic system variables also in the program data
vector : the variables _N_ and _ERROR_. These are maintained by the
compiler and may be accessed by the program, though they do not get written
to the new data set(s). _N_ contains a count of how many times the DATA
step has begun execution from the top (i.e. the number of records), and
_ERROR_ is set to 1 (true) when there occurs a data error. When a DATA
step is executing, each time it begins another iteration the values of variables
to be created by INPUT or by assignment in the PDV are initialised to missing
(unless a RETAIN or Sum statement has been used. _N_ is incremented, and
_ERROR is set to zero. When the DATA step returns to the top for the next
iteration, the PDV is reinitialised and the process repeats.
_____________________________________________________________
SAS Programming Notes
11.8.
86
Iterative processing
The RETAIN and Sum statements
Normally, variables in the PDV that are named with assignment or with INPUT
statements are initialised to missing each time the DATA step begins a new
iteration. The RETAIN statement,
RETAIN [<variables [value]> ....];
causes variables to keep their values from the previous iteration at
initialisation. They can still be changed if INPUT reads a new observation, or
when an assignment statement (including the Sum statement ) is executed.
The RETAIN statement can only be applied to ‘new’ variables, that is, ones
which are being created within the DATA step. If a constant value is specified
after the variable list, that value is given at the first iteration; otherwise,
numeric variables start with a value of zero.
Example
Given the SAS data set history :
Year
1990
1990
1990
1990
1990
Month
1
2
3
4
5
No_sold
10
12
8
6
9
the program :
data changes;
retain no_last;
set course.history;
compare = no_sold - no_last;
no_last = no_sold;
run;
produces the PDV
_N_
1
2
3
4
5
_ERROR_
0
0
0
0
0
YEAR
1990
1990
1990
1990
1990
MONTH
1
2
3
4
5
NO_SOLD
10
12
8
6
9
COMPARE
.
2
-4
-2
3
NO_LAST
10
12
8
6
9
The following sum statement is a special type of assignment statement,
provided as a convenience for incrementing variables during the DATA step.
The statement
a + 7;
is identical in action to the statements
RETAIN A;
SAS Programming Notes
87
Iterative processing
A = A + 7;
i.e. 7 is added to the previous value of a
Exercises
11.1 Results of a survey are recorded for 1996 and 1997. However an extra
question was asked in 1997. Create a SAS data set from the following data
showing year and the answers to the questions.
1996
1996
1997
1997
4
5
3
5
8
7
5
3
3
4
4
4
5
5
7
3
6
6
4
6
5
8
5
7
6
8
11.2 Rewrite the following program using arrays:data test;
input a b c x1-x3
if a = 999 then a
if b = 999 then b
if c = 999 then c
y1-y3;
= .;
= .;
= .;
if x1 = 999 then x1 = .;
if x2 = 999 then x2 = .;
if x3 = 999 then x3 = .;
if y1 = 999 then y1 = .;
if y2 = 999 then y2 = .;
if y3 = 999 then y3 = .;
datalines;
3 5 2 7
2 9 4 7
3 8 3 0
run;
5
2
3
999 2
4 999
2 999
5
4
7
999
999
1
11.3
The chi-square distribution is the sum of the squares of k independent
standard normal random variables. Generate and plot the frequency
distributions of chi-squared variables with 5, 10 and 30 degrees of freedom.
SAS Programming Notes
89
12.
12.1.
Further topics
Further topics
Combining Data Sets
12.1.1.
Concatenating Data Sets
To concatenate data sets means to combine similar data sets into a single
new data set. In its simplest form the original data sets contain the same
variables and the combined data set will contain the original data sets ‘one on
top of the other’.
Suppose SAS data set wood08 contains values for the five variables id
(alpha-numeric), age, diameter, height and variety (alpha-numeric)
obtained from a survey of trees in a wood having reference number 08.
Suppose further that similar results are available in SAS data sets wood48
and wood69. The following code will combine the three data sets into a new
SAS data set woodcom (in the order 08, followed by 48, followed by 69).
data woodcom;
set wood08 wood48 wood69;
run;
If the original data sets contain different variables then the combined data set
will have missing values in an obvious way.
12.1.2.
Merging Data Sets
We merge data sets when we combine data sets containing different
information. In its simplest form (match-merging) we combine the data sets
on the basis of a common variable which typically identifies each case or row.
Consider again SAS data set wood08 containing values for the five variables
id (alpha-numeric), age, diameter, height and variety (alpha-numeric).
Suppose a related SAS data set woodrs08 contains values for the following
variables: id (alpha-numeric) plus numerical scores for damage, bark_dep
and condit.
We will merge the two data sets into a combined data set having variables id,
age, diameter, height, variety, damage, bark_dep and condit.
Matching will be by the variable id (termed the ‘BY variable’). However, we
must sort both data sets by the BY variable. This is illustrated in the following
code.
proc sort data=wood08;
by id;
run;
* Sort the first data set by id ;
proc sort data=woodrs08;
by id;
run;
* Sort the second data set by id ;
SAS Programming Notes
90
Further topics
data woodall08;
* Merge the two data sets using id as a key ;
merge wood08 woodrs08;
by id;
run;
The MERGE operation will combine cases having the same values for id in
data sets wood08 and woodrs08. If there values exist for id that are not
included in both data sets then missing values for the relevant variables will
be inserted.
Exercises
12.1. The ASCII data set mod273ft.txt contains results from a module
taught to full-time students with values for student name (alphanumeric), CW1, CW2, CW3, combined coursework and exam. Data
set mod273pt.txt contains the same information for part-time
students. The data sets are available on the module web page and
should be examined with Notepad or WordPad.
(a)
Write a program which reads in the two ASCII data sets and
combines them into a single permanent SAS data set with
suitable variable names. Confirm that the data set has been
constructed correctly.
(b)
Modify your program so that the combined data set contains a
new variable indicating which group (full-time or part-time) each
student belongs to.
12.2. The ASCII data set mod273ptsp.txt gives background information on
the statistical software that is used at work by the part-time students.
Values are given for name, excel, sas, spss (all alpha-numeric,
software results are Y or N).
Write a program which reads in the two data sets mod273pt.txt and
mod273ptss.txt and merges them into a single permanent SAS data
set. Confirm that the data set has been constructed correctly.
12.2.
Hints on Using Word with SAS and SAS/INSIGHT
12.2.1.
Copying a Selection from the Output Window
(WINDOWS)
In SAS Output Window
Highlight text
Edit  Copy
In Word document
Edit  Paste
SAS Programming Notes
91
Further topics
Note: In the event of difficulty use
Ctrl/C for Copy
Ctrl/V for Paste
12.2.2.
Saving the Whole Output Window
In SAS Output Window
File  Save As
Choose directory and file name
The automatic file extension is .lst
Word and other text editos have no difficulty opening or inserting a list file.
12.2.3.
Choice of Font within Word Document
For tables of figures you are recommended to use a monospace font, i.e. one
that has a constant width for all characters. Arial and Times New Roman are
not monospace fonts.
Examples are
SAS Monospace (12 point)
1 2 3 4 5
(You may need SAS running to get this font.)
SAS Monospace (10 point)
1 2 3 4 5
Courier New (10 point)
1 2 3 4 5
Courier New (12 point)
1 2 3 4 5
To avoid tables ‘wrapping round’ you could remove unnecessary spaces to
the left or reduce the size of the font.
It can be effective to use different fonts for different parts of the documents.
For example you might use a standard font like Arial or Times New Roman for
text and a monospace font for tables. Monospace fonts might also be used
for file names etc.
12.2.4.
Copying from a Graphics Window (WINDOWS)
In SAS Output Window
Edit  Copy
In Word document
Edit  Paste Special
Choose Device Independent Bitmap
To reduce the file size
Edit  Paste
Followed by Edit  Cut;
Edit  Paste Special
SAS Programming Notes
92
Further topics
Choose png or jpeg
When you try to reposition the picture you may find that the picture jumps
around the document in an uncontrollable manner. This can be eliminated by
adding dummy returns that will lie under the picture. Next using ‘In front of
text’ of wrapping from the layout tab of the picture format dialogue box. (Right
click on the picture to select the format option). The dummy returns must be
held together using paragraph formatting and the picture anchor locked onto
the dummy returns. This is shown in the diagram below.
Picture
Picture anchor
locked (Advanced
setting)
Enter returns, select all
then Format  Paragraph
Keep lines together
+ Keep with next
Format Picture
dialogue
Figure 14 Picture control using ‘In front of text’ wrapping style.
12.2.5.
Copying and Pasting from SAS/INSIGHT
Tables
Tables can be saved in graphics form or as text (recommended).
In SAS/INSIGHT Analysis Window
Click on ‘arrow’ at top of table
Choose Save
The table (as text) is put into the base SAS Output Window.
Copy from Output Window as described above.
See below for saving table in graphics form.
Graphs (WINDOWS)
In SAS/INSIGHT Analysis Window
Click on border of graph (or table)
Edit  Copy
SAS Programming Notes
In Word document
93
Further topics
Edit  Paste Special
Choose Device Independent Bitmap
Note: If you have highlighted points on your graph hold down Ctrl when you
click on the border.
12.2.6.
Left Alignment of SAS Output
Copying and pasting from the SAS Output Window is easier if the output is
already aligned to the left. The following option will ensure that all future
output is left aligned:
options nocentre;
run;
SAS Programming Notes
13.
95
Solutions to exercises
1.1
data ELEVEN;
input ID $ height weight;
ratio=weight/height;
datalines;
59 135 25
82 146 33
27 153 56
52 154 51
55 139 31
13 131 25
01 149 43
15 137 32
71 133 30
78 149 35
12 141 33
37 164 48
28 146 37
48 149 45
69 147 36
16 152 47
run;
/* Data came from Exercise 1.1*/
proc means;
run;
1.2
...
bmi = 100*weight/height;
…
1.3
data FORTH;
input site $ salinity phos
phos2=1000*phos;
datalines;
CR 30.11 0.068 0.297 1.693
WG 31.48 0.059 0.165 1.464
EG 31.79 0.068 0.144 1.100
SF 31.37 0.185 0.278 1.787
PB 31.50 0.116 0.223 2.099
JO 31.60 0.106 0.207 1.067
SS 30.50 0.047 0.162 1.563
FN 31.96 0.060 0.130 0.753
run;
proc means;
run;
2.1
nitrogen chloro faecal_c;
2.917
3.149
3.196
3.418
3.049
2.903
2.895
2.797
Proc Tabulate
SAS Programming Notes
96
libname week2 'c:\sas\sasdata';
data week2.eleven;
input ID $ height weight;
ratio=weight/height;
/*Program must be cut and pasted from word document*/
datalines;
59 135 25
82 146 33
27 153 56
52 154 51
55 139 31
13 131 25
01 149 43
15 137 32
71 133 30
78 149 35
12 141 33
37 164 48
28 146 37
48 149 45
69 147 36
16 152 47
run;
2.2 (b)
libname week2 'c:\sas\sasdata';
data week2.pulse;
infile 'c:\sas\sasdata\pulse.dat';
input pulse1 pulse2 ran $ smokes $ sex $ height weight activity $;
proc print ;
run;
2.4 (b)
libname mydir 'c:\temp';
data mydir.ex2_4;
infile 'c:\temp\ex2_4.prn';
input size $ colour $ price cost;
run;
proc print data = mydir.ex2_4;
var colour size price;
run;
2.4 (c)
libname mydir 'c:\temp';
data mydir.ex2_4;
infile 'c:\temp\ex2_4.prn';
input size $ 1-8 colour $ 9-19 price 20-24 cost 25-32;
run;
proc print data = mydir.ex2_4;
var colour size price;
run;
2.4 (d)
libname mydir 'c:\temp';
data mydir.ex2_4;
infile 'c:\temp\ex2_4.prn';
input @1 size $8. @9 colour $11. @20 price 5.2 @29 cost 4.2;
run;
proc print data = mydir.ex2_4;
var colour size price;
run;
Proc Tabulate
SAS Programming Notes
97
2.5
libname mydir 'c:\temp';
data mydir.houses;
infile 'c:\temp\houses.dat';
input style 1 sqfeet 3-6 bedroom 8 baths 10-12 price 14-19;
run;
proc print;
run;
2.6
libname mscsas 'c:\kirsty\mscsas';
data cars;
infile 'd:\kirsty\temp\cars.prn' firstobs = 2 ;
input mpg 1-8
cylndrs 9-16
displace 17-24
hrsepwr 25-33
accel 34-41
year 42-49
weight 50-57
origin 58-65
make $ 66-75
model $ 76-89
price 90-93;
run;
4.1(a)
options nocentre;
libname unit4 'c:\sas\sasdata';
proc sort data=unit4.pulse out=smokes;
by smokes ran;
proc means maxdec=1;
var pulse2;
by smokes ran;
run;
(b)
options centre;
proc means data=smokes alpha=0.05 maxdec=1 clm;
var pulse2;
by smokes ran;
run;
(c) and (d)
proc univariate data=smokes plot;
var pulse2;
by smokes ran;
proc freq data=smokes;
tables smokes*ran;
run;
4.2
Proc Tabulate
SAS Programming Notes
98
proc sort data = sashelp.retail out = sorted;
by year;
proc means maxdec=2 N mean std;
var sales;
by year;
output out=summary
mean=mean_sls;
run;
proc print data = summary;
run;
5.1
proc plot data=unit5.eleven;
plot weight*height='+';
run;
5.2
proc plot data=unit5.pulse;
plot pulse2*weight=sex;
run;
proc chart data=unit5.pulse;
pie activity;
run;
proc chart data=unit5.pulse;
hbar ran/sumvar=pulse2 type=mean;
run;
proc chart data=unit5.pulse;
vbar activity/type=pct group=sex;
run;
proc chart data=unit5.pulse;
block activity/group=smokes sumvar=pulse1 type=mean;
run;
6.1
libname unit6 'c:\sasdata';
run;
proc plot data=unit6.beetles;
plot area*mlength;
run;
proc reg data=unit6.beetles;
model area=mlength;
output out=regout
student=stdresid;
plot student.*mlength;
run;
proc univariate data=regout plot normal;
var stdresid;
run;
8.1
libname unit8 'c:\sasdata';
run;
data mpulse (drop=sex height weight);
Proc Tabulate
SAS Programming Notes
99
Proc Tabulate
set unit8.pulse (where=(sex='1'));
label pulse1='First pulse rate'
pulse2='Second pulse rate';
run;
proc print data=mpulse label;
run;
8.2
Note that the formats must correspond to that of the variables in the data set pulse, i.e.
numeric variables need numeric formats. Numeric variables do not use quotes around the
variable values.
libname library 'c:\sasdata';
run;
proc format library=library;
value $ran
'1'='ran in place'
'2'='did not run in place';
value $smokes
'1'='smokes regularly'
'2'='does not smoke regularly';
value $sex
'1'='male'
'2'='female';
value $activity '1'='slight'
'2'='moderate'
'3'='a lot';
run;
proc freq data=unit8.pulse;
tables sex*smokes smokes*activity sex*ran/nocol norow nocum nopercent;
format sex $sex. ran $ran. smokes $smokes. activity $activity.;
run;
8.3
proc format library=library;
value pulsrate low-76=1
77-high=2;
run;
proc freq data=unit8.pulse;
tables pulse1*pulse2/nocol norow nocum nopercent chisq;
by ran;
format pulse1 pulse2 pulsrate.;
run;
9.1 (a)
proc format;
value sex
1 = 'male'
2 = 'female';
value activity
1 = 'slight'
2 = 'moderate'
3 = 'a lot';
run;
proc tabulate data = mydir.pulse;
class sex activity;
var pulse1;
table activity,sex * pulse1*mean;
format sex sex. activity activity.;
title 'Average pulse without exercise';
run;
9.1(b)
SAS Programming Notes
100
proc format;
value smokes
1 = ‘smokes regularly’
2 = ‘does not smoke regularly’;
value ran
1 = ‘ran in place’
2 = ‘did not run in place’;
run;
data mydir.pulsev2;
set mydir.pulse;
pulsedif = pulse2 - pulse1;
run;
proc tabulate data = mydir.pulsev2;
class smokes ran;
var pulsedif;
table smokes, ran * pulsedif*mean;
format smokes smokes. ran ran.;
title 'Average difference in pulse by whether they smoke and/or ran';
run;
10.1
data test;
input alphabet $26.;
pt1 = index(alphabet,'FGHIJ');
pt2 = index(alphabet,'JNP')
;
pt3 = substr(alphabet,3,3)
;
pt4 = verify(alphabet,'ABDE')
;
datalines;
ABCDEFGHIJKLMNOPQRSTUVWXYZ
run;
proc print;
run;
10.2
data ageatoct;
input dob YYMMDD6.;
date = '01OCT99'D;
ageatoct = (date-dob)/365;
datalines;
740625
run;
10.3
libname mydisc 'd:\kirsty\temp';
* first set up a variable for age;
data mydisc.napier2;
set mydisc.napier;
enddate = '31DEC93'D; * sets up a variable as a date constant;
age = (enddate - dob)/365.25;
run;
* create a temporary SAS data set called agesort, sorted by age
and examine data set to identify oldest student;
proc sort data = mydisc.napier2 out = agesort;
by age ;
run;
* create a temporary SAS data set called napsort, sorted by course;
proc sort data = mydisc.napier2 out = napsort;
by crsecd;
run;
Proc Tabulate
SAS Programming Notes
101
Proc Tabulate
* create a temporary SAS data set called means containing the mean, max and
min ages by course. Examine the data set to find which course has the
youngest students (on average) Anything strange?! ;
proc means data = napsort;
var age;
by crsecd;
output out = means mean = mnage max = maxage min = minage ;
run;
* sort by number of records for each course and examine dat set to find the
course with the largest number of students;
proc sort data = means;
by _freq_;
run;
11.1
data survey;
input year @ ; * hold the line;
if year = 1996 then input q1-q6;
else if year = 1997 then input q1-q7;
datalines;
1996
1996
1997
1997
;
run;
4
5
3
5
8
7
5
3
3
4
4
4
5
5
7
3
6
6
4
6
5
8
5
7
6
8
proc print;
run;
11.2
data test;
input a b c x1-x3 y1-y3;
array tt (9) a b c x1-x3 y1-y3;
do i = 1 to 9;
if tt(i) = 999 then tt(i) = .;
end;
drop i;
datalines;
3 5 2 7 5 999 2 5 999
2 9 4 7 2 4 999 4 999
3 8 3 0 3 2 999 7 1
run;
proc print;
run;
11.3
data chisq;
n = 6;
* degrees of freedom required +1;
do i = 1 to 500;
x = rannor(3059);
end;
* to ensure numbers are random;
do j = 1 to 500;
chi = 0;
do i = 1 to n;
x = rannor(3059);
chi = chi + x*x;
*generate 500 chi-squared values;
SAS Programming Notes
102
Proc Tabulate
retain chi;
end;
output;
keep chi;
end;
run;
proc gchart;
vbar chi;
run;
12.1.(b)
/* concatmod.sas */
libname xyz 'c:\Documents and Settings\mp12\My Documents\sasfiles\sasdata8';
data modft;
infile 'c:\Documents and Settings\mp12\My Documents\data\mod273ft.txt';
input name $ 1-16 cw1 cw2 cw3 com_cw exam;
group = 'FT';
run;
data modpt;
infile 'c:\Documents and Settings\mp12\My Documents\data\mod273pt.txt';
input name $ 1-16 cw1 cw2 cw3 com_cw exam;
group = 'PT';
run;
data xyz.mod273com;
set modft modpt;
run;
proc print data=xyz.mod273com;
run;
12.2.
/* mergemod.sas */
libname abc 'c:\Documents and Settings\mp12\My Documents\sasfiles\sasdata8';
data modpt;
infile 'c:\Documents and Settings\mp12\My Documents\data\mod273pt.txt';
input name $ 1-16 cw1 cw2 cw3 com_cw exam;
run;
proc sort data=modpt;
by name;
run;
data modptss;
infile 'c:\Documents and Settings\mp12\My Documents\data\mod273ptss.txt';
input name $ 1-16 excel $ sas $ spss $;
run;
proc sort data=modptss;
by name;
run;
data abc.modptcom;
merge modpt modptss;
by name;
run;
proc print data=abc.modptcom;
run;
SAS Programming Notes
103
proc univariate data = pulse;
var before;
histogram before / midpoints = 0 to 200 by 10;
title ’Histogram Pulse Data’;
run;
Proc Tabulate
Download