Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
1
Introduction to SAS® Version 1.4 updated 9/29/2002
by Kazuaki Uekawa, Ph.D.
Visiting Scholar, The Department of Sociology, The
University of Chicago; Population Research Center at NORC; Address: 1155 E. 60th. St,
Room 340, Chicago, IL 60637
www.src.uchicago.edu/users/ueka
kuekawa@alumni.uchicago.edu
Copyright © 2002 By Kazuaki Uekawa All rights reserved.
Table of Contents
I.
Introduction.......................................................................................................................... 2
II.
How to start? .................................................................................................................... 3
III.
LIBNAME: Assigning library name ............................................................................... 3
IV.
Create SAS data for a practice ........................................................................................ 4
V.
Creating New Variables ................................................................................................... 6
VI.
Procedures ........................................................................................................................ 8
A.
PROC CONTENTS: Description of Contents ............................................................. 8
B.
PROC PRINT: See Data ............................................................................................... 9
C.
PROC SORT: Sorting Observations based on a value of variable ............................ 9
D.
PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max) ...................... 10
E.
PROC FREQ: Get Frequencies .................................................................................. 11
F.
PROC UNIVARIATE: Get elaborate statistics and a univariate plot ........................ 12
G.
PROC PLOT: Plotting Two Variables ........................................................................ 12
H.
PROC TIMEPLOT: Time Plot.................................................................................... 12
I.
PROC CORR: Correlation ............................................................................................. 13
J.
PROC OLS: OLS Regression ......................................................................................... 13
K.
PROC LOGISTIC: Logistic Regression ..................................................................... 14
L.
MAKE AN ASCHI FILE............................................................................................. 14
VII.
More Procedures............................................................................................................. 14
M.
PROC STANDARD: Standardize Values .................................................................. 14
N.
PROC RANK: Rank observations ............................................................................. 16
O.
PROC SQL: Creating group-level mean variables ................................................... 17
VIII.
Merging Data Sets ...................................................................................................... 17
IX.
Temporary and Permanent Data Sets .......................................................................... 18
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
2
I. Introduction
I recommend SAS® over other statistical packages because:
a) ODS (Output Delivery System) allows users to save statistical results as data. A
user can create tables off the result data set in one single program (as opposed to
printing out the results on paper and use excel to finish tables.) The table can be as
sophisticated as
http://www.src.uchicago.edu/users/ueka/SAS/proc_mixed_example1output.txt and
this can be further saved in an excel format using PROC EXPORT.
b) Rich arrays of macro functions
c) Email support service with quick response.
support@sas.com
d) Users come from many fields, including social and natural sciences, as well as
business. Thus, SAS ® programming skill can be an asset in the job market.
I discuss both ODS and MACRO in Introduction SAS 2, the document of which is
available from the same website.
Idiosyncrasy of this document
I am writing this document on my Japanese PC and backslash is not available.
I use
¥ instead.
U. of Chicago People can access SAS on-line on the web!
SAS On-line for version 8
http://gsbapp2.uchicago.edu/sas/sashtml/main.htm
Note on SAS email support:
When you email SAS support with a question, you need to identify yourself as a legitimate
SAS customer. Look at the head of a log file and copy and paste the information at the
beginning of your email text.
NOTE: Copyright (c) 1999-2001 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software Release 8.2 (TS2M0)
Licensed to UNIVERSITY OF XXXXX, Site XXXXX.
NOTE: This session is executing on the WIN_ME
platform.
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
3
II. How to start?
1. Start SAS. You can find the short cut going from START PROGRAMThe SAS
System.
2. Type in syntax in EDITOR window. Syntax is something you learn in this document.
3. Click on the runner icon to run the program. Alternatively, you can highlight the part of
syntax that you want to run and then click the runner to run the program selectively. (The
downside of using UNIX instead of WINDOWS is that UNIX cannot let you do this selective
run.)
LOG file contains messages. Watch for the words error and warning.
OUTPUT file contains output.

If you ever mistype syntax and want to redo, do control-z. This is the same
command that can be used with Microsoft Office products.

To cancel the run while it is happening, click on the stop icon (which looks
like “!”) right next to the runner icon.
III. LIBNAME: Assigning library name
Assigning library name
Using path names as directory names is too tedious (e.g., C: ¥temp¥abc¥old), so we want to
give nicknames to them at the beginning of a program.
libname here “C:¥TEMP”;
libname there “C:¥”;
So from now on,
here.abc means the data set named “abc” placed in the directory nicknamed “here.”
there.xyz means the data set named “xyz” placed in the directory nicknamed “there.”
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
4
IV. Create SAS data for a practice
Description of Practice Data
The data comes from TIMSS (Third International Mathematics and Science Study) in
which some 40 nations’ three population groups (3&4th graders, 7&8th graders, and high
school seniors) participated. I aggregated data at the national level. The variables are:

acro: acronym for participant nations.

nation: name of the country

name: complete name of the country

mat8: 8thgraders’ average math test score

mat7: 7thgraders’ average math test score

GNP14: GNP per capita

prop: proportion of 8th graders in schooling

NATEXA: Administers national-level exam

NATSYLB: Sylbus is decided at the national level

NATTEXT: text is chosen at the national level.
libname here “C:¥TEMP”;
libname there “C:¥”;
data kaz;
input
acro $ NATION $ 6-14
NAME
$
15-33
MAT7
MAT8
GNP14 PROP NATEXAM NATSYLB
NATTEXT block $;
cards;
aus
Australi Australia
498 529.63 -0.15526
84
0
1
0
ocea
aut
Austria
509 539.43 -0.29163
100
0
0
1
weuro
bfl
Belgi_FL Belgium (Fl)
558 565.18 -0.25157
100
1
1
0
weuro
bfr
Belgi_FR Belgium (Fr)
507 526.26 -0.25157
100
0
1
0
weuro
can
Canada
494 527.24
0.07184
88
0
0
0
namer
col
Colombia Colombia
369 384.76 -0.23699
62
0
1
0
samer
cyp
Cyprus
Cyprus
446 473.59 -0.41906
95
0
1
1
seuro
csk
Czech
Czech Republic
523 563.75 -0.34840
86
0
1
0
eeuro
dnk
Denmark
Denmark
465 502.29 -0.34057
100
1
0
0
weuro
Austria
Canada
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
fra
France
France
492 537.83
0.55791
100
0
1
0
weuro
deu
Germany
Germany
484 509.16
0.91992
100
0
0
0
weuro
grc
Greece
Greece
440 483.90 -0.32620
99
0
1
1
seuro
hkg
HongKong Hong Kong
564 588.02 -0.31638
98
1
1
1
seasia
hun
Hungary
Hungary
502 537.26 -0.37602
81
0
0
0
eeuro
isl
Iceland
Iceland
459 486.78 -0.42606
100
0
0
0
neuro
irn
Iran
Iran, Islamic Rep.
401 428.33 -0.17095
66
0
1
1
meast
irl
Ireland
Ireland
500 527.40 -0.38919
100
1
1
0
weuro
isr
Israel
Israel
. 521.59 -0.35464
87
0
1
0
meast
jpn
Japan
Japan
571 604.77
1.85543
96
0
1
0
seasia
kor
Korea
Korea
577 607.38 -0.01168
93
0
1
1
seasia
kwt
Kuwait
Kuwait
. 392.18 -0.40359
60
0
1
1
meast
lva
Latvia
Latvia (LSS)
462 493.36 -0.42319
87
0
0
0
eeuro
ltu
Lithuani Lithuania
428 477.23 -0.41785
78
1
1
1
eeuro
nld
Netherla Netherlands
516 540.99 -0.18184
93
1
0
0
weuro
nzl
NewZeala New Zealand
472 507.80 -0.38319
100
1
1
0
ocea
nor
Norway
461 503.29 -0.35450
100
0
1
1
neuro
prt
Portugal Portugal
423 454.45 -0.32588
81
0
1
0
weuro
rom
Romania
454 481.55 -0.35396
82
1
1
1
eeuro
rus
RussianF Russian Federation
501 535.47
0.12827
88
1
0
0
eeuro
sco
Scotland Scotland
463 498.46
0.48017
100
0
0
0
weuro
sgp
Singapor Singapore
601 643.30 -0.37279
84
1
1
1
seasia
slv
SlovakRe Slovak Republic
508 547.11 -0.40217
89
0
1
0
eeuro
svn
Slovenia Slovenia
498 540.80 -0.41310
85
0
1
1
eeuro
esp
Spain
Spain
448 487.35
0.03461
100
0
1
1
weuro
swe
Sweden
Sweden
477 518.64 -0.30049
99
0
1
0
neuro
che
Switzerl Switzerland
506 545.44 -0.27916
91
0
0
0
weuro
tha
Thailand Thailand
495 522.37 -0.14533
37
0
1
1
seasia
usa
USA
476 499.76
97
0
0
0
namer
Norway
Romania
United States
;
run;
/*this prints out the data*/
proc print;
run;
5.37506
5
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
6
Advanced Topic:
Alternatively you can save above data (just data part) as a simple text and save it at your
C-drive’s temp directory as kaz.txt. (In case you only have this document as a hard copy,
visit www.src.uchicago.edu/users/ueka for a digital version of this document, so you can
copy and paste.) Then use the program below to read in the file.
/*these two lines are not crucial in this example, but let’s just put these at the beginning of
your program*/
libname here “C:¥TEMP”;
libname there “C:¥”;
data kaz;
infile “C:¥TEMP¥kaz.txt” missover;
input
acro $ NATION $ 6-14
NAME $
15-33
MAT7
GNP14
PROP NATEXAM NATSYLB NATTEXT block $;
MAT8
run;
I think missover means that when there is no value in the spot where there is supposed to be
a value, just treat it as a missing value, but I forgot exactly. It is safe to use it.
$ means whatever comes before it is a character variable as opposed to numeric.
V. Creating New Variables
Data kaz2;
set kaz;
/*ADDITION*/
var1=mat7+mat8;
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
7
/*OR*/
var2=sum(of mat7 mat8);
/*SUBSTRACTION*/
var3=mat8-mat7;
/*MULTIPLICATION*/
var4=mat7*mat8;
/*DIVISION*/
var5=mat7/mat8;
/*LOG: a value to enter must be
positive*/
var10=log(mat7);
/*Absolute values: this takes out negative
signs*/
var11=abs(gnp14);
run;
/*Use brackets effectively*/
var6=1/(mat7+mat8);
/*MEAN of several variables*/
var7=mean(of mat7 mat8);
/*MAX of several variables*/
var8=max(of mat7 mat8);
/*MIN of several variables*/
var9=min(of mat7 mat8);
/*TO SEE WHAT YOU DID, USE PROC
PRINT*/
proc print data=kaz2;
title “Lots of manipulations: See results”;
var mat7 mat8 var1 var2 var3 var4 var5
var6 var7 var8 var9 var10 var11;
run;
Advanced Topics:
How is Z=mean(of X1 X2 X3) different from Z=(X1+X2+X3)/2;?
How is Z=sum(of X1 X2 X3) different from Z=X1+X2+X3;?
Functions, such as mean(of …) or sum (of …), take statistics of non-missing values. They
do return values even when some of the variables in the brackets are missing. For
example, if X1 is missing:
X=mean (of X1 X2 X3); will return the average of X2 and X3.
In contrast,
X=(X1+X2+X3)/2 will return a missing value, namely, “.”
Read this after you study PROC REG later in the document.
When we compare several regression models (e.g., coefficients, R2, Goodness-of-fit, etc.),
we want to keep the number of observations same across different models. Because
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
8
predictors may have different patterns of missing values, this must be made to happen if
you want to. For example, mat7, which is 7th graders’ mathematics score include some
missing cases. Some nations only let their 8th graders participate in this international test.
Use NMISS function to create a new variable john.
data kaz2;set kaz;
john=nmiss(of GNP14 mat8 mat7);/*this returns the number of missing cases*/
run;
/*check how the data looks like now*/
proc print data=kaz2;
var name gnp14 mat8 mat7 john;
run;
/*Apply OLS regression with cases with perfect data (no missing cases). In this way,
model 1 and model 2 will have the same number of cases, or to be more precise, the same
data.*/
proc reg data=kaz2;
where john=0; /*Run only when john=0, namely, number of missing cases is 0*/
model mat8=mat7;
model mat8=mat7 gnp14;
run;
VI. Procedures
A. PROC CONTENTS: Description of Contents
PROC CONTENTS data=kaz;
run;
Advanced topic: the variables will be sorted by alphabetical order.
They can be also
shown by position in the data set (left to right) by addition “position”:
proc contents data=kaz position;
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka
9
run;
I like this option because in this way you can find related variables close to each other.
B. PROC PRINT: See Data
PROC PRINT data=kaz;
VAR nation mat7 mat8 natexam; /*without this, all variables will be printed*/
run;
Advanced topic: You can selectively print observations.
/*print only when natexam=1*/
proc print data=kaz;where natexam=1;var nation mat7 mat8;run;
/*print by group units*/
proc sort data=kaz out=kaz2;by block;run;
proc print data=kaz;by block;var nation mat7 mat8;run;
/*print only up to a certain number of observations*/
proc print data=kaz2 (obs=5); /*shows only five observations*/
run;
If you want a nicer print-out, try proc report.
C. PROC SORT: Sorting Observations based on a value of variable
You would be using this procedure a lot, but be careful with large data set. This procedure
consumes lots of computation time.
PROC SORT data=kaz out=kaz2;
/*If you don’t want to create a new data set, just write “out=kaz”*/
by mat8;
run;
Advanced topics:
proc sort data=kaz out=kaz2 nodupkey;
by block;
run;
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 10
proc print data=kaz2;run;
This takes only the first observation of each block. Imagine that you have data where
there are individual level variable (e.g., 100 students) and group level variable (e.g., 10
schools).
Imagine you want to get school level information from this data.
Above
procedure would take just the first observation of each school and gets you ten lines of data
for 10 schools. Ignore individual-level variables, however.
You can use more than one variable in by line.
proc sort data=kaz out=kaz2;
by natexam block;
run;
/*How would the new data look like?*/
proc print data=kaz2;run;
D. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max)
PROC MEANS data=kaz;
VAR mat7 mat8;
run;
Advanced topic: Group means.
/*Report group means*/
proc sort data=kaz out=kaz2;by block;run;
proc means data=kaz2;
by block;
var mat7 mat8;
run;
You can also use “class” statement instead of “by” statement. Class statement is easier
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 11
because you don’t need to sort the data by the by-variable before it.
downside of it was.
I forgot what the
proc means data=kaz2; /*now, kaz2 does not have to be sorted by block*/
class block;
var mat7 mat8;
run;
/*Save group means*/
ods listing close; /*printing of results suppressed*/
proc means data=kaz2; /*make sure kaz2 is already sorted by group ID*/
by block;
var mat7 mat8;
ods output summary=john; /*Output Delivery System Used. See SAS manual 2*/
run;
ods listing on; /*printing of results resumed*/
proc print data=john;
run;
/*Get standard errors by adding STDERR*/
/*But it would only get standard error, so you must add other statistics you would like with
it. Specify mean, N, STD, MAX, and MIN*/
PROC MEANS data=kaz mean n std max min stderr;
VAR mat7 mat8;run;
run;
I recommend reading a chapter on PROC MEANS in SAS CD-online. It is a very
versatile procedure.
E. PROC FREQ: Get Frequencies
PROC FREQ data=kaz;
Tables natexam ;
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 12
Run;
Advanced topics:
Get cross tabulation:
PROC FREQ data=kaz;
tables natexam*block;
run;
F. PROC UNIVARIATE: Get elaborate statistics and a univariate plot
PROC UNIVARIATE PLOT DATA=KAZ;
var mat7 mat8 gnp14;
run;
Advanced topic:Get a whisker plot by sub groups, so you can compare group values. But
the output is text-based and pretty ugly.
proc sort data=kaz out=kaz2;
by block;
run;
PROC UNIVARIATE data=kaz2 plot;
by block;
var mat8;
run;
G. PROC PLOT: Plotting Two Variables
This is text-based graph. Use proc gplot for a nicer graphic.
PROC PLOT data=KAZ;
Plot mat7*mat8;
run;
H. PROC TIMEPLOT: Time Plot
proc timeplot data=KAZ;
plot mat8= '*';
id NAME;
run;
Advanced topics:
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 13
/*Sort first by the variable of your interest and see it*/
/*you will be seeing a ranking of nations*/
proc sort data=kaz out=kaz2;
by mat8;
run;
proc timeplot data=KAZ2;
plot mat8= '*';
id NAME;
run;
Add bells and whistles. Below, I am asking, “Does GNP has anything to do with test
score?
/*First sort by GNP*/
proc sort data=kaz out=kaz2;
by gnp14;
run;
proc timeplot data=KAZ2;
title “TIMSS countries sorted by GNP”;
plot mat7 mat8/overlay hiloc npp ;
id NAME block gnp14 prop;
run;
I. PROC CORR: Correlation
PROC CORR DATA=KAZ;
VAR mat7 mat8 gnp14;
Run;
J. PROC OLS: OLS Regression
PROC REG DATA=KAZ;
MODEL mat8=natexam gnp14;
Run;
Advanced Topic:
See www.src.uchicago.edu/users/ueka for the creation of OLS table using OLS. Also see
PROC IML instruction on the same page to learn how OLS estimates its coefficients.
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 14
K. PROC LOGISTIC: Logistic Regression
/*I don’t know if natexam can be considered a dependent variable, but for the sake of
demonstration*/
PROC logistic data=kaz descend;
Model natexam=gnp14;
run;
/*option descend makes sure that RROC LOGISTIC is modeling the probability that the
outcome=1. Without this option, it would model the probability that the outcome=0*/
L. MAKE AN ASCHI FILE
To use a stand-alone software program, you may have to create a simple aschi file. But I
rarely use this lately because many software read SAS data directly.
data timss;set kaz;
file "aschi_example.txt";
put (nation) (10.0) (mat7 mat8) (8.0);
run;
VII. More Procedures
M. PROC STANDARD: Standardize Values
Make Z-score with a mean of 0 and standard deviation of 1
proc standard data=kaz out=kaz2 mean=0 std=1;
var mat7 mat8;
run;
/*then see what you did*/
proc print data=kaz2;
run;
Advanced technique: Standardize within groups.
/*First sort by group ID*/
proc sort data=kaz out=kaz2;
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 15
by block;
run;
/*Use by statement*/
proc standard data=kaz2 out=kaz3 mean=0 std=1;
by block;
var mat7 mat8;
run;
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 16
N. PROC RANK: Rank observations
proc rank data=kaz out=kaz2 group=3;
/*Creates 3 groups. The new values will be 0, 1, and 2. */
var mat7 mat8;
RANKS Rmat7 Rmat8;
/*give names to the new variables*/
Run;
/*see what happened*/
proc print data=kaz2;
var mat7 Rmat7 mat8 Rmat8;
RUN;
Research Tip:
Why do we use rank?
a. We can split the sample based on the rank. e.g., high SES student sample versus low
SES student sample.
b. We can create dummy variables quickly by specifying group=2. e.g., high SES student
will receive 1; else 0. This grouping occurs at the median point of a variable, which may
or may not be always the best strategy. Alternative way is to assign 1 and 0 based on
some meaningful threshold. For example, I have temperature data, I may use a medium
point to split the data if it makes sense, but maybe I use 0 degree (Freezing point) as a
meaningful point to split the data instead.
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 17
O. PROC SQL: Creating group-level mean variables
One could use proc means to derive group-level means. I don’t recommend this since it
involves extra steps of merging the mean data back to the main data set. Extra steps
always create rooms for errors. PROC SQL does it at once.
proc sql;
create table kaz2 as
select *,
mean(mat7) as mean_mat7,
mean(mat8) as mean_mat8,
mean(gnp14) as mean_gnp
from kaz
group by block;
run; /*proc sql does not really require run statement, but for the sake of consistency*/
proc print data=kaz2;
run;
VIII. Merging Data Sets
libname here “C:¥”;
/*Create two data sets A and B.*/
data A;
set kaz; /*I am assuming that you already have this data set “kaz” by running the program
on page 4 and 5 of this document. */
keep nation mat7;
run;
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 18
data B;
set kaz;
keep nation mat8;
run;
/*MERGE DATA SETS*/
/*First sort them by a common ID*/
/*Here they are already sorted, so the following two lines are not really necessary*/
proc sort data=A;by nation;run;
proc sort data=B;by nation;run;
data NEW;
merge A B;
by nation;
run;
/*Confirm*/
proc print data=NEW;
run;
IX. Temporary and Permanent Data Sets
There are temporary and permanent SAS data sets. When you turn off SAS, the
temporary data will be erased. Throughout the exercise, you have seen “kaz” and “kaz2.”
They are temporary data sets.
To actually see these data, go to the Explorer (leftish side of the SAS window),
then to Libraries, and find folders in there. The default directory is called Work. (You
will also find folders that you nicknamed.) Click them to open and find data in them.
If you want to make them permanent, so they don’t disappear when you turn off
SAS, add the directory nickname in front of the new data set.
For example:
Data here.abc;set kaz;
keep nation growth;
growth=mat8-mat7;
run;
You are bringing in a temporary data set “kaz” and are creating a new permanent data
called abc in the directory “C:¥TEMP” (nicknamed “here” by a library statement) You are
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 19
creating a variable called “growth” and it now is in here.abc. Only nation and growth are
kept in the new data set.
You can also do the opposite: bring in a permanent data set this time and create a temporary
data.
Data xyz; set here.abc;
growth=mat8-mat7;
drop mat8 mat7;
run;
You are bringing a permanent data set called “abc” placed in C:¥TEMP and create a new
data abc in SAS’s defalt directory. You created a variable called “growth” and it now is in
abc. Mat8 and mat7 are dropped from the new data set.
(Of course, reading in a permanent data and creating a permanent data is possible by “data
here.xyz; set here.xyz;)
Research Tip:
I recommend that you make permanent data as infrequently as possible. Just save your
syntax program and create fresh temporary data each time you start and save disc space..
In this way, you can just save your small syntax program. Also research is a lot easier if
you have only a few programs and data sets.
http://www.src.uchicago.edu/users/ueka/SAS/Dataextractor8.3.txt
Every time I need to work on this study, I can just run this one single program to reproduce
data. I don’t have to remember the name convention and location of the data sets that I
have to deal with.
For this particular study, I only need to deal with this file above and one more file
that actually does the analyses.
http://www.src.uchicago.edu/users/ueka/SAS/MakeFinalTables7.2.txt
If I need to make changes to my analyses, I know I just have to look into these two
files. This would be impossible if I had too many files and data sets flying all over the
places even in one directory.
Introduction to SAS ® by Kaz Download from www.src.uchicago.edu/users/ueka 20
HOWEVER, if your data is huge (e.g., census data), then you may be better off
saving permanent data, so it is quicker.
END of Document