SAS_university_slides1

advertisement
LISA SHORT COURSE SERIES:
INTRODUCTION TO SAS
UNIVERSITY
William DeShong
Fall 2015
Upcoming LISA Short Courses
http://www.lisa.stat.vt.edu/?q=short_courses
Outline
SAS Overview
2. SAS University Environment
3. Data Step
1.
Importing Data Sets
Merging Data Sets
1.
2.
Procedure Step
4.
Manipulate/View Data
1.
1.
2.
Proc Print
Proc Sort
Aggregate Data
2.
1.
2.
3.
Proc Summary
Proc Freq
Proc Means
Model Data
3.
1.
Proc Reg (If time permits)
SAS
• SAS (an acronym for Statistical Analysis System) is a
data-driven programming language that provides
information from data.
• The functionality of SAS is built around four data-driven
tasks.
Data Access
Addresses or locates the data
required by the programmer.
Data Analysis
Summarizes, reduces, or
transforms raw data into
meaningful and useful
information.
Data Management
Shapes the data into a form
required by the programmer.
Data Presentation
Communicates information in
ways that clearly demonstrate
its significance.
SAS Program
• A SAS program (also called "SAS code") is a series of
statements (or "steps") for SAS to execute. There are three
types of SAS statements:
• DATA statements
• PROC statements
• global statements
• All DATA statements end with a RUN command.
• All PROC statements end with either:
• RUN command (for almost all statements)
• QUIT command (for very, very few statements)
Flow of Programming
Raw
Dataset
A DATA statement can be used to (1) create a SAS
dataset from scratch, (2) create a SAS dataset from a
raw dataset, (3) check for and correct errors in a dataset,
and (4) create a SAS dataset by merging, subsetting, and
updating existing SAS datasets.
DATA
Statement
Built-In SAS
Dataset(s)
SAS
Dataset
PROC
Statement
Report
SAS Pointers
• When programming in SAS, keep in mind the following
pointers to prevent syntax errors:
• Semicolon Check: Every line of code (with exception to formats
and labels) end with a semicolon ( ; ). One missing semicolon can
destroy an entire SAS program.
• Use Comments: You can make one-line comments by placing an
asterisk ( * ) in the front of your comment. For a multi-line
comment, start with ( /* ) on the first line and end with ( */ ) on the
last line. Commented lines of code are ignored by the SAS
processor. Comments are used to help the programmer remember
parts of the SAS code.
SAS University Edition Environment
• Let’s take a look at SAS University now!
Data Step
Importing Datasets
• Lets use the Data Importing Wizard!
Accessing Permanent SAS Datasets
• To access existing SAS datasets, use the following code:
libname name_of_library ‘ location_of_file ’;
run ;
• The name_of_library is a name that you choose to
represent the name of the folder to store the SAS
datasets in or access the existing SAS datasets.
• The location_of_file represents the location where SAS
should go to find or save permanent SAS datasets.
Accessing Permanent SAS Datasets
• Note that in giving the location, you are not mentioning
which particular SAS dataset that you want to use.
• Rather, you locate the folder or extension (if there is no
folder) where the SAS dataset(s) are located.
• Most SAS programmers put all of their SAS programs in
one folder so that they can access them all at one time.
libname name_of_library ‘ location_of_file ’;
run ;
Accessing Permanent SAS Datasets
• The name_of_library is limited to 1 to 8 characters long, can
only begin with a letter or underscore, and contains only letters,
numbers, or underscores.
libname name_of_library ‘ location_of_file ’;
run ;
• Legal vs. Illegal Names of Libraries
• clinic1
• 1_clinic
• _%clinic
How many of the following seven library
• _clinic1
names are legal library names?
• _1clinic
• clinic_1
• 1clinic_1
4
Descriptive Statistics Functions
• Below are a few of the descriptive statistics functions.
Most of these descriptive statistics can be found using
PROC MEANS or PROC UNIVARIATE.
Functions
Syntax
Calculates
SUM
sum(argument, argument, …) ;
sum of values
MEAN
mean(argument, argument, …) ; average of nonmissing values
MIN
min(argument, argument, …) ;
minimum value
MAX
max(argument, argument, …) ;
maximum value
VAR
var(argument, argument, …) ;
variance of the values
STD
std(argument, argument, …) ;
standard deviation
Date and Time Functions
Functions
Syntax
Calculates
TODAY
today( ) ;
gives today's SAS date value,
requires no arguments
TIME
time( ) ;
gives current time, requires
no arguments
MDY
mdy(month_val, day_val, year_val) ;
gives back the numeric SAS
date value
DAY
day(date_val) ;
gives back the day date of
the SAS date value (1-31)
QTR
qtr(date_val) ;
gives back the quarter of the
year of the SAS value (1-4)
WEEKDAY
weekday(date_val) ;
gives back the numeric day
of the SAS date value (1-7)
MONTH
month(date_val) ;
gives back the month of the
SAS date value (1-12)
YEAR
year(date_val) ;
gives back the year of the
SAS date value (4 digits)
Date and Time Functions
• Here are some interesting ones, however.
Functions
Syntax
Calculates
INTCK
intck('day' , SASdate1 , SASdate2) ;
intck('week' , SASdate1 , SASdate2) ;
intck('month , SASdate1 , SASdate2) ;
intck('qtr' , SASdate1 , SASdate2) ;
intck('year' , SASdate1 , SASdate2) ;
provides the difference in
the number of {days,
weeks, months, quarters,
years} between two SAS
date values.
a SAS_end_date which is a
multiple of the time interval
added to SAS_start_date
INTNX
intnx('interval' , SAS_start_date ,
increment, alignment_character) ;
alignment_characters
• ' b ' = 1st of the month
• ' m ' = 15th of the month
• ' e ' = 30th/31st of month
• ' s ' = same day of
SAS_start_date
Mathematical Functions
• Below are a few of the billions of mathematical functions.
There is no way to list them all. You learn them as you
learn how to program.
Functions
Syntax
Calculates
ROUND
round( argument , d ) ;
rounds to nearest d where
• d =10 (tens)
• d = 1 (integer)
• d = .1 (tenths)
• d = .01 (hundredth)
LOG
log(argument) ;
take the natural log
LOG10
log10(argument) ;
takes the log base 10
FLOOR
floor(argument) ;
rounds down to nearest integer
CEIL
ceil(argument) ;
rounds up to nearest integer
INT
int(argument) ;
returns integer part of value only
Character Functions
Functions
Syntax
Calculates
SCAN
scan(argument, n, delimiters) ;
returns a specified word from a
character word
SUBSTR
substr(argument, n, delimiters) ;
extracts a substring
replaces character values
TRIM
trim(argument) ;
trims trailing blanks
INDEX
index(source, excerpt) ;
searches a character value for
a specific string
UPCASE
upcase(argument) ;
converts to uppercase letters
LOWCASE
lowcase(argument) ;
converts to lowercase letters
PROPCASE
propcase(argument) ;
uppercase first character value
tranwrd(source, target, replace) ;
replaces or removes all
occurrences of a pattern of
characters
TRANWRD
PROC SORT Statement
• The purpose of PROC SORT is to reorganize a SAS
dataset by a subset of its variables.
proc sort data = libref.datasetname ;
by var1 var2 … vark ;
run ;
• The PROC SORT statement can sort:
• by one variable or more than one variable
• in ascending order or descending order
• remove duplicates while sorting (not by default, you must specify it)
PROC SORT Statement
• The purpose of PROC SORT is to reorganize a SAS
dataset by a subset of its variables.
proc sort data = libref2.dataset1 out = libref2.dataset2 ;
by var1 var2 … vark ;
run ;
• If you specify an out statement, SAS will sort the original
SAS dataset (dataset1) and put it in the SAS dataset
(dataset2).
• If you do not use the out statement, SAS will sort
dataset1 and store it into dataset1.
• Thus, it overwrites the dataset and you lose the original order.
Merging Data sets with Match-Merging
• With simple match-merging, the SAS programmer is
trying to link observations together using the values in the
variables listed in the BY statement.
proc sort data = SAS-dataset-1 ;
by <descending> variable_1 variable_2 … variable_n ;
SAS statements ;
run ;
"
"
proc sort data = SAS-dataset-k ;
by <descending> variable_1 variable_2 … variable_n ;
SAS statements ;
run ;
data newSASdataset ;
merge SAS-dataset-1 SAS-dataset-2 SAS-dataset-k ;
by <descending> variable_1 variable_2 … variable_n ;
SAS statements ;
run ;
Match-Merging
• It is required that all of the original SAS datasets being
merged are sorted by the variables in the BY statement
first to perform this technique.
proc sort data = SAS-dataset-1 ;
by <descending> variable_1 variable_2 … variable_n ;
SAS statements ;
run ;
"
"
proc sort data = SAS-dataset-k ;
by <descending> variable_1 variable_2 … variable_n ;
SAS statements ;
run ;
data newSASdataset ;
merge SAS-dataset-1 SAS-dataset-2 SAS-dataset-k ;
by <descending> variable_1 variable_2 … variable_n ;
SAS statements ;
run ;
Example #1
• Flight attendants for International Airlines are need to pass three
exams (federal regulations, customer service, and safety procedures)
in order to become certified flight attendants. They can take them at
any time, but they must pass the federal regulations exam first before
moving on. Below are three permanent SAS datasets showing the
attempts by id number and their scores. A score higher than 6 is
needed to pass.
Match-Merging [Step 1: Use a PROC SORT]
• The PROC SORT steps will
sort the three SAS datasets
by the idnum variable. This
will set us up to begin the
simple match-merging
procedure.
Match-Merging [Step 2: The DATA Statement]
• The DATA step will link the
observations together by the
idnum variable.
• But how does SAS
accomplish this?
Match-Merging [Step 3: The Merging]
• From all three SAS
datasets, SAS
searches for the first
set of observations with
the lowest value for
idnum. In this case, it
is the missing value in
the third dataset. Why?
• Notice, however, that
there are no
observations in the
other SAS datasets
with an idnum also
equal to blank.
If an input SAS dataset does not have a matching BY
value, then the observation in the output SAS dataset
contains missing values for the variables that are unique
to that input dataset.
Match-Merging [Step 3: The Merging]
• SAS now searches for
the next lowest value
for the idnum variable.
• Here, the value
appears in only two of
the three SAS datasets.
• Again, SAS will put
missings in for the
fr_score variable.
Match-Merging [Step 3: The Merging]
• The next idnum value
is 1226. Fortunately, it
appears in all three
once.
• SAS simply links them
together.
• So when BY variable
value appears the
same number of times
in all of the SAS
dataset, SAS has no
problem at all linking
them together by order.
Match-Merging [Step 3: The Merging]
• Similar to the last
idnum value, SAS is
going to do the same
for the value of 2054.
• Since there is an equal
number of observations
in all three SAS
datasets, SAS is going
to link them together by
the order in which they
appear.
• The first observations in
each dataset will link
together and the second
observations will link
together.
Match-Merging [Step 3: The Merging]
• Now look at this
situation. Not only are
we missing an
observation in the third
SAS dataset, but there
is an uneven number of
observations in the first
two.
• SAS only knows how to
match if there are the
same number of
observations in the
SAS datasets that
share the same BY
variable values.
Match-Merging [Step 3: The Merging]
• SAS then links by the
order in which they
appear in the
background.
• This is what actually
really happens for SAS
datasets without a BY
variable value
observation.
• Please note that the
replicated observations
do not appear in the
input SAS datasets.
3362
8
3362
4
3362
.
3362
8
3362
8
3362
.
Match-Merging [Step 3: The Merging]
• This is similar to the
last example (but with
more observations).
• Questions
• How many observations
will SAS create for idnum
4524?
• What observations will be
replicated to perform the
match-merging?
• How will SAS link these
records together?
4524
4
4524
2
4524
4
4524
5
4524
3
4524
7
4524
8
4524
4
4524
7
4524
8
4524
5
4524
7
4524
8
4524
7
4524
7
Match-Merging [Step 3: The Merging]
• I think you get the point
now, right?
• And so you should
know what appears
next in the SAS
dataset.
• There will be two
observations for idnum
5702…
5702
9
5702
9
5702
5
5702
9
5702
9
5702
9
Match-Merging [Step 3: The Merging]
• There will be three
observations for idnum
6256…
6256
1
6256
8
6256
9
6256
5
6256
8
6256
9
6256
9
6256
8
6256
9
Match-Merging [Step 3: The Merging]
• There will be one
observation for 7803…
7803
9
7803
8
7803
8
Match-Merging [Step 3: The Merging]
• There will be two
observations for idnum
8008…
8008
9
8008
7
8008
5
8008
9
8008
7
8008
9
Match-Merging [Step 3: The Merging]
• And finally, there will be
four observations for
idnum 9890.
9890
3
9890
2
9890
9
9890
8
9890
4
9890
9
9890
8
9890
5
9890
9
9890
8
9890
9
9890
9
Voila! … the SAS
dataset is
complete.
Common Variable (Simple Match-Merging)
• Keep in mind that all four common variable rules apply for
the simple match-merging process.
• The common variable must have the same variable type (i.e.
numeric or character) in each of its SAS original datasets.
Otherwise, SAS will return an error message.
• The values from the last original SAS dataset overwrite the
previous values stored for that variable.
• If a common variable has different formats, SAS will use the first
format it sees for that variable.
• If a common variable has different lengths, SAS will use the first
length it sees for that variable.
It is this common variable rule that we are going to investigate
more right now. The last thing that we want to do is overwrite data.
The PROC PRINT Statement
• The PROC PRINT statement is the most popularly used
procedure in SAS. This statement lets you output a SAS
dataset (or a subset of it) in the output window.
• The most basic format of the PROC PRINT statement is the
following:
proc print data = libref.datasetname ;
run ;
• In this format, SAS will print all of the variables in the SAS
dataset into the output window unformatted. Of course, there
are ways to enhance the output (which we will cover some
now).
PROC PRINT: Options
• If you want SAS to print specific variables, you can adjust
the code by including a var statement.
proc print data = libref.datasetname ;
var variable1 variable2 … variablek ;
run ;
• You can also produce column totals for numeric variables
by using a sum statement.
proc print data = libref.datasetname ;
sum num_variable ;
run ;
PROC PRINT: Options (cont.)
• You can also specify not to provide the observation
number by including the noobs statement in the code.
proc print data = libref.datasetname noobs;
run ;
• Rather, if you have a variable that represents the identity
of each observation, you can use the id statement to
replace the default observation number.
proc print data = libref.datasetname ;
id variable1 ;
run ;
PROC PRINT: Options (cont.)
• Rather than use variable name, you can substitute a label
for the variable by including a label statement. But notice
where you have to mention it in the code.
proc print data = libref.datasetname label ;
label variable1 = ‘Variable 1’ ;
run ;
• You can also specify to print a subset of observations
from the SAS dataset based on a condition or a set of
conditions using a where statement in the code.
proc print data = libref.datasetname ;
where insert_condition_here ;
run ;
PROC CONTENTS Statement
• The purpose of PROC CONTENTS is to provide a
detailed listing of:
• the variables listed in a SAS dataset
proc contents data = libref.datasetname ;
run ;
• the SAS datasets located in a SAS folder
proc contents data = libref._all_ ;
run ;
• The ‘ _all_ ‘ is a SAS keyword to reference all of the SAS
datasets in a SAS library.
PROC FREQ Statement
• Now, we turn our attention to procedures that will help produce
results in the output window.
• The purpose of PROC FREQ is to create a frequency or
relative frequency table over a subset of SAS variables. The
code to do this is the following:
proc freq data = libref.datasetname ;
tables var1 var2 … vark ;
run ;
• The PROC FREQ statement can not only create a table by one
or more variables, but it can also save the results as a SAS
dataset.
PROC MEANS
• Let's start from the basics. The basic form of the PROC
MEANS is the following:
proc means data = libref.datasetname ;
run ;
• This basic form:
• produces statistical output for all of the numeric variables in the SAS
dataset
• produces the sample size, mean, standard deviation, minimum, and
maximum values by default
• We will use our baseball SAS dataset to understand how this
procedure works.
Scenario
• Here is a SAS dataset called baseball. It is located in the
' ia ' library.
Scenario
• Here is the breakdown of the variables.
PROC MEANS
• Here is the application of the PROC MEANS without any
options:
proc means data = ia.baseball ;
run ;
• Again, without any options, SAS calculates the sample mean,
sample standard deviation, sample size, minimum, and
maximum values for each numeric variable in the SAS dataset.
• The output is placed in a table and is posted in the output
window (i.e. no new output window is created from the MEANS
procedure unless specified otherwise).
•
Let's adjust the code to get better output.
PROC MEANS [var keyword]
• Notice in the last slide, all of the variables were provided
in the output. To specify specific variables in the SAS
dataset, include a var statement followed by the variables
that you only want outputted.
proc means data = libref.datasetname ;
var variable_1 variable_2 variable_3 …. variable_k ;
run ;
• SAS will only output the statistics for those that you
provided (and in that order).
• Note: if you have SAS variables with names that differ by a
number at the end of the variable name (for example: exam1
exam2 exam3 exam4 exam5), you can reference all of them by
saying the following: var variable_1 - variable_k
• For our example, we can say: var exam1 - exam5
PROC MEANS [ <stat-keywords> ]
• You can specify which descriptive statistics that you want
to output if you list them after the name of the dataset.
proc means data = libref.datasetname <stat-keywords> ;
var variable_1 variable_2 variable_3 …. variable_k ;
run ;
• By using this option, you will be trumping the default
statistics that is outputted. Now, SAS will only produce
the statistics that you specify.
• There are dozens of statistical keywords to choose from.
PROC REG
• The basic form of the PROC REG is the following:
proc reg data = libref.datasetname ;
id;
model responsevar = var1 var2…vark;
run ;
• This basic form:
• produces a linear regression model with model fit, parameter
estimates, and
• produces the residual diagnostic test
• We will use our Salary of Major Leageue Baseball Players
SAS dataset to understand how this procedure works.
Questions?
Special Thanks
• Dr. Chris Franck- Assistant Director of LISA
• Tonya Pruitt-Administrative Specialist LISA
• Dr. Marlow Lemons
• Kris Patton
• Elaine Perrin
• Weibin Xu
Download