Advanced Data Manipulation Greg Jenkins 1

advertisement
Advanced Data Manipulation
Greg Jenkins
1
Topics
• How a data step really works.
• Advanced data step features: arrays, loops,
SAS system variables, retain, output
statements, advanced “input” statements,
using the data step as an output device.
• Data manipulation procedures: proc
transpose, and an overview of proc sql.
2
Compile Part of a Data Step
• SAS scans the syntax of the data step
(checks for syntax errors).
• Translation of source code to machine
language.
• Defines input/output files
• Creates: - input buffer(non-SAS data)
- Program Data Vector(PDV)
• Set variable attributes for output dataset.
• Figures out what variables to set missing.
3
Execute Part of a Data Step
• Executes code written in a data step.
• Basically inputs a line of data into the PDV from
the data source(if one exists).
• This PDV is just a temporary storage area that can
be thought of as a row of data.
• Then any calculations, subsetting, etc. that are part
of the data step code are done.
• These steps are processed until an “output”
statement is arrived at, or the last data step
command is reached(but one that is part of the
executable code)
4
Example
Sample Code
data main;
set old;
length y $20.;
z = x + 2;
keep x y z;
run;
OLD dataset
X
A
3
Qsrt
5
f
2
#
5
Compile Part of Example Data Step
• SAS scans code for “syntax errors”, none
there(at least I hope no typos), so it moves
on to the next step.
• SAS translates the “SAS code” into
machine language.
• Defines the input file as a SAS dataset
named “old”, and the output file as a SAS
dataset named “main”.
6
Compile Part of Example Data Step
• Since the input file is a SAS dataset(no
input statement) SAS doesn’t create a input
buffer.
• SAS creates the PDV:
X
A
Y
Z
SAS System Variables
…
…
7
Compile Part of Example Data Step
• SAS now creates the attributes of the variables in
the output dataset.
• Attributes of a variable are:
- type(character, or numeric)
- formats/informats
- labels
- length
• We’ll suppose that the variables in the old dataset
were defined as:
- ‘x” numeric, “a” character
- no formats, informats, or labels for both
- “x” has a length of 8, and “a” has a length of 4
8
Compile Part of Example Data Step
• SAS creates the attributes in the new
“main” dataset for the variables in the “old”
dataset, using the same attributes as in the
“old” dataset since no commands were
issued to change the attributes.
9
Compile Part of Example Data Step
• The attributes of the variables that will be
created are:
- “z” numeric, “y” character (this is defined
in the length statement since a $ appears
after the name.
- no formats, informats, or labels since none
are defined in the data step.
- the length of “z” will be defined to be 8 by
default (although this could be changed,
though not suggested), and the length of “y”
will be defined as 20 due to the length
statement.
10
Compile Part of Example Data Step
• Then SAS initializes the new variables to
missing and starts the executable phase of
the data step.
• It’s good to note here that there are other
things that SAS is doing during this phase
of the data step, but for a basic
understanding of what’s going on here we’ll
ignore them and move on (there are some
books and papers on this subject if you’re
really interested).
11
Executable Part of the Data Step
• Input row 1st row of data from “old” dataset
into the PDV:
OLD dataset
data main;
set old;
length y $20.;
z = x + 2;
keep x y z;
run;
X
A
3
Qrst
X
A
3
Qsrt
5
f
2
#
Y Z SAS System Variables
_n_=1
… …
…
12
Executable Part of the Data Step
• Run through executable code:
z = x + 2;
X
A
3
Qrst
Y Z SAS System Variables
5
_n_=1
… …
…
13
Executable Part of the Data Step
• End of executable code “implicit output”
command adds data in PDV to output dataset
X
A
3
Qrst
Y Z SAS System Variables
5
_n_=1
… …
…
MAIN dataset
X
3
Y
Z
5
14
Executable Part of the Data Step
• Input row 2nd row of data from “old” dataset
into the PDV:
OLD dataset
data main;
set old;
length y $20.;
z = x + 2;
keep x y z;
run;
X
A
5
f
X
A
3
Qsrt
5
f
2
#
Y Z SAS System Variables
_n_=2
… …
…
15
Executable Part of the Data Step
• Run through executable code:
z = x + 2;
X
A
5
f
Y Z SAS System Variables
7
_n_=2
… …
…
16
Executable Part of the Data Step
• End of executable code “implicit output”
command adds data in PDV to output dataset
X
A
5
f
Y Z SAS System Variables
7
_n_=2
… …
…
MAIN dataset
X
5
Y
Z
7
17
Executable Part of the Data Step
• Input row 3rd row of data from “old” dataset
into the PDV:
OLD dataset
data main;
set old;
length y $20.;
z = x + 2;
keep x y z;
run;
X
A
2
#
X
A
3
Qsrt
5
f
2
#
Y Z SAS System Variables
_n_=3
… …
…
18
Executable Part of the Data Step
• Run through executable code:
z = x + 2;
X
A
2
#
Y Z SAS System Variables
4
_n_=3
… …
…
19
Executable Part of the Data Step
• End of executable code “implicit output”
command adds data in PDV to output dataset
X
A
2
#
Y Z SAS System Variables
4
_n_=3
… … …
MAIN dataset
X
2
Y
Z
4
20
Executable Part of the Data Step
• SAS can’t find any more observations in the
input dataset so the data step is ended and
the PDV is destroyed.
21
Output Statement
• “implicit” vs. “explicit” output statement.
• In the previous example the “implicit” output
statement was used, after all executable code
is run in data step, SAS outputs a row of data
from PDV to the output dataset.
• “explicit” output statement is used to control
when data is written from the PDV to the
output dataset.
22
Output Statement
• When an output command is run across in a data
step data is written from the PDV to the output
dataset:
data gsdfs;
x = 3;
output;
x = 5;
output;
run;
23
Output Statement
• In the previous example there is no data to be
read in, so SAS reads through the executable
code only once.
• So x is set to a value of 3 in the PDV and then
outputted to the output dataset gsdfs.
• X again is set to a value of 5 in the PDV and
the outputted to the output dataset gsdfs.
X
GSDFS Dataset
3
5
24
SAS System Variables
• SAS system variables are created during the
data step but not stored in the output
dataset, although can be used in the
executable code.
• _n_ is a variable that gives the iterations of
the output command in the data step.
• _error_ is related to data errors in inputting
data.
25
SAS System Variables
• Other variables that can be used in the
executable code are: in=, first.”by-variable”,
last.”by-variable”, end=, point=
• An in= statement creates a variable that has
a user specified name and takes on the value
of 1 if the observation is in the input dataset
during execution, and 0 if it is not in the
input dataset.
26
SAS System Variables
DATA1
DATA2
ID
X
ID
X
3
4
2
-6
Data main;
set data1(in=myvar)
data2(in=other1);
Nv1 = myvar;
Nv2 = other1;
MAIN
ID
X
NV1 NV2
3
4
1
0
2
-6
0
1
27
SAS System Variables
• The first. & last. Variables can only used if a by
statement is specified:
data temp;
set temp2;
by id date;
if first. id then f = 1;
else f = 0;
if last. id then l = 1;
else l = 0;
run;
28
SAS System Variables
TEMP
TEMP2
ID
DATE
ID
DATE
F
L
1
1/5/02
1
1/5/02
1
0
1
2/6/02
1
2/6/02
0
1
2
12/31/01
2
12/31/01
1
0
2
6/15/02
2
6/15/02
0
0
2
7/7/02
2
7/7/02
0
1
3
4/7/02
3
4/7/02
1
1
29
Retain Statement
• Allows variables to remain in the PDV and
not be re-initialized as missing for a new
input line from an input statement or
dataset.
• Can often be useful when used in
combination with the first. & last. system
variables.
30
Retain Statement
TEMP
data data1;
set temp;
by id date;
retain visit;
if first.id then visit = 1;
else visit = visit + 1;
run;
ID
DATE
1
1/5/02
1
2/6/02
2
12/31/01
2
6/15/02
2
7/7/02
3
4/7/02
31
Retain Statement
DATA1
ID
DATE
VISIT
1
1/5/02
1
1
2/6/02
2
2
12/31/01
1
2
6/15/02
2
2
7/7/02
3
3
4/7/02
1
32
Arrays
• Arrays are a way of representing a group of
variables in a possibly more efficient way.
• Two parts of using an array are the array
“declaration” statement and the array “call”.
• Names of arrays can be any SAS name providing
that the name is not used by any of the variables in
the data step.
• The array “declaration” statement that will take on
the values of a group of variables has the
following form:
array arrayname variable-list;
33
Arrays
• What this statement does is represent the first
“column” of the arrayname as the first variable in
the variable-list, the second column as the second
variable in the list, …
• To use the array use the following general call of:
arrayname[column #].
• Other options in the array statement are to create
new variables from an array initializing the
elements at some starting value(you can make these
temporary, i.e. not outputted to the output dataset
stored by using the _temporary_ command).
34
Arrays
Data data1;
set data2;
array x a b c;
q = x[1] + x[3] – x[2];
run;
DATA2
DATA1
A
B
C
A
B
C
Q
1
3
-7
1
3
-7
-9
4
2
6
4
2
6
8
35
Arrays
• To create a different array element index
than that of the default starting at one,
change the array declaration statement:
array array-name[start:stop] varlist;
36
Arrays
Data data1;
set data2;
array x[2:4] a b c;
q = x[2] + x[4] – x[3];
run;
DATA2
DATA1
A
B
C
A
B
C
Q
1
3
-7
1
3
-7
-9
4
2
6
4
2
6
8
37
Arrays
• There are also multidimensional arrays, to
define these use the following general array
declaration statement:
array array-name[dim1,dim2, …] varlist;
38
Arrays
data data1;
set data2;
array x[2,2] a b c d;
q = x[1,1] + x[1,2] – x[2,1] + x[2,2];
run;
DATA2
DATA1
A
B
C
D
A
B
C
D
Q
1
3
-7
2
1
3
-7
2
13
4
2
6
3
4
2
6
3
3
39
Loops
• Iterative do loops, conditional do loops.
• Iterative loops work with a counter, or loop
through a specified group of code for a
specified number of times.
• Conditional loops, continue to loop until or
while a conditional statement is true.
• All loops are started with a do statement
and ended with an end statement.
40
Iterative Do Loop
• Basic syntax:
do counter-variable = start-value to end-value
<by by-value>;
< programming statements(body of the loop);>
end;
41
Iterative Do Loop
TEMP
I X
data temp;
do i = 1 to 5;
x = i + 2;
output;
end;
run;
1
3
2
4
3
5
4
6
5
7
42
Conditional Do Loop
• Two types of conditional do loops, there
are: do until, and do while loops.
• Do until loops continue looping until the
conditional statement used becomes true
(note that this will always execute at least
once).
• Do while loop continues looping while the
conditional statement is true.
43
Do While Loop
• General syntax:
do while(conditional statements);
< programming statements(body of the loop);>
end;
44
Do While Loop
TEMP
data temp;
i = 1;
do while(i le 5);
x = i + 2;
output;
i = i + 1;
end;
run;
I X
1
3
2
4
3
5
4
6
5
7
45
Do Until Loop
• General syntax:
do until(conditional statements);
< programming statements(body of the loop);>
end;
46
Do Until Loop
TEMP
data temp;
i = 1;
do until(i > 5);
x = i + 2;
output;
i = i + 1;
end;
run;
I X
1
3
2
4
3
5
4
6
5
7
47
Advanced Input Statements
• Up to this point we looked at the basic input
statement assuming the input was given as
space delimited values with each row as an
observation. Also mainly syntax for
reading numeric data.
• General syntax:
input variable-list;
48
Basic Input Statement Arguments
• For general character variables, the variable name
must be followed by a $:
data temp;
input d $;
cards;
Banana
Orange
;
49
Basic Input Statement Arguments
• For character variables with an embedded blank
use the $ then &:
data temp;
input d $ &;
cards;
Banana Rama
Orange Roughy
;
50
Basic Input Statement Arguments
• For character variables with quotation marks that
are not meant as delimiters use the $ then ~:
data temp;
input d $ ~;
cards;
Banana’s
hasn’t
;
51
Basic Input Statement Arguments
• For inputting missing values of character or
numeric data use a period(.).
• Another way to input missing values that
are indicated by a special value is to use the
missing statement.
data temp;
missing r; input a b;
cards;
2r
.4
;
A
B
2
.R
.
4
52
Changing Delimiters in Input
Statement
• To specify what delimiter to use in the list
input style add the delimiter = option to the
infile statement:
data data1;
infile ‘U:/sasclass/datafile.txt’ delimiter=‘,’;
input a b c;
run;
53
Formatted Input
• If you need to read in input that is formatted such
as dates just follow the variable name by a : and
its format:
data temp;
input date : mmddyy10.;
cards;
11/05/1996
12/13/2000
;
54
Column Input
• Read in data by specifying the “columns” the data
occupies in the input file:
data temp;
infile in;
input x 1-5 y 12-14;
run;
In
Column # 123456789012345678901234567890
3234 4321 132344
55
Column Input
• The previous example would read from the
input file and create the following dataset:
TEMP Dataset
X
Y
3234
132
56
Pointer Controls
• Can use the @ character to control the start
column a variable is read in at. Note, that it
will read like list input.
TEMP Dataset
data temp;
input @2 a @4 b;
cards;
12345
67890
;
A
B
2345
45
7890
90
57
Pointer Controls
TEMP Dataset
data temp;
input @2 a @4 b;
cards;
12 45
67 90
;
A
B
2
45
7
90
58
Pointer Controls
• Can use the + character to control the start
column a variable is read in at in relation to
the previously variable read in.
TEMP Dataset
data temp;
input @2 a 1. +2 b;
cards;
12345
67890
;
A
B
2
5
7
0
59
Pointer Controls
• There are also record (or input file row)
control symbols like #, which will use #n
the nth record.
data temp;
input #2 a;
cards;
1
2
3
4
;
TEMP Dataset
A
2
4
60
Pointer Controls
• Be careful though this can be tricky with
multiple variables.
data temp;
input #2 a #1 b;
cards;
15
26
37
48
;
TEMP Dataset
A
B
2
1
4
3
61
Pointer Controls
• The / symbol will start the input for the variable at
the 1st column of the next line of input.
data temp;
input a /b c;
cards;
1
23
4
56
;
TEMP Dataset
A
B
C
1
2
3
4
5
6
62
Using Data Step as an Output Device
• Use the put statement in a data step to output to
somewhere other than a data file.
data temp;
x = 3;
put x;
run;
• The above example will output a 3 to the log
window or .log file.
63
Using Data Step as an Output Device
• If you want to output to a place other than the log
window or .log file use the file statement.
data temp;
file print;
x = 3;
put x;
run;
• This will output to the output window or the .lst
file.
64
Using Data Step as an Output Device
• You can also specify a file to output to instead of
the output or .lst file:
data temp;
file ‘U:/myfile.txt’;
x = 3;
put x;
run;
• This will output to U:/myfile.txt
65
Using Data Step as an Output Device
• Sometimes you don’t need to create a data
set if you’re just using the data step as an
output device, so use the _null_ keyword in
the data statement:
data _null_;
x = 3;
put x;
run;
66
Proc Transpose
• This procedure will transpose data, the data
manipulation that it does can be done in the
data step but this procedure is often simpler
to use.
• Example: You have a data set with many
observations for each person, say blood
pressure measurements done at several
clinical visits and want to a have a data set
with one observation per person with a
variable for each blood pressure
measurement.
67
YOU HAVE THIS:
ID
Visit
BP
1
1
90
1
2
98
1
3
76
2
1
82
2
2
104
3
1
115
68
BUT YOU WANT THIS:
ID
BP1
BP2
BP3
1
90
98
76
2
82
104
.
3
115
.
.
69
Proc Transpose
• General syntax:
proc transpose data = dsname1 out = dsname2;
var varname;
id idvar;
by byvar;
run;
70
Proc Transpose
• The data= statement is the input dataset name (i.e.
dataset to be transposed), and the out= statement is
the output dataset (i.e. transposed dataset).
• The var statement indicates the variable to be
transposed.
• The id statement indicates the variable names you
want created for the transposed data, the default is to
create the variables var1 – varn.
• The by variables are like by variables in all the other
procedures and indicate how to transpose the data.
71
Previous Example
• Using the previous blood pressure example
another variable must be added to the input data
set to indicate the new variable names.
OLD Dataset
ID
Visit
idvar
BP
1
1
BP1
90
1
2
BP2
98
1
3
BP3
76
2
1
BP1
82
2
2
BP2
104
3
1
BP1
115
72
Previous Example
• So the code needed to transpose the data and
achieve the data structure desired would be:
proc sort data = old; by id; run;
proc transpose data = old out = new;
var bp;
id idvar;
by id;
run;
73
Proc SQL
• Proc SQL can do much if not more than
what a data step can do.
• It has the ability to access data from other
sources (databases, etc.) and in the case of
some databases can pass native SQL
language to the database for added
efficiency.
• Another ability that SQL has is to do
complex merging of data.
74
Proc SQL
• This is an overview of the procedure, so
we’ll just look at some basic parts of the
procedure.
• Proc SQL is an interactive procedure and is
started by issuing the statement:
proc sql;
• There is no data= statement for this
procedure, data is read in a little differently.
75
Proc SQL
• The main option for proc sql; statement is
print|noprint.
• In one respect this is a “query” procedure
intended to be used with databases.
Meaning that you are asking SAS to tell you
something about the data, so the noprint
option will suppress the “query” part of the
procedure and just build datasets(for the
most part) or tables as they are referred to in
the SQL procedure language.
76
Proc SQL
• There is a nice graphical user interface that
is often helpful since it will output code.
• The SQL GUI can be found be selecting
“tools” then “Query”.
• This interface is also mainly intended as a
“query” tool but can also be used to perform
most of the procedures capabilities.
77
Select Statement
• The next most important statement in the
SQL procedure is the select statement.
• The select statement allows you to decide
which variables are of interest or that will
be used in the “query” or dataset building.
Proc sql;
select <options>
from <options>
where <options>
group by <options>
order by <options>
having <options>;
78
Select Statement
• To get data into the procedure use the from
statement and include a table name(an example is
a SAS dataset name).
• In the select statement you can use an asterisk(*)
to indicate you are interested in all the variables in
the dataset or supply a list of variables seperated
by commas.
Proc sql;
select *
from data1;
Proc sql;
select a, b, c
from data1;
79
Select Statement
• Proc SQL has many functions such as min,
max, average, etc. which can be used to
create new variables, however, they do
work across rows (observations) instead of
columns (variables) as in the data step.
• To create a new variable using one of these
functions you have to define the function
and then use an as statement followed by
the new variable name;
80
Select Statement
• Example 1 creates a new variable y that will
take on the value one-half of x.
• Example 2 will create a new variable name
dsmin that will be the minimum x value in
the entire dataset data1.
EXAMPLE 1
EXAMPLE 2
Proc sql;
select x*0.5 as y
from data1;
Proc sql;
select min(x) as dsmin
from data1;
81
Select Statement
• If you want to use any variable that you have
created in proc sql (in other proc sql statements)
you have to proceed the new variable name by the
word calculated.
• The following example uses the order by
statement for sorting the data as in the proc sort
procedure.
Proc sql;
select x*0.5 as y
from data1
order by calculated y;
Quit;
82
Select Statement
• The group by statement “collapses” rows of
like data decided by the variables following
the statement. It is similar to the familiar by
statement in other SAS procedures.
• Variables in the group by clause do not
have to be specified in the select clause.
• This is like a by statement but no sorting is
necessary prior to running proc sql for this
or any other part of the SQL procedure.
83
DATA1 Dataset
ID
DATE
INCOME
1
1/13/00
3213
1
2/7/01
545
1
6/3/99
654
2
2/7/02
5235
2
1/8/00
8768
3
12/2/89
2155
84
Proc sql;
select count(*) as days, sum(income) as totinc
from data1
group by id;
Quit;
Result of Query
ID
1
2
3
DAYS
3
2
1
INCOME
4412
14003
2155
85
Select Statement
• The where and having statements work the same
as where statements in other SAS procedures &
the data step.
• If the procedure is connected to a database
however you can pass database “native” SQL
code, instead of using SAS code.
• Also if the procedure is connected to another
database and SAS statements are used instead of
“native” SQL to the database, SAS must make a
temporary copy of the entire database before it
subsets (not very efficient).
86
SQL Joins
• Joins in SQL are basically merging and
concatenating datasets together.
• There are many types of joins: left, right,
inner, outer, etc.
• It is often easier to use the proc SQL
graphical user interface than trying to figure
out how to code this, but if you do code
this, you need to put the datasets of interest
in the from clause and create aliases for
them.
87
SQL Joins
• An alias for a dataset works in the same way as
a libname for an alias of a SAS library.
Proc sql;
select a.id, a.x, b.y
from data1 a, data2 b
where a.id = b.id;
DATA1 Dataset
ID
3
4
X
14
17
DATA2 Dataset
ID
3
4
Y
6
-5
88
SQL Joins
• The previous example will complete a
merge as was done by the data step with a
merge and by statement.
Result of Query
ID
3
4
X
14
17
Y
6
-5
89
SQL Joins
• More complex joins are complete using the
“type of join name” between two or more
select … from statements or in the from
statement alone.
Proc sql;
select * from data4
union
select * from data5;
Proc sql;
select * from data4 d4
left join data5 d5
on d4.id = d5.person;
90
Creating a SAS Dataset from a Query
• Up until now we have just been creating queries
which basically do not store any information.
• To create a SAS dataset from a query use the
“create table table-name as” statement
PROC SQL SYNTAX
proc sql;
create table data2 as
select * from data1;
quit;
DATA STEP SYNTAX
Data data2;
set data1;
Run;
91
Accessing Databases
• There are many different statements depending
what operating environment you are working in, as
well as the type of database you’re connecting to
and how that database is set up.
• The basic statement you can use is the connect to
and disconnect from statements.
Proc sql;
connect to odbc …
< SQL statements >
disconnect from odbc … ;
Quit;
92
Download