tweaking_223_20121017

Creating and Tweaking Data
HRP223 – 2010
October 17, 2012
Copyright © 1999-2012 Leland Stanford Junior University. All rights reserved.
Warning: This presentation is protected by copyright law and international treaties.
Unauthorized reproduction of this presentation, or any portion of it, may result in
severe civil and criminal penalties and will be prosecuted to maximum extent possible
under the law.
1
Topics
• Creating data with loops in data steps
• Creating variables
• Modifying variables
2
Making 100 records
Click here to
add new
variables.
Once you have the 100 records you can add
in details based on the value of dude.
You can easily add in a random number for
everyone’s height and make half the people
male.
3
Case Clauses
• You can add new variables using functions and
simple assignment statements inside casewhen-else-end phrases within the SQL.
This is what we are building.
Notice the computed column, the
name of the new node and the
name of the new data set.
Click Preview to see the SQL code.
4
1st
2nd
Computed
At the end of the process we want a character variable with dudes
1 to 50 to be male and 51 to 100 to be female and anything else
labeled as *** BAD SEX ***.
5
6
Always specify a
value for when it
does not know
what value to use.
Be sure to specify a
character column if you
are making strings of
characters.
Add in the rules on
how to replace.
7
8
Click Modify Task and
click Computed Column
to add in the random
height.
9
Display 5 digits including the .
Two are after the decimal
To change how the new values
appear click here.
10
rand("normal", 67, 3)
You can find
functions here. Use
OnLineDoc to find
more information.
11
More Complex Variables
• You can compute a new value for different levels
of a existing variable. Say I want to add 2 inches
to all the males.
• Open the with height and gender
• Click computed columns
• Recode the sex column
• Specify it is a numeric column
• Replace female with the height variable
• Replace male with the height variable + 2
12
13
Nested Loops
• How to create data.
– Use loops.
– Be sure to include an end with every do.
– Include an output inside the innermost loop.
• If you forget the output, the only time it will write a
record to the new dataset is at the end of the data step.
14
Advanced Expressions
• If you get sick of clicking you can write complex
case statements yourself using:
• Computed Column
• Advanced Expression
• Type the case logic
case
when
when
when
else
end
(treat = "Placebo") then rand("norm", 10, 1)
(treat = "Low") then rand("norm", 19, 2)
(treat = "High") then rand("norm", 20, 2)
.E
15
16
Fixing Bad Values
• You will eventually need to fix bad data.
– Say you want to set Placebo5 to be a score of 10.
Name the node and
output.
Select the variables
that are not
modified.
17
Fixing Bad Values
• Tell it to compute a column and choose either
Recode column or do a case-when-else-end
statement in an Advanced expression.
18
19
To get a better look click
validate
20
Hang on to this syntax
case
when (logic check) then new value
else originalVariable
end
21
Collapsing Groups
• Often you will have a categorical variable and
you will want to reduce the number of groups.
– High Dose and Low Dose are the same as being on
a drug.
• You can create a new variable or just use a
custom format to change how the values
appear.
22
Adding a New Column
• Choose Computed Column and recode a
column.
23
Adding a User
Defined Format
Here we are
changing characters
to appear as other
characters.
24
1
2
4
3
Repeat
until you
have filled
in all the
values you
want to
appear
differently.
25
Using Formats
• The formats are not automatically associated
with any variables. You need to tell SAS to
apply the format when it is creating a dataset
or when it is processing a variable.
• Some processing nodes do better if you have
assigned the format in a previous step.
26
Select the variable
that needs the
format and click
properties.
Click Change… and
then pick the User
Defined format.
27
Same Information Formatted
28
Combining
• When you have data in two tables, you need
to tell SQL how the two tables are related to
each other.
– Typically you have a subject ID number in both
files. The variable that can be used to link
information is called the key.
29
Here the two tables have
different variables (except
ID) and they are in a
different sort order.
Response to Treatment
We want the favorite color
merged in to see if it is
related to response to
treatment.
Demographics
30
Merging
• Merging is trivially easy with EG. Choose a table and
do the Query Builder…. And push the Join Tables
button.
31
Double click on the dividing lines
to make the columns wide
enough to read.
32
Notice the name t1. In the SQL
statements, variables from this
table will have the prefix t1.
This table will be referred to as
t2.
It noticed that the two tables
have the common variable ID.
Therefore it is going to match
records that have a common
value in ID.
Double click the link for details.
33
Joins
• You will typically do inner joins and left joins.
– Inner Joins: select the marching records
– Left Joins: select all records on the left side and
any records that match on the right.
34
Inner Joins
• Inner Joins are useful when you want to keep
the information from the tables, if and only if,
there are matches in both tables.
– Here you keep the records where you have
demographic and response to treatment
information on people.
35
Left Joins
• Left joins are useful when you have a table
with everybody on the left side of the join and
not everyone has records in the right table.
– A typical example has the left side with the IDs of
everyone in a family and the right table has
information on diagnoses. Not everyone is sick so
you want to keep all the IDs on the left and add in
diagnoses where you can.
36
Typical Left Join
Notice the numeric
variable is formatted to
display with words.
37
38
Coalesce
• The previous example leaves NULL for the
people who are disease free. You probably
want to list the rest as healthy.
• The coalesce function returns the first nonmissing value.
– Coalesce works on numeric lists.
– Coalesce works on character lists.
39
40
Coalesce
• If you are using left joins from multiple tables,
coalesce can be really useful.
– Say you have people who have reported disease,
other people have verified disease and the rest
are assumed to be healthy. You can coalesce an
indicator variable from the verified table and
reported table and call everybody else healthy.
41
If the tables have indicator variables, once the tables
are linked, the coalesce function is easy:
COALESCEC(t3.status2 , t2.status1, "Healthy"))
42
No indicator variables?
• If the tables you are coalescing do not have
indicator variables, just make them as part of
the query by adding a column which has the
ID in the child tables (e.g., reported and
verified) recoded to a word like “reported” or
“verified”.
43
The two new indicator
columns.
44
Coalesce the new columns
• Once the new columns are created, create a
new variable using the Advanced expression
option for a new computed column. Then do
coalesce on the new variables. Double click
on the new variables and it will insert the
code.
45
After double clicking
the ver variable the
code is inserted.
Don’t forget the comma
before double clicking
the rep variable.
After inserting reported
and verified, put in
another comma and the
“healthy” option.
46
47