Point and Click SAS - MedBlog

advertisement
Point and Click SAS:
Getting Started with SAS ™
Enterprise Guide™ 4.1
Raymond R. Balise, PhD
Stanford University
Department of Health Research and Policy
From Start to Finish
• Today I am going to walk you though the
process of doing analyses using a package
called SAS/Enterprise guide. Along the way I
will hit the high points in these areas:
–
–
–
–
–
Importing
Cleaning
Visualizing
Analyzing
Reporting
Data Management and Analysis
Choices
• You have many choices for managing and
analyzing data:
– Excel™ by Microsoft
– R™ by R Foundation for Statistical Computing
• Rcmdr by John Fox
– S-Plus™ by Insightful
– SPSS™
– SAS™
– SAS/Enterprise Guide
•
•
•
•
•
Excel affords a
very nice
interface to
quickly allow
simple tables of
data.
It has some
data validation
tools.
It has nice
visualization
prototyping tools
but there are
MAJOR
graphical bugs.
It can do some
common
analysis
methods but it
has next to no
built-in
diagnostic tools.
Its report writing
abilities are
pathetic.
Excel
• Excel’s built-in analysis
tools are hidden away and
are extremely limited.
R
• R has no obvious way to do anything.
R with Rcmdr
If you learn to
download and
run a library
called R
commander, you
can get a nice
point-and-click
system to do
many common
statistics but
important ones
for medicine (like
survival analysis)
are missing.
R’s big brother (commercial
software) called S-plus has a
very nice graphical user
interface. You can point and
click your way to lots of
analyses and in some cases
you can see the code and
use it to learn the language.
Unfortunately, the language
is extremely unintuitive for
things like data
management. Also, the
code generated by the GUI
sometimes includes things
that amount to “you clicked
on a menu to do the task.”
SPSS
SPSS has a brilliant
graphical user interface for
graphics and analyses. Its
major downfall is in data
manipulation involving
multiple tables. Also, when
you get stuck, you’ll need
to find an SPSS expert,
which is not as easy as
with other packages (within
the medical school).
SAS
The latest
version of SAS
has a robust
system for
point-and-click
analyses,
which they hid
away on a
submenu.
SAS/Analyst
Once you have
the analyst
running, you get
the graphics and
statistics menus
you would
expect. The
interface is
primitive but it
does allow you
to generate
many analyses.
But the format of
the output
(especially the
graphics) is
substandard.
SAS/Enterprise Guide
SAS/Enterprise Guide
• Getting EG
– If you have a SAS 9.1.3 license though Stanford, you
can Get EG for free by contacting software licensing
software.stanford.edu
• Using EG
– Project explorer shows a “tree” view of the different
things in the project.
– Project Designer shows flowcharts of the tasks you
have done.
– Task List gives you hyperlinks to common tasks like
analyses and graphics.
– Task Status shows what EG is working on.
– Notice the pushpins. A “pushed in” pin keeps the
window displayed. A sideways pin says the window
can retract to the edge of the screen.
Orientation to EG
• The model for EG is to have a flowchart
beginning with data import, moving
through data management, into analysis
with visualization and ending at well
formatted pages.
• You can right click on the objects in the
flowchart and set and reset properties until
you are happy.
To Code or Not to Code
• You don’t have to memorize SAS code syntax
anymore!
• Enterprise Guide builds SAS code for you and
passes it on to the SAS analysis and
visualization engines.
• If you want to learn the programming, you can
see the code as it develops and if you already
know some SAS, all the procedures are readily
available to you. You can augment the code that
EG writes or you can get an enhanced editor to
write code from scratch.
Excel and Analysis Software
• Everyone wants to import data from Excel. Use
extreme caution with every package that can
read in Excel files, even other Microsoft
programs.
– There is a common, but not commonly known, bug
with how Windows processes Excel files. The
fundamental issue is how it figures out if a column in
Excel has character or numeric data. If it thinks that
the column is character data, no problem, but what
happens when it thinks a column has numbers, and
part way down the column, it has some letters in a
cell? Sometimes the cells with the characters are
unceremoniously blanked out!
• You can end up with missing data.
R
SAS
The Registry
• Deep inside of windows is a repository of
information on all the software on your
computer. It is called the registry.
• In the registry there is a key that tells
applications which are talking to Excel how
many rows to check, going down a column, to
figure out if a column should be called character
or numeric.
– It is set by default to only look in the first 8 rows!!!!!
So if you have character data for the first time in a cell
after the first 8 rows, it guesses incorrectly that you
have only numeric data in the column and your
character cells will be erased without warning on
import.
You can fix this.
• Make sure to follow these instructions carefully. If you
tweak the wrong thing in the registry you can render your
machine unable to reboot!
• Click the Windows Start menu and choose Run
• In the dialog type regedit and click ok
• Open up the tree to this path
•
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel
• Double click TypeGuessRows
• Type 0, that is zero not the letter o, in the DWORD editor
and click ok
• Microsoft ACCESS will silently change this setting!
– So watch this setting if you use ACCESS.
After the Tweak…Back to EG
• Even with that fix, the typical “import” menu choice can still be
problematic in EG (and also in the other packages).
• After you tweak the registry, you can copy and paste a tiny
little program and use it to import your data correctly.
proc import out = bugged datafile = "C:\blah\bugged.xls" replace;
mixed = YES;
Import into a temporary file called
run;
bugged that will disappear when you
quit EG.
data bugged;
set bugged;
run;
Importing
• You want to add an import data icon to the
flowchart.
• You can go to the file menu and choose “Import
Data ….” but given the Excel problem, tread with
extreme caution.
• Instead, I recommend going to the file menu and
choosing New > Code and then pasting in the
code from the last slide:
Importing with EG
2) Push the play button
1) Paste in the code
Imported
• There are many things to notice when you play
(aka run) the code.
• A view table opens.
• The red letter A in the column heading means
that the data imported as a column of character
data. The fact that this is character data will
prevent you from doing numeric statistics on this
column.
• The Project Explorer gets a log entry and a data
object added to the “tree”. The Process Flow
gets a dataset item. The selected dataset now
shows up on the task/button bar.
Column Headings in Excel
• You can have anything you want for the column
headings in your Excel files but you will save
yourself many headaches if you make the
heading into a single “word” with no punctuation.
• In Excel, change your column heading “is
dead…” into isDead. This convention will let you
read the data into any popular analysis package
without having to rename over and over again.
More Complicated Import
• You can import worksheets with many columns and
use named tabs in the Excel workbook. Add in
another code object, then paste in:
Storing Data with the Project
• The datasets I have shown you so far are kept in
a temporary store called “work” and when you
quit EG, it will forget the data.
• You can permanently store the data in a format
EG knows. You do this by setting up a “library”
using the Tools menu. This is just a “pointer to”
or “alias for” a folder on your hard drive.
Use 8 or less letters
Click on local
Push next if you are not accessing a database
SAS can read and write many
files seamlessly. If you specify
an Engine other than BASE
you can access different types
of files in the folder (e.g.,
SPSS .sav files). You can
even have a library refer to an
Excel workbook. That voodoo
lets you work directly inside of
Excel files from within SAS.
Name it something!
• Once you have the library, change the label on
the flowchart. You can click on the icon in the
process flow and then push the F2 key on your
keyboard or click on the name in the project
explorer and when it highlights, type over the
name.
• You can change the name on any object like
this.
You can put any
label you want
into the flowchart
but you probably
want to use the
real library name.
Libraries
• Once the library is set, SAS is aware that
there is a place that can hold data, but you
still need to tell it to move the data to the
permanent store.
• Next is an example where the data gets
stored in a permanent library on your hard
drive.
Working with Data
• You can import, then fix, problematic data and
then save a permanent copy in a library.
Double click the table
icon to view the table
• You will be asked if you really want to change
the data when you start to type the changes.
Keeping Data
• I recommend moving the data to a
permanent library before you make any
changes. First make a library. Next load
the data into the temporary work spot.
Single click the data set, then make a
query using the “Filter and Query…” menu
option on the Data menu.
Drag and drop the variables or
double click them.
Name your query.
Push change to change the
destination to be the library.
Navigate to the BABIES
library and assign a name
to the data set. Use a
single word less than 31
letters with only letters in
the name.
When you are
done, push run.
Add in a note icon
(from the File >
New menu) and
always document
every change.
Save your project!
• EG is very stable if it is patched up to date
but I save often.
After first save
Before first
save
Patch Software
• The first release of EG 4.1 had major
crashing issues.
• You will want to get the SAS patches and
EG patches from the web:
ftp.sas.com/techsup/download/hotfix/e9_win_sbcs.html
ftp.sas.com/techsup/download/hotfix/ent_guide41.html
Returning to a Project
• When you get back to a project the files that were
imported into the temporary work location are
gone and SAS will have forgotten about the library
reference. The files in the permanent library are
there. You just need to tell SAS to remember the
library.
• If you double click a view table immediately after
you return, you will run into trouble. Just rerun the
code.
Returning to a Project(2)
• Replay the import code if you need the
temporary files.
• To replay the library creation, click on the library
and use the play button or right click on the
library and choose Run.
So far…
• That is the process for importing, cleaning
and saving your data.
• Next you will want to visualize and do
statistics on the data.
For the Old Timers
• If you were exposed to SAS prior to EG
you were told a LOT about data steps
(keyword DATA) and analysis procedures
(keyword PROC). Everything you learned
still works in a code object in the flowchart.
The EG-created code for data step
manipulation is now done mostly with SQL
(what is called PROC SQL in SAS 9.1.3)
and the procs are now nicely hidden away
in the menus.
Coding
• Should you choose to write some
code, I have written macros to
have SAS autocomplete with the
syntax for the procs. Download
the file here:
www.stanford.edu/class/hrp223/2006/programming/macros.kmf
• To get this functionality, open a
code object then go to the Code
menu and choose “Editor
Macros…” and “Macros…”
• Click “Import…” then select the
file that you downloaded.
• Then as you type in procedures,
you can hit the tab key to have it
complete your syntax.
Why would you code?
• You need tools to do data validation and to fix
systematic problems. You can manually change
your dataset like you would in Excel by changing
one cell at a time or you can learn to write basic
data step code. The right way to clean data is
with code.
– You want to have an audit trail of every change to
your data.
• Check out HRP 223 if you need to do real data
management with validation.
www.stanford.edu/class/hrp223/
Data Menu for Old Timers
EG does provide you with a
data manipulation menu.
Data validation is limited and
is buried within the “Filter and
Query…” menu item!
Proc sql
Sort can also find
and remove
duplicates. Proc sort
proc sql or proc append or data step
Proc format
Proc transpose
Formats are very important tools
for changing appearance of
variables (e.g., how do dates and
dollars look).
Proc surveyselect
Proc rank
Proc standard
Proc datasets /contents
Proc compare
Want to know how two
spreadsheets or datasets
differ? Use Compare.
The Describe Menu
• You use this menu for summary descriptions
of both character and numeric data.
Proc means +
Proc print
Proc tabulate
Proc Means + univariate graphics
Proc univariate
Proc freq and gplot
Proc tabulate
Proc freq
Proc freq
The Graph Menu
• These plots are almost all done with proc gplot
with 3D defaults. Don’t ever use the fancy 3D
effects unless you are modeling 3D data!
• Right click on all the graphics
objects in the flowchart,
choose Properties and
uncheck the 3D effects box.
Adding to the Graphics
• A couple months back, SAS added in the ability to do
very basic interactive exploratory graphics.
www.sas.com/apps/demosdownloads/setupcat.jsp?cat=SAS+Enterprise+Guide
Once installed, you get an extra
item on the Graph menu.
You can easily
add a second
plot to the
display.
You can click a
point and then
click “Show
Selected Data”
and subset the
displayed data.
Analyze Menu
• The core statistics from SAS/BASE,
SAS/STAT and SAS/QC modules of SAS
are under the Analyze menu.
• Some univariate
statistics are
mixed into the
Describe menu.
Getting Analyses
• These menus work by assigning variables to
roles.
– Add the sample Cars data set to a new project
• File > Open > Data…
– Browse the data
– Get descriptive statistics and graphics
Notice you can only
use numbers here.
You can choose other statistics.
All objects in the flowchart have
properties.
If you wanted a Microsoft Word
document instead of HTLM, you can
change the output format easily by
tweaking the properties.
1) Click on results.
2) Check to override defaults.
3) Check output you want.
5) Rerun the analysis.
4) Click ok.
About those options…
• Go to the Tools > Options… menu item
and tweak the setting. You can always
reset them to default.
I quickly got tired of telling
it to overwrite the old
output.
You can see and tweak
the general output
appearance templates
here.
About those options…(2)
The default text was not useful.
Printing the name of the analysis procedure is not useful.
Roles in Analyses
Notice cars is selected.
• Say you want to compare the average
weight for American vs. Japanese
cars.
• You need to tell the t-test the name of
the variable that has the weight that
will be used in comparing the
averages and the grouping variable
that says if a car is from the USA or
Notice it is saying it
Japan.
needs a grouping
variable.
Notice the symbol. This
can be character or
numeric data.
The analysis variable
(weight in this case) has
to be numeric.
You can drag and drop the variables into the
boxes to fill the roles or click one and push
the arrow.
A t-test compares two means and
the Country variable has 3 levels
(USA, Japan and other). Happily
the program complains. Ideally it
would advise doing an ANOVA….
Oh well.
Building a Data Set
• To subset the data down:
– Click on the dataset.
– Choose “Filter and Query…” from the Data menu.
– Drag and drop in all the variables (or a subset if you prefer
minimal datasets) into the “Selected Data” window/tab.
– Click on the filter data tab and move over the country
variable.
You can control click to pick
more than one choice at
once.
Name the query and the new dataset.
Do the analysis on the new data.
• Click on the new
data set.
• Click the T-test
menu item.
• Fill in the
blanks.
Notice it is saying it
needs an analysis
variable.
ALWAYS plot your
analyses!
As you are setting your
analysis you can look at
the code to learn SAS
coding.
This is the display format set to “journal”.
The rest…
• The rest of the analyses work as easily as
the t-test.
• Be sure to look at the data visualization
options that go with every test.
• There is a very nice graphical user
interface for working with multiple tables
accessible from the “Filter and Query…”
tool. See the “Cow” bonus material.
Learning More
• The Little SAS Book for Enterprise Guide 4.1 by Slaugher and
Delwiche is a nice friendly introduction.
• There is a free tutorial at SAS:
www.sas.com/apps/elearning/elearning_courses.jsp?cat=Free+Tutorials
• There are fairly inexpensive additional online tutorials:
www.sas.com/apps/elearning/elearning_courses.jsp?cat=SAS+Enterprise+Guide
• There are many “course notes” through SAS but they usually
are written EG 3 not 4. Ask me before you spend the money.
• The only beyond-the-basics book for EG 4 is Statistics Using
SAS Enterprise Guide by James Davis:
www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=57255
Next?!!!
• If you need pre-grant statistical support, come talk to us at
SPCTRM!
clinicaltrials.stanford.edu
• If you are interested in more education opportunities, visit
clinicaltrials.stanford.edu/education
• You can sign up for our class mailing list form on that page.
• Next quarter I am teaching data management with SAS. It will be
using EG and will have a focus on programming.
www.stanford.edu/class/hrp223/
• … and I will be teaching first quarter graduate level statistics
(hrp259) using EG.
• Also, Lane and SPCTRM are collaborating to give a 3 day short
course on R/Rcmdr but it is very full.
lane.stanford.edu/services/workshops/laneclasses.html
Cows…
The bonus level!
The Task
• I have some cows whose milk is graded. I
want to give scores of 80 or higher “Pass”
and below 80, “Fail”. The cows were
assigned IDs to anonymize them for the
study but now that I have the scores, I
want to use their names.
Loading the Data
• I could make the files in Excel but just to
be different, I did it with code.
Add in the Names to the Grades
1) Click on the grades table.
2) Use “Filter and Query” from the Data menu.
3) Click on
Add Tables
4) Find the data on the local SAS server
because the data is already in the project in
the temporary work location.
The tables are related.
• The ID variable in the two tables explains
which name is for which grade. Push the
tiny “Join…” button and it notices the
common variable name (id) and links the
This is an equijoin. Only records
two tables.
with a matching ID in both tables will
be in the final dataset.
Preliminary Look
• Add the name and grade variables (and
the ID if you like) to the Select Data tab
and perhaps use the Sort Data tab to
order by name. Then run the query.
Create the Pass/Fail Score from
the Grade
• Double click on the query to reopen it for editing.
Then push the Computed Columns button and
the New button and pick “Recode a Column…”
and pick grade.
• Recoding categorical data is easy. There
is a gotcha when working with continuous
scale data.
1) Name the variable
you are making.
3) Push Add… to tell it how
to categorize the data.
2) Specify that the new
variable is a character
string.
The gotcha is
how will it
handle scores
that are 80.
Specify the
value when the
values are out
of range.
Remember to name your query and the new dataset.
Trouble with less-than-or-equal-to
• The new column looks good except 80
went into the lower category. Open the
“Last Submitted Code” to fix this.
Fixing the Range
• While the code is mostly unintelligible, you can
quickly see that it is using less-than-or-equal-to
80 to call the sample “Fail”. Just tweak it to lessthan.
Download