VT PowerPoint Template - LISA (Laboratory for Interdisciplinary

advertisement
Welcome to Data Analysis in SAS class !

Where to find this presentation and data

http://www.lisa.stat.vt.edu

Short Courses

“Data Analysis in SAS”

Course Materials

Download Data to Desktop
• Please log into Windows System!
1
Data Analysis in SAS
Zhe Bao
Dept. of Statistics & Dept. of Biological Sciences
December 02, 2014
* Credit to Mark Seiss and Matthew Lanham for materials of this presentation
Laboratory for Interdisciplinary
Statistical Analysis
LISA helps VT researchers benefit from the use of
Statistics
Collaboration:
Visit our website to request personalized statistical advice and assistance with:
Designing Experiments • Analyzing Data • Interpreting Results
Grant Proposals • Software (R, SAS, JMP, Minitab...)
LISA statistical collaborators aim to explain concepts in ways useful for your research.
Great advice right now: Meet with LISA before collecting your data.
LISA also offers:
Educational Short Courses: Designed to help graduate students apply statistics in their research
Walk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB) for questions
<30 mins. See our website for additional times and locations.
All services are FREE for VT researchers. We assist with research—not class projects or homework.
www.lisa.stat.vt.edu
Why SAS?
SAS versus R
•
•
•
•
•
•
SAS will require a license ($), R is open-source
Any statistical analysis you can do in SAS, you could probably do in R
SAS is commercial software so additions take longer. R has new libraries being added
frequently (but some would be unreliable).
SAS help and documentation is professional. R documentation is lean.
SAS handles large data sets with ease. R stores everything in RAM making it vulnerable.
SAS and R are fairly easy to learn, but R feels more like natural language programming.
Popularity
•
•
•
Tiobe Sofware which ranks software popularity currently has SAS ranked #22 and R #24.
KDNuggets 2013 software poll for data science or big data: R (37%), SAS (11%), MATLAB
(10%)
Rblogger.com - Number of Jobs on Indeed.com requesting skills in analytics: #1 SAS (12,272),
#2 SPSS (3,289), #3 R (1,693)
SAS and R comparison code
http://sas-and-r.blogspot.com/p/statistics-examples.html
Presentation Outline
1. Introduction to the SAS Environment
2. Data Manipulation
3. Summary Procedures
4. Basic Statistical Analysis Procedures
•
Linear regression and ANOVA
•
Logistic regression
Reference Material

The Little SAS Book – Delwiche and Slaughter

SAS Programming I: Essentials
SAS Programming II: Data Manipulation
Techniques (by SAS)

Presentation and Data

http://www.lisa.stat.vt.edu

http://www.ats.ucla.edu/stat/sas
Presentation Outline
Questions/Comments
Individual Goals/Interests
Part I: Introduction to the
SAS Environment
1.
2.
3.
4.
SAS Programs
SAS Data Sets and Data Libraries
SAS System Help
Creating SAS Data Sets
SAS Environment
When you begin SAS, you should see 3 windows by default:
Log
window
Editor
Window
Output
Window
SAS Environment
On the left hand side is where you will find:
1.
2.
Explorer window – allows you to view and manage your SAS
files
Results window – helps you navigate and manage output
from programs submitted. Uses a tree structure to list various
types of output.
SAS Programs
• Editor Window
• File extension - .sas
• SAS program – sequence of steps that the user submits for execution
• Editor window has four uses:
• Access and edit existing SAS programs
• Write new SAS programs
• Submitting SAS programs for execution
• Saving SAS programs
• Submitting SAS programs
• Entire program; Selection of the program
• Keyboard shortcuts
• Tips: change preference, add number to line
SAS Programs
• Syntax Rules for SAS statements
• Free-format – can use upper or lower case
• Usually begin with an identifying keyword (data, proc, by etc.)
• Always end with a semicolon “;”
• Can span multiple lines
• Multiple statements can be on the same line
• Add comments to make the program easier to read * or /* */
• Errors
• Misspelled key words or file name
• Missing or invalid punctuation (missing semi-colon common)
• Invalid options
• Indicated in the Log window
SAS Programs
• Two Basic steps in SAS programs:
• Data Steps
• Typically used to create SAS datasets and manipulate data,
• Begins with DATA statement
• Proc Steps
• Typically used to process SAS data sets and carry out statistical
analysis
• Begins with PROC statement
• The end of the DATA or PROC steps are indicated by:
• RUN statement – most steps
• QUIT statement – some steps
• Beginning of another step (DATA or PROC statement)
SAS Programs
• Output generated from SAS program – 2 Windows
• SAS log
• Information about the processing of the SAS program
• Includes any warnings or error messages
• Accumulated in the order the data and procedure steps are
submitted
• SAS output
• Reports generated by the SAS procedures
• Accumulates output in the order it is generated
• Clean the log window: DM “log;clear;”;
SAS Data Sets and Data Libraries
• SAS Data Set
• Specifically structured file that contains data values.
• File extension - .sas7bdat
• Rows and Columns format – similar to Excel
• Columns – variables in the table corresponding to fields of data
• Rows – single record or observation
• Two types of variables
• Character – contain any value (letters, numbers, symbols, etc.)
• Numeric – floating point numbers (including dates and times)
(More on variable attributes: http://support.sas.com/
documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001103996.htm)
• Located in SAS Data Libraries
SAS Data Sets and Data Libraries
• SAS Data Libraries
• Contain SAS data sets
• Identified by assigning a library reference name – libref
• Temporary
• Work library
• SAS data files are deleted when session ends
• Library reference name not necessary
• Permanent
• SAS data sets are saved after session ends
• SASUSER library
• You can create and access your own libraries
SAS Data Sets and Data Libraries
• SAS Data Libraries (cont.)
• Assigning library references
• Syntax
LIBNAME libref ‘SAS-data-library’;
• Rules for Library References
• 8 characters or less
• Must begin with letter or underscore
• Other characters are letters, numbers, or under scores
SAS Data Sets and Data Libraries
• SAS Data Libraries (cont.)
• Identifying SAS data sets within SAS Data Libraries
libref.filename
• Accessing SAS data sets within SAS Data Libraries
Example: DATA new_data_set;
set libref.filename;
run;
• Creating SAS data sets within SAS Data Libraries
Example: DATA libref.filename_new;
set old_data_set;
run;
If we closed our SAS session now what do you think would happen?
SAS System Help
• SAS Help and Documentation
• Help  SAS Help and Documentation
•
Icon
• SAS Online Help
• http://support.sas.com/
Creating SAS Data Sets
• Creating a SAS data sets from raw data
• Three common methods:
1. Importing existing data sets using the Import wizard
2. Importing existing raw data in SAS program using proc import
3. Manually entering raw data in SAS program using data step
DATA data=libref.filename;
INPUT <variables>;
DATALINES;
<data goes here>
;
Creating SAS Data Sets
• Using the import data menu option
1.
File  Import Data
2.
Standard data source  select the file format
3.
Specify file location or Browse to select file
4.
Create name for the new SAS data set and specify location
5.
Click “Finish”
6.
Review the log for errors
7.
Review the data file
Creating SAS Data Sets
• Compatible file formats
• Microsoft Excel Spreadsheets
• Microsoft Access Databases
• Comma Separate Files (.csv)
• Tab Delimited Files (.txt)
• dBASE Files (.dbf)
• JMP data sets
• SPSS Files
• Lotus Spreadsheets
• Stata Files
• Paradox Files
• ……
Creating SAS Data Sets
Assignment
Import State_SAT_data.xls  Assign as
work.state_sat_data.sas7bdat
Import State_region_data.txt  Assign as
work.state_region_data.sas7bdat
Creating SAS Data Sets
•
Example Data Sets
• 1) Excel File – State_SAT_data.xls
• http://www.stat.ucla.edu/labs/datasets/sat.dat
• Extracted from 1997 Digest of Education Statistics, an annual
publication of the U.S. Department of Education
• Contains variables that show the relationship between public
school expenditure and SAT performance
• Variables:
–
–
–
–
–
–
–
–
State (state)
Current expenditure per pupil (expend)
Average pupil to teacher ratio (PT_ratio)
Estimated annual salary of teachers (salary)
Percentage of eligible students taking the SAT (students)
Average verbal SAT score (verbal)
Average math SAT Score (math)
Average total score (total)
Creating SAS Data Sets
•
Example Data Sets (Cont.)
• 2) Text file – State_region_data.txt
•
•
•
•
•
•
•
•
•
•
Contains region assignments for each state
1 = New England
2 = Middle Atlantic
3 = East North Central
4 = West North Central
5 = South Atlantic
6 = East South Central
7 = West South Central
8 = Mountain
9 = Pacific
Part I: Introduction to the
SAS Environment
Questions/Comments
Part II: Data Manipulation
1. Data Set Information
2. Data Set Manipulation
• Data Set Processing
3. Combining Data Sets
A. Concatenating/Appending
B. Merging
4. Saving Data Sets
Data Set Information
• Proc Contents
• Output a table of contents/structure of the specified data set
• Data Set Information
• Data set name
• Number of observations
• Number of Variables
• Variable Information
• Type (numeric or character)
• Length
• Formats
• Syntax:
PROC CONTENTS DATA=libname.input_data_set <options>;
RUN;
Data Set Information
Assignment
Obtain Data Set Information for work.state_sat_data and
work.state_region_data
You can try to add these useful options:
position, short, out=filename noprint
Data Set Information
Solution
proc contents data=state_sat_data out=state_sat_contents noprint;
run;
proc contents data=state_region_data;
run;
Data Set Manipulation
• Create a new SAS data set using an existing SAS data set
as input with modifications
• Specify name of the new SAS data set after the DATA statement
• Use SET statement to identify SAS data set being read
• Syntax:
DATA output_data_set;
SET input_data_set;
<additional SAS statements>;
RUN;
• By default the SET statement reads all observations and variables
from the input data set into the output data set.
• Create new variables or Change variable formats
Data Set Manipulation
• Assignment Statements
• Evaluate an expression
• Assign resulting value to a variable
• General Form: variable = expression;
• Example: miles_per_hour = distance/time;
Note: make sure the order of variables in statements is correct!
• SAS Functions
• Perform arithmetic functions, compute simple statistics, manipulate
dates, etc.
• General Form: variable=function_name(argument1, argument2,…);
• Example: Time_worked = sum(Day1,Day2, Day3, Day4, Day5);
• More useful functions:
http://www2.sas.com/proceedings/sugi25/25/btu/25p057.pdf
Data Set Processing
• Data Set Processing
• “DATA steps execute line by line and observation by
observation.” DATA steps read in data from existing data sets or
raw data files one row at a time, like a loop
• DATA step reads data from the input data set in the following way:
1.
Read in current row from input data set to Program Data
Vector (PDV)
2.
Process SAS statements
3.
PDV to output data set
4.
5.
Set current row to the next row in the input data set
Iterate to Step 1
Data Set Manipulation
• Selecting Variables
• Use DROP and KEEP to determine which variables are written to
new SAS data set.
• 3 Ways
• DROP and KEEP as statements
– Form: DROP Variable1 Variable2;
KEEP Variable3 Variable4 Variable5;
• DROP and KEEP options in SET statement
– Form: SET input_data_set (KEEP=Var1 Var2);
• DROP and KEEP options in data statement
– Form: DATA output_data_set (KEEP=Var1 Var2);
Notice the
difference!
Data Set Manipulation
• Conditional Processing
• Uses IF-THEN-ELSE logic
• General Form: IF <expression1> THEN <statement>;
ELSE IF <expression2> THEN <statement>;
ELSE <statement>;
• <expression> is a true/false statement, such as:
• Day1=Day2, Day1 > Day2, Day1 < Day2
• Day1+Day2=10
• Sum(day1,day2)=10
• Day1=5 and/or Day2=5
Data Set Manipulation
• Conditional Processing
Symbolic
Mnemonic
=
EQ
IF region=‘Spain’;
~= or ^=
NE
IF region ne ‘Spain’;
>
GT
IF rainfall > 20;
<
LT
IF rainfall lt 20;
>=
GE
IF rainfall ge 20;
<=
LE
IF rainfall <= 20;
&
AND
| or !
OR
IS NOT
MISSING
Example
IF rainfall ge 20 & temp < 90;
IF rainfall ge 20 OR temp < 90;
IF region IS NOT MISSING;
BETWEEN
AND
IF region BETWEEN ‘Plain’ AND ‘Spain’;
CONTAINS
IF region CONTAINS ‘ain’;
IN
IF region IN (‘Rain’, ‘Spain’, ‘Plain’);
Data Set Manipulation
• Conditional Processing (cont.)
• If <expression1> is true, <statement> is processed
• ELSE IF and ELSE are only processed if <expression1> is false
• Only one statement specified using this form
• Use DO and END statements to execute a group of statements
• General Form: IF <expression> THEN DO;
<statements>;
END;
ELSE DO;
<statements>;
END;
Data Set Manipulation
• Subsetting Sample (Observations)
• We will look at two ways
• Using IF statement
• Using WHERE option in SET statement
• To select a random sample: PROC SURVEYSELECT
• IF statement
• Only writes observations to the new data set in which an
expression is true;
• General Form: IF <expression>;
• Example: IF career = ‘Teacher’;
IF sex ne ‘M’;
• In the second example, only observations where sex is not equal
to ‘M’ will be written to the output data set
Data Set Manipulation
• Subsetting Sample (Observations)
• WHERE Option in SET statement
• Use option to only read rows from the input data set in which the
expression is true
• General Form: SET input_data_set (where=(<expression>));
• Example: SET vacation (where=(destination=‘Bermuda’));
• Only observations where the destination equals ‘Bermuda’ will be
read from the input data set
• Comparison
• Resulting output data set is equivalent
• IF statement – all rows read from the input data set
• Where option – only rows where expression is true are read from
input data set
• Difference in processing time when working with big data sets
Data Set Manipulation
• Assignments
1. Create new dataset work.state_SAT_data2 from
work.state_SAT_data
Assign new variable  upper_ind
If total > 1000 then upper_ind=1
Otherwise upper_ind=0
2. Create new dataset work.south from work.state_region_data
Specify work.south contains only records from regions 5, 6, or 7
Specify work.south only contains the state variable
Data Set Manipulation
• Solutions
1.
data state_sat_data2;
set state_sat_data;
if total>1000 then upper_ind=1;
else upper_ind=0;
run;
Data Set Manipulation
• Solutions
2.
data south;
set state_region_data;
if region=5 or region=6 or region=7;
keep state;
run;
OR
data south;
set state_region_data(where=(region=5 or region=6 or region=7));
keep state;
run;
Data Set Manipulation
•
PROC SORT sorts data according to specified variables
•
General Form:
•
Sorts data according to Variable1 and then Variable2;
•
By default, SAS sorts data in ascending order
•
•
Number low to high
•
A to Z
PROC SORT DATA=input_data_set <options>;
BY Variable1 Variable2;
RUN;
Use DESCENDING statement for numbers high to low and letters Z to A
•
BY City DESCENDING Population;
•
SAS sorts data first by city A to Z and then Population high to low
Data Set Manipulation
•
Some Options
• NODUPKEY
• Eliminates observations that have the same values for the BY
variables
• Delete duplicate observations (exact match for all variables):
NODUPRECS
• OUT=output_data_set
• By default, PROC SORT replaces the input data set with the
sorted data set
• Using this option, PROC SORT creates a newly sorted data set
and the input data set remains unchanged
Combining Data Sets
• Concatenating (or Appending)
• Stacks each data set upon the other
• If one data set does not have a variable that the other datasets do, the
variable in the new data set is set to missing (‘.’) for the observations from
that data set.
• General Form: DATA output_data_set;
SET data1 data2;
run;
• PROC APPEND may also be used
• If the two data files have different variable names for the same thing, you
can use RENAME in set statement.
• SET data1(RENAME=(var1=common_name))
data2(RENAME=(var2=common_name));
Combining Data Sets
• Merging Data Sets
• One-to-One Match Merge
• A single record in a data set corresponds to a single record in all
other data sets
• Example: Patient and Billing Information
• One-to-Many Match Merge
• Matching one observation from one data set to multiple
observations in other data sets
• Example: County and State Information
• Note: Data must be sorted before merging can be done
(PROC SORT)
Combining Data Sets
• One-to-One Match Merge
• Usually need at least one common variable between data sets –
matching purposes
• For the example, a patient ID would be needed
• Do not need common variable if all data sets are in exactly the same
order
• General Form: DATA output_data_set;
MERGE input_data_set1 input_data_set2;
By variable1 variable2;
RUN;
Combining Data Sets
• One-to-One Match Merge
• Example:
Performance
Month
Sales
Goals
Month
Goal
1
8223
1
9000
2
6034
2
6000
3
4220
3
5000
Code:
DATA work.compare;
MERGE work.performance work.goals;
BY month;
difference=sales-goal;
RUN;
Combining Data Sets
• One-to-One Match Merge
• Example cont.:
Compare
Month
Sales
Goal
Difference
1
8223
9000
-777
2
6034
6000
34
3
4220
5000
-780
Combining Data Sets
• One-to-Many Match Merge
• Requires at least one common variable in the data sets for matching
purposes
• For the example, State information is in both the state and county
files
• If two data sets have variables with the same name, the variables in
the second data set will overwrite the variable in the first.
• General Form: DATA output_data_set;
MERGE Data1 Data2 Data3;
BY Variable1 Variable2;
RUN;
Combining Data Sets
• One-to-Many Match Merge
• Example:
Videos
Adjustment
Category
Sales
Category
Adjustment
Aerobics
12.99
Aerobics
.20
Aerobics
13.99
Step
.30
Aerobics
13.99
Weights
.25
Step
12.99
Step
12.99
Weights
15.99
Code:
DATA work.prices;
MERGE work.videos work.adjustment
BY category;
NewPrice=(1-adjustment)*sales;
RUN;
Combining Data Sets
• One-to-Many Match Merge
• Example cont.:
Videos
Category
Sales
Adjustment
NewPrice
Aerobics
12.99
.20
10.39
Aerobics
13.99
.20
11.19
Aerobics
13.99
.20
11.19
Step
12.99
.30
9.09
Step
12.99
.30
9.09
Weights
15.99
.25
11.99
Combining Data Sets
• Assignment
Create the dataset work.state_data
Merge work.state_sat_data2 with work.state_region_data by the
state variable
Combining Data Sets
• Solution
proc sort data=state_sat_data2;
by state;
run;
proc sort data=state_region_data;
by state;
run;
data state_data;
merge state_sat_data2 state_region_data;
by state;
run;
*****Check: Has the state_data been created correctly?*****
Saving Data Sets
•
Save as permanent SAS data set (.sas7bdat)
DATA libref.filename;
SET current_name;
<optional commands;>
RUN;
•
Save as other formats
1. PROC EXPORT
data=current_name
outfile=“C:\Users\student\Desktop\SAS” dbms=xlsx; RUN;
2. Export Wizard
1)
File  Export Data
2)
Specify SAS data set
3)
Standard data source  select the file format
4)
Specify File Folder and Filename
Combining Data Sets
• Assignment
Save the dataset state_data on your desktop as .csv or .xlsx file
Part II: Data Manipulation
Questions/Comments
Part III: Summary Procedures
Print Procedure
Plot Procedure
Univariate Procedure
Means Procedure
Freq Procedure
Print Procedure
•
PROC PRINT is used to print data to the output window
•
By default, prints all observations and variables in the SAS data set
•
General Form:
PROC PRINT DATA=input_data_set <options>
<optional SAS statements>;
RUN;
•
Some Options
• input_data_set (obs=n) -
Specifies the number of observations to
be printed in the output
• NOOBS -
Suppresses printing observation number
• LABEL -
Prints the labels instead of variable
names
Print Procedure
• Optional SAS statements
• BY variable1 variable2 variable3;
• Starts a new section of output for every new value of the BY
variables
• ID variable1 variable2 variable3;
• Prints ID variables on the left hand side of the page and
suppresses the printing of the observation numbers
• SUM variable1 variable2 variable3;
• Prints sum of listed variables at the bottom of the output
• VAR variable1 variable2 variable3;
• Prints only listed variables in the output
Print Procedure
• Assignment
Use PROC PRINT to print out the state variable separately for
each region
Note: All procedures in this summary statistics section of course will be
run on the data set work.state_data.
( If for some season your SAS shuts down/restarts, simply go
ahead and import the permanent state_data file we just exported.)
Print Procedure
• Solution
proc sort data=state_data;
by region;
run;
proc print data=state_data;
var state;
by region;
run;
Plot Procedure
•
Used to create basic scatter plots of the data
•
Use PROC GPLOT (with symbol statement) or PROC SGPLOT for more
sophisticated plots; use PROC GCHART for bar chart and pie chart
•
General Form: PROC PLOT DATA=input_data_set;
PLOT vertical_variable *
horizontal_variable/<options>;
RUN;
•
By default, SAS uses letters to mark points on plots
• A for a single observation, B for two observations at the same point,
etc.
•
To specify a different character to represent a point
• PLOT vertical_variable * horizontal variable = ‘*’;
Plot Procedure
•
To specify a third variable to use to mark points—detect how the
relationship between Y and X different at different levels of a 3rd variable
• PLOT vertical_variable * horizontal_variable = third_variable;
•
To plot more than one variable on the vertical axis
• PLOT vertical_variable1 * horizontal_variable=‘2’
vertical_variable2 * horizontal_variable=‘1’/OVERLAY;
Plot Procedure
• Assignment
Use the PLOT PROCEDURE to plot SAT Verbal scores versus
SAT Math Scores
Use the value of the region variable to mark points (the third
variable)
Plot Procedure
• Solution
proc plot data=state_data;
plot math*verbal=region;
run;
* Add regression line for the relationship between math score and verbal
score? Use PROC GPLOT-- can be found in the code file
Univariate Procedure
•
PROC UNIVARIATE is used to examine the distribution of data
•
Produces distribution and summary statistics for a single variable
• Includes mean, median, mode, standard deviation, skewness,
kurtosis, quantiles, etc.
• Used for detecting missing values and extreme observations
•
General Form: PROC UNIVARIATE DATA=input_data_set <options>;
VAR variable1 variable2 variable3;
<optional SAS statements>;
RUN ;
•
If the variable statement is not used, summary statistics will be produced
for all numeric variables in the input data set.
Univariate Procedure
•
Options include:
• PLOT – produces Stem-and-leaf plot or Horizontal bar plot, Box
plot, and Normal probability plot;
• NORMAL/NORMAL TEST– produces tests of Normality
•
Statements include:
• HISTOGRAM
Histogram var1 var2/normal midpoint= ctex=;
• ID—output id in the extreme observations table
• QQPLOT—test if variables follow certain distributions
Univariate Procedure
• Assignment
Use PROC UNIVARIATE to produce a normal probability plot and test
the normality of the SAT Total variable and Expenditure variable.
Univariate Procedure
• Solution
proc univariate data=state_data normal plot;
var expend total;
run;
Means Procedure
•
Similar to the Univariate procedure, produces summary statistics
•
General Form:
PROC MEANS DATA=input_data_set <options>;
<Optional SAS statements>;
RUN;
•
With no options or optional SAS statements, the Means procedure will
print out the number of non-missing values, mean, standard deviation,
minimum, and maximum for all numeric variables in the input data set for
all the numerical variables
Means Procedure
•
Options
•
•
Statistics Available
CLM
Two-Sided Confidence Limits
RANGE
Range
CSS
Corrected Sum of Squares
SKEWNESS
Skewness
CV
Coefficient of Variation
STDDEV
Standard Deviation
KURTOSIS
Kurtosis
STDERR
Standard Error of Mean
LCLM
Lower Confidence Limit
SUM
Sum
MAX
Maximum Value
SUMWGT
Sum of Weight Variables
MEAN
Mean
UCLM
Upper Confidence Limit
MIN
Minimum Value
USS
Uncorrected Sum of Squares
N
Number Non-missing Values
VAR
Variance
NMISS
Number Missing Values
PROBT
Probability for Student’s t
MEDIAN (or P50)
Median
T
Student’s t
Q1 (P25)
25% Quantile
Q3 (P75)
75% Quantile
P1
1% Quantile
P5
5% Quantile
P10
10% Quantile
P90
90% Quantile
P95
95% Quantile
P99
99% Quantile
Note: The default alpha level for confidence limits is 95%. Use ALPHA= option to
specify different alpha level.
Means Procedure
•
Optional SAS Statements
• VAR Variable1 Variable2;
• Specifies which numeric variables statistics will be produced for
• BY Variable1 Variable2;
• Calculates statistics for each combination of the BY variables
• Output out=output_data_set;
• Creates data set with the default statistics
Means Procedure
• Assignment
Use PROC MEANS to calculate the mean and variance of the
expend variable for each region
Means Procedure
• Solution
proc sort data=state_data;
by region;
run;
proc means data=state_data mean var;
var expend;
by region;
run;
FREQ Procedure
•
PROC FREQ is used to generate frequency tables
•
Most common usage is create table showing the distribution of categorical
variables
•
General Form:
•
Options
•
PROC FREQ DATA=input_data_set;
TABLE variable1*variable2*variable3/<options>;
RUN;
•
LIST – prints cross tabulations in list format rather than grid
•
MISSING – specifies that missing values should be included in the tabulations
•
OUT=output_data_set – creates a data set containing frequencies, list format
•
NOPRINT – suppress printing in the output window
Use BY statement to get percentages within each category of a variable
FREQ Procedure
• Assignment
Use PROC FREQ to find the number of states within each region
FREQ Procedure
• Solution
proc freq data=state_data;
table region;
run;
Part III: Summary Procedures
Questions/Comments
Part IV: Basic Statistical Analysis
Procedures
A. Linear Regression and ANOVA
1. Correlation – PROC CORR
2. Regression – PROC REG
3. Analysis of Variance – PROC ANOVA
B. Categorical Data and Generalized Linear Model
1. Chi-square Test of Association – PROC FREQ
2. Generalized Linear Models – PROC GENMOD
CORR Procedure
•
PROC CORR is used to calculate the correlations between variables
•
Correlation coefficient measures the linear relationship between two
variables
•
Values Range from -1 to 1
• Negative correlation - as one variable increases the other decreases
• Positive correlation – as one variable increases the other increases
• 0 – no linear relationship between the two variables
• 1 – perfect positive linear relationship
• -1 – perfect negative linear relationship
•
General Form:
PROC CORR DATA=input_data_set <options>
VAR Variable1 Variable2;
With Variable3;
RUN;
CORR Procedure
PROC CORR (cont.)
•
If the VAR and WITH statements are not used, correlation is computed
for all pairs of numeric variables
•
Options include
• SPEARMAN – computes Spearman’s rank correlations
• KENDALL – computes Kendall’s Tau coefficients
• Plot=matrix
CORR Procedure
• Assignment
We will use a new data set: petrol2.sas7bdat
First import this data set into one of your work library as “petrol”.
Then use PROC CORR to find the correlation between all the variables.
(optional: create the correlation plots for variable )
CORR Procedure
• Solution
proc corr data=petrol;
run;
OR
ods graphics on;
title 'Petrol Consume Data';
proc corr data=petrol cov plots=matrix;
var petroltax income highway_mile driver_pr consumpetrol;
with petroltax income highway driver consumpetrol;
run;
ods graphics off;
REG Procedure
•
PROC REG is used to fit linear regression models by least squares estimation
•
One of many SAS procedures that can perform regression analysis (PROC GLM,
PROC MIXED)
•
Only continuous independent variables
•
General Syntax:
PROC REG DATA=input_data_set <options>
MODEL dependent=independent1 independent2/<options>;
<optional statements>;
RUN;
•
PROC REG statement options include
• PCOMIT=m - performs principle component estimation with m principle
components
• CORR – displays correlation matrix for independent variables in the model
REG Procedure
•
MODEL statement options include
• SELECTION=
• Specifies a model selection procedure be conducted –
FORWARD, BACKWARD, and STEPWISE
• ADJRSQ - Computes the Adjusted R-Square
• MSE – Computes the Mean Square Error
• VIF – Indicates multicollinearity
• CLB – computes confidence limits for parameter estimates
• ALPHA=
• Sets significance value for confidence and prediction intervals
and tests
REG Procedure
•
Optional statements include
• PLOT Dependent*Independent – generates plot of data
REG Procedure
• Assignment
Use PROC REG to generate a multiple linear regression model
Dependent Variable – consumpetrol:
1) Use all the other variables as independent variables
2) Use Stepwise Selection  stepwise selection
REG Procedure
• Solution
1)
proc reg data=petrol corr;
model consumpetrol = petroltax income highway driver /vif;
run;
2)
proc reg data=petrol;
model consumpetrol = petroltax income highway driver/
selection=Stepwise slentry=0.5 slstay=0.1;
quit;
ANOVA Procedure
•
PROC ANOVA performs analysis of variance
•
Designed for balanced data (PROC GLM used for unbalance data)
•
Can handle nested and crossed effects and repeated measures
•
General Form:
PROC ANOVA DATA=input_data_set <options>;
CLASS independent1 independent2;
MODEL dependent=independent1 independent2;
<optional statements>;
Run;
•
Class statement must come before model statement, used to define
classification variables
ANOVA Procedure
•
Useful PROC ANOVA statement option – OUTSTAT=output_data_set
• Generates output data set that contains sums of squares,
degrees of freedom, statistics, and p-values for each effect in the
model
•
Useful optional statement – MEANS independent1/<comparison type>
• Used to perform multiple comparisons analysis
• Set <comparison type> to:
• TUKEY – Tukey’s studentized range test
• BON – Bonferroni t test
• T – pairwise t tests
• Duncan – Duncan’s multiple-range test
• Scheffe – Scheffe’s multiple comparison procedure
ANOVA Procedure
•
Question: In state_data,
1) Are there significant differences between the Math SAT
scores of students from different regions?
2) If there are significant differences, which regions are different?
• Assignment
Use PROC ANOVA to determine if there are significant
differences in the Math SAT variable between regions
Perform multiple comparisons between regions using
Tukey’s Adjustment
ANOVA Procedure
•
Solution
proc anova data=state_data;
class region;
model math=region;
means region/tukey;
run;
Assumptions for Linear Regression
Assumptions of linear regression:
(1) IID (“random”) samples
(2) Equal variances Use unequal variance test (Satterthwaite)
(3) Normally distributed Transformation; Could try non-parametric test
Nonparametric methods relax underlying assumptions about how the data is
generated, because maybe you don’t know or the parametric assumptions are
not satisfied.
Nonparametric equivalent to two-sample t-test is: Wilcox rank sum test
(Wilcoxon-Mann-Whitney Test) PROC NPAR1WAY
Part IV Summary Procedures
Part A: Linear regression and ANOVA
Questions/Comments?
FREQ Procedure
•
PROC FREQ can also be used to perform analysis with categorical data
•
General Form:
PROC FREQ DATA=input_data_set;
TABLE variable1 variable2/<options>;
RUN;
•
TABLE statement options include:
•
AGREE – Tests and measures of classification agreement including
McNemar’s test, Bowker’s test, Cochran’s Q test, and Kappa statistics
•
CHISQ -- Chi-square test of homogeneity and measures of association
•
MEASURE -- Measures of association include Pearson and Spearman
correlation, gamma, Kendall’s Tau, Stuart’s tau, Somer’s D, lambda, odds
ratios, risk ratios, and confidence intervals
GENMOD Procedure
•
PROC GENMOD is used to estimate linear models in which the response
is not necessarily continuous variable
•
Logistic and Poisson regression are examples of generalized linear
models
•
General Form:
PROC GENMOD DATA=input_data_set;
CLASS independent1;
MODEL dependent = independent1 independent2/
dist= <option>
link=<option>;
run;
GENMOD Procedure
•
DIST = - specifies the distribution of the response variable
•
LINK= - specifies the link function from the linear predictor to the mean of
the response
•
Example – Logistic Regression
•
•
DIST = binomial
•
LINK = logit
Example – Poisson Regression
•
DIST = poisson
•
LINK = log
GENMOD Procedure
• Assignment
Use PROC GENMOD to perform Logistic Regression on the
apple_juice data set
• Dependent variable – CRA7152
• Independent variables
•
•
•
•
pH (3.5-5.5)
Brix (i.e. Sugar content of an aqueous solution, 11-19)
Temperature (25-50 °C)
Nisin concentration (0-70) (variable name Nisin)
GENMOD Procedure
• Solution
proc genmod data=apple_juice descending;
model CRA7152=PH Brix Temperature Nisin/dist=bin link=logit;
run;
Reference Material

The Little SAS Book – Delwiche and Slaughter

SAS Programming I: Essentials
SAS Programming II: Data Manipulation
Techniques (by SAS)


SAS help file; support.sas.com
Presentation and Data

http://www.lisa.stat.vt.edu

http://www.ats.ucla.edu/stat/sas
Questions/Comments
Download