SAS Short Course Presentation 2011 Part 2

advertisement

1
Presentation and Data

http://www.lisa.stat.vt.edu

Short Courses

Intro to SAS

Download Data to Desktop
Introduction to SAS Part 2
Mark Seiss, Dept. of Statistics
February 22, 2011
Reference Material

The Little SAS Book – Delwiche and Slaughter

SAS Programming I: Essentials


SAS Programming II: Manipulating Data with the
DATA Step
Presentation and Data

http://www.lisa.stat.vt.edu
Presentation Outline
Part 1
1. Introduction to the SAS Environment
2. Working With SAS Data Sets
Part 2
1. Summary Procedures
2. Basic Statistical Analysis Procedures
Presentation Outline
Questions/Comments
Individual Goals/Interests
Summary Procedures
1.
2.
3.
4.
5.
Print Procedure
Plot Procedure
Univariate Procedure
Means Procedure
Freq Procedure
Print Procedure
•
PROC PRINT is used to print data to the output window
•
By default, prints all observations and variables in the SAS data set
•
General Form:
PROC PRINT DATA=input_data_set <options>
<optional SAS statements>;
RUN;
•
Some Options
• input_data_set (obs=n) -
Specifies the number of observations to
be printed in the output
• NOOBS -
Suppresses printing observation number
• LABEL -
Prints the labels instead of variable
names
Print Procedure
• Optional SAS statements
• BY variable1 variable2 variable3;
• Starts a new section of output for every new value of the BY
variables
• ID variable1 variable2 variable3;
• Prints ID variables on the left hand side of the page and
suppresses the printing of the observation numbers
• SUM variable1 variable2 variable3;
• Prints sum of listed variables at the bottom of the output
• VAR variable1 variable2 variable3;
• Prints only listed variables in the output
Print Procedure
• Assignment
Use PROC PRINT to print out the state variable separately for
each region
Note: All procedures for the remainder of the course will be run on the
data set work.state_data.
Print Procedure
• Solution
proc sort data=state_data;
by region;
run;
proc print data=state_data;
var state;
by region;
run;
Plot Procedure
•
Used to create basic scatter plots of the data
•
Use PROC GPLOT or PROC SGPLOT for more sophisticated plots
•
General Form: PROC PLOT DATA=input_data_set;
PLOT vertical_variable *
horizontal_variable/<options>;
RUN;
•
By default, SAS uses letters to mark points on plots
• A for a single observation, B for two observations at the same point,
etc.
•
To specify a different character to represent a point
• PLOT vertical_variable * horizontal variable = ‘*’;
Plot Procedure
•
To specify a third variable to use to mark points
• PLOT vertical_variable * horizontal_variable = third_variable;
•
To plot more than one variable on the vertical axis
• PLOT vertical_variable1 * horizontal_variable=‘2’
vertical_variable2 * horizontal_variable=‘1’/OVERLAY;
Plot Procedure
• Assignment
Use the PLOT PROCEDURE to plot SAT Verbal scores versus
SAT Math Scores
Use the value of the region variable to mark points
Plot Procedure
• Solution
proc plot data=state_data;
plot math*verbal=region;
run;
Univariate Procedure
•
PROC UNIVARIATE is used to examine the distribution of data
•
Produces summary statistics for a single variable
• Includes mean, median, mode, standard deviation, skewness,
kurtosis, quantiles, etc.
•
General Form: PROC UNIVARIATE DATA=input_data_set <options>;
VAR variable1 variable2 variable3;
RUN ;
•
If the variable statement is not used, summary statistics will be produced
for all numeric variables in the input data set.
Univariate Procedure
•
Options include:
• PLOT – produces Stem-and-leaf plot, Box plot, and Normal
probability plot;
• NORMAL – produces tests of Normality
Univariate Procedure
•
Assignment
Use PROC UNIVARIATE to produce a normal probability plot and test
the normality of the SAT Total variable and Expenditure variable
Univariate Procedure
• Solution
proc univariate data=state_data normal plot;
var expend total;
run;
Means Procedure
•
Similar to the Univariate procedure
•
General Form:
PROC MEANS DATA=input_data_set options;
<Optional SAS statements>;
RUN;
•
With no options or optional SAS statements, the Means procedure will
print out the number of non-missing values, mean, standard deviation,
minimum, and maximum for all numeric variables in the input data set
Means Procedure
•
Options
•
•
Statistics Available
CLM
Two-Sided Confidence Limits
RANGE
Range
CSS
Corrected Sum of Squares
SKEWNESS
Skewness
CV
Coefficient of Variation
STDDEV
Standard Deviation
KURTOSIS
Kurtosis
STDERR
Standard Error of Mean
LCLM
Lower Confidence Limit
SUM
Sum
MAX
Maximum Value
SUMWGT
Sum of Weight Variables
MEAN
Mean
UCLM
Upper Confidence Limit
MIN
Minimum Value
USS
Uncorrected Sum of Squares
N
Number Non-missing Values
VAR
Variance
NMISS
Number Missing Values
PROBT
Probability for Student’s t
MEDIAN (or P50)
Median
T
Student’s t
Q1 (P25)
25% Quantile
Q3 (P75)
75% Quantile
P1
1% Quantile
P5
5% Quantile
P10
10% Quantile
P90
90% Quantile
P95
95% Quantile
P99
99% Quantile
Note: The default alpha level for confidence limits is 95%. Use ALPHA= option to
specify different alpha level.
Means Procedure
•
Optional SAS Statements
• VAR Variable1 Variable2;
• Specifies which numeric variables statistics will be produced for
• BY Variable1 Variable2;
• Calculates statistics for each combination of the BY variables
• Output out=output_data_set;
• Creates data set with the default statistics
Means Procedure
• Assignment
Use PROC MEANS to calculate the mean and variance of the
expenditure variable for each region
Means Procedure
• Solution
proc sort data=state_data;
by region;
run;
proc means data=state_data mean var;
var expend;
by region;
run;
FREQ Procedure
•
PROC FREQ is used to generate frequency tables
•
Most common usage is create table showing the distribution of categorical
variables
•
General Form:
•
Options
•
PROC FREQ DATA=input_data_set;
TABLE variable1*variable2*variable3/<options>;
RUN;
•
LIST – prints cross tabulations in list format rather than grid
•
MISSING – specifies that missing values should be included in the tabulations
•
OUT=output_data_set – creates a data set containing frequencies, list format
•
NOPRINT – suppress printing in the output window
Use BY statement to get percentages within each category of a variable
FREQ Procedure
• Assignment
Use PROC FREQ to find the number of states within each region
FREQ Procedure
• Solution
proc freq data=state_data;
table region;
run;
Summary Procedures
Questions/Comments
Statistical Analysis Procedures
1.
2.
3.
4.
5.
Correlation – PROC CORR
Regression – PROC REG
Analysis of Variance – PROC ANOVA
Chi-square Test of Association – PROC FREQ
General Linear Models – PROC GENMOD
CORR Procedure
•
PROC CORR is used to calculate the correlations between variables
•
Correlation coefficient measures the linear relationship between two variables
•
Values Range from -1 to 1
• Negative correlation - as one variable increases the other decreases
• Positive correlation – as one variable increases the other increases
• 0 – no linear relationship between the two variables
• 1 – perfect positive linear relationship
• -1 – perfect negative linear relationship
•
General Form:
PROC CORR DATA=input_data_set <options>
VAR Variable1 Variable2;
With Variable3;
RUN;
CORR Procedure
•
If the VAR and WITH statements are not used, correlation is computed
for all pairs of numeric variables
•
Options include
•
SPEARMAN – computes Spearman’s rank correlations
•
KENDALL – computes Kendall’s Tau coefficients
CORR Procedure
•
Question: What is the correlation between the SAT Total variable and
Expenditure variable? Is it significant?
Based on previous exercises, which correlation coefficient
should we use?
•
Assignment:
Use PROC CORR to find the correlation between the SAT
Total variable and Expenditure Variable
CORR Procedure
• Solution
If the normality assumption is valid
proc corr data=state_data /;
var total expend;
run;
If the normality assumption is not valid
proc corr data=state_data spearman;
var total expend;
run;
REG Procedure
•
PROC REG is used to fit linear regression models by least squares estimation
•
One of many SAS procedures that can perform regression analysis
•
Only continuous independent variables (Use GENMOD for categorical variables)
•
General Form:
PROC REG DATA=input_data_set <options>
MODEL dependent=independent1 independent2/<options>;
<optional statements>;
RUN;
•
PROC REG statement options include
•
PCOMIT=m - performs principle component estimation with m principle
components
•
CORR – displays correlation matrix for independent variables in the model
REG Procedure
•
MODEL statement options include
• SELECTION=
• Specifies a model selection procedure be conducted –
FORWARD, BACKWARD, and STEPWISE
• ADJRSQ - Computes the Adjusted R-Square
• MSE – Computes the Mean Square Error
• COLLIN – performs collinearity analysis
• CLB – computes confidence limits for parameter estimates
• ALPHA=
• Sets significance value for confidence and prediction intervals
and tests
REG Procedure
•
Optional statements include
• PLOT Dependent*Independent1 – generates plot of data
REG Procedure
• Assignment
Use PROC REG to generate a multiple linear regression model
Dependent Variable – SAT Total (total)
Use Stepwise Selection  Possible Independent Variables
– Average pupil to teacher ratio (PT_ratio)
– Current expenditure per pupil (expend)
– Estimated annual salary of teachers (salary)
– Percentage of eligible students taking the SAT
(students)
REG Procedure
• Solution
proc reg data=state_data;
model total=pt_ratio expend salary students/selection=stepwise;
run;
ANOVA Procedure
•
PROC ANOVA performs analysis of variance
•
Designed for balanced data (PROC GLM used for unbalance data)
•
Can handle nested and crossed effects and repeated measures
•
General Form:
PROC ANOVA DATA=input_data_set <options>;
CLASS independent1 independent2;
MODEL dependent=independent1 independent2;
<optional statements>;
Run;
•
Class statement must come before model statement, used to define
classification variables
ANOVA Procedure
•
Useful PROC ANOVA statement option – OUTSTAT=output_data_set
• Generates output data set that contains sums of squares,
degrees of freedom, statistics, and p-values for each effect in the
model
•
Useful optional statement – MEANS independent1/<comparison type>
• Used to perform multiple comparisons analysis
• Set <comparison type> to:
• TUKEY – Tukey’s studentized range test
• BON – Bonferroni t test
• T – pairwise t tests
• Duncan – Duncan’s multiple-range test
• Scheffe – Scheffe’s multiple comparison procedure
ANOVA Procedure
•
Question: Are there significant differences between the Match SAT
scores of students from different regions?
If there are significant differences, which regions are different?
•
Assignment:
Use PROC ANOVA to determine if there are significant
differences in the Math SAT variable between regions
Perform multiple comparisons between regions using
Tukey’s Adjustment
ANOVA Procedure
•
Solution
proc anova data=state_data;
class region;
model math=region;
means region/tukey;
run;
FREQ Procedure
•
PROC FREQ can also be used to perform analysis with categorical data
•
General Form:
PROC FREQ DATA=input_data_set;
TABLE variable1 variable2/<options>;
RUN;
•
TABLE statement options include:
•
AGREE –
Tests and measures of classification agreement including McNemar’s test,
Bowker’s test, Cochran’s Q test, and Kappa statistics
•
CHISQ -
Chi-square test of homogeneity and measures of association
•
MEASURE - Measures of association include Pearson and Spearman correlation,
gamma, Kendall’s Tau, Stuart’s tau, Somer’s D, lambda, odds ratios, risk
ratios, and confidence intervals
GENMOD Procedure
•
PROC GENMOD is used to estimate linear models in which the response
is not necessarily normal
•
Logistic and Poisson regression are examples of generalized linear
models
•
General Form:
PROC GENMOD DATA=input_data_set;
CLASS independent1;
MODEL dependent = independent1 independent2/
dist= <option>
link=<option>;
run;
GENMOD Procedure
•
DIST = - specifies the distribution of the response variable
•
LINK= - specifies the link function from the linear predictor to the mean of
the response
•
Example – Logistic Regression
•
•
DIST = binomial
•
LINK = logit
Example – Poisson Regression
•
DIST = poisson
•
LINK = log
GENMOD Procedure
• Question:
How do we model the probability of having a high
total SAT average based on other variables in the
dataset?
Is the dependent variable normal, or does it have a
different distribution?
What link function would you specify?
• Assignment: Use PROC GENMOD to perform Logistic
Regression on the work.state_data data set
• Dependent variable – upper_ind
• Independent variables
–
–
–
–
–
Average pupil to teacher ratio (PT_ratio)
Current expenditure per pupil (expend)
Estimated annual salary of teachers (salary)
Percentage of eligible students taking the SAT (students)
Region (region)
GENMOD Procedure
• Solution
proc genmod data=state_data descending;
class region;
model upper_ind=pt_ratio expend salary students/dist=bin link=logit;
run;
Statistical Analysis Procedures
Questions/Comments
Attendee Questions
If time permits
Download