Applied Statistical Models – Attachment

advertisement
Applied Statistical Models – Appendix Part 
SAS for Windows v6.12 – Introduction & Reference
Gunther Matthys
University Center for Statistics
Katholieke Universiteit Leuven
-1-
1
Introduction
1.1
The SAS-environment
Windows
When you start SAS, three windows are opened:
 the PROGRAM EDITOR, where you can type SAS-programs to create and to analyse datasets,
 the LOG-window, where you get information from the system about the execution of commands
and programs,
 the OUTPUT-window, where you find the output of analyses, tests and other procedures.
In the upper left-hand corner you find the COMMAND LINE where you can enter SAS-commands that
influence or customise the SAS-session, e.g.
 call modules and user interfaces (ASSIST, INSIGHT, …),
 manage windows (CLEAR, MANAGER, PGM, LOG, OUTPUT, …).
Working with SAS
Basically there are two ways to get SAS doing what you want it to do:
1. With SAS-programs which you type in the PROGRAM EDITOR. A SAS-program consists of
 DATA-steps
 to create and edit datasets,
 PROC-steps
 to analyse datasets, perform tests and produce output.
Each entity (‘line’) in a SAS-program ends with a semicolon “;”.
The program itself ends with “run;” and you can execute it with the <F8>-button or by clicking the icon
with the little running man on top.
In the PROGRAM EDITOR you can recall previously executed programs with the <F4>-button.
When a SAS-program is saved as a file, its name gets extension “.sas”.
 In the following we will indicate SAS-programs which have to be typed in the PROGRAM EDITOR
with PGM and put them in this font.
2. By clicking and selecting in menu-oriented modules and interfaces. We will mainly use
 SAS/INSIGHT
 exploratory data analysis,
 SAS/ASSIST
 menu-oriented user interface.
1.2
Data-management in SAS
SAS-datasets are matrices; each column represents a variable and each observation’s features are in a row.
Libraries
Datasets are collected in libraries.
In fact a library-name (libname) is just a shorter name for a Windows File Manager directory in which
datasets are located.
Standard in SAS you dispose of 2 libraries:
 WORK
 default, temporary,
 SASUSER  permanent.
You can manage libraries and assign your own libnames to Windows-directories by clicking on the librarybutton on top. For instance, if you want to save SAS-datasets on floppy disk, you assign a libname to the
directory “a:\” with the “New Library...” option.
-2-
Names of datasets are of the form
 “libname.dataset”, where libname refers to the library the dataset is located in,
 “dataset”, when the dataset is in the default library WORK (this is equivalent to
“WORK.dataset”).
Names of datasets and libraries are at most 8 characters long.
In Windows File Manager a SAS-dataset is a file with extension “.sd2” in the folder that is assigned to its
library.
Creating SAS-datasets
If necessary you first create the library you want to put your SAS-dataset in. After that there are 2 easy ways
to create datasets that SAS can handle.
1. By importing data from a text-file, an Excel-spreadsheet, etc. Call the Import Wizard with “File >
Import”, make the appropriate selections and click “Next >” after each step.
2. With a SAS-program using DATA-steps.
A DATA-step typically consists of the following statements:
PGM
data libname.datasetname;
input variablenames;
cards;
 (data)
;





E.g.
The “data”-line creates a dataset with the specified name in the specified library. The libname is
optional; if you omit it, the dataset will be put in the WORK library.
The “input”-line determines the structure of the dataset by enumerating the names of the variables
in the dataset. SAS works with 2 kinds of variables: character and numeric variables. Character
variables are defined by a “$”-sign after their name.
“cards” indicates that the data follow in the DATA-step.
The data must be given one line per observation and in the same order as the variable names.
You end the DATA-step with a semicolon “;” and execute it with <F8> or by clicking on the button
with the running man.
PGM
data sasuser.clouds;
input rainfall type $;
cards;
1202.6 unseeded
830.1 unseeded

2745.6 seeded
1697.1 seeded

;
-3-
 You can also enter the data one after another without starting a new line for each observation by
typing “@@” at the end of the input line.
E.g.
1.3
PGM
data sasuser.clouds;
input rainfall type $ @@;
cards;
1202.6 unseeded 830.1 unseeded 372.4 unseeded 
2745.6 seeded 1697.1 seeded 1656.4 seeded 
;
SAS/INSIGHT and SAS/ASSIST
SAS/INSIGHT
The module SAS/INSIGHT is a very useful tool to investigate the distribution of data, i.e. for exploratory data
analysis. You can call it by typing “insight” in the command line and entering.
Select the appropriate library and the dataset you want to explore and click “Open”. Now you see the dataset
as a matrix.
Basic techniques to handle the data in INSIGHT involve selecting observations and variables by clicking on
them, double-clicking to view their properties, and sorting the dataset using the data popup menu (with the
arrow in the upper left hand corner) and “Sort...”.
Observations and variables can be deleted by selecting them and choosing
Edit > Delete.
If you just want certain observations to be excluded in calculations and/or not to appear in graphs (without
deleting them) you can select them and click
Edit > Observations > Exclude in Calculations
and/or
Edit > Observations > Hide in Graphs.
You can transform variables with
Edit > Variables
and by selecting the appropriate transformation.
When changes have been made to the dataset (e.g. observations are deleted or a new variable is created)
you can save the modified dataset with
File > Save > Data....
Otherwise the original dataset remains unaltered after quitting INSIGHT.
SAS/ASSIST
SAS/ASSIST is a menu-oriented interface to perform various kinds of tests. This module is called by typing
“assist” in the command line and entering.

In the following, whenever analyses have to be performed with submenus of SAS/ASSIST or
SAS/INSIGHT this will be indicated with the tags ASSIST and INSIGHT respectively.
-4-
2
2.1
Comparing two data groups
Methodology
Aim: compare two groups of data
 do they have the same distribution?
Parametric techniques
Basic assumption: relevant data are normally distributed
 check for equal means
2 distinguished cases:
Unpaired data

Paired data
( correlation; visual check with scatter plot)
check normality of both groups separately
 normal density on histogram
 Kolmogorov-Smirnov test
 normal QQ-plot
(if not normal: try transformation of both data
groups and check again)

F-test for equal variances

standard 2-sample t-test

create difference-variable & check its
normality
 normal density on histogram
 Kolmogorov-Smirnov test
 normal QQ-plot
(if not normal: try difference of logtransformed data)

one-sample t-test on difference variable
( paired t-test)


sign test
Wicoxon signed rank sum test
Nonparametric techniques



2.2
2-sample quantile plot with correlation
coefficient
Wilcoxon(-Mann-Whitney) rank sum test
Terry-Hoeffding-Van der Waerden test
Exploratory Data Analysis
Sample distribution
In order to investigate the distribution of a sample and to check the assumption of normality one can use the
following tools. All of them are available in SAS/INSIGHT.
 histogram (with normal density)
 boxplot
 empirical and normal distribution functions with Kolmogorov-Smirnov test
 normal quantile plot with correlation coefficient
Remark: when working with unpaired data it is necessary to specify a group variable in order to perform the
analyses for each group separately!
INSIGHT
#boxplot, histogram, moments and quantiles :
Analyze > Distribution (Y) (Y : Rainfall)
#best fitting normal density on histogram :
Curves > Parametric Density... > Normal - Sample Estimates/MLE
-5-
#empirical and normal distribution functions with Kolmogorov-Smirnov test :
Curves > Test for Distribution... > Normal
#normal quantile plot :
Graphs > QQ Plot... > Normal
#correlation coefficient :
Analyze > Multivariate (Y’s) (Y : Rainfall, N_Rainfal)
Dependence
The dependence and/or similarity between paired datasets can be visualised with
INSIGHT
#paired boxplot
Analyze > Histogram/Bar Chart (Y) (Y : February, March)
#scatter plot with correlation coefficient
Analyze > Scatter Plot (Y X) (Y : March, X : February)
Analyze > Multivariate (Y’s) (Y : February March)
2.3
t-Tools
INSIGHT
#one-sample t-test
Analyze > Distribution (Y)
Tables > Location Tests... > Student’s T Test
#paired t-test
Edit > Variables > Other... > Y-X
(Y : March, X : February, Name : Diff)
Analyze > Distribution (Y) (Y : Diff)
Tables > Location Tests... > Student’s T Test
The Student’s t-test in SAS/INSIGHT gives a Student's t-statistic
Y 
T=
.
S/ n
Assuming that the null hypothesis (population mean = ) is true and the population is normally distributed,
the t-statistic has a Student's t-distribution with n-1 degrees of freedom. The P-value is the probability of
obtaining a Student’s t-statistic greater in absolute value than the absolute value of the observed statistic T.
ASSIST
#paired t-test
Data Analysis > Anova > T-tests... > Paired comparisons...
(Active data set : Gariep, Paired variables : February March)
Locals > Run
#standard 2-sample t-test
Data Analysis > Anova > T-tests... > Compare two group means...
(Active data set : Clouds, Dependent : Rainfall,
Classification : Type)
Locals > Run
The standard 2-sample t-test in SAS/ASSIST returns the result of a 2-sided F-test for equal variances, and tstatistics with corresponding 2-sided P-values assuming either equal or unequal variances.
-6-
2.4
Non-parametric alternatives
Unpaired data
Some non-parametric alternatives to the t-tools for unpaired data groups are:
 2-sample quantile plot with correlation coefficient
 Wilcoxon(-Mann-Whitney) rank sum test
 Terry-Hoeffding-Van der Waerden test
In order to make a 2-sample quantile plot with correlation coefficient you first have to create a dataset with
the ordered data; this only works when both data groups contain the same number of observations. After
that you can make the quantile plot and its correlation coefficient in SAS/INSIGHT.
PGM
data Seeded;
set Clouds;
rename Rainfall=SeRain;
if Type="Seeded";
keep Rainfall;
data Unseeded;
set Clouds;
rename Rainfall=UnRain;
if Type="Unseeded";
keep Rainfall;
proc sort data=Seeded;
by SeRain;
proc sort data=Unseeded;
by UnRain;
data Combine;
merge Seeded Unseeded;
run;
INSIGHT
Analyze > Scatter Plot (Y X) (Y : UnRain, X : SeRain)
Analyze > Multivariate (Y’s) (Y : SeRain UnRain)
ASSIST
#Wilcoxon rank sum test
Data Analysis > Anova > Nonparametric Anova...
(Active data set : Clouds, Dependent : Rainfall,
Classification : Type)
> Analysis based upon : Wilcoxon scores
Locals > Run
#Terry-Hoeffding-Van der Waerden test
Data Analysis > Anova > Nonparametric Anova...
(Active data set : Clouds, Dependent : Rainfall,
Classification : Type)
> Analysis based upon : Van der Waerden scores
Locals > Run
SAS uses normal approximations to calculate the 2-sided P-values of the Wilcoxon rank sum test and the
Terry-Hoeffding-Van der Waerden test. This is done by centring the test statistic with its expected value
under the null hypothesis, and dividing this by its standard deviation under the null hypothesis.
For rather small sample sizes, however, it is more accurate to use tables for the Wilcoxon rank sum test.
-7-
Paired data
For paired data one can use the following non-parametric techniques in SAS/INSIGHT:
 sign test
 Wicoxon signed rank sum test
These tests are performed on the difference variable in its distributional analysis window.
INSIGHT
#sign test
Tables > Location Tests... > Sign Test
The sign statistic is given by
n (  )  n ( )
M=
,
2
where n(+) is the number of observations with values greater than , and n(-) is the number of observations
with values less than .
INSIGHT
#Wilcoxon signed rank sum test
Tables > Location Tests... > Signed Rank Test
The signed rank test assumes that the distribution is symmetric. The signed rank statistic is computed as
n(t )n(t )  1
R(i ) 
S=
4
i

where n(+) is the number of observations with values greater than , n(-) is the number of observations with
values less than , n(t) = n(+) + n(-) is the number of Y(i) values not equal to , R(i+) is the rank of |Y(i) - |
after discarding Y(i) values equal to , and the sum is calculated for values of Y(i) -  greater than 0.
Average ranks are used for tied values.
The P-value is the probability of obtaining (by chance alone) a signed rank statistic greater in absolute value
than the absolute value of the observed statistic S.
-8-
3
3.1
One-way layout ANOVA
Testing the hypothesis of equal means
You get a visual impression of the data for the different groups with
INSIGHT
#side-by-side boxplots
Analyze > Box Plot/Mosaic Plot (Y) (Y : Time, X : Type)
It is useful to check normality of the data in the different groups. This can be done as before with
SAS/INSIGHT, specifying the appropriate group variable. When the assumption of normality holds (possibly
for a transformation of the data) an ANOVA table with F-test can be used to test the hypothesis that all group
means are equal. Otherwise non-parametric tests such as the Kruskal-Wallis test and the test of Van der
Waerden can be used.
ASSIST
#one-way ANOVA table with F-test
Data Analysis > Anova > Analysis of variance...
(Active data set : Bearings, Dependent : Time,
Classification : Type)
Locals > Run
#Kruskal-Wallis test and test of Van der Waerden
Data Analysis > Anova > Nonparametric Anova...
(Active data set : Bearings, Dependent : Time,
Classification : Type)
> Analysis based upon :
Wilcoxon scores
Van der Waerden scores
Locals > Run
In SAS a  k21 -distribution is used instead of an Fk-1,N-k-distribution to calculate the 2-sided P-value of the test
of Van der Waerden. This is a good approximation when N-k is big enough (i.e. when k is small w.r.t. N), as
then (k-1)Fk-1,N-k is approximately  k21 distributed.
PGM
#one-way ANOVA table with F-test
proc anova data=Bearings;
class Type;
model Time=Type;
run;
#Kruskal-Wallis test and test of Van der Waerden
proc npar1way data=Bearings wilcoxon vw;
class Type;
var Time;
run;
-9-
3.2
Multiple comparison procedures
ASSIST
#confidence intervals for pairwise differences
Data Analysis > Anova > Analysis of variance...
(Active data set : Bearings, Dependent : Time,
Classification : Type)
> Additional options > Output statistics...
> Means... : Type
> Options for means...
> Comparison Tests :
Fisher’s least-significant-difference test
Tukey’s HSD test
Bonferroni t test
Scheffe’s multiple-comparison procedure
> Specify comparison options... :
Level of significance for comparison of means : 0.05
Confidence intervals for differences between means
Locals > Run
#testing contrasts
Data Analysis > Anova > Analysis of variance...
(Active data set : Bearings, Dependent : Time,
Classification : Type)
> Additional options > Model hypotheses... > Contrast...
> Select effect : Type
> Specify number of contrasts : 1
> Specify contrast label : 1_2
> Supply contrast values : 1 -1 0 0 0
Locals > Run
PGM
#confidence intervals for pairwise differences
proc anova data=Bearings;
class Type;
model Time=Type;
means Type / t bon scheffe tukey cldiff alpha=.05;
run;
#testing and estimating contrasts
proc glm data=Bearings;
class Type;
model Time=Type;
contrast ‘1_2’ Type 1 -1 0 0 0;
estimate ‘1_2’ Type 1 -1 0 0 0;
run;
The "estimate"-statement provides the standard error to create an individual student’s t-confidence interval
for the specified contrast.
- 10 -
4
4.1
Simple & multiple linear regression
Model fitting
SAS/INSIGHT is a very helpful tool to interactively fit simple as well as multiple linear regression models.
The model statement or the data can be easily modified, which is especially useful to verify the influence of
presumed outliers, or to include transformations of variables in the model. Moreover, all techniques to
investigate distributions are available for the analysis of residuals.
Exploratory analysis
INSIGHT
#pairwise scatterplot of the response and all covariates
Analyze > Scatter Plot (Y X) (Y : Y X1 X2 ..., X : Y X1 X2 ...)
#correlation between the response and all covariates
Analyze > Multivariate (Y’s) (Y : Y X1 X2 ...)
Simple linear regression
For simple linear regression SAS/INSIGHT provides a.o. the following output:
 parameter estimates, with confidence intervals and significance t-tests
 R2- and adjusted R2-statistics
 F-test for 1 = 0
 residuals and standardised residuals, which can be analysed in the usual way
 fitted values for given datapoints
 confidence bounds for individual means
 prediction bounds for individual predictions
INSIGHT
#model fitting
Analyze > Fit (Y X)
(Y : Y, X : X,  Intercept)
> Output > Tables : Model Equation
Summary of Fit
Analysis of Variance/Deviance
Parameter Estimates
95% C.I. (LR) For Parameters
> Residual Plots : Residual by Predicted
Apply
Individual confidence bounds for the means and prediction bounds for individual predictions are created with
#individual confidence and prediction bounds
Curves > Confidence Curves
> Mean : 95%
> Prediction : 95%
2 new variables have been added to the dataset: P_Y, the predicted or fitted values, and R_Y, the
corresponding residuals.
Standardised residuals, which get the name RS_Y, and which should
approximately have a standard normal distribution, are calculated with the following command. You can
investigate these residuals and standardised residuals with the usual techniques.
#standardardized residuals
Vars > Standardized Residual
- 11 -
A normal QQ-plot of the standardised residuals, and a scatterplot of the standardised residuals versus the
fitted values, where about 95% of the residuals should lie between -2 and 2, are very informative about the
quality of the regression model.
You can detect possible 'outliers' through their standardised residuals or through their Cook's distances,
which indicate influential observations and which should be compared to the critical value of an Fp-1,N-pdistribution.
#Cook's distances
Vars > Cook's D.
The influence of presumed outliers can be investigated by clicking on them in a graph or in the dataframe
and selecting
Edit > Observations > Exclude in Calculations.
When you want to apply a linear regression to a transformation of the response or the explanatory variable,
you can add this transformed variable to the dataset with
Edit > Variables
and re-fit your regression model, including the transformed variable.
Multiple linear regression
The basic techniques to fit a multiple linear regression model are the same as for simple linear regression.
However, especially the paired scatterplot and the adjusted R2-statistic become more interesting for multiple
covariates.
After having fitted a multiple regression model, it is useful to construct a partial residual (or leverage) plot for
each explanatory variable, which reflects the variation of the response variable that is induced by this
explanatory variable only. This can be used as an indicator of the relative influence of each observation on
the parameter estimates.
#partial leverage plots
Graphs > Partial Leverage
Two reference lines are displayed in each plot. For a given datapoint, its residual without the explanatory
variable under consideration is the vertical distance between the point and the horizontal line; its residual
with the explanatory variable is the vertical distance between the point and the fitted line.
4.2
Model selection
Regression analysis can also be performed with a SAS-procedure. Additionally, this offers the possibility to
make automatic selections among multiple covariates, based on various criteria, in order to get a good
regression model.
PGM
#regression analysis and model selection
proc reg data=dataset;
model Y = X1 X2 ... / <options>;
run;
Covariates are separated from each other with a blank. When you want to include transformations of
variables (power transforms, for instance) in the “model”-statement, you first have to create them explicitly
(e.g. in SAS/INSIGHT "XSQR = X*X "); then the new variables can be used in the regression procedure.
- 12 -
Some useful options that can be specified in the “model”-statement are:
 P: calculates predicted values from the input data and the estimated model. The printout includes the
observation number, the actual and predicted values, and the residual. (If CLI, CLM, or R is specified, P
is unnecessary.)
 NOINT: suppresses the intercept term that is otherwise included in the model.
 R: requests an analysis of the residuals. The printed output includes everything requested by the P
option plus the standard errors of the prediction for the mean and of the residual, the studentised
residual, and Cook's D statistic to measure the influence of each observation on the parameter
estimates.
 CLI: requests the 95% upper- and lower-confidence limits for the prediction of the dependent variable
for each observation. The confidence limits reflect variation in the error, as well as variation in the
parameter estimates.
 CLM: prints the 95% upper- and lower-confidence limits for the expected value (mean) of the dependent
variable for each observation. This is not a prediction interval because it takes into account only the
variation in the parameter estimates, not the variation in the error term.
 SELECTION= : specifies the method used to select an optimal subset of covariates. This can be
FORWARD (or F), BACKWARD (or B), STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, or CP.
 ADJRSQ: computes the adjusted R2-statistic for each model selected. (Only available when
SELECTION= RSQUARE, ADJRSQ, or CP).
 CP: computes Mallows' Cp-statistic for each model selected. (Only available when SELECTION=
RSQUARE, ADJRSQ, or CP).
Furthermore, it is possible to test hypotheses about the regression parameters in a “test”-statement, to
make plots of the results of the regression, and to save the output in a SAS-dataset.
PGM
#regression analysis, model selection, hypothesis testing, plots, and data output
proc reg data=dataset;
model Y = X1 X2 ... / <options>;
test equation <,..., equation >;
plot variable.*variable. <...>;
output out=newdataset predicted=pred residual=res student=stud;
run;



The hypothesis-equations for the parameters are written with the names of the corresponding
covariates. For instance, "test X1, X2 - X3 = 2;" tests whether the coefficient of X1 equals 0, and
the difference between the coefficients of X2 and X3 is equal to 2.
In the “plot”-line you can use all variables that result from the “model”-statement, such as r.,
student., p., u95., u95m., etc. For instance, "plot r.*p. student.*p." plots the residuals and the
studentised residuals versus the fitted values of the linear regression.
The “output”-statement creates an output dataset containing statistics calculated for each observation.
For each statistic, specify the keyword, an equal sign, and a variable name for the statistic in the output
data set. The following keywords are available: PREDICTED or P, RESIDUAL or R, L95M, U95M, L95, U95,
STDP, STDR, STDI, STUDENT, COOKD, and PRESS.
- 13 -
5
5.1
Multivariate data analysis
Principal component analysis
Principal component analysis (PCA) is meant to find an optimal representation of a multivariate data cloud in
a lower-dimensional subspace. Therefore one creates independent new variables by taking linear
combinations of the original ones such that the projections of the data points upon the consecutive
subspaces generated by these so-called principal components retain maximal spread (variance). For
instance, the 2-dimensional representation that distinguishes best among the data is found by projecting
upon the plane generated by the first 2 principal components.
The interpretation of the principal components has to be inferred from the original variables' coefficients in
the linear combinations that make up the PCA.
A PCA can be performed with the following SAS-procedure:
PGM
#principal component analysis
proc princomp data=dataset <options>;
var X1 X2 ...;
run;
Some options that can be specified are:
 cov: uses the diagonalisation of the variance-covariance matrix of the data in order to calculate the
principal components instead of the correlation matrix, which is the default method.
 out= newdataset: creates an output dataset that contains the principal component scores for all
observations, together with the scores for the original variables. The principal components are given the
variable names PRIN1, PRIN2, etc. This output dataset can be used e.g. in SAS/INSIGHT to plot the
projection of the data cloud upon the first principal components.
5.2
Factor analysis
Factor analysis (FA) is aimed at finding underlying factors that explain contrasts and similarities between the
original variables. This procedure leads to projections in lower-dimensional spaces that optimally reflect the
configuration or spread of the original variables.
PGM
#factor analysis
proc factor data=dataset <options>;
var X1 X2 ...;
run;
Some possible options are:
 nfactors= : specifies the number of factors to be retained.
 scree: prints a scree plot of the eigenvalues.
 rotate= v: asks for a varimax rotation of the factors.
 out= newdataset: creates an output dataset that contains the factor scores for all observations, together
with the scores for the original variables. The factor variables are given the names FACTOR1, FACTOR2,
etc. This output dataset can be used e.g. in SAS/INSIGHT to plot the projection of the data cloud upon
the first underlying factors.
A biplot gives an optimal 2-dimensional representation of the configuration of the variables, combined with
the projections of the data points on the plane generated by the first 2 underlying factors. In order to create a
biplot one first has to submit the following macro:
- 14 -
PGM
/* This macro produces a correlation biplot of a general form.
Last update : July 95.
It is written by Magda Wauters (E-mail Magda.Vuylsteke@kuleuven.ac.be).
Comments and information can be obtained on this E-mail address. */
%macro biplot(
target=cgmmwwc,
skore=scoreset,
text=id,
multlad=1,
multsco=1,
radius = 0,
print=0,
varlist=,
mean=0,
x_as=FACTOR1,
y_as=FACTOR2
);
/*device for hardcopy plot */
/*dataset with observation scores */
/*label variable - must be a character variable */
/*scalingfactor for variable-arrows */
/*scalingfactor for observations */
/*min distance of visible points */
/*print=1 then print loadings and scores. */
/* list of variables to be plotted by means of arrows */
/*mean=1 then only the mean-score of the obs belonging to the
different categories of the text-variable will be plotted */
/* factor for factoranalysis */
/* Other orthogonal axes are possible e.g. in can. analysis */
goptions hsize=15 cm vsize=15 cm gunit=pct ftext=zapfb htext=2;
axis1 length=12 cm order=(-1,0,1)
value = (tick=1 justify=c "-1:&multsco"
tick=2 "0"
tick=3 "+1:&multsco")
label=("&x_as" f=zapfb h=2.5);
axis2 length=12 cm order=(-1 to 1)
value = (tick=1 justify=c "-1:&multsco"
tick=2 "0"
tick=3 "+1:&multsco")
label=("&y_as" f=zapfb h=2.5);
goptions targetdevice=⌖
%annomac;
* scaling to length = 1 of scores;
proc means data=&skore uss noprint;
var &x_as &y_as;
output out=gemid uss=ussx ussy;
data skore; if _n_ = 1 then set gemid; set &skore;
x_as = &x_as / sqrt(ussx) * &multsco;
y_as = &y_as / sqrt(ussy) * &multsco;
text=&text;
run;
%if &mean %then %do;
proc sort data=skore; by text;
proc means data=skore noprint mean; by text;
var x_as y_as;
output out=gemid mean=x_as y_as;
run;
data skore; set gemid;
%end;
* Saving the loadings;
proc corr data=&skore out=loadset noprint;
var &x_as &y_as;
with &varlist;
data lading(keep=text x_as y_as); set ; if _type_ = 'CORR';
x_as=&x_as * &multlad; y_as=&y_as * &multlad;
text='+'||_name_; run;
* concatenation of loadings and scores;
data totaal; set skore lading; keep text x_as y_as;
radius = x_as ** 2 + y_as ** 2 ;
if sqrt(radius) ge &radius;
data anno; set totaal;
length function $ 8 style $ 8;
retain xsys '2' ysys '2';
if _n_=1 then do; hsys='2'; %circle(0,0,1); end ;
hsys='4';
x=x_as;y=y_as;
if substr(text,1,1)='+' then do;
size=.8;
iden='var';
text=substr(text,2);
style='zapfb ';
function='label';
- 15 -
output;
end;
else do;
style='zapfbe';
size=.5;
function='label';
output;
end;
if iden='var' then do;
size=1;
%line(x,y,0,0,*,1,1);
output;
end;
symbol1 v=none;
proc gplot data=totaal anno=anno;
plot y_as*x_as/href=0 vref=0 haxis=axis1 vaxis=axis2;
* printout of the coordinates ;
%if &print %then %do;
proc print data =totaal; var text x_as y_as; run; %end;
footnote "Radius of the circle = 1";
%mend biplot;
options mlogic mprint;
run;
After submitting the factor procedure and this biplot macro you can make a biplot with a SAS-program of the
following form. After skore= you must give the name of the output dataset that contains the factor scores
and that was created with out= in the factor procedure. text= specifies the variable that contains the labels
of the observations (this must be a character variable), and varlist= lists the variables used in the FA.
PGM
%biplot(
target=cgmmwwc,
/*device for hardcopy plot */
skore= newdataset,
/*dataset with observation scores */
text= labelvar,
/*label variable - must be a character variable */
multlad=1,
/*scalingfactor for variable-arrows */
multsco=1,
/*scalingfactor for observations */
radius = 0,
/*min distance of visible points */
print=0,
/*print=1 then print loadings and scores. */
varlist= X1 X2 ...,
/* list of variables to be plotted by means of arrows */
mean=0,
/*mean=1 then only the mean-score of the obs belonging to the
different categories of the text-variable will be plotted */
x_as=FACTOR1,
/* factor for factoranalysis */
y_as=FACTOR2
/* Other orthogonal axes are possible e.g. in can. analysis */
);
run;
- 16 -
6
Canonical discriminant analysis
The purpose of canonical discriminant analysis (CDA) is to find a projection of the data in a low-dimensional
subspace which optimally distinguishes between different groups of multivariate data. As such it is an
application of PCA to a transformation of the data.
PGM
#canonical discriminant analysis
proc candisc data=dataset <options>;
class classvar;
var X1 X2 ...;
run;
 class: specifies the variable that contains the group labels.
 var: lists the variables to be used in the CDA.
Some options:
 ncan= : specifies the number of canonical components to be retained.
 out= newdataset: creates an output dataset that contains the CDA scores for all observations, together
with the scores for the original variables. The CDA scores are given the variable names CAN1, CAN2,
etc. This output dataset can be used e.g. in SAS/INSIGHT to make a scatter plot of the data cloud in
the canonical co-ordinate system.
- 17 -
Download