Chapter 9

advertisement
Chapter 9
Producing Descriptive Statistics
PROC MEANS;
Summarize descriptive statistics for
continuous numeric variables.
PROC FREQ;
Summarize frequency tables for discrete
numeric variables or categorical variables.
Objectives
• Compute statistical summaries such as mean,
median, std, min, max, and so on for numeric
continuous variables
• Control # of decimals for reporting the
summary statistics
• Difference between PROC MEANS and PROC
SUMMARY procedures.
• Create one-way frequency table
• Create 2-way, n-way cross frequency table
2
PROC MEANS Output
Salary by Job Code
The MEANS Procedure
Analysis Variable : Salary
Job
N
Code
Obs
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
FLTAT1
14 14
25642.86
2951.07
21000.00
30000.00
FLTAT2
18
18
35111.11
1906.30
32000.00
38000.00
FLTAT3
12
12
44250.00
2301.19
41000.00
48000.00
PILOT1
8
8
69500.00
2976.10
65000.00
73000.00
PILOT2
9
9
80111.11
3756.48
75000.00
86000.00
PILOT3
8
8
99875.00
7623.98
92000.00
112000.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
3
Calculating Summary Statistics
for Numeric Variables
The MEANS procedure displays simple descriptive statistics
for the numeric variables in a SAS data set.
General form of a simple PROC MEANS step:
PROC MEANS DATA=SAS-data-set;
RUN;
Example:
proc means data=mylib.crew;
title 'Salary Analysis';
run;
4
Calculating Summary Statistics
Salary Analysis
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
HireDate 69
9812.78
1615.44
7318.00
12690.00
Salary
69
52144.93
25521.78
21000.00
112000.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
NOTE: PROC MEANS computes summary statistics for
any variable we want. However, it is meaningless to
compute some variables, such as Hiredate.
5
Calculating Summary Statistics
By default, PROC MEANS
– analyzes every numeric variable in the
SAS data set
– prints the statistics N, MEAN, STD, MIN, and MAX
– excludes missing values before calculating
statistics.
6
Specifying summary statistics to be computed
PROC MEANS data = mylib.crew mean median
range std ;
To specify the summary statistics to be
computed, add them to the PROC MEANS
statement as options.
Limitting Decimal Places
By default, RPOC MEANS uses the BEST. Format
to display values in the report. It can be many
decimal places such as 52.000000
To specify the # of decimal places to k places:
PROC MEAN Data = Mylib.crew MAXDEC=k ;
Maxdec =2 will result in 2 decimals in the report.
Maxdec =0 will result in no decimal place in the report.
8
Selecting Variables
The VAR statement restricts the variables
processed by PROC MEANS.
General form of the VAR statement:
VAR SAS-variable(s);
9
Selecting Variables
Mylib.crew
HireDate LastName
FirstName
Location Phone EmpID JobCode Salary
07NOV1992 BEAUMONT
SALLY T.
LONDON
1132
E00525 PILOT1
72000
1151
E02466 FLTAT3
41000
BARBARA ANN FRANKFURT 1163
E00802 PILOT2
81000
12MAY1985 BERGAMASCO CHRISTOPHER CARY
04AUG1988 BETHEA
proc means data=Mylib.crew;
var Salary;
title 'Salary Analysis';
run;
Salary Analysis
The MEANS Procedure
Analysis Variable : Salary
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
69
52144.93
25521.78
21000.00
112000.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
10
Grouping Observations
Using CLASS statement
The CLASS statement in the MEANS procedure
groups the observations of the SAS data set
for analysis.
General form of the CLASS statement:
CLASS SAS-variable(s);
11
Grouping Observations
Mylib.crew
HireDate LastName
FirstName
Location Phone EmpID JobCode Salary
07NOV1992 BEAUMONT
SALLY T.
LONDON
1132
E00525 PILOT1
72000
1151
E02466 FLTAT3
41000
BARBARA ANN FRANKFURT 1163
E00802 PILOT2
81000
12MAY1985 BERGAMASCO CHRISTOPHER CARY
04AUG1988 BETHEA
proc means data=mylib.crew maxdec=2;
var Salary;
class JobCode;
title 'Salary by Job Code';
run;
NOTE: The MAXDEC= option controls the number of
decimal places displayed in the output.
12
Grouping Observations using CLASS statement
Salary by Job Code
The MEANS Procedure
Analysis Variable : Salary
Job
N
Code
Obs
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
FLTAT1
14 14
25642.86
2951.07
21000.00
30000.00
FLTAT2
18
18
35111.11
1906.30
32000.00
38000.00
FLTAT3
12
12
44250.00
2301.19
41000.00
48000.00
PILOT1
8
8
69500.00
2976.10
65000.00
73000.00
PILOT2
9
9
80111.11
3756.48
75000.00
86000.00
PILOT3
8
8
99875.00
7623.98
92000.00
112000.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
The summary is displayed based on the order of the categories of the
CLASS variable.
Variables in CLASS statement can be character or numeric.
13
Results due to CLASS Statement
• The summary is displayed based on the order of the
categories of the CLASS variable.
• Variables in CLASS statement can be character or
numeric. It is important to make sure you do not use
continuous numeric variable in the CLASS statement.
• If there are two or more variables in CLASS
statement, the order of the variables in the CLASS
statement determined the order in the output
report.
14
PROC MEAN procedure Using BY Statement
PROC MEANS;
VAR variable list ;
BY Variable;
• It is important to know that when using BY statement, the data
set MUST be sorted in ascending order based on the variables in
the BY statement first using PROC SORT.
•The result using BY statement is displayed as separate tables
each is for the category of the variable in the BY statement.
•If there are two or more variables in the BY statement, the order
determines the order of the displayed tables in the report.
15
Exercise
Write a program to read diabetes data set and use PROC Means to produce
summary statistics
• for variables Age, Height and Weight.
Run the program and see the results.
• Produce the summary statistics N, mean median, max, min std, and range,
and Set decimal places to two by Maxdec =2.
Run the program and see the results.
• Ass the CLASS statement to produce summary results for each sex.
Run the program to see the results.
• Practice using BY statement for each sex. Before you add the BY SEX
statement, Make sure you sort the data by SEX.
Run the program and see the result.
• Add a WHERE statement to select cases for AGE > 30 to the program.
Run the program and see the results.
16
Create data set for summary statistics
in PROC MEANS
In many occasions, we may want to create a SAS data
set consisting of the summary statistics calculated by
PROC MEANS.
OUTPUT OUT=sas-data-set
summary-keyword(s) = variablename(s);
NOTE: summary-keywords are: Mean, Min, Max, Range,
Std, etc.
Variablenames are the variable names you want to call
for each summary statistics for each variable.
17
Create Summary Data Set using
PROC MEANS
Examples:
PROC MEANS data= mylib.crew;
VAR Hiredate salary;
OUTPUT OUT = mylib.discrip
mean = avghiredate avgsalary
Median= medhiredate medsalary;
Run;
18
Exercise
Revise the following program to do the following task: Use the
OUTPUT OUT= statement to save the summary statistics Mean,
Median and Std to a sas data set dia_summary, then print this data
set to see what’s in there.
PROC MEANS data = mylib.diabetes maxdec =2 ;
var age height weight;
class sex;
run;
19
PROC SUMMARY procedure
PROC SUMMARY procedure uses the same program codes
as PROC MEANS.
PROC SUMMARY does not produce report by default. In
order to produce the report, you need to add PRINT as
the option:
PROC SUMMARY data = sasdataset PRINT;
When do we use PROC SUMMARY?
If you only want to produce and save the summary to a
SAS data set, you can use PROC SUMMARY.
OR you can use the option: NOPRINT in PROC MEANS.
PROC FREQ procedure Objectives
– Generate simple descriptive statistics using the
MEANS procedure.
– Group observations of a SAS data set for analysis
using the CLASS statement in the MEANS
procedure.
– Create one-way and two-way frequency tables
using the FREQ procedure.
– Restrict the variables processed by the FREQ
procedure.
21
PROC FREQ Output
Distribution of Job Code Values
The FREQ Procedure
Job
Cumulative
Cumulative
Code
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
FLTAT1
14
20.29
14
20.29
FLTAT2
18
26.09
32
46.38
FLTAT3
12
17.39
44
63.77
PILOT1
8
11.59
52
75.36
PILOT2
9
13.04
61
88.41
PILOT3
8
11.59
69
100.00
22
Goal Report 1
International Airlines wants to know how many
employees are in each job code.
Distribution of Job Code Values
The FREQ Procedure
Job
Cumulative
Cumulative
Code
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
FLTAT1
14
20.29
14
20.29
FLTAT2
18
26.09
32
46.38
FLTAT3
12
17.39
44
63.77
PILOT1
8
11.59
52
75.36
PILOT2
9
13.04
61
88.41
PILOT3
8
11.59
69
100.00
23
Goal Report 2
Categorize job code and salary values to determine how
many employees fall into each group.
Salary Distribution by Job Codes
The FREQ Procedure
Table of JobCode by Salary
JobCode
Salary
Frequency
Percent
Row Pct
Col Pct
‚
‚
‚
‚Less tha‚25,000 t‚More tha‚ Total
‚n 25,000‚o 50,000‚n 50,000‚
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Flight Attendant ‚
5 ‚
39 ‚
0 ‚
44
‚
7.25 ‚ 56.52 ‚
0.00 ‚ 63.77
‚ 11.36 ‚ 88.64 ‚
0.00 ‚
‚ 100.00 ‚ 100.00 ‚
0.00 ‚
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Pilot
‚
0 ‚
0 ‚
25 ‚
25
‚
0.00 ‚
0.00 ‚ 36.23 ‚ 36.23
‚
0.00 ‚
0.00 ‚ 100.00 ‚
‚
0.00 ‚
0.00 ‚ 100.00 ‚
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
5
39
25
69
7.25
56.52
36.23
100.00
24
Creating a Frequency Report
• PROC FREQ displays frequency counts of the
data values in a SAS data set.
• General form of a simple PROC FREQ step:
PROC FREQ DATA=SAS-data-set;
RUN;
Example:
proc freq data=mylib.crew;
run;
25
Creating a Frequency Report
• By default, PROC FREQ
– analyzes every variable in the SAS data set
– displays each distinct data value
– calculates the number of observations in which each
data value appears (and the corresponding
percentage)
– indicates for each variable how many observations
have missing values.
26
Default Frequency Reports
mylib.crew
HireDate LastName
•
07NOV1992 BEAUMONT
FirstName
Location Phone EmpID JobCode Salary
SALLY T.
LONDON
1132
E00525 PILOT1
72000
1151
E02466 FLTAT3
41000
BARBARA ANN FRANKFURT 1163
E00802 PILOT2
81000
12MAY1985 BERGAMASCO CHRISTOPHER CARY
04AUG1988 BETHEA
Distribution of
proc freq data=mylib.crew;
run;
HireDate
Distribution of
Salary
Distribution of
Distribution of
LastName
JobCode
Distribution of
FirstName
Distribution of
Distribution of
Distribution of
Location
Phone
EmpID
27
...
One-Way Frequency Report
• Use the TABLES statement to limit the variables
included in the frequency counts. These are typically
variables that have a limited number of distinct
values.
• General form of a PROC FREQ step with a TABLES
statement:
PROC FREQ DATA=SAS-data-set ;
TABLES SAS-variables / NOCUM;
RUN;
NOCUM option in the TABLES statement suppress
Cumulative frequency and Cumulative percentage
28
Creating a Frequency Report
proc freq data=mylib.crew;
tables JobCode;
title 'Distribution of Job Code Values';
run;
Distribution of Job Code Values
The FREQ Procedure
Job
Cumulative
Cumulative
Code
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
FLTAT1
14
20.29
14
20.29
FLTAT2
18
26.09
32
46.38
FLTAT3
12
17.39
44
63.77
PILOT1
8
11.59
52
75.36
PILOT2
9
13.04
61
88.41
PILOT3
8
11.59
69
100.00
29
Using PROC FORMAT to redefine
Categories of Values in TABLES statement
International Airlines wants to use formats to
categorize the flight crew by job code.
Stored values
Formatted values
PILOT1
PILOT2
PILOT3
Pilot
FLTAT1
FLTAT2
FLTAT3
Flight Attendant
30
Analyzing Categories of Values
proc format;
• value $codefmt
'FLTAT1'-'FLTAT3'='Flight Attendant'
'PILOT1'-'PILOT3'='Pilot';
run;
proc freq data = mylib.crew;
format JobCode $codefmt.;
tables JobCode;
run;
NOTE: The original data values for Jobocde are not changed.
They are still FLTAT1 FLTAT2, and so on.
31
Analyzing Categories of Values
Distribution of Job Code Values
The FREQ Procedure
Cumulative
Cumulative
JobCode
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Flight Attendant
44
63.77
44
63.77
Pilot
25
36.23
69
100.00
32
Crosstabular Frequency Reports
• A two-way, or crosstabular, frequency report
analyzes all possible combinations of the distinct
values of two variables.
• The asterisk (*) operator in the TABLES statement is
used to cross variables.
• General form of the FREQ procedure to create a
crosstabular report:
PROC FREQ DATA=SAS-data-set;
TABLES variable1*variable2;
RUN;
Variable1 is ROW and Variable2 is Column
33
Crosstabular Frequency Reports
proc format;
• value $codefmt
'FLTAT1'-'FLTAT3'='Flight Attendant'
'PILOT1'-'PILOT3'='Pilot';
value money
low-<25000 ='Less than 25,000'
25000-50000='25,000 to 50,000'
50000<-high='More than 50,000';
run;
proc freq data=mylib.crew;
tables JobCode*Salary;
format JobCode $codefmt. Salary money.;
title 'Salary Distribution by Job Codes';
run;
34
Crosstabular Frequency Reports
Salary Distribution by Job Codes
The FREQ Procedure
Table of JobCode by Salary
JobCode
Salary
Frequency
Percent
Row Pct
Col Pct
‚
‚
‚
‚Less tha‚25,000 t‚More tha‚ Total
‚n 25,000‚o 50,000‚n 50,000‚
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Flight Attendant ‚
5 ‚
39 ‚
0 ‚
44
‚
7.25 ‚ 56.52 ‚
0.00 ‚ 63.77
‚ 11.36 ‚ 88.64 ‚
0.00 ‚
‚ 100.00 ‚ 100.00 ‚
0.00 ‚
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Pilot
‚
0 ‚
0 ‚
25 ‚
25
‚
0.00 ‚
0.00 ‚ 36.23 ‚ 36.23
‚
0.00 ‚
0.00 ‚ 100.00 ‚
‚
0.00 ‚
0.00 ‚ 100.00 ‚
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
5
39
25
69
7.25
56.52
36.23
100.00
35
Additional Syntax for TABLES statement in
PROC FREQ; statement
Syntax
Equivalent to
tables A*(B C);
tables A*B A*C;
tables (A B)*(C D);
tables A*C B*C A*D B*D;
tables (A B C)*D;
tables A*D B*D C*D;
tables A - - C;
tables A B C;
tables (A - - C)*D;
tables A*D B*D C*D
TABLES A*B*C;
Produces separate two-way tables of B*C
for each value of A.
To Suppress some columns in the
PROC FREQ summary report
PROC FREQ;
TABLES var1*var2/ <OPTIONS>;
Options for suppressing cell frequency: NOFREQ
Options for suppressing cell percent: NOPERCENT
Options for suppressing ROW percent: NOROW
Options for suppressing COLUMN percent: NOCOL
Additional usages of PROC FREQ statement
In addition to reporting tables, PROC FREQ;
statement also conduct many statistical tests
for analyzing categorical data such as
Chi-square test,
Cochran-Mantel-Haenszel test,
Fisher’s exact test,
Kappa coefficient,
Risk, Odds ratio and so on.
This is beyond the programming course.
Exercise
The Diabetes data set consists of Sex, Age, Height, Weight, Pulse FastGluc PostGluc for 20
patients. Revise the following program by using PROC FREQ procedure to perform the
following tasks:
1. Use IF statement to create AGE_G variable : IF AGE > 45 then, AGE_G = ‘Senior’ ,
otherwise AGE_G = ‘Young’. Create one-way table for variables SEX , Age_G, and Pulse
using user-defined format.
Run the program and see the results.
2. Create cross tabular table sex*(Age_G Pulse), make sure the
user-defined format is applied for Pulse variable.
Run the program and see the results.
3. Suppressing ROW percent and Column percent.
Run the program and see the results.
proc format;
value pulft LOW-70 = 'Low' 71-High = 'High'; run;
data diab; set mylib.diabetes; run;
39
Download