Chapter 2 - Describing Data 1. Summary Statistics

advertisement
Chapter 2 - Describing Data
1. Summary Statistics - Proc Means
2. More Statistics and Plots - Univariate
Note that for good plots, SAS is not recommended.
Other packages such as Mathematica, Matlab, Maple
and R are much better.
3. Proc Sort
This covers sections: 2.A-H. You should also read section
19I.
1
Creating a SAS data set: Example
/* Population, population density, births and deaths for
Western European countries, 1995 */
DATA EUROPE_W; /* this creates a SAS Data Set called EUROPE_W */
/* Source: Organisation for Economic Co-op. and Devel. Labour
Force Stat., 1976-1996, Paris, 1997 Ed.*/
INPUT COUNTRY $ POP DENSITY BRATE DRATE;
/* POP
= population in 1000’s, DENSITY = 1000’s of
residents/km^2 BRATE, DRATE = birth, death rate per 1000 */
DATALINES;
Austria 8047 95.9 . .
Belgium 10137 332.4 . .
Denmark 5228 121.3 13.4 12.0
Finland 5108 15.1 12.3 9.6
France 58143 105.9 12.5 9.1
2
Creating a SAS Data Set: Example
Germany 81661 228.8 9.4 10.8
Greece 10454 79.2 9.7 9.6
Iceland 267 2.61 7.2 6.0
Ireland 3598 51.2 . .
Italy
57283 190.2 . .
Luxembourg 413 158.8 13.2 9.3
Netherlands 15459 378.91 2.3 8.8
Norway 4348 13.4 13.8 10.3
Portugal 9918 107.3 10.8 10.5
Spain 39210 77.7 9.2 8.7
Sweden 8847 19.7 11.6 10.6
Switzerland 7062 171.0 11.6 8.9
UK 58606 239.4 12.5 11.0
;
3
Questions of interest:
1. How many missing birth rates are in the sample?
2. What is the mean population density?
3. How variable is population density from country to
country?
4. What is the distribution of population? population
density?
4
Another SAS Data Set: Infile and Input
The file snails.txt contains data from an experiment in
which groups of 20 snails were held for periods of 1, 2, 3
or 4 weeks in carefully controlled conditions of
temperature and relative humidity.
There were two species of snail: A and B.
At the end of the exposure time the snails were tested to
see if they had survived; the process itself is fatal for the
animals.
Using the INFILE and INPUT statements, the data can be
read into a SAS data set called SNAILS.
Species Time Humidity Temperature Fatalities N
A
1
60.0
10
0
20
A
1
60.0
15
0
20
...........................................
B
4
75.8
20
7
20
5
Questions of interest:
1. What is the mean and standard deviation of the
number of fatalities of species B for each level of
exposure (TIME)?
2. What is the distribution of the number of fatalities?
3. What is an approximate 95% confidence interval for
the mean number of fatalities?
4. How many times did 0 fatalities occur?
6
Proc Means
Syntax:
PROC MEANS DATA = SASdata options;
(optional statements)
Explanation:
the DATA option specifies a SAS data set. If this option is
not used, SAS looks to the most recently created or used
SAS data set.
Examples:
PROC MEANS DATA = EUROPE_W;
PROC MEANS DATA = SNAILS;
PROC MEANS;
7
Europe Sample Example
PROC MEANS DATA = EUROPE_W;
The SAS System
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
POP
18
21321.61
25376.41
267.0000000
81661.00
DENSITY
18
132.7122222
108.3853853
2.6100000
378.9100000
BRATE
14
10.6785714
3.0579477
2.3000000
13.8000000
DRATE
14
9.6571429
1.4318972
6.0000000
12.0000000
The SAS System
8
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
Optional Statements for Proc Means
To compute specific kinds of statistics, use e.g. N,
NMISS, MEAN, STD, STDERR, CLM, MIN, MAX,
SUM, VAR, CV, SKEWNESS, KURTOSIS, T, and
MAXDEC=n.
An additional option is the NOPRINT option which
suppresses printing of output in the Output Window.
PROC MEANS DATA = EUROPE_W NMISS MEAN STD
VAR MAXDEC=4;
gives the number of missing observations for each variable
in the SAS data set EUROPE_W, as well as the mean,
standard deviation and variance.
The MAXDEC option restricts the number of decimal
places to 4.
A number of types of optional statements can be used,
including a TITLE , VAR , CLASS, BY and OUTPUT statement.
9
The MEANS Procedure
Variable
N Miss
Mean
Std Dev
Variance
POP
0
21321.6111
25376.4144
643962409.08
DENSITY
0
132.7122
108.3854
11747.3917
BRATE
4
10.6786
3.0579
9.3510
9.6571
1.4319
2.0503
DRATE Example
4
Europe Sample
PROC MEANS DATA = EUROPE_W NMISS MEAN STD
VAR MAXDEC=4;
The SAS System
The MEANS Procedure
Variable
N Miss
Mean
Std Dev
Variance
POP
0
21321.6111
25376.4144
643962409.08
DENSITY
0
132.7122
108.3854
11747.3917
BRATE
4
10.6786
3.0579
9.3510
DRATE
4
9.6571
1.4319
2.0503
The SAS System
The MEANS Procedure
10
Variable
N Miss
Mean
Std Dev
Variance
Subcommand statements for Proc Means
The TITLE statement is useful for preparing reports.
The VAR statement specifies which variables the summary
statistics should be computed for.
Example:
PROC MEANS DATA = EUROPE_W NMISS MEAN STD VAR;
TITLE ’Demographic Statistics for Western Europe’;
VAR DENSITY BRATE DRATE;
11
Subgrouping with the Class Statement
The CLASS statement is used when we require computation
of the various summary statistics for different subgroups
of classes. For example, to estimate the mean number of
fatalities for each of the two species of snail, we use
SPECIES as a class variable:
Example (try this one yourself using the data file from the
web):
DATA SNAILS;
INFILE ’snails.txt’;
INPUT SPECIES $ TIME HUMIDITY TEMP FATALITY N;
PROC MEANS DATA=SNAILS MEAN;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES;
RUN;
QUIT;
12
Subgrouping with Class
After execution, the Output window contains the two
averages:
Mean Fatalities For Each Species of Snail
A
0.708333
B
4.020833
13
Subgrouping with Class
We are actually interested in the mean number of
fatalities for each type of snail at each level of exposure
(TIME).
Thus, TIME is a second classification variable, nested
within the first classification variable SPECIES.
We can obtain all of the required averages, as well as
95% confidence limits for the true mean in each case, by
employing the following:
PROC MEANS DATA=SNAILS MEAN CLM;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES TIME;
14
Subgrouping with the By Statement
The BY statement is almost interchangeable with the CLASS
statement. However, it will only work when the data set
is sorted according to the BY variable. The CLASS
statement does not have this restriction.
Example:
PROC MEANS DATA=SNAILS MEAN CLM;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
BY SPECIES TIME;
This works since SPECIES and TIME are already sorted. For
each value of SPECIES the variable TIME is sorted.
The CLASS statement uses more memory than BY, but the
BY will tend to be slower than CLASS, since sorting is a slow
operation. These differences are only noticeable for large
data sets.
15
Using Output from Proc Means
The OUTPUT statement is used to create a new SAS data
set consisting of the summary statistic computed by PROC
MEANS.
Example: The following creates a new SAS data set
called SNAILSUM which will contain 2 observations (one for
each species) on the 3 variables M_FATAL, S_FATAL, and
V_FATAL.
PROC MEANS DATA=SNAILS MEAN STD VAR NOPRINT;
VAR FATALITY;
CLASS SPECIES;
OUTPUT OUT=SNAILSUM
MEAN=M_FATAL
STD =S_FATAL
VAR =V_FATAL;
16
Output: Another Example
The following creates a SAS data set consisting of a
single observation on the two variables M_BRATE and M_DRATE.
The number of variables in the VAR statement must match
the number of variables created by the OUTPUT statement,
for each statistic listed in the options.
PROC MEANS DATA=EUROPE_W MEAN;
VAR BRATE DRATE;
OUTPUT OUT=EUROPSUM
MEAN=M_BRATE M_DRATE;
These new SAS data sets can later be used by SAS
procedures, if desired.
17
Proc Means: Example
Here we plot a histogram of the averages of the numbers
of fatalities. Note that we have used the NOPRINT option
here to suppress output to the Output window.
PROC MEANS DATA=SNAILS MEAN NOPRINT;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES TIME;
OUTPUT OUT = SNAILSUM;
MEAN = M_FATAL;
PROC CHART DATA=SNAILSUM;
VBAR M_FATAL;
18
PROC UNIVARIATE
Syntax:
PROC UNIVARIATE DATA = SASdata options;
statements;
Many of the options are the same as for PROC MEANS.
Some additional ones are available: see page 29 of the
textbook.
The default output is quite extensive and includes the
median and quartiles, the extreme percentiles, and lowest
and highest 5 observations.
These last are useful for ensuring that the data has been
read in sensibly.
19
PROC UNIVARIATE options
The NORMAL option gives a crude normal QQ plot, an
informal, yet useful, test of normality.
It is a plot of the ordered observations versus the
expected value of ordered normal observations.
If the plot is close to a straight line, then the data are
approximately normally distributed. Otherwise, the data
are likely non-normal.
20
Normal QQ Plot: Example
This checks whether the distribution of Western
European population densities are approximately normal.
PROC UNIVARIATE DATA=EUROPE_W NORMAL;
VAR DENSITY;
To train your eye to recognize typical departures from
non-normality, simulation of normal and non-normal data
sets having various sample sizes is helpful:
DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
X = RANNOR(0);
PUT X;
END;
RUN; QUIT;
21
Normal QQ Plotting
Now, construct the normal QQ plot:
DATA NORTEST;
INFILE ’normal.dat’;
INPUT X;
PROC UNIVARIATE NORMAL;
VAR X;
RUN; QUIT;
Repeating this for a number of different simulation runs
will give you a good notion as to what the normal QQ
plot should look like.
22
Normal QQ Plotting of Non-Normal Data
To see what a normal QQ plot shouldn’t look like, try
something like the following:
DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
U = UNIFORM(0);
IF U < .8 THEN
X = RANNOR(0);
ELSE X = 5*RANNOR(0);
PUT X;
END;
RUN; QUIT;
or
23
Normal QQ Plots of Non-Normal Data
DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
X = RANEXP(0);
PUT X;
END;
RUN; QUIT;
In each case, create the normal QQ plot to see what
happens when the data is really not normally distributed.
24
The Plot options and Proc Means
Crude stem-and-leaf and boxplots can be produced using
the PLOT option.
Most of the statements that can be used with PROC MEANS
can be used with PROC UNIVARIATE. The exception is the
CLASS statement. You must make sure the data are sorted
properly and use the BY statement instead.
25
PROC SORT
Syntax
PROC SORT DATA=SASdata;
BY var1 var2 ... ;
Example 1:
PROC SORT DATA = EUROPE_W;
BY DENSITY;
26
PROC SORT
The SAS data set then becomes
Country
POP DENSITY BRATE DRATE
Iceland
267
2.61 7.2 6.0
Norway
4348 13.40 13.8 10.3
Finland
5108 15.10 12.3 9.6
................................
Netherlands 15459 378.91 2.3 8.8
The following sorts the data set so that DENSITY appears in
reverse order.
PROC SORT DATA = EUROPE_W;
BY DESCENDING DENSITY;
27
Download