proc means

advertisement
EPIB 698C lecture 7
Raul Cruz-Cano
1
Sorting, Printing and Summarizing Your Data
• SAS Procedures (or PROC) perform specific analysis or
function, produce results or reports
• Eg: Proc Print data =new; run;
• All procedures have required statements, and most have
optional statements
• All procedures start with the key word “PROC”, followed
by the name of the procedure, such as PRINT, or contents
• Options, if there are any, follow the procedure name
• Data=data_name options tells SAS which dataset to use as
an input for this procedure. NOTE: if you skip it, SAS will
use the most recently created dataset, which is not
necessary the same as the mostly recently used data.
2
BY statement
• The BY statement is required for only one procedure, Proc
sort
PROC Sort data = new;
By gender;
Run;
• For all the other procedures, BY is an optional statement,
and tells SAS to perform analysis for each level of the
variable after the BY statement, instead of treating all
subjects as one group
Proc Print data =new;
By gender;
Run;
• All procedures, except Proc sort, assumes you data are
already sorted by the variables in your BY statement
3
PROC Sort
• Syntax
Proc Sort data =input_data_name out =out_data_name ;
By variable-1 … variable-n;
• The variables in the by statement are called by variables.
• With one by variable, SAS sorts the data based on the
values of that variable
• With more than one variable, SAS sorts observations by
the first variable, then by the second variable within the
categories of the first variable, and so on
• The DATA and OUT options specify the input and output
data sets. Without the DATA option, SAS will use the
most recently created data set. Without the OUT statement,
SAS will replace the original data set with the newly sorted
version
4
PROC Sort
• By default, SAS sorts data in ascending order, from the
lowest to the highest value or from A to Z. To have the the
ordered reversed, you can add the keyword
DESCENDING before the variable you want to use the
highest to the lowest order or Z to A order
• The NODUPKEY option tells SAS to eliminate any
duplicate observations that have the same values for the
BY variables
5
PROC Sort
• Example: The sealife.txt contains information on the average length in
feet of selected whales and sharks. We want to sort the data by the
family and length
Name
Family Length
beluga whale 15
whale shark 40
basking shark 30
gray whale 50
mako shark 12
sperm whale 60
dwarf shark .5
whale shark 40
humpback . 50
blue whale 100
killer whale 30
6
PROC Sort
• Example: The sealife.txt contains information on the average length in
feet of selected whales and sharks. We want to sort the data by the
family and length
Name
Family Length
beluga whale 15
whale shark 40
basking shark 30
gray whale 50
mako shark 12
sperm whale 60
dwarf shark .5
whale shark 40
humpback . 50
blue whale 100
killer whale 30
7
PROC Sort
DATA marine;
INFILE 'F:\sealife.txt';
INPUT Name $ Family $ Length;
run;
* Sort the data;
PROC SORT DATA = marine OUT = seasort
NODUPKEY;
BY Family DESCENDING Length;
run;
8
Title and Footnote statement
• Title and Footnote statements are global statements, and
are not technically part of any step.
• You can put them anywhere in your program; but since
they apply to the procedure output, it is usually make sense
to put them with the procedure
• Syntax
Title ‘This is a title for this procedure’
Footnote ‘This is the footnote for this procedure’;
• To cancel the current title or footnote, use the following
null statement:
Title;
Footnote;
9
Label Statement
• The label statement can create descriptive labels, up to 256
characters long, for each variable
• Eg:
Label Shipdate = ‘Date merchandise was shipped’;
ID =‘Identification number of subject’;
• When a label statement is used in a data step, the labels
become part of the data set; but when used in a PROC step,
the labels stay in effect only for the duration of that step
10
PROC Format statement
• The PROC FORMAT procedure allows you to create your own
formats. It is useful when you use coded data.
• The Proc format procedure creates formats what will later be
associated with variables in a FORMAT statement
• Syntax of the PROC FORMAT:
PROC FORMAT;
Value name range-1 =‘formated-text-1’
range-2 =‘formated-text-2’
range-n =‘formated-text-n’;
• Name is the name of the format you are creating; if the format is for
character data, the you need to use $name instead of name. In addition
the name can not be the name of an existing format
11
PROC Format statement
• Each range is the value of the variable that is assigned to the text
given in the quotation marks
• The text can be up to 32,767 characters long, but some
procedures print only the first 8 to 16 characters
• The following are some examples of valid range specifications:
‘A’=‘Asian’;
character values must be put in quotation marks
1,3,5,7,9=‘ODD’; with more than one value in the range, separate
them with comma or hyphen (-);
5000-high=‘high price’; the key word high and low can be used in
ranges to indicate the lowest and highest
non-missing values for the variable
12
PROC Format statement
• Here is a survey about subject’s preference of car colors.
The data contains subject’s age, sex (coded as 1 for male
and 2 for female), annual income, and preferred car color
(yellow, green, blue, and white). Here are the data:
age sex income color
19 1 14000 Y
45 1 65000 G
72 2 35000 B
31 1 44000 Y
58 2 83000 W
13
DATA carsurvey;
INFILE ‘C:\car.txt';
INPUT Age Sex Income Color $ ;
run;
PROC FORMAT;
VALUE gender 1 = 'Male’
2 = 'Female';
VALUE agegroup 13 -< 20 = 'Teen'
20 -< 65 = 'Adult'
65 - HIGH = 'Senior';
VALUE $col 'W' = 'Moon White'
'B' = 'Sky Blue'
'Y' = 'Sunburst Yellow'
'G' = ‘Green';
PROC PRINT DATA = carsurvey;
FORMAT Sex gender. Age agegroup.
Color $col.
Income DOLLAR8.;
RUN;
14
Subsetting in procedures with a where statement
• The WHERE statement tells a procedure to use a subset of
data
• It is an optional statement for any PROC step
• Unlike subsetting in the DATA step, using a WHERE
statement in a procedure does not create a new data set
• The basic form is
Where condition; (eg : where gender =‘female’;)
15
Subsetting in procedures with a where statement
• A data set contains information about well-known painters:
Name
Style
Nation of origin
Mary Cassatt
Impressionism
U
Paul Cezanne
Post-impressionism F
Edgar Degas
Impressionism
F
Paul Gauguin
Post-impressionism F
Claude Monet
Impressionism
F
Pierre Auguste Renoir Impressionism
F
Vincent van Gogh
Post-impressionism N
• Goal: we want a list of impressionist painters
16
DATA style;
INFILE ‘C:\style.txt';
INPUT Name $ 1-21 style $ 23-40 Origin $ 42;
RUN;
PROC PRINT DATA = style;
WHERE style = 'Impressionism';
TITLE 'Major Impressionist Painters';
FOOTNOTE 'F = France N = Netherlands U = US';
RUN;
17
Summarizing you data with PROC MEANS
• The proc means procedure provide simple statistics on
numeric variables. Syntax: Proc means options ;
• List of simple statistics can be produced by proc means:
MAX: the maximum value
MIN: the minimum value
DEFAULT
MEAN: the mean
N : number of non-missing values
STDDEV: the standard deviation
NMISS: number of missing values
RANGE: the range of the data
SUM: the sum
MEDIAN: the median
18
Proc means
• Options of Proc means:
 By variable-list : perform analysis for each level
of the variables in the list. Data needs to be sorted
first
 Var variable list: specifies which variables to use
in the analysis
19
Proc means
• A wholesale nursery is selling garden flowers, they want to
summarize their sales figures by month. The data is as
follows:
ID
756-01
756-01
834-01
834-01
901-02
834-01
756-01
901-02
756-01
Date
Lily
05/04/2001 120
05/14/2001 130
05/12/2001 90
05/14/2001 80
05/18/2001 50
06/01/2001 80
06/11/2001 100
06/19/2001 60
06/25/2001 85
SnapDragon
80
90
160
60
100
60
160
60
110
Marigold
110
120
60
70
75
100
75
60
100
20
DATA sales;
INFILE 'C:\Flowers.txt';
INPUT CustomerID $ @9 SaleDate MMDDYY10. Lily
SnapDragon Marigold;
Month = MONTH(SaleDate);
PROC SORT DATA = sales;
BY Month;
* Calculate means by Month for flower sales;
PROC MEANS DATA = sales; *OUTPUT OUT= values;
BY Month;
VAR Lily SnapDragon Marigold;
TITLE 'Summary of Flower Sales by Month';
RUN;
21
OUTPUT statement
• We can use the OUTPUT statement to write summary statistics in a
SAS data set
• Syntax
• OUTPUT out =data_name output-statistic-list;
• Eg:
Proc means data =new;
Var age BMI;
Output out = new1 mean (age BMI)=mean_age mean_BMI;
Run;
• In the output data set new1, we have two means for age and BMI
respectively. The variable names are mean_age mean_BMI
respectively.
22
Proc means
• A wholesale nursery is selling garden flowers, they want to
summarize their sales figures by month. The data is as
follows:
ID
756-01
756-01
834-01
834-01
901-02
834-01
756-01
901-02
756-01
Date
Lily
05/04/2001 120
05/14/2001 130
05/12/2001 90
05/14/2001 80
05/18/2001 50
06/01/2001 80
06/11/2001 100
06/19/2001 60
06/25/2001 85
SnapDragon
80
90
160
60
100
60
160
60
110
Marigold
110
120
60
70
75
100
75
60
100
23
PROC MEANS DATA = sales;
BY Month;
VAR Lily SnapDragon Marigold;
output out=new1
mean(Lily SnapDragon Marigold)=mean_lily
mean_SnapDragon mean_Marigold
sum (lily SnapDragon Marigold)=sum_lily
sum_SnapDragon sum_Marigold;
TITLE 'Summary of Flower Sales by Month';
RUN;
24
OUTPUT statement
• The SAS data set created by the output statement will
contain all the variables defined in the output statistic list;
any variables in a BY or CLASS statement, plus two new
variables: _TYPE_ and _FREQ_
• Without BY or CLASS statement, the data will have just
one observation
• If there is a BY statement, the data will have one
observation for each level of the BY group
• CLASS statements produce one observation for each level
of interaction of the class variables
• The value _TYPE_depends on the level of interactions of
the CLASS statement.
• _TYPE_= 0 is the grand total
25
Download