The print procedure PROC PRINT data=data-set NOOBS LABEL; By variable-list; Group output by the by-variables. Data must be presorted. ID variable-list; The observation numbers are replaced by the ID variables. SUM variable-list; Print sums for the variables in the list. VAR variable-list; Specify variables to print and order LABEL variable=’label’; Use label for the specified variable. The NOOBS option requests the observation numbers to be suppressed. The LABEL option requests labels instead of variable names to be printed, if variable labels have been defined in a DATA step with a LABEL statement. Note that when a LABEL statement is used in a DATA step, the labels become part of the data set; but when used in a PROC, the labels stay in effect only for the duration of that step. data shoes; input style $ 1-15 ExcerciseType $ :10. Sales price; datalines; Max Flight running 1930 142.99 Zip fit leather walking 2250 83.99 zoom airborne running 4150 112.99 Light step walking 1130 73.99 Max step woven walking 2230 75.99 zip sneak c-train 1190 92.99 Air basketball 1000 150 ; proc sort data=shoes; by ExerciseType; proc print data=shoes label; by ExerciseType; sum sales; var style sales price; label Sales="sales in 2009"; run; The sort procedure PROC SORT DATA=messy OUT=neat NODUPKEY; BY state DESCENDING city; The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same values for the BY variables. The DESCENDING option before the city variable requests SAS to sort by descending order of the city. By default, SAS sorts by ascending order. proc sort data=shoes out=shoes_sorted NODUPKEY; by ExerciseType; run; proc print; run; The format procedure: mainly used to recode variable values through user-defined formats. PROC FORMAT library=libref.catalogname ; VALUE numfmt value1='formatted-value-1' value2='formatted-value-2' ........ valuen='formatted-value-n' ; VALUE $charfmt 'value1'='formatted-value-1' 'value2'='formatted-value-2' ........ 'valuen'='formatted-value-n' ; RUN; PROC FORMAT Statement: Without the LIBRARY=option, formats are stored in a catalog called FORMATS in the temporary WORK library and exist only for the duration of the SAS session. If the LIBRARY= option specifies only a libref, formats are permanently stored in that library in a catalog called FORMATS. data temp; infile cards dlm=','; input id $ sex $ emp_stat yr_edu jobcat $ ; cards; A, m, 2, 18, 42 B, F, 0, 16, 00 C, f, 2, 16, 32 D, M, 1, 12, 52 E, f, 1, 18, 01 ; run; proc format; value $job '01'='Teacher' '31'-'33'='Computing Consultant' '41'-'49','51'-'59'='Medical Professional' other='N/A' ; value empst 0='NotEmployed' 1='Part-time Employed' 2='Full-time Employed' ; value edu_cat 1-11='Less than High School' 12='High School' 12<-high='More than High School' ; value $gen 'M','m'='Males' 'F','f'='Females' ; run; data rec; set temp; FORMAT emp_stat empst. jobcat $job.; run; proc print data=rec; run; /* Using formats temporarily in a PROC step */ proc freq data=saslib.rec; tables sex yr_edu ; format sex $gen. yr_edu edu_cat. ; run; data new; set rec; /* Detaching formats from these variables */ FORMAT emp_stat jobcat ; run; proc print data=new; title "Data with NO user-written formats" ; run; Specifying range of values: 1. Ranges can be constant values or values separated by commas: a. · ‘a’, ‘b’, ‘c’ b. · 1,22,43 2. Ranges can include intervals such as: <lower> – <higher> means that the interval includes both endpoints. <lower> <- <higher> means that the interval includes higher endpoint, but not the lower. <lower> - < <higher> means that the interval includes lower endpoint, but not the higher. <lower> <- < <higher> means that the interval does not include either endpoint. 3. The numeric “ . “ and character ‘ “ “ ‘ missing values can be individually assigned values. 4. Ranges can be specified with special keywords: a. LOW: From the least (most negative) possible number. b. HIGH: To the largest (positive) possible number. c. OTHER: All other numbers not otherwise specified. 5. The LOW keyword does not format missing values. 6. The OTHER keyword does include missing The means procedure proc means options; statements; Commonly used options: n, nmiss, mean, median, std, stderr, clm, lclm, uclm, min, max, sum, var, q1, q3, qrange, cv, skewness, kurtosis, t, prt (p-value for the t-test), maxdec. Commonly used statements: class variable-list; request summary analysis done for each group. Data need not to be ordered first by variable-list; request summary analysis done for each group. Data need to be ordered first var variable-list; output out=data set name statKeywords=names; data htwt; input subject $ gender $ height weight score $; datalines; 1 M 68.5 155 L 2 F 61.2 99 H 3 F 63.0 115 M 4 5 6 7 8 ; M M F M M 70.0 205 68.6 170 65.1 125 72.4 220 . 188 H . M H L proc means data=htwt maxdec=2 N mean std stderr clm; run; The univariate procedure proc univariate options; statements/statment options; Some useful options: normal: test for normality plot: produce three text plots: stem-and-leaf, box plot and normal probability plot (QQplot) Statements: Var variable-list; By variable-list; histogram variable-list/normal; This will generate a histogram with normal density curve superimposed. QQplot: quantile-quantile plot Probplot: quantile-probability plot Inset: add a box that displays selected stats proc univariate data=htwt plot normal; var weight; histogram weight/normal; Inset mean='Mean' (5.2) std='standard deviation' (6.3)/Font='Arial' POS=NW HEIGHT=3; QQplot weight/normal(mu=160 sigma=44 color=red); Probplot weight/normal(mu=160 sigma=44 color=red); run; See http://www.ats.ucla.edu/stat/sas/output/univ.htm for a detailed annotation on the output Two sample comparisons T-test: testing the differences between two independent group means Assumptions: 1. Two groups are independent and samples within each group are independent 2. The means of the two groups are normally distributed 3. The variances of the two groups are approximately equal data grouptime; Do group="C", "T"; Do sub=1 to 5; input time @; output; end; end; drop sub; datalines; 80 93 83 89 98 100 103 104 99 102 ; proc ttest data=grouptime; class group; var time; run; Wilcoxon rank-sum test Appropriate for nonnormal distributions and small sample size, and ordinal data. The null hypothesis is that the distributions of X in both groups are the same. Group A: 3.1 2.2 1.7 2.7 2.5 Group B: 0.0 0.0 1.0 2.3 Order data: 0.0 0.0 1.0 1.7 2.2 2.3 2.5 2.7 3.1 groups: B B B A A B A A A rank: 1.5 1.5 3 4 5 6 7 8 9 Rank-sum A: 4+5+7+8+9=33 Rank-sum B: 1.5+1.5+3+6=12 test statistic: min(Rank-sum A, Rank-sum B) data tumor; infile datalines missover; input group $ mass1-mass5; datalines; A 3.1 2.2 1.7 2.7 2.5 B 0.0 0.0 1.0 2.3 ; proc transpose data=tumor out=tumor1 prefix=mass; by group; var mass1-mass5; run; proc npar1way data=tumor1 wilcoxon; class group; var mass1; exact wilcoxon; run; Paired t-test The same subject is measured under the two different treatment conditions Assumptions: The mean of the within-pair differences is normally distributed data grouptime1; set grouptime; ctime=lag5(time); if ctime ^=.; rename time=ttime; drop group; proc ttest data=grouptime1; paired ctime*ttime; run; Proc univariate can be used for paired t-test. It's better than the ttest proc since it also does nonparametric tests data grouptime2; set grouptime1; change=ctime-ttime; keep change; proc univariate data=grouptime2; run; One way analysis of variance (one way ANOVA): Comparison of one continuous variable among multiple groups Assumptions: 1. Groups are independent and samples within each group are independent 2. data are normally distributed 3. The variances of the groups are approximately equal F-test: 1. Total Sum of Squres: TSS 2. Sum of Squres due to treatment: SST 3. Sum of Squres due to error: SSE TSS = SST + SSE SST/(k-1) F = --------------, N = total sample size, k = number of groups SSE/(N-k) data reading; input group $ words datalines; X 700 X 850 X 820 X 640 X Y 480 Y 460 Y 500 Y 570 Y Z 500 Z 550 Z 480 Z 600 Z ; proc anova data=reading; class group; @@; 920 580 610 model words=group; means group; run; Categorical data One binomial or multinomial variable proc freq data=data; tables variable-list/statment options; Some proc statement options: missing: includes missing values in frequency statistics nocum: no cumulative frequency nopercent: no percentage List: print cross-tabulations in list format rather than grid nocol: suppresses printing of column percentages in cross-tabulations norow: suppresses printing of row percentages in cross-tabulation Table statement options AGREE: requests tests and measures of classification agreement including McNemar's test, kappa statistics, etc BIN: requests binomial proportion, confidence limits and test for one-way tables CHISQ: requests chi-square tests of homogeneity and measures of association CL: requests confidence limits for measures of association EXACT: requests Fisher's exact test MEASURES: requests measures of association including Pearson and spearman correlation coefficients, etc RELRISK: requests relative risk measures for 2x2 tables RISKDIFF: requests risk difference and confidence limits for 2x2 tables data htwt; input subject gender $ height weight score $; datalines; 1 M 68.5 155 L 2 F 61.2 99 H 3 F 63.0 115 M 4 M 70.0 205 . 5 M 68.6 170 M 6 F 65.1 125 H 7 M 72.4 220 L 8 M . 188 H ; proc freq data=htwt; tables score/bin(p0=.6 level="L"); *tables score; *exact bin; run; data htwt1; set htwt; score1=(score="H"); proc print;run; proc freq data=htwt1; tables gender*score1/riskdiff; run; 2-way contingency table (cross tabulations) Useful for 1. Comparing two proportions with independent samples 2. Testing independence between two categorical variables for one sample Commonly used tests: 1. Chi squre test (chisq): the expected number of count in each cell > 5 2. Fisher exact test (fisher): for small sample sizes data fisher; input gender $ vote $ count; datalines; M Y 5 M N 0 F Y 1 F N 4 ; proc freq data=fisher; tables gender*vote/fisher; weight count; run; proc freq data=htwt; tables gender*score/chisq fisher; exact chisq; run; proc gplot proc gplot data=data_name; plot y1*x=symbol y2*x=symbol/overlay haxis vaxis; run; The example below is from data collected on a series of plots in Maryland to examine the relationships between gypsy moth egg mass densities and subsequent defoliation. The plots are 60 ha in size and a random sample of .01ha subplots were obtained in each plot. Egg masses were counted and defolation (as a percent) was measured. option ls=70; title 'create defol means'; data dd; infile 'C:\Documents and Settings\anna\Desktop\597\87md.dat'; input plot $ subplot egg def; run; proc print data=dd;run; proc means data =dd nway; class plot; var egg def; output out=result2 mean=meanegg meandef stderr=seegg sedef; run; proc print data=result2;run; The nway option in the means procedure statement: Limit the output statistics to the observations with the highest _TYPE_ value. data c; set result2; up=meanegg+(1.96*seegg); low = meanegg-(1.96*seegg); run; proc print data=c;run; title1 'Mean egg mass and two standard errors'; title2 ' Maryland 1987'; title2; axis2 label=("Egg mass"); symbol1 value=u color=red; symbol2 value=l color=red; symbol3 value=m color=black; proc gplot data=c; plot up*plot=1 low*plot=2 meanegg*plot=3/overlay vaxis=axis2; run; Title statement: 1. Global statement 2. TITLE1 is twice the height of all other titles and uses the SWISS font. 3. All other TITLE statements are one unit high and use the default hardware font. 4. The following quoted paragraph is from SAS online document “Using TITLE and FOOTNOTE Statements You can define TITLE and FOOTNOTE statements anywhere in your SAS program. They are global and remain in effect until you cancel them or until you end your SAS session. All currently defined FOOTNOTE and TITLE statements are automatically displayed. You can define up to ten TITLE statements and ten FOOTNOTE statements in your SAS session. A TITLE or FOOTNOTE statement without a number is treated as a TITLE1 or FOOTNOTE1 statement. You do not have to start with TITLE1 and you do not have to use sequential statement numbers. Skipping a number in the sequence leaves a blank line. You can use as many text strings and options as you want, but place the options before the text strings they modify. The most recently specified TITLE or FOOTNOTE statement of any number completely replaces any other TITLE or FOOTNOTE statement of that number. In addition, it cancels all TITLE or FOOTNOTE statements of a higher number. For example, if you define TITLE1, TITLE2, and TITLE3, resubmitting the TITLE2 statement cancels TITLE3. To cancel individual TITLE or FOOTNOTE statements, define a TITLE or FOOTNOTE statement of the same number without options (a null statement): title4; But remember that this will cancel all other existing statements of a higher number. To cancel all current TITLE or FOOTNOTE statements, use the RESET= graphics option in a GOPTIONS statement: goptions reset=footnote; Specifying RESET=GLOBAL or RESET=ALL also cancels all current TITLE and FOOTNOTE statements as well as other settings.” Symbol statement: 1. Global statement 2. Syntax Symbol<1...255> keyword=value; keywords include: color line value interpol 3. A new symbol definition of any number replaces the old symbol definition of the same number with the same keywords Axis statement: 1. Global statement 2. Syntax: axis<1...99> label=("value"); proc reg data=htwt; model weight=height; plot weight * height P.*height/overlay; run; symbol1 value=plus color=black; symbol2 I=RLCLM95 line=1 color=red; symbol3 I=RLCLI95 line=4 color=blue; proc gplot data=htwt; plot weight*height weight*height=2 weight*height=3/overlay; run;