EPIB698D Lecture 3 Raul Cruz-Cano Spring 2013 Proc Univariate • The UNIVARIATE procedure provides data summarization on the distribution of numeric variables. PROC UNIVARIATE <option(s)>; Var variable-1 variable-n; Run; Options: PLOTS : create low-resolution stem-and-leaf, box, and normal probability plots NORMAL: Request tests for normality data blood; INFILE 'C:\blood.txt'; INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol; run; proc univariate data =blood ; var cholesterol; run; OUTPUT (1) The UNIVARIATE Procedure Variable: cholesterol Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Sum Weights 795 Sum Observations 160141 Variance 2488.6844 Kurtosis -0.0706044 Corrected SS 1976015.41 Std Error Mean 1.76929947 OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 N - This is the number of valid observations for the variable. The total number of observations is the sum of N and the number of missing values. Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean Moments Moments are statistical summaries of a distribution 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 Sum Weights - A numeric variable can be specified as a weight variable to weight the values of the analysis variable. The default weight variable is defined to be 1 for each observation. This field is the sum of observation values for the weight variable OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 Sum Observations - This is the sum of observation values. In case that a weight variable is specified, this field will be the weighted sum. The mean for the variable is the sum of observations divided by the sum of weights. OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 Std Deviation - Standard deviation is the square root of the variance. It measures the spread of a set of observations. The larger the standard deviation is, the more spread out the observations are. OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 Variance - The variance is a measure of variability. It is the sum of the squared distances of data value from the mean divided by N-1. We don't generally use variance as an index of spread because it is in squared units. Instead, we use standard deviation. OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 Skewness - Skewness measures the degree and direction of asymmetry. A symmetric distribution such as a normal distribution has a skewness of 0, and a distribution that is skewed to the left, e.g. when the mean is less than the median, has a negative skewness. OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 (1)Kurtosis - Kurtosis is a measure of the heaviness of the tails of a distribution. In SAS, a normal distribution has kurtosis 0. (2) Extremely nonnormal distributions may have high positive or negative kurtosis values, while nearly normal distributions will have kurtosis values close to 0. (3) Kurtosis is positive if the tails are "heavier" than for a normal distribution and negative if the tails are "lighter" than for a normal distribution. OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Uncorrected Sum of Square Distances from the Mean This is the sum of squared data values. Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 Corrected SS - This is the sum of squared distance of data values from the mean. This number divided by the number of observations minus one gives the variance. OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 (1)Coeff Variation - The coefficient of variation is another way of measuring variability. (2)It is a unitless measure. (3)It is defined as the ratio of the standard deviation to the mean and is generally expressed as a percentage. (4) It is useful for comparing variation between different variables. OUTPUT (1) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 795 201.43522 49.8867157 -0.0014449 34234053 24.7656371 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 795 160141 2488.6844 -0.0706044 1976015.41 1.76929947 (1)Std Error Mean - This is the estimated standard deviation of the sample mean. (2)It is estimated as the standard deviation of the sample divided by the square root of sample size. (3)This provides a measure of the variability of the sample mean. OUTPUT (2) Location Variability Mean 201.4352 Std Deviation 49.88672 Median 202.0000 Variance 2489 Mode 208.0000 Range 314.00000 Interquartile Range 71.00000 NOTE: The mode displayed is the smallest of 2 modes with a count of 12. Median - The median is a measure of central tendency. It is the middle number when the values are arranged in ascending (or descending) order. It is less sensitive than the mean to extreme observations. Mode - The mode is another measure of central tendency. It is the value that occurs most frequently in the variable. OUTPUT (3) Location Variability Mean 201.4352 Std Deviation 49.88672 Median 202.0000 Variance 2489 Mode 208.0000 Range 314.00000 Interquartile Range 71.00000 NOTE: The mode displayed is the smallest of 2 modes with a count of 12. Range - The range is a measure of the spread of a variable. It is equal to the difference between the largest and the smallest observations. It is easy to compute and easy to understand. Interquartile Range - The interquartile range is the difference between the upper (75% Q) and the lower quartiles (25% Q). It measures the spread of a data set. It is robust to extreme observations. OUTPUT (3) Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 113.8503 Pr > |t| <.0001 Sign M 397.5 Pr >= |M| <.0001 Signed Rank S 158205 Pr >= |S| <.0001 OUTPUT (3) Student's t t 113.8503 Pr > |t| <.0001 Sign M 397.5 Pr >= |M| <.0001 Signed Rank S 158205 Pr >= |S| <.0001 (1) Sign - The sign test is a simple nonparametric procedure to test the null hypothesis regarding the population median. (2) It is used when we have a small sample from a nonnormal distribution. (3)The statistic M is defined to be M=(N+-N-)/2 where N+ is the number of values that are greater than Mu0 and N- is the number of values that are less than Mu0. Values equal to Mu0 are discarded. (4)Under the hypothesis that the population median is equal to Mu0, the sign test calculates the p-value for M using a binomial distribution. (5)The interpretation of the p-value is the same as for t-test. In our example the M-statistic is 398 and the p-value is less than 0.0001. We conclude that the median of variable is significantly different from zero. OUTPUT (3) Student's t t 113.8503 Pr > |t| <.0001 Sign M 397.5 Pr >= |M| <.0001 Signed Rank S 158205 Pr >= |S| <.0001 (1) Signed Rank - The signed rank test is also known as the Wilcoxon test. It is used to test the null hypothesis that the population median equals Mu0. (2) It assumes that the distribution of the population is symmetric. (3)The Wilcoxon signed rank test statistic is computed based on the rank sum and the numbers of observations that are either above or below the median. (4) The interpretation of the p-value is the same as for the t-test. In our example, the S-statistic is 158205 and the p-value is less than 0.0001. We therefore conclude that the median of the variable is significantly different from zero. OUTPUT (4) Quantiles (Definition 5) Quantile Estimate 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min 331 318 282 267 236 202 165 138 123 94 17 95% - Ninety-five percent of all values of the variable are equal to or less than this value. OUTPUT (5) Extreme Observations ----Lowest-------Highest--Value Obs Value Obs 17 829 323 828 36 492 328 203 56 133 328 375 65 841 328 541 69 79 331 191 Missing Values -----Percent Of----Missing Missing Value Count All Obs Obs . 205 20.50 100.00 Extreme Observations This is a list of the five lowest and five highest values of the variable PROC FREQ • PROC FREQ and PROC MEANS have literally been part of SAS for over 30 years • Probably THE most used of the SAS analytical procedures. • They provide useful information directly and indirectly and are easy to use, so people run them daily without thinking about them. • These procedures can facilitate the construction of many useful statistics and data views that are not readily evident from the documentation. Proc Freq • PROC FREQ can be used to count frequencies of both character and numeric variables • When you have counts for one variable, it is called one-way frequencies • When you have two or more variables, the counts are called two-way, three-way or so on up to n-way frequencies; or simply cross-tabulations • Syntax: Proc freq ; Table(s) variable-combinations; • To produce one-ways frequencies, just put variable name after “TABLES”; To produced cross-tabulations, put an asterisk (*) between the variables 22 Proc Freq • The blood.txt data contain information of 1000 subjects. The variables include: subject ID, gender, blood_type, age group, red blood cell count, white blood cell count, and cholesterol. • Here is the data with first few subjects: 1 Female AB Young 7710 7.4 258 2 Male AB Old 6560 4.7 . 3 Male A Young 5690 7.53 184 4 Male B Old 6680 6.85 . 5 Male A Young . 7.72 187 • We want to derive frequencies of gender, age group, and blood type. 23 PROC FREQ options • Nocol: Suppress the column percentage for each cell • Norow: Suppress the row percentage for each cell • Nopercent: Suppress the percentages in crosstabulation tables, or percentages and cumulative percentages in one-way frequency tables and in list format 24 Proc Freq data blood; infile "c:\blood.txt"; input ID Gender $ Blood_Type $ Age_Group $ RBC WBC cholesterol; run; proc freq data=blood; tables Gender Blood_Type; tables Gender * Age_Group * Blood_Type / nocol norow nopercent; run; 25 Continuous Values • You would produce literally hundreds of categories for each value of a continuous variable (tuition, math SAT score, salary, age, etc). • You need to transform the numeric variables into categorical variables. proc freq data=blood ; tables RBC*WBC /norow nocol nopercent ; run; data blood; length rbc_group $ 12 wbc_group $12; set blood; if 0 <= RBC <6375 then rbc_group ="Low"; else if 6375 <= RBC < 7040 then rbc_group ="Medium Low"; else if 7040 <= RBC < 7710 then rbc_group ="Medium High"; else rbc_group =" High"; proc univariate data=blood ; var RBC WBC; run; if 0 <= WBC <4.84 then wbc_group ="Low"; else if 4.84 <= WBC < 5.52 then wbc_group ="Medium Low"; else if 5.52 <= WBC < 6.11 then wbc_group ="Medium High"; else wbc_group =" High"; run; proc freq data=blood ; tables rbc_group*wbc_group; run; PROC FREQ options • Missprint: Display missing value frequencies • Missing: Treat missing values as nonmissing data one; input A Freq; datalines; 12 22 .2 ; Run; proc freq data=one; tables A; title 'Default'; run; proc freq data=one; tables A / missprint; title 'MISSPRINT Option'; run; proc freq data=one; tables A / missing; title 'MISSING Option'; run; 27 PROC FREQ options • The order option orders the values of the frequency and crosstabulation table variables according to the specified order, where: – Data: orders values according to their order in the input dataset – Formatted: orders values by their formatted values – freq: orders values by descending frequency count – Internal: orders values by their unformatted values 28 PROC FREQ output Creates an output data set with frequencies, percentages, and expected cell frequencies • Out=: Specify an output data set to contain variable values and frequency counts data blood; infile "c:\blood.txt"; input ID Gender $ Blood_Type $ Age_Group $ RBC WBC cholesterol; run; proc freq data=blood order = freq; tables Blood_Type / out = proccount;; run; proc print data = proccount; run; out goes here not in the proc step 29 PROC FREQ options • Some of the statistics available through the FREQ procedure include: – Chisq provides chi-square tests of independence of each stratum and computes measures of association. – A significant Chi Square (p<.001, for example) indicates that there is a strong dependence between the variables proc freq data=blood ; tables rbc_group*wbc_group/norow nocol nopercent chisq; run; Proc corr The CORR procedure is a statistical procedure for numeric random variables that computes correlation statistics (The default correlation analysis includes descriptive statistics, Pearson correlation statistics, and probabilities for each analysis variable). PROC CORR options; VAR variables; WITH variables; BY variables; Proc corr data=blood; var RBC WBC cholesterol; run; Proc Corr Output Simple Statistics Variable RBC WBC cholesterol N 908 916 795 Mean Std Dev 7043 1003 5.48353 0.98412 201.43522 49.88672 Sum Minimum 6395020 5023 160141 4070 1.71000 17.00000 Maximum 10550 8.75000 331.00000 N - This is the number of valid (i.e., non-missing) cases used in the correlation. By default, proc corr uses pairwise deletion for missing observations, meaning that a pair of observations (one from each variable in the pair being correlated) is included if both values are non-missing. If you use the nomiss option on the proc corr statement, proc corr uses listwise deletion and omits all observations with missing data on any of the named variables. Proc Corr Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations RBC RBC 1.00000 908 P-value H0: Corr=0 WBC cholesterol Number of observations -0.00203 0.9534 833 0.06583 0.0765 725 WBC cholesterol -0.00203 0.9534 833 1.00000 916 0.02496 0.5014 728 0.06583 0.0765 725 0.02496 0.5014 728 1.00000 795 Pearson Correlation Coefficients - measure the strength and direction of the linear relationship between the two variables. The correlation coefficient can range from -1 to +1, with -1 indicating a perfect negative correlation, +1 indicating a perfect positive correlation, and 0 indicating no correlation at all. SAS ODS (Output Delivery System) • ODS is a powerful tool that can enhance the efficiency of statistical reporting and meet the needs of the investigator. • To create output objects that can be send to destinations such as HTML, PDF, RTF (rich text format), or SAS data sets. • To eliminate the need for macros that used to convert standard SAS output to a Microsoft Word, or HTML document Fish Measurement Data The data set contains 35 fish from the species Bream caught in Finland's lake Laengelmavesi with the following measurements: • Weight (in grams) • Length3 (length from the nose to the end of its tail, in cm) • HtPct (max height, as percentage of Length3) • WidthPct (max width, as percentage of Length3) ods graphics on; title 'Fish Measurement Data'; proc corr data=fish1 nomiss plots=matrix(histogram); var Height Width Length3 Weight3; run; ods graphics off; data Fish1 (drop=HtPct WidthPct); title 'Fish Measurement Data'; input Weight Length3 HtPct WidthPct @@; Weight3= Weight**(1/3); Height=HtPct*Length3/100; Width=WidthPct*Length3/100; datalines; 242.0 30.0 38.4 13.4 290.0 31.2 40.0 13.8 340.0 31.1 39.8 15.1 363.0 33.5 38.0 13.3 430.0 34.0 36.6 15.1 450.0 34.7 39.2 14.2 500.0 34.5 41.1 15.3 390.0 35.0 36.2 13.4 450.0 35.1 39.9 13.8 500.0 36.2 39.3 13.7 475.0 36.2 39.4 14.1 500.0 36.2 39.7 13.3 500.0 36.4 37.8 12.0 . 37.3 37.3 13.6 600.0 37.2 40.2 13.9 600.0 37.2 41.5 15.0 700.0 38.3 38.8 13.8 700.0 38.5 38.8 13.5 610.0 38.6 40.5 13.3 650.0 38.7 37.4 14.8 575.0 39.5 38.3 14.1 685.0 39.2 40.8 13.7 620.0 39.7 39.1 13.3 680.0 40.6 38.1 15.1 700.0 40.5 40.1 13.8 725.0 40.9 40.0 14.8 720.0 40.6 40.3 15.0 714.0 41.5 39.8 14.1 850.0 41.6 40.6 14.9 1000.0 42.6 44.5 15.5 920.0 44.1 40.9 14.3 955.0 44.0 41.1 14.3 925.0 45.3 41.4 14.9 975.0 45.9 40.6 14.7 950.0 46.5 37.9 13.7 ; run; Weighted Data Problem: Pct. of Voting Population Minority Voters White Voters Pct. of People who have a phone Minority Voters White Voters Solution: Give more “weight” to the minority people with telephone Weighted Data Not limited to 2 categories Pct. of Voting Population Pct. of People who have a phone Minority/Dem. Minority/Dem. Minority/Rep. Minority/Rep. White /Dem White /Dem White /Rep White /Rep How many categories? As many as there are significant Proportion Suppose minority voters are 1/3 of the voting population but only 1/6 of the people with phone 1 1 ? ? 2 6 3 5 2 4 ? ? .8 6 3 5 Needless to say that in reality this is a much more complex issue Which weight we need to use? • Oversimplified example (don’t take seriously) Pct. of People who have a phone Minority Voters White Voters Pct. of Voting Population in 2008 Minority Voters O White Voters Pct. of Voting Population in 2010 Minority Voters White Voters M Proportion Suppose minority voters are 1/3 of the voting population but only 1/6 of the people with phone 1. 2. 3. 4. 100 minority + 500 white answer the phone survey 75 Minority will vote for candidate X 250 White will votes for candidate X Non-Weighted Conclusion: 325/600 =54.16% of the voters will vote for candidate X 5. Weighted Conclusion: a) 75 minority = 75% of minority with phone=>(.75)*(1/6)=12.5% of people with phone * 2 weight= 25% pct of voting population b) 250 white = 50% of white people with phone =>(.5)*(5/6)= 41.66% of people with phone * .8 weight =>33.33% c) 25% +33.33%=58.33% SAS Weighted Mean proc means data=sashelp.class; var height; run; proc means data=sashelp.class; weight weight; var height; run; Both PRC FREQ and PROC CORR also allow to use weights using the statement weight. Another (better?) approach for weighted data • Experimental design data have all the properties that we learned about in statistics classes. – The data are going to be independent – Identically-distributed observations with some known error distribution – There is an underlying assumption that the data come to use as a finite number of observations from a conceptually infinite population – Simple random sampling without replacement for the sample data • Sample survey data, – The sample survey data do not have independent errors. – The sample survey data may cover many small sub-populations, so we do not expect that the errors are identically distributed. – The sample survey data do not come from a conceptually infinite population. PROC MEANS vs PROC SURVEYMEANS 1. We have a target population of 647 receipt amounts, classified by the company region. If we need to perform a full audit, and it is too expensive to perform the full audit on every one of the receipts in the company database, then we need to take a sample. 2. We want to sample the larger receipt amounts more frequently. 3. Sample 'proportional to size'. That means that we choose a multiplier variable, and we make our choices based on the size of that multiplier. If the multiplier for receipt A is five times the multiplier for receipt B, then receipt A will be five times more likely to be selected than receipt B. We will use the receipt amount as the multiplier data AuditFrame (drop=seed); seed=18354982; do i=1 to 600; if i<101 then region='H'; else if i<201 then region='S'; else if i<401 then region='R'; else region='G'; Amount = round ( 9990*ranuni(seed)+10, 0.01); output; end; do i=601 to 617; if i<603 then region='H'; else if i<606 then region='S'; else if i<612 then region='R'; else region='G'; Amount = round ( 10000*ranuni(seed)+10000, 0.01); output; end; do i = 618 to 647; If i<628 then region='H'; else if i<638 then region='S'; else if i<642 then region='R'; else region='G'; Amount = round ( 9*ranuni(seed)+1, 0.01); output; end; run; PROC MEANS vs PROC SURVEYMEANS We can perform this sample selection using PROC SURVEYSELECT: proc surveyselect data=AuditFrame out=AuditSample3 method=PPS seed=39563462 sampsize=100; size Amount; run; This gives us a weighted random sample of size 100. Now we'll just (artificially) create the data set of the audit results. We will have a validated receipt amount for each receipt. Ideally, the validated amount would be exactly equal to the listed amount in every case. data AuditCheck3; set AuditSample3; ValidatedAmt = Amount; if region='S' and mod(i,3)=0 then ValidatedAmt = round(Amount*(.8+.2*ranuni(1234)),0.01); if region='H' then do; if floor(Amount/100)=13 then ValidatedAmt=1037.50; if floor(Amount/100)=60 then ValidatedAmt=6035.30; if floor(Amount/100)=85 then ValidatedAmt=8565.97; if floor(Amount/100)=87 then ValidatedAmt=8872.92; if floor(Amount/100)=95 then ValidatedAmt=9750.05; end; diff = ValidatedAmt - Amount; run; PROC MEANS vs PROC SURVEYMEANS • • • The WEIGHT statement in PROC MEANS allows a user to give some data points more emphasis. But that isn't the right way to address the weights we have here. We built a sample using a specific sample design, and we have sampling weights which have a real, physical meaning. A sampling weight for a given data point is the number of receipts in the target population which that sample point represents. The primary difference is the inclusion of the TOTAL= option. PROC SURVEYMEANS allows us to compute a Finite Population Correction Factor and adjust the error estimates accordingly. This factor adjusts for the fact that we already know the answers for some percentage of the finite population, and so we really only need to make error estimates for the remainder of that finite population. proc means data=AuditCheck3 mean stderr clm; var ValidatedAmt diff; run; proc means data=AuditCheck3 mean stderr clm; var ValidatedAmt diff; weight SamplingWeight; run; proc surveymeans data=AuditCheck3 mean stderr clm total=647; var ValidatedAmt diff; weight SamplingWeight; run; Are the point estimates different ? What about the confidence intervals? Household Component of the Medical Expenditure Panel Survey (MEPS HC) • The MEPS HC is a nationally representative survey of the U.S. civilian noninstitutionalized population. • It collects medical expenditure data as well as information on demographic characteristics, access to health care, health insurance coverage, as well as income and employment data. • MEPS is cosponsored by the Agency for Healthcare Research and Quality (AHRQ) and the National Center for Health Statistics (NCHS). • For the comparisons reported here we used the MEPS 2005 Full Year Consolidated Data File (HC-097). • This is a public use file available for download from the MEPS web site (http://www.meps.ahrq.gov). Transforming from SAS transport (SSP) format to SAS Dataset (SAS7BDAT) • The MEPS is not a simple random sample, its design includes: – – – – Stratification Clustering Multiple stages of Selection Disproportionate sampling. • The MEPS public use files (such as HC-097) include variables for generating weighted national estimates and for use of the Taylor method for variance estimation. These variables are: – person-level weight (PERWT05F on HC-097) – stratum (VARSTR on HC-097) – cluster/psu(VARPSU on HC-097). LIBNAME PUFLIB 'C:\'; FILENAME IN1 'C:\H97.SSP'; PROC XCOPY IN=IN1 OUT=PUFLIB IMPORT; RUN; Needed for even better estimates of the CI H97.SASBDAT occupies 408MB vs. 257MB for H97.SSP vs. 14MB for H97.ZIP PROC SURVEYFREQ Simple Example SAS7BDAT PROC SURVEYFREQ DATA= PUFLIB.H97; TABLES HISPANX*INSCOV05 / ROW; WEIGHT PERWT05F; RUN;