Analysis of Complex Survey Data Day 2: Univariate and Bivariate analysis Today’s schedule • Part I: Introduction to SUDAAN – Specifying the study design and design options • PART II: Introduction to – PROC RECORDS – PROC CROSSTAB – PROC DESCRIPT SUDAAN • Developed by RTI (Research Triangle Institute) in the 1970s to deal with complex survey data (no statistical software packages at the time could do this) • Although originally introduced for statistical analysis of sample survey data from stratified, multi-stage cluster samples, SUDAAN applies directly to the analysis of clustered efficacy and safety data from clinical trials, toxicology studies, and epidemiology studies. • Now in it’s 10th version – New to Version 10: SUDAAN has procedures that can compute sample weight adjustments (e.g., nonresponse and post-stratification) and can perform imputation with a weighted sequential hot deck approach. Some datasets that I have worked with and others that I know about • NESARC (http://aspe.hhs.gov/hsp/06/catalog-ai-anna/nesarc.htm) • Monitoring the Future (http://monitoringthefuture.org/) • AddHealth (http://www.cpc.unc.edu/projects/addhealth) • NLSY (http://www.bls.gov/nls/nlsy79.htm) • PSID (http://psidonline.isr.umich.edu/) • BRFSS (http://www.cdc.gov/BRFSS/) • NHSDUH (https://nsduhweb.rti.org/) • Collaborative Psych Epi Surveys (http://www.icpsr.umich.edu/icpsrweb/CPES/) SUDAAN Statements • Procedures statements (PROC), which define the procedure you are asking SUDAAN to run • Sample design statements, which tell SUDAAN how to compute standard errors • Procedure definition statements, which tell SUDAAN what sort of analysis is desired • Computation statements, which tell SUDAAN what to compute • Output statements, which tell SUDAAN how to display results in printed tables and how to save the results for further processing. Specifying your study design • The choice of study design in SUDAAN is very important when analyzing correlated data. It is only through the correct study design choice that you will compute correct standard errors. How does SUDAAN estimate standard errors? • Taylor series linearization (equivalent to GEE in regression procedures) or replication methods (BRR – Balanced Repeated Replication - and Jackknife) for robust variance estimation of descriptive statistics and regression parameters – Most design options will use TSL • I will not go through details on TSL, but for further reading go to: Tepping 1968, Kish and Frankel 1974, Folsom 1974, Shah et al. 1977, Woodruff 1971, Binder 1983 Specifying your study design • If no design is selected, the With Replacement (WR) design will be assumed. • Options include: Specifying your study design With Replacement options • DESIGN=WR • Sampling with replacement at the first stage (or with small sampling fractions) in every first-stage stratum. The sampling fraction in a first-stage stratum is the number of primary sampling units (PSUs) selected into the sample divided by the proportion number of PSUs in a stratum. • Sampling with or without replacement at subsequent stages • Sampling with equal or unequal probabilities of selection at both the first and subsequent stages • The design is valid when the PSUs are independent • In the absence of complete design information, the WR design is often chosen to approximate variances for more complicated designs. Equal versus unequal selection probabilities? • For most surveys, each PSU has an equal probability of selection within each stratum. However, there may be cases in which selection probabilities are unequal. • For example, lower selection probabilities may be assigned to units with higher data collection costs and higher selection probabilities to PSUs from small subpopulations of particular interest. • These design probabilities are a feature of the survey design and are assumed known before data analysis. Specifying your study design With Replacement options • DESIGN=STRWR • A single-stage design (no clustering), stratified random sampling with replacements (or small sampling fractions within each stratum). Equal or unequal probabilities of selection within each stratum. (e.g., you take a sample of students and stratify by classroom and sex). • DESIGN=SRS • A single-stage design (no clustering or stratification), simple random sampling (equal probabilities of selection), small sampling fraction Specifying your study design: Without Replacement options • Design=WOR • Sampling without replacement at the first stage (or with large sampling fractions in any first-stage stratum). The sampling fraction in a first-stage stratum is the number of PSUs selected into the sample divided by the population number of PSUs in the stratum. • Sampling with or without replacement at subsequent stages. • Sampling with equal probabilities of selection within each stratum and at each stage of without replacement sampling. ** In SUDAAN, the WOR design requires knowledge of the population counts in each stratum or PSU at each stage of without replacement sampling. These population counts are needed because the WOR design computes variances according to a multi-stage formula, which computes the finite population correction factors (FPCs) at each stage Specifying your study design: Without Replacement options • DESIGN=UNEQWOR • Sampling without replacement, with unequal probabilities of selection at the first stage • Sampling with equal probabilities at subsequent stages, with or without replacement • DESIGN=STRWOR • A single-stage design (no clustering). Stratified random sampling without replacement (or large sampling fractions in at least one stratum). Equal probabilities of selection within each stratum. Specifying your study design: Replication Methods • DESIGN=JACKKNIFE, • Alternative method to estimate variance in “with replacement” designs – delete one PSU (or cluster, for correlated data), weights for the remaining PSUs in the same stratum are adjusted to account for the deleted PSU. • DESIGN=BRR • When sample design is specified by a series of replicate weights Specifying your study design: Summary Specifying your study design • SUDAAN has nine sample design statements. Each statement has a specific purpose, and some are used with specific design options only. Specifying your study design • WEIGHT – identifies analysis weights used in computing estimates • NEST – lists variable(s) whose values identify the design stages Other: • TOTCNT – lists the variable(s) whose values are the population counts at each sampling stage (don’t need with “WR”) • SAMCNT – lists in order the variable(s) whose values are the sample counts at each sampling stage (optional) • JOINTPROB – lists in order the variable(s) whose values are the single and joint inclusion probabilities for each primary sampling unit (PSU) and each pair of PSUs in each first-stage stratum • REPWGT – use with BRR • IDVAR –use with BRR • JACKWGTS – use with JACKKNIFE • JACKMULT – use with JACKKNIFE Specifying your study design • WEIGHT – identifies analysis weights used in computing estimates • NEST – lists variable(s) whose values identify the design stages Other: • Useful NEST option: – MISSUNIT: specifies that when only one sample unit is encountered within a stage, the variance contribution of that unit is estimated using the difference in that unit’s value and the overall mean value for the population. Specifying your study design Two examples Sampling without replacement, with unequal probabilities of selection at the first stage Sampling with equal probabilities National Longitudinal Alcohol Epidemiology Survey at subsequent stages, with or without replacement The keyword _ZERO_ causes SUDAAN to generate, for every observation (record), a variable with the value 0. The keyword _MINUS1_ causes SUDAAN to generate, for every observation (record), a variable with the value -1. Use _MINUS1_ as a second or subsequent TOTCNT variable name to indicate with replacement sampling for all levels of a variable. Use _ZERO_ as a variable name on the TOTCNT statement to denote a stratification variable (no variance contribution from any level of a particular variable). A NEST variable with a corresponding TOTCNT variable of _ZERO_ is assumed to be a stratification variable, and thus SUDAAN does not compute the corresponding variance component. SUDAAN uses its computed record count per stratum as the population count per stratum for a stratification variable. Two examples • proc sort data=suicidenlaes; by stratrec psuid substrec mseg; run; proc crosstab data=suicidenlaes design=UNEQWOR ; nest stratrec psuid substrec mseg / missunit; totcnt _ZERO_ _ZERO_ _MINUS1_ _ZERO_; jointprob prob1 prob2; weight wssa; subgroup suicidecat sex agecat1 ethrace2a race native; levels 4 2 4 5 4 2; Lists the variables whose values are the single and joint inclusion tables sex*suicidecat agecat1*suicidecat ethrace2a*suicidecat probabilities for each primary native*suicidecat ; run;unit (PSU) and each sampling pair of PSUs in each first-stage stratum Two examples • National Epidemiologic Survey on Alcohol and Related Conditions proc sort DATA=suicidenesarc; by stratum psu; run; PROC CROSSTAB DESIGN=WR DATA=suicidenesarc; NEST stratum psu / MISSUNIT; WEIGHT weight; Specifies that when only one sample unit is encountered within a stage, the variance subgroup attempt thought felt none contribution sex agecat1ofethrace2a that unit israce native; levels 2 2 2 2 2 4 5 4 2; estimated using the tables attempt*sex thought*sex felt*sex none*sex difference in thatattempt*agecat1 unit’s value and thought*agecat1 felt*agecat1 none*agecat1 attempt*ethrace2a the overall mean value for the thought*ethrace2a felt*ethrace2a none*ethrace2a attempt*native population. thought*native felt*native none*native; run; Part II: PROC RECORDS and univariate statistics General useful options on all procedure statements in SUDAAN • PROC procedure_name options; • CONF_LIM=number – change default confidence interval from 95% to something else • DATA=file • DESIGN=design (e.g., “WR”) • EST_NO=count, EST_PSU=count, and EST_STR=count – optional statements that improve runtime efficiency • FILETYPE=filetype (optional - in SAS-Callable SUDAAN, default is SAS. Only options are SAS export files, SUDAAN files, and SAS files). • INCLUDE (optional parameter that sets missing values of a variable to be a legitimate level of all variables on the SUBGROUP and CLASS statements. General useful options on all procedure statements in SUDAAN • PROC procedure_name options; • RECODE variable=(code_list) • • • • • Recode a 0,1 variable to be a 1,2 variable (very useful) Example: RECODE zerone= (0 1); SUBGROUP zerone; LEVELS 2; • • • • Recode a continuous variable to be a 0,1 variable Example: RECODE X = (4.5); All values of X less than 4.5 will be coded 0; all values of X greater than or equal to 4.5 will be coded as 1 internally in SUDAAN. General useful options on all procedure statements in SUDAAN • • • • PROC procedure_name options; RECODE variable=(code_list); SUBGROUP variables; LEVELS levels; • Categorical variables should be declared on the ‘subgroup’ statement. The number of categorical levels should be declared in the ‘levels’ statement. • The values on the levels statement must correspond one-to-one, in order, to the variables listed on the subgroup statement Example: SUBGROUP gender; LEVELS 2; Or SUGROUP gender / INCLUDE=missing; LEVELS 2; General useful options on all procedure statements in SUDAAN • • • • • PROC procedure_name options; RECODE variable=(code_list); SUBGROUP variables; LEVELS levels; SUBPOPN expression Similar to a “where” statement in SAS. e.g., SUBPOPN gender=1 / NAME “Men only”; Or SUBPOPN RACE=2 & SEX=2 & (AGE<18 | AGE>65) / NAME “African-American Females not in the Labor Force”; General useful options on all procedure statements in SUDAAN • Output statements • PRINT statements produce a set of formatted and labeled tables that can go by default to the .LST file in SAS-callable SUDAAN. – SUDAAN can generate printed results in RTF format (specify FILETYPE=RTF on the PRINT statement). When you specify FILETYPE=RTF, you must also specify FILENAME=filename. The filename is the name of the external file that will hold the output (should be surrounded by double quotes). • OUTPUT statements produce an output dataset (SAS, SUDAAN, or SUDXPORT). • The SETENV statement is used to alter the default environment parameters. Position the SETENV statement ahead of one or more PRINT or OUTPUT statements. The environment it defines applies to all subsequent PRINT or OUTPUT statements until SUDAAN encounters another SETENV statement. • TITLE and FOOTNOTE statements can add text before and after your • PRINT statement tables. General useful options on all procedure statements in SUDAAN • SETENV options: PROC RECORDS • RECORDS is a non-analytic procedure that prints observations from the input data set, obtains the contents of the input data set, and converts an input data set from one type to another. You can use the SUBPOPN statement to create a subset of a given data se, and you can use the SORTBY statement to sort your data. • PROC RECORDS is particularly useful when you wish to verify that SUDAAN is reading your data properly. • Similar to PROC PRINT in SAS PROC CROSSTAB • Computes frequencies, percentage distributions, odds ratios, relative risks, and their standard errors (or confidence intervals) for cross tabulations, as well as chi-square tests of independence and Cochran-MantelHaenszel chi-square test for stratified two-way tables. • Similar to PROC FREQ in SAS PROC CROSSTAB • Tests available in PROC CROSSTAB and when to use them: – CHISQ (standard chi-square test, observed compared to expected) – LLCHISQ (tests the null hypothesis that the odds of the outcome in the population is the same for the exposed and the unexposed) – CMH (Cocran Mantel Haenzel test) – TCMH – a CMH test for trend; assumes that both row and column variables lie on an ordinal scale (e.g., you want to compare two Likert scales) – ACMH – ANOVA-type CMH test; assumes that the row variable likes on a nominal scale and the column variable lies on an ordinal scale (e.g., do men and women differ on values of a Likert scale) • NOTE: when the row variable has only two levels, TCMH=ACMH. PROC RATIO • Computes estimates, standard errors, and confidence limits of generalized ratios; also computes standardized estimates and tests single-degree-of-freedom contrasts among levels of a categorical variable. PROC DESCRIPT • Computes estimates of means, totals, proportions, percentages, geometric means, quantiles, and their standard errors and confidence limits; also computes standardized estimates and tests of single degree-offreedom contrasts among levels of a categorical variable. • Similar to PROC MEANS or PROC UNIVARIATE in SAS LAB 2: Univariate and bivariate statistics in SUDAAN