Getting to know your variables Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Overview • • • • Why “get to know” your variables? Unit of analysis Restrictions on your analytic sample Information about each variable – Level of measurement – Missing values – Valid range of values – Substantive interpretation of values – Distribution of observed values in your data set The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Why is it important to get to know your variables? • Each variable measures – A specific concept • Numeric values have particular meanings that differ depending on the nature of that concept – In a particular context • When, where, to whom do those numbers pertain? – Collected with a specific study design • Need to understand why some values are missing – By design – Due to non-response The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Example of failing to get to know variables • In a nationally representative survey sample from a developing country circa 2002, birth weight in grams observed range up to 9999. – Data set downloaded from a research data web site; not cleaned or evaluated before use. – Mean birth weight over 8000 in the sample • First red flag: Implausible as an actual birth weight, given its meaning and units. 9,999 grams ~= 22 lbs. – 9999 was a code for missing value • Lesson learned: Must become familiar with what a particular value means for that concept and context. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Second red flag • 2/3 of sample had a birth weight value of 9999 – Very high value for a substantial share of the sample, unlikely to be explained solely by • outliers • data entry errors • Lesson learned: Look at study documentation and questionnaire to find out why this distribution was observed. – Occurred due to a skip pattern designed to minimize recall bias in birth weight reporting. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Resources needed for this exercise • Documentation on the data source – Description of study design – Questionnaire – Codebook for electronic data file • • • • Electronic file of database Statistical software Your research question Articles, books, web sites etc. on your topic – Dependent and key independent variables The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Time needed for this exercise • This is not an assignment that can be done overnight • Multi-step process involving multiple resources – About data source – About topic • Results from early steps will inform later steps in the exercise • May also need feedback from – Mentors – Colleagues – Persons involved in original data collection The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Getting to know your variables is project-specific • Several issues that inform this assignment are specific to research question and data set. – Unit of analysis – Restrictions on your analytic sample – Roles of different variables in your analysis • Dependent, independent, control, filter • Even experienced researchers should complete this assignment when undertaking a project with a new topic or data set. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Valuable information for all parts of a well-written research paper • Reading the literature on your topic will provide information needed for – Introduction – Literature review – Discussion • Detailed knowledge of study design and variables from documentation, questionnaire and codebook will provide information for – A comprehensive data section – Appropriate model specification – Interpretation of statistical results The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Unit of analysis • Do data pertain to – Individual person? – Family? – Census tract? – Institution? • Knowing unit of analysis helps ascertain plausible range of values, e.g., – mean number of family members will be much lower than population of a census tract or a school The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Analytic sample • Before you acquaint yourself with the range of values for each variable in your analysis, impose any limits related to your research question. • Exclude cases – to whom the topic does not pertain. – that are part of a group with too few cases – for whom a key variable was not collected The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Restrictions on your analytic sample • Limit your analytic sample to cases that meet certain criteria, related to your research question. e.g., – particular demographic traits – minimum test scores – a specific disease • Exclude subgroups that don’t meet minimum sample size if – there aren’t enough cases in one or more subgroups of a key variable to provide sufficient statistical power – it would not be theoretically sensible to combine them with other subgroups used in your analysis The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Attributes of each variable to familiarize yourself with prior to analysis The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Labeling, coding, and missing value information for your variables • To help you create a comprehensive record of information on each of the variables in your analysis, fill out a grid like this one – An electronic version can be found online in the “getting to know your variables” assignment. Variable name Saw doctor last year Birth weight Variable label (e.g. acronym on your data set) DOCLY BWGRMS Type of variable (nominal, ordinal, interval, or ratio) Nominal Ratio Coding (for categorical variables) OR Units (for continuous variables) 1 = yes 2 = no Grams Plausible range of values (excluding missing values) 1, 2 0–6000 Skip pattern?* (e.g., conditions under which Missing value variable not codes (if any) collected) 7 = refused None for this 8 = don’t variable know 9 = missing 9999 = Asked only about missing children under age 5 years at time of survey. Original or created variable?† Original Original The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. “Old” and “new” variables • Familiarize yourself with all variables to be used in your analysis – Variables analyzed in the same form in which they appeared in the original data set – Variables you created from those variables, e.g., • Dummy variables created from categorical variables • Categorical versions of continuous variables • Aggregated variables, e.g., – income calculated from several sources – scales that combine responses to multiple items • Calculated variables – E.g., body mass index calculated from weight and height • Transformed variables (logged, standardized) The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Organizing variables within the grid • Using major row headings, label sections for each of the following, based on their role in your analysis – Dependent variable(s) – Key independent variables – Control variables – Sampling weights – Filter questions (e.g., used to restrict sample) The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Variable names and labels • For each variable, fill in – Variable name: a short (up to ~8 character) acronym used to identify the variable in the software program you are using – Variable label: a descriptive phrase of up to 40 characters that helps convey the meaning of the variable • If you rename an item with a more informative variable name (e.g., “gender” instead of Q117), include the original question name in the variable label The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Level of measurement • Categorical variables are those that are classified into ranges or categories. • Continuous variables – Measured in numeric units, but not grouped. – Two types of continuous variables: • Interval – Zero is not lowest possible value – e.g., temperature °Fahrenheit • Ratio – Zero is lowest possible value – e.g., temperature °Kelvin, height, weight Helps to anticipate limits on range of values The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Categorical variables Nominal variables Ordinal variables • No inherent order to the categories • Numeric value labels have NO mathematical interpretation • Examples: • Categories have an inherent numeric order • Examples: – Gender – Race – Geographic region – Letter grades – Age group – Likert scale items • E.g., from strongly disagree to strongly agree The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Units of measurement • System of measurement – E.g., Metric or British or other? • income in dollars or euros or pesos? • Level of aggregation – E.g., income per hour or per week or per year? • Scale – E.g., income in dollars or thousands of dollars or millions of dollars? • See also podcast on “reporting one number” The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Missing values • Missing values on a variable can occur because they are – Not applicable – Missing by design – Non-response • See chapters 4 and 13 of The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition for more on missing values and missing by design. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Not applicable • Some questions are not asked of specific respondents because they don’t pertain. – E.g., if someone reports that they are unemployed, they wouldn’t be asked about their current job type or earnings • Look – At the questionnaire or form used to collect the data for • a filter question • a skip pattern – At the codebook for one or more missing value codes for non-response The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Missing by design • Some topics are not asked of specific subgroups due to concern about the accuracy of their responses. – E.g., to minimize recall bias, mothers asked birth weight only for children under age 10 years. • Surveys sometimes administer specialized topic modules only to a randomly selected subsample of respondents. – Used to obtain a smaller representative sample that meets statistical power requirements while reducing study costs. • Read the study design documentation to find out whether these pertain to the question you are using. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Item non-response • Another reason for missing values is when a respondent does not answer a question that was asked of them. • Item non-response is particularly common for – stigmatized topics – questions that require complex or detailed answers – unclear instructions about number of allowed responses • Examples: Respondent was asked – to report income, but didn’t know it – immigration status, but had concerns about deportation The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Types of non-response • • • • Don’t know Refused to answer the question Didn’t answer the question (unspecified reason) Other – marked too many answers to a single response question – wrote an illegible answer • Look up the pertinent missing value codes for each of your variables in the codebook for your data set. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Valid range of values • • • • • Definitional limits Conceptually plausible range Context of measurement Observed range Watch for numeric values for missing values – Label them in your electronic database, so they are treated correctly during analysis. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Definitional limits on values • Some variables by definition have limits on the range of values they can assume: – A percentage share of a whole must fall between 0 and 100 • Likewise, a proportion must fall between 0 and 1 – However, a percentage change can be • Negative (<0) • Greater than 100 – Variables at the ratio level of measurement cannot take on negative values – Other topic- or field-specific variables also have such restrictions • E.g., a Gini coefficient must fall between 0 and 1 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Plausible range of values for the concept being measured A value of 10,000 • Makes sense in at least some contexts for – Annual family income in dollars – Population of a census tract – An annual death rate per 100,000 persons • Does NOT make sense for – – – – – – Hourly income in dollars Birth weight in grams Number of persons in a family A Likert scale item A proportion An annual death rate per 1,000 persons The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Another example of plausible range of values A value of –1 • Makes sense in at least some contexts for – – – – Temperature in degrees Fahrenheit or Celsius Change in rating on a 5 point scale Change in death rate Percentage change in income • Does NOT make sense for – – – – – Temperature in degrees Kelvin Number of persons in a family Death rate A Likert scale item A proportion The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Context of measurement • When, where, who, e.g., family income will be – Higher now than it was 200 years ago in a given place and group – Higher in a currently developed than developing country – Higher in a sample of all households than in a sample of low-income households • Remember to take into account restrictions you imposed on the analytic sample to fit your research question and data. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Descriptive statistics on your variables • After you have – Imposed restrictions on your analytic sample – Filled in missing value codes for each variable • Complete a grid like the one below, with descriptive statistics on each of the variables in your analysis. • An electronic version of this grid can be found online in the “getting to know your variables” assignment. Observed values from data set Number of valid Reference Values & Variable label cases for that Mean (for value from range Variable (e.g. acronym on variable (excl. continuous Modal external consistent w/ name your data set) missing values) Minimum Maximum variables) value source codebook? The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Familiarizing yourself with the concepts under study • To identify plausible ranges of values for each of your variables, read the literature on your dependent and key independent variables. • Read for – how each concept is operationalized in the data set – standards, cutoffs, or transformations commonly used for that variable in your field – range of values observed • but pay attention to differences in who, when, where studied The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Check each distribution against the codebook for the original source • Codebooks for some data sets provide information on – frequency distribution of categorical variables – range and/or mean values for continuous variables – number of cases with missing values, by reason for missing value (not applicable, refused, etc.) • Check the distribution of values observed in your analytic sample for each variable against the codebook for your data set. • If any distributions are inconsistent, do NOT analyze the data until you have resolved the discrepancies! The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Identify reasons for inconsistencies • Review your answers to the previous steps in this exercise to identify possible reasons for discrepancies between your statistics and the codebook, such as: – Units of analysis • e.g., family instead of individual – Restrictions on your analytic sample • e.g., excluding a subgroup that is included in the statistics shown in the overall codebook – Scale • e.g., grams instead of kilograms – Transformations you have made to the variables, e.g., • logged values • multiples of standard deviations rather than original units The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Check each distribution against the literature on similar variables • Track down information in the published literature on each of your main variables for a similar population. • Check the distribution of values for each variable in your data set against the values from the external source of information about that variable. • Again, if the values of your data are substantially different from those used in other studies of the same concepts, do NOT analyze the data until you have resolved the discrepancies! The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Identify reasons for inconsistencies • Review your answers to the previous steps in this assignment to explain possible reasons for discrepancies between your data and other similar data sets, such as: – Population studied, e.g., substantially different time, place, and/or subgroup – Units of analysis, e.g., family instead of individual – Units of measurement, e.g., metric instead of British units – Scale, e.g., grams instead of kilograms – Transformations of the variables, e.g., percentiles instead of original value The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Summary • Before you conduct your analyses, it is critical that you and other members of your research team become familiar with the following for each variable – – – – Levels of measurement Units and categories Plausible and observed ranges of values Missing values and their reasons • Compare observed values against – Documentation for the data set you use in your analysis – The published literature on your topic The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Summary, continued • These attributes are essential information for – Data preparation • Inclusion criteria for your analytic sample • Creation of new variables – Choice of pertinent descriptive and multivariate statistics – Design of correct charts and tables – Writing correct prose descriptions for the data and methods and results sections of your paper. • Even experienced researchers should complete this assignment when they undertake a project with a new topic or data set. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Reasons for getting to know your variables, redux • Exercises in this podcast are time-consuming but very valuable for generating in-depth knowledge needed for your paper • Reading the literature on your topic will yield information needed for the introduction, literature review and discussion sections. • Detailed knowledge of study design and variables from documentation, questionnaire and codebook will yield information for the data and methods and results sections. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Resources on your topic • Articles, books, reports, or web sites related to the main independent and dependent variables in your data – Definitions of concepts under study – Operationalization (how those concepts are actually measured in a particular data set) – Observed distributions in populations similar to those from which your sample is drawn – Commonly used transformations of those variables prior to analysis The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Resources on your data set • Documentation on study design – Context (who, when, where) – Sampling – Unit of analysis • Questionnaire or other data collection instrument – Modules – Wording of questions – Skip patterns • Codebook – Levels of measurement, categories, units – Missing value codes – Distribution of observed values for the study sample The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Suggested readings • Miller, J. E. 2013. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. – chapter 4 on levels of measurement, units, standards and cutoffs – chapters 7 and 10 on choice of contrasts to suit the variable – chapter 13 on data and methods – chapters 4 and 13 on missing values and missing by design • Chambliss, Daniel F., and Russell K. Schutt. 2012. Making Sense of the Social World: Methods of Investigation, 4th Edition. Thousand Oaks, CA: Sage Publications, or other research methods book for information on – study design, conceptualization, and measurement The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Suggested online resources • Podcasts on – Reporting one number (re: units) – Comparing two numbers or series of numbers (re: levels of measurement) – Defining the Goldilocks problem The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Suggested practice exercises • Study guide to The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. – Problem sets for • chapter 4, questions # 6 and 13 • chapter 10, question #1 and 3 – Suggested course extensions for • chapter 4 – “Reviewing” exercise #1 – “Estimating statistics and writing” exercises #1 and 2 • chapter 10 – “Reviewing” exercise #1 – “Estimating statistics and writing” exercises #1 and 2 – “Revising” exercises #1, 2 and 3 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Contact information Jane E. Miller, PhD jmiller@ifh.rutgers.edu Online materials available at http://press.uchicago.edu/books/miller/multivariate/index.html The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.