September 2, 2015 Brandeis University Heller Graduate School Sustainable International Development (SID) Program Evaluating survey data: Questioning answers & cleaning data (HS 238f) Syllabus – Fall 2015 (Module II) Instructor: Time & place: Ricardo Godoy Office: Heller 153 Tel. 6-2784 E-mail: rgodoy@brandeis.edu Thursdays 9-11:50 am. Room: G54 The topic and main aim of the course: As part of many entry-level jobs, young professionals often have to evaluate and clean survey data. The main aim of the course is to expose students to the principles and to the best practices for evaluating and cleaning survey data. Intended audience: The course is for students (a) without background in statistics but who feel comfortable with algebra and quantitative reasoning (b) who want to learn how to clean survey data. The course in the SID curriculum: The course forms a logical sequel to the course on survey design, which was offered in Module I, and is a foundation for the course on data analysis and field experiments (or randomize control trials), both of which are offered in the spring semester. This course fulfills the MA/SID core elective requirement. Approach: The stress in the course lies on: (1) exposure to principles: There is growing consensus among survey designers, statisticians, and empirical analysts about what constitutes a reliable, clean data set. These aspects include such things as meaningful sample size, sufficient variation in the variables of main interest, clear units of measurement, and low measurement error. During class time we will try to understand – from a theoretical or statistical point of view -- why these aspects matter. Some of these topics will be a review from the course on survey design. (2) exposure to tools and best practices: Having a clear idea of the main aspects to explore when cleaning or assessing a data set, we then move to practical steps for how to assess the quality of data produced from a survey. Using Stata, students will learn how to assess effective sample size and variance, and transform variables (either to correct mistakes or to ease interpretation). This is the rationale for the weekly computational assignments, which will require the use of Stata 14. The problems are not graded; they are designed to reinforce the statistical principles covered in class. (3) repeated exposure to previous material covered in class: we rarely learn anything from a onetime exposure. For this reason I will give weekly quizzes; besides allowing you to gauge your progress in the course, the quizzes should be an instrument to help you learn. We will go over the quizzes in class right after the quiz so you can identify areas for improvement. You should expect to see some questions and topics from earlier lectures, readings, and computational exercises reappear in later quizzes. The quizzes will be in-class, closed-book. Requirements and grading: [A] Four quizzes (15% each; total=60%). During weeks 2-5 (inclusive) there will be a short quiz at the start of the class. The quiz will be followed by a short discussion of the quiz. The material for the quiz includes all class notes and readings up to and including the material to be covered on the day of the quiz. For example, quiz #3 will cover all the material up to and including the material 2 covered on class 4 when quiz #3 is administered. Some of the questions that appear in a previous quiz will re-appear in subsequent quizzes. Each quiz will be graded as follow: the top 20% of the scores will receive an A, the next 20% will receive an A-, the next 20% will receive a B+, the next 20% a B, and the bottom 20% will receive a B-. I will use the three best quizzes out of the four quizzes to compute the final grade. I will put in Latte the quizzes from 2014, but without the suggested solutions. [B] Final project (40%). For the final project I will hand out a set of questions and a data set on class 5 before the final class (December 10), and ask you a series of questions about the data set; to answer the questions you will need to use the principles and tools learned in the course. We will reserve the final class for an in-class discussion of your solutions. Structure of class time: The typical class during weeks 2-4 will run as follow: Quiz: (30 minutes) Review of quiz and problem set: (60 minutes) Break: (10 minutes) Lecture: (70 minutes) 3 Stata: The problem sets will require you to use Stata 14 (freely available to Brandeis students). A tutorial and on-line notes will be provided at the end of the first week. You will need a student version of Stata in your laptop so you can follow parts of the lecture with Stata in front of you during the class. Readings: There is a vast literature on survey design and implementation, but a scant literature explicitly devoted to assessing the quality of data already collected. Much of research on techniques for assessing the quality of survey data in fact comes not from researchers doing survey design but from econometricians. Much of econometrics is devoted to techniques for evaluating and handling imperfectly -collected data. To my knowledge, there is no review article or book devoted to discussing how to evaluate survey data – the literature seems scattered across fields and is more academic than practical. For this reason, I do not include readings for the course. Some of these readings are listed in Appendix I, but are optional. Instead, of readings from books or articles, I will hand out notes before each lecture. A complete set of lectures from last year will be in Latte, but on or about the day of the class I will hand out a revised version of last year’s lecture notes for that class. Feedback to students: You will be able to assess how well you are doing every week because we will go over the quiz and problem set in class each week. Gender perspective: The course does not deal with a gender perspective, but the tools you learn in the course should make it easier for you to evaluate survey data bearing on gender issues. Office hours: By appointment, Wednesdays 8-10 & 12-1:45. Students with disabilities: See me if you have a documented disability and wish to request a reasonable accommodation for this class. Brandeis cannot provide reasonable accommodations retroactively. Policy about academic honesty: Academic integrity is central to the mission of educational excellence at Brandeis University. Each student is expected to turn in work completed independently, except when assignments specifically authorize collaborative effort. It is not acceptable to use the words or ideas of another person without proper acknowledgement of that source. This means that you must use footnotes and quotation marks to indicate the source of any phrases, sentences, paragraphs or ideas found in published volumes, on the internet, or created by another student. Violations of University policy on academic integrity, described in Section 3 of Rights and Responsibilities, may result in failure in the course or on the assignment, and could end in suspension from the University. If you are in doubt about the instructions for any assignment in this course, you must ask for clarification. Table 1: Summary of schedule and topics Class # Date Topics [see next page] Quiz 1 10/22 Understanding surveys & files & introduction to Stata 14 No 2 10/29 Variables I: Types Yes 3 11/5 Variables II: Assessment Yes 4 11/12 Changing original variables Yes 5 11/19 Principal component factor analysis Yes 11/26 HOLIDAY 6 12/3 Power analysis No 7 12/10 In-class discussion of final assignment, due electronically at 9am No 4 Detail schedule of topics covered in class (Stata commands in purple) Class #1. Understanding surveys & files & introduction to Stata 14 Understanding the survey: Econometric model, main outcome and explanatory variables, stratification/clustering Map overall data set to econometric model to see what survey data you need to extract and use. Variables to merge. Transferring files to Stata Structure of individual files. Understanding o Horizontal dimensions: Clustering at what level & real number of observations (subjid, egen) Unit of observation or measurement vs unit of analysis (def of hhid, subjid, etc.) o Vertical dimensions (Variables): Name, definition, label, & value of variables Meaningful prefixes and suffixes of variables [labeling, define vars, describe] Class #2. Variables I: Identifying the type of variables Typology for outcome and explanatory variables [tab; graph] o Why is typology of outcome variable important? o Identifying type of outcome variables o Main explanatory variables: Continuous and categorical variables Creating and using do files Class #3. Variables II: Assessment. Evaluating whether values of each variable are reasonable by: Examining mean, SD, and median via summary statistics or graphs [sum, graph; sum, detail] Assessing variance: coefficient of variation Estimating # of observations & missing values [sum, inspect] Spotting outliers through graphical analysis and/or descriptive statistics & how to correct (transformations and robust SE) Identifying mistakes: o Systematic & classical errors (t-test for telescoping bias; estimating digit heaping) o Ad-hoc (innocent) mistakes (tabulate) Class #4. Changing original variables Re-coding to o Reduce mistakes (replace, generate) o Achieve greater variance o Ease interpretation of results Transforming original variables even though they might have no mistakes o Rationale for main transformations o Common types of transformations: logs, z-scores, and new variables (e.g., income/person), egen to create summary statistics for higher-level units o Full set of dummy variables & dummy variable trap (Stata commands) Class #5. Harmonizing data sets [Hand out of final assignment] Labeling and defining newly-created variables (labeling, defining) Re-visit assessment of reasonable values since you will have changed some of the variables Harmonize or rectangularize panel and cross-sectional data sets (merge, append) Class #6. Bivariate analysis Typology of bivariate relations Reading and interpreting most common coefficients Using the “outreg” command in Stata 5 Class #7: In-class discussion of final assignment and discussion of the future of quantitative work with the arrival of big data sets. This is an exciting, new field that raises questions about the traditional econometrics approach. I have included three recent very readable articles that map the future of quantitative data + a skeptical view published in Science. You might want to read the skeptical view + one of other advocates of big data sets. Appendix I. Bibliography on how to evaluate survey data [A] Journal articles: Goldenberg, Karen L. 1994. Answering questions and questioning answers: Evaluating data quality in an establishment survey. Proceedings of the American Statistical Association 1357-1362. Krosnick, J.A. (1999). Survey research. Annual Review of Psychology, 50, 537-567. Sullivan, M., Karson J., and J.E. Ware. 1995. The Swedish SF-36 Health Survey-I. Evaluation of data quality and construct validity across general populations in Sweden. Social Science & Medicine 41:10:13491358. Kathy E. Green. 1996. Applications of the Rasch model to evaluation of survey data quality. New Directions for Evaluation 70:81-92. Barge, Scott and Hunter Ghlbach. 2012. Using the Theory of Satisficing to evaluate the quality of survey data. Research in Higher Education 53: 182-200. [B] Books (I have indicated in parenthesis the most relevant chapters for this course): Dillman, Don A. 2009. 3rd edition. Internet, mail, and mixed-mode surveys: The tailored design method. New Jersey: John Wiley. Chapter 4 (for Survey Design Course), Chapter 11. Presser, S., ed. 2004. Methods for testing and evaluating survey questionnaires. Hoboken, NJ: John Wiley & Sons. (Chapter 1) Skinner, C.J., D. Holt, and T.M.F. Smith 1989. Analysis of Complex Surveys. New York: Wiley. Biemer, Paul P. and Lars E. Lyberg. 2003. Introduction to Survey Quality. New York: Wiley. Chapter 1 & 10. Duane F. Alwin. 2008. Margins of error: A study of reliability in survey measurement. New York: Wiley.