Syllabus - Brandeis University

advertisement
September 2, 2015
Brandeis University
Heller Graduate School
Sustainable International Development (SID) Program
Evaluating survey data: Questioning answers & cleaning data (HS 238f)
Syllabus – Fall 2015 (Module II)
Instructor:
Time & place:
Ricardo Godoy
Office: Heller 153
Tel. 6-2784
E-mail: rgodoy@brandeis.edu
Thursdays 9-11:50 am. Room: G54
The topic and main aim of the course: As part of many entry-level jobs, young professionals often have
to evaluate and clean survey data. The main aim of the course is to expose students to the principles and to
the best practices for evaluating and cleaning survey data.
Intended audience: The course is for students (a) without background in statistics but who feel
comfortable with algebra and quantitative reasoning (b) who want to learn how to clean survey data.
The course in the SID curriculum: The course forms a logical sequel to the course on survey design,
which was offered in Module I, and is a foundation for the course on data analysis and field experiments
(or randomize control trials), both of which are offered in the spring semester. This course fulfills the
MA/SID core elective requirement.
Approach: The stress in the course lies on:
(1) exposure to principles: There is growing consensus among survey designers, statisticians, and
empirical analysts about what constitutes a reliable, clean data set. These aspects include such
things as meaningful sample size, sufficient variation in the variables of main interest, clear units
of measurement, and low measurement error. During class time we will try to understand – from a
theoretical or statistical point of view -- why these aspects matter. Some of these topics will be a
review from the course on survey design.
(2) exposure to tools and best practices: Having a clear idea of the main aspects to explore when
cleaning or assessing a data set, we then move to practical steps for how to assess the quality of
data produced from a survey. Using Stata, students will learn how to assess effective sample size
and variance, and transform variables (either to correct mistakes or to ease interpretation). This is
the rationale for the weekly computational assignments, which will require the use of Stata 14.
The problems are not graded; they are designed to reinforce the statistical principles covered in
class.
(3) repeated exposure to previous material covered in class: we rarely learn anything from a onetime exposure. For this reason I will give weekly quizzes; besides allowing you to gauge your
progress in the course, the quizzes should be an instrument to help you learn. We will go over the
quizzes in class right after the quiz so you can identify areas for improvement. You should expect
to see some questions and topics from earlier lectures, readings, and computational exercises
reappear in later quizzes. The quizzes will be in-class, closed-book.
Requirements and grading:
[A] Four quizzes (15% each; total=60%). During weeks 2-5 (inclusive) there will be a short quiz at the
start of the class. The quiz will be followed by a short discussion of the quiz. The material for the
quiz includes all class notes and readings up to and including the material to be covered on the day
of the quiz. For example, quiz #3 will cover all the material up to and including the material
2
covered on class 4 when quiz #3 is administered. Some of the questions that appear in a previous
quiz will re-appear in subsequent quizzes. Each quiz will be graded as follow: the top 20% of the
scores will receive an A, the next 20% will receive an A-, the next 20% will receive a B+, the next
20% a B, and the bottom 20% will receive a B-. I will use the three best quizzes out of the four
quizzes to compute the final grade. I will put in Latte the quizzes from 2014, but without the
suggested solutions.
[B] Final project (40%). For the final project I will hand out a set of questions and a data set on class 5
before the final class (December 10), and ask you a series of questions about the data set; to
answer the questions you will need to use the principles and tools learned in the course. We will
reserve the final class for an in-class discussion of your solutions.
Structure of class time: The typical class during weeks 2-4 will run as follow:
Quiz: (30 minutes)
Review of quiz and problem set: (60 minutes)
Break: (10 minutes)
Lecture: (70 minutes)
3
Stata: The problem sets will require you to use Stata 14 (freely available to Brandeis students). A tutorial and
on-line notes will be provided at the end of the first week. You will need a student version of Stata in your
laptop so you can follow parts of the lecture with Stata in front of you during the class.
Readings: There is a vast literature on survey design and implementation, but a scant literature explicitly
devoted to assessing the quality of data already collected. Much of research on techniques for assessing the
quality of survey data in fact comes not from researchers doing survey design but from econometricians. Much
of econometrics is devoted to techniques for evaluating and handling imperfectly -collected data. To my
knowledge, there is no review article or book devoted to discussing how to evaluate survey data – the literature
seems scattered across fields and is more academic than practical. For this reason, I do not include readings for
the course. Some of these readings are listed in Appendix I, but are optional. Instead, of readings from books
or articles, I will hand out notes before each lecture. A complete set of lectures from last year will be in Latte,
but on or about the day of the class I will hand out a revised version of last year’s lecture notes for that class.
Feedback to students: You will be able to assess how well you are doing every week because we will go over
the quiz and problem set in class each week.
Gender perspective: The course does not deal with a gender perspective, but the tools you learn in the course
should make it easier for you to evaluate survey data bearing on gender issues.
Office hours: By appointment, Wednesdays 8-10 & 12-1:45.
Students with disabilities: See me if you have a documented disability and wish to request a reasonable
accommodation for this class. Brandeis cannot provide reasonable accommodations retroactively.
Policy about academic honesty: Academic integrity is central to the mission of educational excellence at
Brandeis University. Each student is expected to turn in work completed independently, except when
assignments specifically authorize collaborative effort. It is not acceptable to use the words or ideas of another
person without proper acknowledgement of that source. This means that you must use footnotes and quotation
marks to indicate the source of any phrases, sentences, paragraphs or ideas found in published volumes, on the
internet, or created by another student. Violations of University policy on academic integrity, described in
Section 3 of Rights and Responsibilities, may result in failure in the course or on the assignment, and could end
in suspension from the University. If you are in doubt about the instructions for any assignment in this course,
you must ask for clarification.
Table 1: Summary of schedule and topics
Class #
Date
Topics [see next page]
Quiz
1
10/22
Understanding surveys & files & introduction to Stata 14
No
2
10/29
Variables I: Types
Yes
3
11/5
Variables II: Assessment
Yes
4
11/12
Changing original variables
Yes
5
11/19
Principal component factor analysis
Yes
11/26
HOLIDAY
6
12/3
Power analysis
No
7
12/10
In-class discussion of final assignment, due electronically at 9am
No
4
Detail schedule of topics covered in class (Stata commands in purple)
Class #1. Understanding surveys & files & introduction to Stata 14
 Understanding the survey: Econometric model, main outcome and explanatory variables,
stratification/clustering
 Map overall data set to econometric model to see what survey data you need to extract and use.
Variables to merge.
 Transferring files to Stata
 Structure of individual files. Understanding
o Horizontal dimensions:
 Clustering at what level & real number of observations (subjid, egen)
 Unit of observation or measurement vs unit of analysis (def of hhid, subjid, etc.)
o Vertical dimensions (Variables):
 Name, definition, label, & value of variables
 Meaningful prefixes and suffixes of variables [labeling, define vars, describe]
Class #2. Variables I: Identifying the type of variables
 Typology for outcome and explanatory variables [tab; graph]
o Why is typology of outcome variable important?
o Identifying type of outcome variables
o Main explanatory variables: Continuous and categorical variables
 Creating and using do files
Class #3. Variables II: Assessment. Evaluating whether values of each variable are reasonable by:
 Examining mean, SD, and median via summary statistics or graphs [sum, graph; sum, detail]
 Assessing variance: coefficient of variation
 Estimating # of observations & missing values [sum, inspect]
 Spotting outliers through graphical analysis and/or descriptive statistics & how to correct
(transformations and robust SE)
 Identifying mistakes:
o Systematic & classical errors (t-test for telescoping bias; estimating digit heaping)
o Ad-hoc (innocent) mistakes (tabulate)
Class #4. Changing original variables
 Re-coding to
o Reduce mistakes (replace, generate)
o Achieve greater variance
o Ease interpretation of results
 Transforming original variables even though they might have no mistakes
o Rationale for main transformations
o Common types of transformations: logs, z-scores, and new variables (e.g.,
income/person), egen to create summary statistics for higher-level units
o Full set of dummy variables & dummy variable trap (Stata commands)
Class #5. Harmonizing data sets [Hand out of final assignment]
 Labeling and defining newly-created variables (labeling, defining)
 Re-visit assessment of reasonable values since you will have changed some of the variables
 Harmonize or rectangularize panel and cross-sectional data sets (merge, append)
Class #6. Bivariate analysis
 Typology of bivariate relations
 Reading and interpreting most common coefficients
 Using the “outreg” command in Stata
5
Class #7: In-class discussion of final assignment and discussion of the future of quantitative work with
the arrival of big data sets. This is an exciting, new field that raises questions about the traditional
econometrics approach. I have included three recent very readable articles that map the future of
quantitative data + a skeptical view published in Science. You might want to read the skeptical view +
one of other advocates of big data sets.
Appendix I. Bibliography on how to evaluate survey data
[A] Journal articles:
Goldenberg, Karen L. 1994. Answering questions and questioning answers: Evaluating data quality in an
establishment survey. Proceedings of the American Statistical Association 1357-1362.
Krosnick, J.A. (1999). Survey research. Annual Review of Psychology, 50, 537-567.
Sullivan, M., Karson J., and J.E. Ware. 1995. The Swedish SF-36 Health Survey-I. Evaluation of data quality
and construct validity across general populations in Sweden. Social Science & Medicine 41:10:13491358.
Kathy E. Green. 1996. Applications of the Rasch model to evaluation of survey data quality. New Directions for
Evaluation 70:81-92.
Barge, Scott and Hunter Ghlbach. 2012. Using the Theory of Satisficing to evaluate the quality of survey data.
Research in Higher Education 53: 182-200.
[B] Books (I have indicated in parenthesis the most relevant chapters for this course):
Dillman, Don A. 2009. 3rd edition. Internet, mail, and mixed-mode surveys: The tailored design method. New
Jersey: John Wiley. Chapter 4 (for Survey Design Course), Chapter 11.
Presser, S., ed. 2004. Methods for testing and evaluating survey questionnaires. Hoboken, NJ: John Wiley &
Sons. (Chapter 1)
Skinner, C.J., D. Holt, and T.M.F. Smith 1989. Analysis of Complex Surveys. New York: Wiley.
Biemer, Paul P. and Lars E. Lyberg. 2003. Introduction to Survey Quality. New York: Wiley. Chapter 1 & 10.
Duane F. Alwin. 2008. Margins of error: A study of reliability in survey measurement. New York: Wiley.
Download