SOC610 Topics in Social Science Data Analysis A two-quarter, variable-credit seminar & workshop: Winter & Spring 2014 Seminar meets every other week during Winter and Spring terms Prof. Caleb Southworth Dept. of Sociology, 740PLC caleb@uoregon.edu This course is a seminar in data analysis. The goal of the course is to expand and refine your toolbox of data management and data analysis techniques. The course differs from a traditional statistics course with a textbook in that the seminar will cover a wide variety of data cleaning tasks and present many alternative ways to look at your data. The intended audience is graduate students who are working on a thesis, research project, or dissertation. The seminar is relevant to many areas of study. Students in journalism may be interested in methods of extracting web-based data and analyzing time trends; in Political Science, analyzing the life history of regimes; in Economics and Business, models of organizational success or failure; in Sociology, questions of spatial or network structure; in Psychology, analysis of social context and individual decisions. The tools in the course are quite general and apply to all sorts of scientific problems. The emphasis is on clear questions, visual presentation and grant writing, all of which should have interdisciplinary appeal. In the seminar, we will use a wide variety of software. The main package for the course is Stata, but we will also use R, SAS, nVivo, StatsNet, Mathematica, LaTex, GeoDa and a variety of other free and open-source software. You are encouraged to bring your own favorite software to class and put it to use. Every effort will be made to have the substance of the course be platform and software independent. Course Structure The course is spread out over two quarters and there are five workshops per quarter. Each workshop demonstrates a particular technique, offers a set of exercises with an existing dataset, and then turns to students’ own particular interests, problems and data. You are welcome to attend any portion of the seminar. Hour 1: Lecture & Demonstration of a specific technique Hour 2: Hands-on laboratory. Data provided. Students tackle a problem within the particular topic area. Hour 3: Consultation. Office hours dedicated to data management and analysis. Analyze your data and problems in a group setting. Enrollment and Grading Options Everyone is welcome at any or all of the workshops and/or the consulting office hours. The department asks that students enroll under one of the following options: Option 1a: P/NP, attend and participate and receive 2 credits per quarter (This is the default enrollment option; students may enroll in one or both quarters.) Option 1b: P/NP, attend and participate in two workshops and receive 1 credit per quarter. Option 2: Graded, attend and participate in all workshops over two quarters and receive 4 graded credits Option 3: Research Methods credit. Students receive credit for a graduate methods course when they attend over two terms and turn in a research project using one or more of the techniques presented in the course. Students receive 5 graded credits for this choice. Required Software & Materials The course requires that you have a laptop and Stata 13 installed. A one-year graduate student license costs $98. There are other purchase options. (http://www.stata.com/order/new/edu/gradplans/#3) Recommended software includes: A unix operating system of any flavor StaTransfer (2-year license = $69) R-Project for Statistical Computing (free, http://www.r-project.org/) Topic Areas These are proposed topic areas. Many of them we will definitely cover, such as making tables for publication. Others will be determined by the interest of the participants. Within topic areas, there is substantial room for emphasis on specific problems from students’ dissertations or research projects. 1: Constructing Regression or ANOVA Tables for Publication Most statistics courses focus on the creation and interpretation of point estimates and overlook the mechanical and conceptual issues that go with presentation of results in tables. At the end of this workshop you will know how to run an analysis and automatically print a table of coefficients in the format of a journal. This is done without typing anything on the table. You can change one parameter and receive a revised, publication-ready table immediately. We will discuss best practices for tables and data summary. Ian Watson. “Publication quality tables in Stata: a tutorial for the tabout program.” The Stata Journal. Fear, S. 2003. “Publication Quality Tables in LATEX.” Documentation for the booktabs package, www.ctan.org/texarchive/macros/latex/contrib/booktabs/booktabs.pdf. Tufte, ER. 2001. Visual Display of Quantitative Information. Graphics Press. 2: Analysis of Time-Series Data Here the emphasis is on establishing a time trend and being able to identify a case or cases in a set that exhibit a specific temporal pattern. Does crime respond to stateor country-level social policy? Do welfare program change public health outcomes? Does urban planning affect human behavior in cities? Hochheiser, H., Shneiderman, B. Visual Specification of Queries for Finding Patterns in Time-Series Data Proceedings Discovery Science 2001, University of Maryland, Computer Science Dept. Technical Report #CS-TR-4326. UMIACSTR-2001-25. Jeffrey Wooldridge. 2012. Introductory Econometrics: A modern approach. Part II: Regression with Time-Series Data. 5th Edition (any addition is acceptable) 3: Analysis of Panel Data Panel data are common in the social sciences. Repeated observations on people, cities, states, countries or organizations over time. It is equally common for students to have a panel data problem and to have difficulty putting the data in the correct format, especially when pooling from multiple sources. The reward for that work, however, is statistical control of original states or starting conditions and the additional leverage of observing the same unit over time. Robert Yaffee. 2003. “A Primer for Panel Data Analysis.” 4: Event History Analysis Event history, survival analysis or life history data all refer to the same set of methods. These methods deal with the time until an event occurs and have been used to analyze diseases and treatments, domestic violence, regime change, and organizational mortality. A key component of survival analysis is the necessity of dealing with censored data, where observations join the analysis at different points and drop out in a possibly non-random pattern. Paul Allison. 2003. Survival Analysis Using SAS. A Practical Guide. 5: Visualizing Data, Graphical Methods Exploratory data analysis (EDA) is a family of graphical techniques to visually describe data. EDA includes plots, line drawing, mapped data, and many graphical ways of summarizing the relationships present. It is particularly useful in developing concepts and hypotheses. EDA has answers for some common problems in the social sciences: What would replications of a social science experiment look like? What can be done if my data are not a random sample? How well do my data meet the assumptions of specific techniques, such as regression? How can interactions in continuous data be graphed? John W. Tukey. 1977. Exploratory Data Analysis. Addison-Wesley Publishing Co. William S. Cleveland. 1985. The Elements of Graphing Data. Wadsworth. Edward Tufte. 2006. Beautiful Evidence. 6: Qualitative Comparative Analysis Social science research often involves comparison of composite cases, which are bundles of discrete events. This approach draws on Boolean algebra and Bayesian inference methods to analyze patterns in rare events where increasing the sample size is not practicable. Under what conditions do revolutions occur? When will a labor union go on strike? When does a protest turn violent? Models of these events as continuous processes make little sense and the number of viable cases is always small. Here the emphasis is on describing a pattern of unique events and asking to what extent a particular hypothesis could have generated the observed pattern. Charles C. Ragin. 1987. The Comparative Methods. University of California. 7: Categorical Dependent Variables Categorical dependent variables are a common occurrence in the social sciences. Which candidate did you vote for? Did you complete high school? Are you unemployed? Do you participate in a protest? Black and white decisions about social action describe many choices. However, like other bounded dependent variables, models of such outcomes require a particular type of non-linear function. Likewise, these models require particular attention to marginal change, something best described graphically. J. Scott Long and Jeremy Freeese. 2006. Regression Models for Categorical Dependent Variables. Stata Press. 8: Missing Data Problems Missing data occur in all scientific research. Deleting them is unsatisfactory as such data are often not missing at random. Reweighting to make the remaining distribution resemble the population likewise ignores selection in the process generating the missing data. Instead, social scientists need to understand why the data are missing and arrive at a solution that permits the analysis of the remaining cases without bias. Paul D. Allison. 2001. Missing Data. Sage. 9: Grant Writing for Social Scientists Grant writing is an integral part of academic life, but students are rarely schooled in it. The goal of this unit is to get you to see the grant proposal from the reviewers' perspective. It is based on my decade of service at the National Science Foundation and review work for many other granting agencies. Writing a clear, concise grant proposal will help you write convincing papers and theses. Pzreworski, Adam and Salomon, Frank. 2000. "The Art of Proposal Writing." Social Science Research Counsel. 10: Analysis of Interactions, Categorical & Continuous Interactions are present in many types of models and often ignored. This unit will walk you through the simple, two-variable binomial interaction. From there we will learn to graph and understand complicated interactions of continuous variables, many of which do not have a single-modal distribution. How does the effect of income on obesity vary across social class? How do class, race and religion vary across the geography of a city to explain voter choice? Leona Aiken and Stephen G. West. 1991. Multiple Regression: Testing and Interpreting Interactions. Sage.