Lecture 10 Causal Inference with Python March 28, 2024 Data Science Using Python Jaeung Sim Assistant Professor School of Business, University of Connecticut Contents • Causation vs. Correlation • Time Series and Events • Panel Data Analysis • Panel Analysis with Events • Python Hands-On • Notice for Upcoming Weeks 2 Causation vs. Correlation Basic Concepts of Causal Relationships Causation vs. Correlation • Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables. • A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values of the other variable. • Causation indicates that one event is the result of the occurrence of the other event; i.e., there is a causal relationship between the two events. 4 Causation vs. Correlation 5 Causation vs. Correlation Popularity of Artists Illegal Music Sharing Legal Music Sales 6 Causation vs. Correlation • Why does it matter? • Avoiding Spurious Relationships • Many variables may be correlated due to coincidence, the presence of a third variable (confounding variable), or other complex interactions. • Understanding mechanisms • Causation implies that a change in one variable directly results in a change in another. Identifying causal relationships helps us understand the underlying mechanisms or processes that lead to an outcome. • Implications for Decision Making • Decisions based on causal relationships are more likely to produce the intended outcomes. If decisions were based solely on correlations, they might be ineffective or lead to unintended consequences because the true causal factors are not addressed. 7 Causation vs. Correlation • Does Causality Matter in Predictive Analysis? 8 Time Series and Events Causal Inference with Time Series Events • An event is anything that changes the underlying process that generates time series data, such as • Changes in level • Changes in trend (slope) • The analysis of events includes two activities: • exploration to identify the functional form of the effect of the event • inference to determine whether the event has a statistically significant effect • Other names for the analysis of events are the following: • intervention analysis • interrupted time series analysis 10 Events • In retail sales, the term event is often used and includes the following: • promotional events: discounts, sales, featured displays, and so on • advertising events: broadcast, internet, and print media advertising campaigns, sponsored events, celebrity spokespersons, and so on • Other examples? • In economics and the social sciences, the term intervention is often used and includes these: • catastrophic events • events related to a key player (CEO, spokesperson): imprisonment, scandal, illness, injury, or death • public policy changes • Other examples? 11 Events • Changes in Level and Trend for Events • What is valid inference here? Y (average) After Before (average) Event t 12 Events • Changes in Level and Trend for Events • What is valid inference here? Y Event t 13 Events • Changes in Level and Trend for Events • What is valid inference here? Y Event t 14 Events • Changes in Level and Trend for Events • What is valid inference here? Y Event t 15 Events • Other Types of Changes? 16 Interrupted Time Series • An interrupted time series (ITS) design is a statistical method involving observations before and after an interruption. • ITS evaluates the effects of the interruption by changes in the level and slope of the time series and their statistical significance. • To use an ITS design, you need two things: • an intervention that can quickly produce a measurable effect when introduced at a specific time point • the ability to collect time series data (sequential observations from before and after the intervention) to evaluate the effects of introducing the intervention 17 Interrupted Time Series 𝑌𝑡 = 𝛽0 + 𝛽1 𝑡 + 𝛽2 𝐷𝑡 + 𝛽3 𝑡 − 𝑇𝐼 𝐷𝑡 + 𝜀𝑡 18 Interrupted Time Series • Pros • It is a strong design to use to estimate effects of your product when randomization is not suitable or possible. • Many ITS designs involve comparing participants to themselves, which means that the design is more sensitive to differences in the effects of the intervention. • It can be conducted with a small sample size. • Cons • Lack of randomization means that drawing definitive answers about the effects of your digital product will be limited. • The pre-existing trend may not be comparable to the counterfactual trend after the interruption. 19 Panel Data Analysis Basic Concept of Panel Data Structure Cross-sectional Data • Observations of many different individuals (subjects, objects) at a given time, each observation belonging to a different individual • Examples • The list of UConn students enrolled in Fall 2022 • US infrared satellite map on Nov 14, 2022 • Various survey data • Advantage • You can compare the states of different individuals at a given time. • Disadvantage • You can’t observe how each individual has changed. 21 Time Series Data • A series of data points indexed in time order • Examples • Total US population trend • US recorded music revenues trend • Daily temperature in NYC in 2022 • Advantage • You can understand how an individual has changed over time. • Disadvantage • You can’t compare this trend with other individuals. 22 Panel Data • Multi-dimensional data involving measurements over time • Time series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only. • Examples • Per capita GDP by country over time • Monthly sales by customer over time • Advantages • You can track how relative differences across individuals change over time. 23 Time-invariant/varying Factors • Time-invariant factors • Time-varying factors • Factors that do not change over time within an individual • Factors that change over time • Examples • Examples • Gender (in most cases) • Season, year, holidays • Age (within a year) • Temperature, precipitation • Date/place of birth • US president (among US residents) • Mobile OS (rarely changes) • Other examples? • Other examples? 24 Fixed Effects • Variables that are constant within each observation group • De-meaning the variables using the within transformation • Individual fixed effects • A set of dummy variables indicating individuals • If the effects of each factor is consistent over time, time-invariant factors are entirely captured by individual fixed effects. • Time fixed effects • A set of dummy variables indicating time units • If the effects of each factor is consistent over time, time-varying factors common across individuals are fully captured by time fixed effects. 25 Two-way Fixed Effects • Individual fixed effects + Time fixed effects 26 Two-way Fixed Effects • Internet adoption and print newspapers Cho, Daegon, Michael D. Smith, and Alejandro Zentner. "Internet adoption and the survival of print newspapers: A country-level examination." Information Economics and Policy 37 (2016): 13-19. 27 Two-way Fixed Effects • Internet adoption and print newspapers • Country-year level data on • Newspaper circulation • Newspaper titles • Broadband penetration rates • GDP per capita • Population • Tertiary school enrollment • Cell phone penetration • Country fixed effects + Year fixed effects Cho, Daegon, Michael D. Smith, and Alejandro Zentner. "Internet adoption and the survival of print newspapers: A country-level examination." Information Economics and Policy 37 (2016): 13-19. 28 Two-way Fixed Effects • Internet adoption and advertising expenditure on traditional media Zentner, Alejandro. "Internet adoption and advertising expenditures on traditional media: An empirical analysis using a panel of countries." Journal of Economics & Management Strategy 21, no. 4 (2012): 913-926. 29 Two-way Fixed Effects • Internet adoption and advertising expenditure on traditional media Zentner, Alejandro. "Internet adoption and advertising expenditures on traditional media: An empirical analysis using a panel of countries." Journal of Economics & Management Strategy 21, no. 4 (2012): 913-926. 30 Panel Analysis with Events Panel Structure and Analysis with Internal/External Shocks Time Series with Events • Assumption What is the true counterfactual? • The same trend would continue if the event did not occur. Y • Problems • Observing only one individual • A change after the event might be attributable to unobserved factors other than the event. Event t 32 Panel Data with Events • Assumption • The trend of individuals without the event (control) can be a counterfactual of the individual with the event (treated). Control Y Treated • Without the event, the control and treated individuals will present parallel trends in their outcome variables. Event t 33 Difference-in-Differences • A statistical technique using observational data that measures a differential effect of a treatment on a 'treatment group' versus a 'control group' in a natural experiment. 34 Difference-in-Differences • After removing NBC content from Apple’s iTunes store in Dec 2007 • 11.4% increase in demand for NBC’s pirated content • Insignificant decline in piracy for the same content when NBC came back to iTunes • No change in demand for NBC’s DVD content at Amazon.com Danaher, Brett, Samita Dhanasobhon, Michael D. Smith, and Rahul Telang. "Converting pirates without cannibalizing purchasers: The impact of digital distribution on physical sales and internet piracy." Marketing Science 29, no. 6 (2010): 1138-1151. 35 Difference-in-Differences • COVID-19 and E-Commerce Operations in Alibaba Han, Brian Rongqing, Tianshu Sun, Leon Yang Chu, and Lixia Wu. "COVID-19 and E-commerce Operations: Evidence from Alibaba." Manufacturing & Service Operations Management 24, no. 3 (2022): 1388-1405. 36 Difference-in-Differences • COVID-19 and E-Commerce Operations in Alibaba Han, Brian Rongqing, Tianshu Sun, Leon Yang Chu, and Lixia Wu. "COVID-19 and E-commerce Operations: Evidence from Alibaba." Manufacturing & Service Operations Management 24, no. 3 (2022): 1388-1405. 37 Synthetic Control • Background • The parallel trends assumption may not hold in many cases. • It can be difficult to find sufficient number of control units having similar trends. • Method • Pooling of a combination of untreated units to create a composite control against which the treated unit can be compared • The outcomes of the control units are weighted so as to construct the counterfactual treatment-free outcome for the treated unit. • The weights are chosen such that the treated unit and synthetic control have similar outcomes and covariates over the pretreatment period. 38 Synthetic Control • Background • The parallel trends assumption may not hold in many cases. • It can be difficult to find sufficient number of control units having similar trends. • Method • Pooling of a combination of untreated units to create a composite control against which the treated unit can be compared • The outcomes of the control units are weighted so as to construct the counterfactual treatment-free outcome for the treated unit. • The weights are chosen such that the treated unit and synthetic control have similar outcomes and covariates over the pretreatment period. 39 Synthetic Control • Paywall in Online Newspapers • Comparing New York Times with national newspapers of similar popularity (i.e., USA Today, Washington Post, Wall Street Journal, Chicago Tribune, New York Daily News) • The control newspapers are not strictly comparable to NYT because they experienced different temporal trends prior to the erection of the paywall. • Findings • More negative impact on online heavy users • Positive spillovers on print readership 40 Python Hands-On Using Real-world Datasets Notice for Upcoming Weeks What you need to do, what you will do What You Will Do • Homework #3 canceled • April 4 (In person) • Group Presentation • April 11 (In person) • Course Review & Exam Preview • April 18: Final Exam • April 25 (Recording) • Data Scraping Techniques 43 Jaeung Sim Assistant Professor School of Business, University of Connecticut jaeung.sim@uconn.edu