Uploaded by dharmasaiaakarsh

OPIM5512 Lecture 10 (2023-03-28)(1)

advertisement
Lecture 10
Causal Inference with Python
March 28, 2024
Data Science Using Python
Jaeung Sim
Assistant Professor
School of Business, University of Connecticut
Contents
• Causation vs. Correlation
• Time Series and Events
• Panel Data Analysis
• Panel Analysis with Events
• Python Hands-On
• Notice for Upcoming Weeks
2
Causation vs. Correlation
Basic Concepts of Causal Relationships
Causation vs. Correlation
• Correlation is a statistical measure (expressed as a number) that describes the size
and direction of a relationship between two or more variables.
• A correlation between variables, however, does not automatically mean that the
change in one variable is the cause of the change in the values of the other variable.
• Causation indicates that one event is the result of the occurrence of the other event;
i.e., there is a causal relationship between the two events.
4
Causation vs. Correlation
5
Causation vs. Correlation
Popularity
of Artists
Illegal
Music Sharing
Legal
Music Sales
6
Causation vs. Correlation
• Why does it matter?
• Avoiding Spurious Relationships
• Many variables may be correlated due to coincidence, the presence of a third variable
(confounding variable), or other complex interactions.
• Understanding mechanisms
• Causation implies that a change in one variable directly results in a change in another.
Identifying causal relationships helps us understand the underlying mechanisms or processes
that lead to an outcome.
• Implications for Decision Making
• Decisions based on causal relationships are more likely to produce the intended outcomes. If
decisions were based solely on correlations, they might be ineffective or lead to unintended
consequences because the true causal factors are not addressed.
7
Causation vs. Correlation
• Does Causality Matter in Predictive Analysis?
8
Time Series and Events
Causal Inference with Time Series
Events
• An event is anything that changes the underlying process that generates
time series data, such as
• Changes in level
• Changes in trend (slope)
• The analysis of events includes two activities:
• exploration to identify the functional form of the effect of the event
• inference to determine whether the event has a statistically significant effect
• Other names for the analysis of events are the following:
• intervention analysis
• interrupted time series analysis
10
Events
• In retail sales, the term event is often used and includes the following:
• promotional events: discounts, sales, featured displays, and so on
• advertising events: broadcast, internet, and print media advertising campaigns,
sponsored events, celebrity spokespersons, and so on
• Other examples?
• In economics and the social sciences, the term intervention is often used
and includes these:
• catastrophic events
• events related to a key player (CEO, spokesperson): imprisonment, scandal,
illness, injury, or death
• public policy changes
• Other examples?
11
Events
• Changes in Level and Trend for Events
• What is valid inference here?
Y
(average)
After
Before
(average)
Event
t
12
Events
• Changes in Level and Trend for Events
• What is valid inference here?
Y
Event
t
13
Events
• Changes in Level and Trend for Events
• What is valid inference here?
Y
Event
t
14
Events
• Changes in Level and Trend for Events
• What is valid inference here?
Y
Event
t
15
Events
• Other Types of Changes?
16
Interrupted Time Series
• An interrupted time series (ITS) design is a statistical method involving
observations before and after an interruption.
• ITS evaluates the effects of the interruption by changes in the level and slope
of the time series and their statistical significance.
• To use an ITS design, you need two things:
• an intervention that can quickly produce a measurable effect when
introduced at a specific time point
• the ability to collect time series data (sequential observations from before
and after the intervention) to evaluate the effects of introducing the
intervention
17
Interrupted Time Series
𝑌𝑡 = 𝛽0 + 𝛽1 𝑡 + 𝛽2 𝐷𝑡 + 𝛽3 𝑡 − 𝑇𝐼 𝐷𝑡 + 𝜀𝑡
18
Interrupted Time Series
• Pros
• It is a strong design to use to estimate effects of your product when randomization is not
suitable or possible.
• Many ITS designs involve comparing participants to themselves, which means that the
design is more sensitive to differences in the effects of the intervention.
• It can be conducted with a small sample size.
• Cons
• Lack of randomization means that drawing definitive answers about the effects of your
digital product will be limited.
• The pre-existing trend may not be comparable to the counterfactual trend after the
interruption.
19
Panel Data Analysis
Basic Concept of Panel Data Structure
Cross-sectional Data
• Observations of many different individuals
(subjects, objects) at a given time, each
observation belonging to a different individual
• Examples
• The list of UConn students enrolled in Fall 2022
• US infrared satellite map on Nov 14, 2022
• Various survey data
• Advantage
• You can compare the states of different individuals
at a given time.
• Disadvantage
• You can’t observe how each individual has
changed.
21
Time Series Data
• A series of data points indexed in time order
• Examples
• Total US population trend
• US recorded music revenues trend
• Daily temperature in NYC in 2022
• Advantage
• You can understand how an individual has
changed over time.
• Disadvantage
• You can’t compare this trend with other
individuals.
22
Panel Data
• Multi-dimensional data involving
measurements over time
• Time series and cross-sectional data can be
thought of as special cases of panel data that
are in one dimension only.
• Examples
• Per capita GDP by country over time
• Monthly sales by customer over time
• Advantages
• You can track how relative differences across
individuals change over time.
23
Time-invariant/varying Factors
• Time-invariant factors
• Time-varying factors
• Factors that do not change over time
within an individual
• Factors that change over time
• Examples
• Examples
• Gender (in most cases)
• Season, year, holidays
• Age (within a year)
• Temperature, precipitation
• Date/place of birth
• US president (among US residents)
• Mobile OS (rarely changes)
• Other examples?
• Other examples?
24
Fixed Effects
• Variables that are constant within each observation group
• De-meaning the variables using the within transformation
• Individual fixed effects
• A set of dummy variables indicating individuals
• If the effects of each factor is consistent over time, time-invariant factors are entirely
captured by individual fixed effects.
• Time fixed effects
• A set of dummy variables indicating time units
• If the effects of each factor is consistent over time, time-varying factors common across
individuals are fully captured by time fixed effects.
25
Two-way Fixed Effects
• Individual fixed effects + Time fixed effects
26
Two-way Fixed Effects
• Internet adoption and print newspapers
Cho, Daegon, Michael D. Smith, and Alejandro Zentner. "Internet adoption and the survival of print
newspapers: A country-level examination." Information Economics and Policy 37 (2016): 13-19.
27
Two-way Fixed Effects
• Internet adoption and print newspapers
• Country-year level data on
• Newspaper circulation
• Newspaper titles
• Broadband penetration rates
• GDP per capita
• Population
• Tertiary school enrollment
• Cell phone penetration
• Country fixed effects + Year fixed effects
Cho, Daegon, Michael D. Smith, and Alejandro Zentner. "Internet adoption and the survival of print
newspapers: A country-level examination." Information Economics and Policy 37 (2016): 13-19.
28
Two-way Fixed Effects
• Internet adoption and advertising expenditure on traditional media
Zentner, Alejandro. "Internet adoption and advertising expenditures on traditional media: An empirical analysis
using a panel of countries." Journal of Economics & Management Strategy 21, no. 4 (2012): 913-926.
29
Two-way Fixed Effects
• Internet adoption and advertising expenditure on traditional media
Zentner, Alejandro. "Internet adoption and advertising expenditures on traditional media: An empirical analysis
using a panel of countries." Journal of Economics & Management Strategy 21, no. 4 (2012): 913-926.
30
Panel Analysis with Events
Panel Structure and Analysis with Internal/External Shocks
Time Series with Events
• Assumption
What is the true counterfactual?
• The same trend would continue if the
event did not occur.
Y
• Problems
• Observing only one individual
• A change after the event might be
attributable to unobserved factors other
than the event.
Event
t
32
Panel Data with Events
• Assumption
• The trend of individuals without
the event (control) can be a
counterfactual of the individual
with the event (treated).
Control
Y
Treated
• Without the event, the control
and treated individuals will
present parallel trends in their
outcome variables.
Event
t
33
Difference-in-Differences
• A statistical technique using observational data that measures a differential effect of a
treatment on a 'treatment group' versus a 'control group' in a natural experiment.
34
Difference-in-Differences
• After removing NBC content from Apple’s
iTunes store in Dec 2007
• 11.4% increase in demand for NBC’s pirated
content
• Insignificant decline in piracy for the same
content when NBC came back to iTunes
• No change in demand for NBC’s DVD
content at Amazon.com
Danaher, Brett, Samita Dhanasobhon, Michael D. Smith, and Rahul Telang. "Converting pirates without cannibalizing purchasers: The impact of
digital distribution on physical sales and internet piracy." Marketing Science 29, no. 6 (2010): 1138-1151.
35
Difference-in-Differences
• COVID-19 and E-Commerce Operations in Alibaba
Han, Brian Rongqing, Tianshu Sun, Leon Yang Chu, and Lixia Wu. "COVID-19 and E-commerce Operations:
Evidence from Alibaba." Manufacturing & Service Operations Management 24, no. 3 (2022): 1388-1405.
36
Difference-in-Differences
• COVID-19 and E-Commerce Operations in Alibaba
Han, Brian Rongqing, Tianshu Sun, Leon Yang Chu, and Lixia Wu. "COVID-19 and E-commerce Operations:
Evidence from Alibaba." Manufacturing & Service Operations Management 24, no. 3 (2022): 1388-1405.
37
Synthetic Control
• Background
• The parallel trends assumption may not hold in many cases.
• It can be difficult to find sufficient number of control units having similar trends.
• Method
• Pooling of a combination of untreated units to create a composite control against which
the treated unit can be compared
• The outcomes of the control units are weighted so as to construct the counterfactual
treatment-free outcome for the treated unit.
• The weights are chosen such that the treated unit and synthetic control have similar
outcomes and covariates over the pretreatment period.
38
Synthetic Control
• Background
• The parallel trends assumption may not hold in
many cases.
• It can be difficult to find sufficient number of
control units having similar trends.
• Method
• Pooling of a combination of untreated units to
create a composite control against which the
treated unit can be compared
• The outcomes of the control units are weighted so
as to construct the counterfactual treatment-free
outcome for the treated unit.
• The weights are chosen such that the treated unit
and synthetic control have similar outcomes and
covariates over the pretreatment period.
39
Synthetic Control
• Paywall in Online Newspapers
• Comparing New York Times with national
newspapers of similar popularity (i.e., USA
Today, Washington Post, Wall Street Journal,
Chicago Tribune, New York Daily News)
• The control newspapers are not strictly
comparable to NYT because they experienced
different temporal trends prior to the erection
of the paywall.
• Findings
• More negative impact on online heavy users
• Positive spillovers on print readership
40
Python Hands-On
Using Real-world Datasets
Notice for Upcoming Weeks
What you need to do, what you will do
What You Will Do
• Homework #3 canceled
• April 4 (In person)
• Group Presentation
• April 11 (In person)
• Course Review & Exam Preview
• April 18: Final Exam
• April 25 (Recording)
• Data Scraping Techniques
43
Jaeung Sim
Assistant Professor
School of Business, University of Connecticut
jaeung.sim@uconn.edu
Download