TABLE OF CONTENTS
1
2
3
RP 2 NOTES
Session 8 - Introduction to experimental design
Correlation versus Causation
Correlation refers to the direction of change of the variables (e.g. in positive correlation both
variables go down / up together).
-> However, based on correlation we cannot say what drives what. It is not possible to say for
variables A and B that are correlated whether A drives B, B drives A, or whether some other
variable C is actually responsible for both (e.g. ice cream consumption and shark casualties are
correlated, but the causation lies in high temperatures)
-> Correlation is a much wider term that causation. Correlation between X and Y might arise
without any causation between X and Y.
The reasons X and Y may correlate:
- X causes Y
- Y causes X
- Z causes both X and Y
- Spurious correlation
Reverse causality - X appears to cause Y, but it is actually the Y that causes X.
Example 1: Pro tability and R&D investment -> seems that pro tability means more R&D is
possible, but actually R&D causes more pro tability in the rst place.
Example 2: advertising expenditure and soda consumption -> does advertising increase soda
consumption, or does a prior knowledge of upcoming high demand / consumption increase
advertising expenditure?
Third variable - X appears to cause Y, but both X and Y are actually caused by Z
-> Example: the more toys a children has, the higher the IQ. However, a third variable possible:
parent income, which increases both the number of toys and child intelligence.
Spurious correlation - a correlation between two variables with no underlying mechanism
behind this correlation (i.e. accidental correlation). The underlying mechanism is of no use.
-> Example: number of movies with Nicolas Cage released and swimming pool deaths - someone
has uncovered such a correlation, but it is accidental and useless / nonsensical.
When can we infer that X causes Y? Below are the three necessary (but not su cient!)
conditions for causality:
- Relationship between X and Y (they vary together, are correlated)
- Time order (X cannot happen after Y, otherwise it is not possible for it to cause Y)
- Elimination of all other possible causal factors (by holding variables constant / controlling)
ffi
fi
fi
fi
fi
4
Experimental design elements:
- Independent variables (IV) that are manipulated
-> Between-subject: one participant is assigned to one condition of the IV (e.g. to A -or- B), for
instance, the participant is given either orange juice or apple juice
-> Within-subject: one participant is assigned to several experimental conditions of the IV (e.g. to
A -and- B), for instance, the participant is given both orange juice and apple juice
- Dependent variables (DV) that are measured
- Controlled extraneous factors - everything that is not an independent variable (manipulated)
must remain the same. The possible other variables are measured for statistical control. To
ensure proper control, the participants are -randomly assigned- to groups
- Context: online survey, laboratory study, eld study, etc.
Btw: conditions are derived based on the IV. E.g. condition 1 is the orange juice group, condition 2
is the apple juice group. Conditions are the research groups characterized by which IV is applied
to them.
“Example:
Let's say you're studying the e ect of di erent fertilizer types on plant growth.
-> Independent variable: Fertilizer type (Type A, Type B, No fertilizer)
-> Dependent variable: Plant height
-> Experimental Conditions:
- Plants receiving fertilizer Type A
- Plants receiving fertilizer Type B
- Control Condition: Plants receiving no fertilizer”
Experimental design - Example 1 (coupons):
-> Two or more conditions developed: coupon type 1 (condition 1), coupon type 2 (condition 2)
-> Conditions di er in one characteristic - the independent variable: coupon type 1 says “obtain a
25% discount”, coupon type 2 says “pay 75% of the price” (IV is the type of discount wording)
-> Everything else is held constant across the conditions - both coupon conditions are otherwise
the same
-> Participants are randomly assigned to one condition - 50% get coupon 1, 50% get coupon 2
-> Are there di erences in terms of the DV across the conditions? - DV: percentage of participants
who made a purchase using the coupon in each group
Example 2 - nutritional labeling
- Experimental context - lab research, N = 200 boomers are given a bag of chips
- Independent variable (manipulated) - nutritional labeling
-> Control condition - no nutri-score on the bag (N = 100 participants)
-> Intervention condition - nutri-score on the bag (N = 100 participants)
- Dependent variable - the amount of chips eaten
- Research hypothesis:
H1: chips consumption will be lower in the intervention group (with nutri-score) than in the control
condition
Example 3 - default choice in organ donation
- Experimental context - online survey, N = 100 participants
- IV - default choice characteristic. Conditions: opt-in, opt-out, neutral
fi
ff
ff
ff
ff
5
H1: the number of individuals consenting to organ donation will be higher in the opt-out than in
the opt-in condition
Developing experimental stimuli:
- A stimulus is the object / event to which a response is measured
- It can be visual / audio / textual, etc.
- The IV is conveyed through the stimulus. E.g. a stimulus is the presence/absence of the nutriscore, a stimulus is a high/low price, etc.
Confounding variables:
- External variables that in uence the relationship between the IV and DV (i.e. provide e ects on
top of the endogenous e ects of IV on DV). They need to be controlled for (“z” variable),
sometimes they are not controlled for (as the researchers do not realize).
- Example of a confounder - in the nutri score presence (IV) and chips consumption (DV), the
color of the packaging can be a confounding variable. As such, it needs to be controlled for both packagings should be the same color, they might only di er in terms of whether they
display the nutri-score info or not.
Exercise - nd a confounder in the experiment above. Solution: the confounder is the location
where each group spends their time. For instance, the treatment group might perform meditation
in a rst, while the control group might be located in the city center. This needs to be controlled
for, otherwise the possible in uence on the DV might come due to the location (rather than
intervention) itself.
Experimental validity:
- Internal validity - conclusions about the e ects of IV on DV are valid within the sample. I.e. the
ndings are correct (methodologically, logically, etc.)
-> Can be achieved with a correct research design, often accomplished in lab settings
- External validity - conclusions can be generalized outside the sample, e.g. lab participants ->
consumers.
-> Field studies are higher in external validity (although they su er in internal validity) - the ndings
are more generalizable, although the quality of the ndings themselves might be poorer
fi
ff
ff
ff
fi
ff
fl
fl
ff
fi
6
fi
fi
- DV - binary, organ donation registration as yes / no.
- Hypothesis
Randomized trials support internal validity:
-> RCT (randomized control trial)
- RCT in parallel groups -> It is a between-subject experiment by nature - the groups are parallel
(no interaction). Participants are randomly allocated into one of the conditions and stay within it
(e.g. drug vs placebo)
- RCT in crossover -> It is a within-subject experiment by nature. The participants receive several
interventions, the data is collected for each (and also data before and after the experiment).
Here, the randomization refers to the order of interventions (all interventions applied but
di erent participants subject to them in di erent orders).
- RCT in cluster - the random allocation into intervention groups is not applied to individual
participants but to clusters (e.g. villages, schools).
Incorrect designs with a detrimental in uence on internal validity:
- Quasi-Experiment - non-random allocation to the experimental conditions
- Pre-post - the DV is measured before and after an IV category is applied. It can be labelled as a
non-experimental observational study. The problem: no control group, the conditions are solely
drawn based on before vs after data rather than an intervention vs control data.
-> Hard to determine causality
Replication experiments - repeating an experiment that has already been conducted (e.g. to
recheck the e ect, robustness across situations)
Field experiment - no laboratory setting
A/B testing (split testing) - two versions of a web page / app / product are compared to determine
the better one (e.g. a new Google sign-in page o ered to some people in variation A, to some in
variation B)
Session 9 - Introduction to linear regression
Regression - an approach for modeling the relationship between a deponent variable (DV) and one
or more independent variables (IV)
- Linear regression - DV is quantitative (i.e. continuous or discrete)
- Linear probability model and logistic regression - DV is qualitative (binary)
Application of regression:
- Establishing patters of association between IVs and DV - speci cally, this is done for
observational data
- Establishing the e ect of IVs on DV - this is done with experimental data (more causality)
- Predicting the value of DV based on available values of IV
Regression involves developing a model - an equation describing the relationship between the DV
and the IVs
Linear regression - an approach for modeling the relationship between a quantitative DV and one
or more independent variables (IVs)
-> IVs can be quantitative or qualitative
-> IVs can be measured (like in a survey) or manipulated (experiment)
-> Simple linear regression - there is only 1 IV
-> Multiple linear regression - there are more than 1 IVs
A (simple / multiple) linear model describes the linear relationship between the DV and IVs.
fi
ff
ff
fl
ff
ff
ff
7
1. Simple linear regression model
- Dependent variable (DV) - variable to be predicted / explained
-> denoted by Y
- Independent variable (IV) - variance used to predict / explain the DV. Can be manipulated or
measured.
-> Denoted by X
Example: nutritional labeling
-> X = {0, 1] —> binary IV, either no nutri-score or nutri-score present
-> Y = quantitative (continuous) DV
Example 2: chips consumption
-> Chips consumption is to be explained by personal discipline. Hence, the former is DV, latter IV.
-> X = Quantitative (discrete) IV
-> Y = Quantitative (discrete) DV
8
Simple linear regression model - equation:
Y = β0 + β1X + ϵ
- Y (DV) would be the chips consumption, X (IV) would be the presence or absence of nutri-score
- Beta 0 and Beta 1 are model coe cients. Beta 0 is the intercept, B1 is the coe cient of the
independent variable
- Epsilon ( ) is the error term
Error term - a catch-all term for what the model misses. It presents the margin of error of the
model - the di erence between the theoretical value of the model (Y value after substituting X)
and the actual observed results. As such, a non-zero error term means that the model does not
fully capture the data. Similar to R^2.
Conducting a simple linear regression:
1. Intuition
Example: Using the hotel rating (X) to predict the number of reviews it has (Y)
-> We make the observations into a table (like on Geogebra for regression), one column is for X,
second column is Y, each row is an observation
-> Since we are doing a simple linear regression, we must attempt to t a straight line to the data.
We use the best t line, the equation of which is shown above. Beta 0 is the intercept, Beta 1 is
the slope.
-> How to measure how well a line (the best t line) ts the data?
1. Measure the distance from the line to the data (vertical distance between each point and the
line)
ffi
fi
fi
fi
ffi
fi
ff
𝜀
9
2. Square each distance (to get rid of the negative signs)
3. Add the squared distances up
Example:
- Notice point (x_i, y_i). It has a true value y_i.
- The value of this point given by the regression (with x_i as an input) is the y_i hat.
- The error term is the di erence between the two, i.e. the true value minus the regression value.
In other words,
e_i = y_i - y_i hat
This can be expressed as follows:
- Prediction (the regression based output):
The regression-predicted y value is the B0 coe cient + B1 coe cient * x value (true x ofc)
- True value (in terms of the regression terms)
The true value is equal to the relevant regression computed value + the relevant residual (error)
- Residual (i.e. the error term, but for a sample) -> can be negative
Residual is the di erence between the true value and the regression computed value. There is a
residual for each of the four data points.
2. Squaring the distance -> compute: e_i ^2 for each i
(i.e. square each of the residuals)
3. Residual sum of squares (RSS) = sum of the squared error terms
ffi
ffi
ff
ff
10
Objective of linear regression: the best t line is the line for which the RSS (Residual Sum of
Squares) is minimized.
Btw: the beta hat in the regression above are coe cient -estimates—— > The above method of computing the residuals, squaring them, adding them up; and then
choosing the line with the smallest RSS is called Ordinary Least Squares (OLS)
Important:
- The population regression model refers to the entire population. The Beta 0 and Beta 1 are the
model coe cients, which are the true parameters. Epsilon is the error term (population error
term).
-> It is not possible to obtain the true population parameters and the true error term (except for
some cases) - this requires the knowledge of, for instance, the wages of every single member of
the population
- In research, sample data is used. For sample data, the values obtained through regression are
estimates. So Beta 0 and Beta 1 are estimates of the true coe cients (i.e. they are coe cient
estimates). Furthermore, e_1 is the residual, which is an estimate of the error term.
-> To obtain a particular relevant true observed sample value (y_1) we use the coe cient
estimates (that apply to all sample values), the x_i term (the relevant true observed value of the IV),
and we add the relevant residual e_i.
-> The prediction does not contain the residual term. The residual term (if available) needs to be
added to obtain the true rather than estimated DV value for the sample data.
ffi
ffi
ffi
ffi
fi
ffi
11
- If population coe cients / parameters are known, the sample estimate coe cients / parameters
can be compared with them - for a well designed research, there should be little di erence.
As such, the regression is a sample regression.
Regression with R: predicting hotel demand (Y, number of reviews) using hotel rating (X)
-> To conduct regression, we create a regression model (above: model1)
model1 <- lm(n_reviews ~ rating, data = hotels)
Where ‘lm’ stands for linear model. Inside, we put (dependent variable ~ independent variable,
data = name of the source data set)
-> To see the regression, we summarize the model using: summary(model1)
The output:
We know that Y = numb. of reviews, X = hotel rating. Given that we have simple linear regression,
the following general form applies:
Theoretical model:
Where we substitute n_reviews for Y and rating for X.
ff
ffi
ffi
12
The R Output corresponds to the estimated model (a sample model, no residuals accounted for):
We know that -886.9 is the intercept. We know that 454.8 is the rating coe cient. This gives us
the regression equation for our problem. We can, for instance, investigate the in uence of a unit
change in X variable on the Y variable. This corresponds to the slope = 454.8
We can plot the regression function using the formula below:
abline(model1, col = “red”)
(model 1 regression line plotted in red)
-> The slope (Beta 1) corresponds to the change in Y as a result of unit change in X. In
experimental (and quasi-experimental) research, this can be interpreted as the causal e ect of X
on Y. For the observational research, this represents a sole association / expected di erence (no
causality).
- Multiple linear regression - a linear regression with more than one IV.
-> There is one DV, namely Y
-> There are many IVs, namely X_1, X_2, … X_k
Example: Predicting backpack sales
-> Y = backpack sales
-> X1 = brand of the bag, X2 = size of the bag, X3 = average rating of a bag, …
Mathematically, the multiple linear regression is expressed as follows:
ff
ff
fl
ffi
13
As before, there is a distinction between population and sample data:
Example in R: multiple linear regression examining hotel popularity (number of reviews) as the
dependent variable; hotel rating and distance to the city center as independent variables
lm(formula = dependent variable ~ independent variable 1 + independent variable 2, data =
database)
-> This can be expanded to more IVs, e.g. IV1 + IV2 + IV3, etc.
Substituting the computed coe cients yields the regression model:
-> The coe cient of X1 corresponds to the change in Y1 as a result of unit change in X1, holding
all other IV values (X2, X3, …) constant.
ffi
ffi
14
Extra workshop R code:
-> Creates a new variable (column in the main data set) called top_rated. It assesses whether an
observation has a rating above 4. If yes, then the top_rated variable is assigned a value of 1, else
it is assigned a value of 0.
-> It creates a new variable called hotels_only. The variable includes inputs from the database
“hotels” ltered by the variable accommodation type being a hotel (rather than a home, for
instance).
Session 10 - model assessment and inference
(Often independent variables that are manipulated are true IVs, those that are measured are
control variables; ofc more importantly this depends on the context of what we want to assess)
(When we have two binary IVs, there is four conditions):
………
- Assessing how well a model ts the data (remember, a residual refers to a single observation)
—> Goodness of t (R^2)
Where:
- RSS (Residual Sum of Squares) - the variation of a DV that cannot be explained by the IVs
- TSS (Total Sum of Squares) - the variation of a DV
- TSS - RSS -> the variation of DV that can be explained by the IVs (logical)
- R^2 - the proportion of the variation of DV that is explained by the IVs (the higher, the better)
-> E.g. R^2 = 65% means that 65% of the variation of DV is explained by the IVs
fi
fi
fi
15
In practice:
- To calculate the RSS, we do what we did before. I.e. we take our regression line and the points;
for each point we measure the vertical distance to the line (this is the residual); we square these
and add them up
- The total sum of squares is like variance but without dividing by the number of observations
(since we also do not divide the RSS).
We take a particular value minus the mean value (for each), square and add up.
- 0 <= R^2 <= 1; the higher, the better the model ts the data
-> If R^2 = 0, RSS = TSS -> zero explanatory power, all DV variation is an unexplained variation
-> If R^2 = 0, RSS = 0 -> all variation of the DV is an explained variation by the IVs.
Problem: every time we add an extra IV to the model, R^2 increases. This happens even if the
new variable does not help in explaining the DV. This is because the TSS remains constant, while
RSS becomes smaller (think of a quartic vs quadratic regression).
Solution - adjusted R^2 (as opposed to multiple R^2 above). We only consider this one in
statistical outputs!
Where n is the number of observations (sample size), k is the number of independent variables
(but also including intercept, so k = 2 for a simple linear regression)
-> It takes into account the number of observations and the number of variables. It penalizes for
adding unnecessary independent variables.
-> It is always smaller than regular R^2 and can become negative when the regular R^2 is
close to 0.
Example - regression with a quantitative (discrete) IV, interpreting the output
IV - nutritional involvement (from 1 to 7), denoted as INVO
DV - chips consumption, denoted as CONS
H1: chips consumption will be smaller for participants with high nutritional involvement than with
low nutritional involvement
fi
16
Interpretation:
- INVO (involvement) can be between 1 and 7. The coe cient of INVO is -0.9. As such, a unit
increase in INVO decreases (-0.9 < 0) the chip consumption by 0.9 grams.
- The residuals are concerned close to the top. They refer to the di erence between the observed
(actual) and predicted values of the DV. Statistics about them are shown - for instance, the
minimum residual has a value of -12.391.
- Multiple R^2 is also provided (the only important). It is 0.078 - as such, 7.8% of the variation in
DV is explained by the model. The R^2 is 0.089, this one is not important. Adding an
unnecessary variable to the model would increase the R^2 but decrease the adjusted R^2.
- H0 (null hypothesis) for the regression is always that there is no signi cant relationship between
the DV and IVs, i.e. that the coe cients for all IVs in the model are zero.
-> At the very bottom, the F-statistic and p-value are provided. These refer to the H0 above. Since
p observed < 0.05, we can reject H0 -> there is at least one coe cient that is signi cantly nonzero.
- The coe cient estimates are provided (they result from applying the OLS to sample data). For
instance, the coe cient of INVO is -0.9071.
- The standard error is also provided for each coe cient. It measures the precision of the
estimated coe cients (a lower standard error is better). The smaller the standard error, the
more representative the sample is of the overall population. Furthermore, the standard error is
inversely proportional to the sample size - a larger sample means a smaller standard error.
- The t value is obtained by dividing the estimate by the standard error, for instance 30.13 / 1.52
= 19.72.
- The p-value tells whether a coe cient is signi cant. One star means that observed p <0.05,
two mean p <0.01, three mean p < 0.001.
Example 2: regression with a binary IV
- IV - presence of nutri-score (NUTR = 0, control condition, no nutri-score, NUTR = 1,
intervention condition, nutri-score present)
- DV - chips consumption
- H1: the presence of nutri-score decreases chips consumption
lm(formula = CONS ~ NUTR, data = chips)
fi
fi
ff
ffi
ffi
ffi
fi
ffi
ffi
ffi
ffi
ffi
17
-> In regression with a binary IV, Beta 0 (intercept) is the mean chips consumption in the control
condition. Beta 0 + Beta 1 is the mean chips consumption in the intervention condition.
[Mathematically, X = 0 in the former, X = 1 in the latter]
We can also combine the NUTR and INVO IVs, where the former is binary and the latter is
quantitative (discrete).
H1: The presence of the nutri-score label decreases chips consumption
Answer: The regression coe cient of NUTR is negative, hence when the nutri-score is present
(NUTR = 1), the consumption goes down by 8.51 grams, ceteris paribus (since more than 1 IV)!
Furthermore, it is signi cant. Hence, H1 con rmed.
- Example: categorical (rather than quantitative) IV with more than 2 categories (otherwise it is
binary)
Set-up:
fi
ffi
fi
18
- One IV with 3 categories
- …
We can visualize the situation using the following R function:
We aggregate the ORGA variable by the values of COND variable. Data comes from the organ
database. We display the mean value for each aggregation.
—> We have a nominal IV with n = 3 categories. Harder to conduct the regression. We need to
use dummy variables in the regression analysis. The number of dummy variables is always n - 1,
so: 3 - 1 = 2
-> So we have 3 categories of a single IV in total (one for each IV condition), 2 of these will be
used in the regression. The value of the third category can be concluded from the regression by
setting the other 2 variables to zero.
We generate symbols for the three categories: COND_1, COND_2, COND_3
Our estimation model looks as follows:
As above, only COND_2 and COND_3 are used in the regression. COND_1 is not used - it is now
considered the base category.
Manual dummy coding:
-> We create three dummy variables. To do so, let’s look at the example of COND_1. We create
this new variable COND_1. It is taken from the aggregate COND variable. Speci cally, if a value in
COND is equal to “1”, COND_1 gets a value of 1. Else, it gets a value of 0.
For COND_2, if a variable in COND is equal to 2, it is assigned as 1 in COND_2, else as 0.
-> This is all that needs to be done (assigning 1 and 0). Now, the two variables COND_2 and
COND_3 are used in the regression.
fi
19
Alternatively (equivalently):
-> We use the factor (factor = category) function which categories COND into separate categories
(as we did before). COND_f is the resulting categorized variable.
-> The relevel function is applied to COND_f (to the categorized variable) with an input of 1 (see
above). It updates the COND_f and puts category 1 (since 1 is the input; if we put “dog”, then
category named “dog” would become the base) as the base category.
-> The input is now not the two of three dummy variables as IVs, but a single aggregate /
categorized COND_f.
-> The second (f2) and third (f3) categories are displayed with their coe cient. Since f1 is the
base, it is not displayed.
NOW THE IMPORTANT PART (applies to both the manual and alternative method):
-> When the dummy variables are created, one is not used in the regression. It is the base
category.
-> The coe cients of the second dummy and third dummy (f2 and f3) are relative to the base
dummy variable.
-> The coe cient of the base variable (f1) is equal to the intercept coe cient (when we set f2 and
f3 to 0). So B1 = 4.22
-> The displayed coe cients of f2 and f3 are once again relative to f1. To obtain the true
coe cients of these categories, we add the coe cient of f1 to the displayed values.
-> So B2 = 4.02 + 4.22 = 8.24, while B3 = 3.66 + 4.22 = 7.88
-> In other words, the displayed coe cients (4.02 and 3.66) are the di erence of their respective
category coe cients compared to the base category.
Alternatively, if we used f2 as the base category:
ffi
ffi
ff
ffi
ffi
ffi
ffi
ffi
ffi
ffi
20
-> Same logic applies and the same results are obtained.
-> Remember, however, that the coe cients of the conditions are the di erence relative to the
base category. For COND_3, the di erence to the base category COND_2 is -0.36 (so COND_3 is
smaller by o.36 than COND_2). Just as before.
-> This di erence of -0.36 is not signi cant - see that the p-value is 0.347! As such, the two
coe cients are not signi cantly di erent (does not matter for the regression itself).
Workshop:
-> “genre” is a factor (categorized variable)
-> to recognize the base category, see which one is missing from the output. Classic rock is not
included among the coe cients - as such, it is the base category.
-> tickets$genre <- relevel(tickets$genre, “other”)
The above changes the base category to “other”. Executing the same regression formula as
above would now lead to a di erent output with “other” (rather than classic rock) as the base.
-> predict is a function that replaces missing values of a variable with predicted values. The
function is: predict(regression_model_name, name_of_variable_that_must_be_predicted)
-> a subset function displays only the rows that meet a certain condition.
Eg. subset(tickets, price >100) (from the tickets data set, only those are displayed with price >100)
-> How to assess if a new IV is useful? If the adjusted R^2 increases, it is likely useful. If it
decreases, it is likely redundant.
Module 11 - moderation and interaction terms in regressions
Moderation - the relationship between the DV and an IV changes when another variable / the
moderator variable changes.
-> The moderator variable can be measured (e.g. Likert scale)
-> The moderator variables can be manipulated (through di erent experimental conditions)
Example: packaging color as the moderating variable
ff
ff
fi
ffi
ff
ff
ff
fi
ffi
ff
ffi
21
-> The IV is the presence of nutri score, the moderator variable is packaging color
-> Note that both the IV and moderator are manipulated (rather than the moderator being
measured)
-> Hypothesis: The negative e ect of nutri-score on chips consumption will be more pronounced
with the orange packaging (vs. the green one).
-> To visually represent the hypothesis, the y-axis always contains the dependent variable, while
the x-axis always contains the moderator. The bar chart / line chart itself is separated into the
control (here, no label) and intervention (nutri-score label present), black vs grey.
-> Let’s consider the base case situation (green packaging). The change from control to
intervention Δ1 < 0 (a decrease).
-> For the orange packaging, the change from control to intervention is Δ2 < 0 (also a decrease)
-> Furthermore, we can notice that the e ect is stronger for the orange packaging than the green
one. To not bother with positive / negative signs in di erent situations, we consider the absolute
value. Speci cally, |Δ1| < |Δ2|
-> Note: above we have 4 experimental conditions (as the moderator is manipulated). If the
moderator is measured instead, it does not constitute a new experimental condition (then we
would have just two). This is a 2 x 2 experiment (2 conditions in the rst IV x 2 conditions in the
second IV / moderator, for a total of 4 conditions).
fi
ff
ff
ff
fi
22
-> The measured moderating variable is put on the x-axis. NUTR is a Likert-scale item from 1 to 7.
As such, it is represented as a line chart (quantitative data).
-> There are only two conditions (no label vs label). Nutritional involvement is not a condition
contributor as it is measured.
-> The change from control to intervention is negative for low nutritional involvement: Δ1 < 0.
Same for high. However, for high nutritional involvement the change is higher.
-> Conclusion: the decrease in chips consumption as a result of introducing the nutri-score is
stronger when the nutritional involvement is high (rather than low).
-> Note the position of the the lines. Since the hypothesis is that the DV will go down from control
to intervention, the control line is above the intervention line.
-> The above are conceptual hypothesis representations, no p-scores considered.
- Interpretation of interaction coe cients - binary moderator
Example: the packaging color as a moderator in the nutri-score and chips consumption
experiment. So DV = CONS, IV = NUTR, moderator = ORAN (stands for orange package)
H1: The presence of the nutri-score label decreases chips consumption
H2: The negative e ect of nutri-score on chips consumption will be more pronounced with the
orange packaging (vs. the green one)
model_X <- lm(CONS ~ NUTR + ORAN)
-> The above model is not appropriate! It allows to see the overall e ect of regression and color
(separately) on chips consumption but it does not tell anything about the moderation e ect. We
cannot distinguish the e ect of NUTR for di erent levels of ORAN
-> As such, we need to use a di erent model. We cannot even use the above model to answer H1
as the aim is to always use a single regression model to answer all hypothesis.
The correct model when considering moderation e ects:
ff
ff
ff
ff
ffi
ff
ff
ff
23
We use: lm(DV ~ IV * MODERATOR)
(note that we just do the multiplication, it would be
incorrect to do IV + MODERATOR + IV * MODERATOR!!!)
-> NUTR (-7.9787) represents the change in DV when ORAN = 0 (base case)
-> ORAN (2.4559) represents the change in DV when NUTR = 0 (base case)
-> NUTR:ORAN is the interaction term. A non-zero value (signi cant) indicates a moderation
e ect. The value of -3.21 means that if the moderator equals to 1, then the e ect of NUTR on the
DV is lower by -3.21 compared to its base e ect of -7.97. If the moderator equals to 0 (is absent),
then the base case value only applies. More generally, the value of the interaction term indicates
the change of the base e ect of the IV (here: NUTR) as a result of a unitary change in the
moderator. (Also change in the moderator with e ect with a unitary IV change)
So if the moderator (in a non-binary case) were to increase from 0 (absent) to 2, the NUTR would
become -7.97 - 2 * 3.21
-> Note the general regression model above in the presence of a moderator. We have the
interaction e ect on top of the individual e ects, expressed as a multiplication. We basically treat
the moderator as a regular IV.
— We can use the above output to con rm / reject H1. Perfectly speaking:
-> When considering a non-moderation situation (as in H1), we always use the base case value for
the other / moderator variable (equal to 0) to assess the sole e ect of the IV of interest. We
basically want the interaction term (that has multiplication in it) to become zero.
— We can also use the output to evaluate H2
-> When we evaluate the moderation e ect, we rst consider the case with moderator equal to
zero
-> Then we consider the case with the moderator equal to one
-> Finally, we compare the two cases
ff
fi
ff
ff
fi
ff
ff
fi
ff
ff
ff
ff
24
-> For the base case (moderator = 0), we are interested in the coe cient of the IV itself
-> For the moderation case (moderator = 1), we are interested in the e ect of the IV that is now
in uenced by the moderator. This e ect is the sum of the base case e ect and the interaction
e ect (sum of the coe cient of the IV and the coe cient of the interaction term, coe cient of the
moderator / other IV itself is not relevant).
-> Finally, we investigate the di erence between the two above e ects. This di erence is equal to
the coe cient of the interaction term. If the interaction coe cient is statistically signi cant (p <
0.05), then the di erence is also signi cant (moderation e ect occurs). Furthermore, we also
compare the total coe cients (from point 1 and point 2), speci cally their absolute values to
con rm the result. Here, the e ect is stronger (i.e. more negative) when the moderator is present.
The above function allows to obtain the coe cient of the IV variable NUTR when ORAN = 0 and
when ORAN = 1. We see that the e ect of NUTR is -11.19 when ORAN = 1. Just as we obtained
manually.
-> the function is contrast, “ORAN” can be replaced by any moderator variable. Other elements
are constant.
ffi
fi
ff
ff
ff
ffi
ff
fi
ffi
ff
ffi
ffi
fi
ff
ff
ff
ff
ffi
ffi
ff
ffi
fi
fl
ff
25
-> The ndings can be displayed graphically. For a green package, as we move from the control
to intervention the change is -7.98. For an orange package, this becomes -11.2. The latter is -3.2
less (hence stronger) than the former. We only consider the change, not the start and end values
of consumption (as these might di er between the moderator conditions).
-> A regression with a moderator term is called a multiple linear regression
-> Check each single coe cient for signi cance
- Interpretation of interaction coe cients - quantitative moderator
-> Same case as before but now instead of the binary moderator (packaging color) we have a
quantitative moderator — nutritional involvement (1 - 7), denoted as INVO
Hypothesis: The negative e ect of nutri-score on chips consumption will be more pronounced for
participants with high nutritional involvement (than with low nutritional involvement).
Results section - part 0 (regression outcome)!:
—> We run a regression with chips consumption as DV, the binary variable for nutri-score label (vs
no label) as one IV, the quantitative variable for nutritional involvement as another IV, and their
interaction term as another IV. We obtain the following result: CONS = 21.5 - 5.8 * NUTR ….
fi
ff
ffi
ff
ffi
fi
26
-> To answer H1 (whether NUTR in uences consumption), we set INVO to zero (just as before) this is correct despite INVO having a range from 1 to 7.
To answer H2 (with moderation):
- We focus on the coe cients of NUTR and NUTR * INVO
- The hypothesis is basically:
So as before, with moderation the x-axis is the moderator, y-axis is the value of the DV. Di erent
lines (grey vs black) represent di erent conditions of the IV.
-> We can investigate the slope for each IV condition (i.e. for di erent values of NUTR)
-> A non-zero slope (+ must be signi cant) indicates that a moderation e ect occurs - with
increasing values of the moderator (on the x-axis), the y-value becomes signi cantly di erent.
-> In the output above, we can see that the moderation e ect is signi cant for both conditions
(enough if it is signi cant for just the intervention slope). The moderation e ect displays itself
stronger in the intervention condition as well (the slope is steeper + more signi cant).
-> The function is: whatever <- emtrends(reg_model_name, ~ IV_name, var=“moderator_name”)
And then: test(whatever)
This shows the presence of the moderation e ect. Now, let’s investigate the moderation e ect
more closely. Let’s investigate the regression model that was obtained again:
ff
ff
ff
fi
fi
ff
ff
fi
ff
ff
ff
fi
fl
ff
ffi
fi
27
We investigate the e ect of the IV on the DV for di erent values of the moderator. Remember that
the moderator can be from 1 to 7 in this particular case (Likert scale score for INVO).
- When INVO = 0, CONS = -5.8 * NUTR (simple e ect of NUTR on CONS, however - here it is
not interesting as outside the domain of INVO)
- When INVO = 1, CONS = -5.8 * NUTR - 2.4 * NUTR * 1 = -8.2 NUTR
- When INVO = 2, CONS = -5.8 * NUTR - 2.4 * NUTR * 2 = -10.6 NUTR
- And so on…
- Just as for a binary variable, we can automate this calculation with the contrast function
- contrast(probe, “pairwise”, by=“moderator_name”, scale=-1) [probe must be de ned earlier]
- Same result is obtained for INVO = 1 and INVO = 2 as manually above. Further results are
presented for greater values of INVO. Also, the signi cance is stated for each value of the
moderator. All results are signi cant. The estimates are the total e ects of the IV on DV under a
speci c value of the moderator.
- We can see that the negative e ect of the IV on DV becomes stronger with increasing values of
the moderator (e.g. -8.17 vs -22.42).
fi
ff
fi
ff
ff
ff
fi
ff
fi
28
- This is as was predicted with the hypothesis. This moderator-related increasing e ect of the IV
on the DV (when we go from control to intervention) is visible in the graph. E.g. -10.6 for INVO=2
A results section can now be written to answer the hypothesis.
- Similar to a binary moderator, we report the coe cient for the lowest value of the moderator
and for the highest value of the moderator. Then, we compare these coe cients. Finally, we
also report about the coe cient of the interaction term (between IV and moderator) itself.
- As for binary, we only report the signi cance for stand-alone coe cients (so just beta 3). If this
coe cient is signi cant, everything else is signi cant.
Workshop:
Before the “probe” is inserted into the contrast function, it is rst de ned as above (values of the
moderator speci ed in step1 (just two values for a binary moderator, more for quantitative), probe
de ned in step 2 (we are interested in the changes of NUTR coe cient due to INVO), contrast
execution in step 3).
ff
ffi
fi
ffi
ffi
fi
ffi
fi
fi
ffi
fi
fi
ffi
fi
29
Module 12 - advanced topics in linear regression
Example: regression with IV and a moderator, plus additional IVs (controls!)
-> Same setup as with the binary moderator (CONS, NUTR, ORAN) but now two additional
control IVs are introduced: GENDER (binary) and AGE (quantitative)
-> Control variables are held constant throughout the regression. They are like moderators but
they are not subject of the research - they are never included in the hypothesis! However, they still
have an in uence so we need to control for them in the regression (by simply including them as
singular, separated IVs, we do not care about interaction e ects with the control variables).
When control variables are present, we use the statement “ceteris paribus” when presenting the
conclusions.
- The results section (part 0) is as before + we mention the controls:
- Otherwise, testing for h1 and h2 goes exactly as before (we ignore the controls). However, in
the results section we need to use the term “ceteris paribus”. We use this term whenever we
report the value of a coe cient / sum of coe cients. See below for an example:
ff
ffi
ffi
fl
30
— Homoscedasticity of error ε
-> The assumption that the variance (σ^2) of the error term is constant for all values of X (of the IV)
-> This is an assumption because we never have the population error term.
-> Instead, in practice we check the residuals for their variance when assessing this
As a reminder, the residuals are:
Example test: DV is food expenditure, IV is income level
-> We can make a plot with DV on the y axis and IV on the x axis
Since income is discrete in this case, we have several food expenditure values corresponding to
the same income score (from 0 to 12)
-> We can assess the spread of food expenditure as the income level changes.
-> For lower income levels, the spread is lower. For higher income levels, the spread is larger.
-> In other words, there is a greater variance of DV when the IV is larger
31
We can now run a regular regression: lm(expenditures ~ income_level, data = expenditures)
-> A regression line can be plotted on the plot above, where intercept is the y-intercept and the
coe cient of the income level (IV) is the slope
-> Alternatively, we can use R:
-> We use the function abline(regression_model_name, color)
-> The regression plot is shown above.
-> The variation of residuals increases as the income level increases (since there are larger
di erences between the predicted and observed values)
-> In other words, the variance of residuals is NOT constant / equal across the values of the
independent variable
-> This is known as heteroskedasticity (rather than homoscedasticity)
Another way to test this: plotting tted (i.e. estimated) values on the x-axis and the residuals on
the y-axis
fi
ffi
ff
32
-> we use the function plot( tted(regression_name), residuals(regression_name), main = anything)
-> The spread of the values of residuals increases (i.e. their variance) as the tted (estimated)
values increase.
To recognize heteroskedasticity, we can visually look for patterns in the plots
-> On the left, the residuals are scattered basically equally - homoscedasticity
-> On the right, there is a pattern (no apparent randomness) - more speci cally, the residuals
become more scattered as we move along the tted values. We have heteroscedasticity.
Altenative - Breusch-Pagan Test
Null hypothesis H0: the variance of the -error term- is constant
Alternative hypothesis H1: the variance of the error term is NOT constant
- bptest(regression_model_name)
- Make a conclusion based on the p-value (if less than 0.05, then null hypothesis rejected)
- Conclusion: there is heteroskedasticity in the -residuals- of our model
-> Heteroskedasticity distorts the standard error estimates provided by the regression model
-> We can obtain a heteroskedasticity-robust standard errors as shown below
fi
fi
fi
fi
33
- We use the function coeftest(regression_model_name, vcov = vcovHC(regression_model_name))
-> The coe cient estimates remain constant
-> The standard error values change, so do the t- and p-values (not visible here)
-> Indeed, the standard error in uences the p-values. A greater standard error means greater pvalues (and vice versa). This is because with more deviation / variance we are less certain of the
results.
-> Above, the standard error of the income level increases to 0.079985 after adjusting for
heteroscedasticity. The p-value nonetheless remains signi cant, which reinforces the robustness
of this variable’s e ect on the DV
The results of this can be reported as follows:
———-> Collinearity
-> It refers to the situation when two (or more) -IVs- are closely related to one another
(This is basically a high correlation but not between an IV and the DV, but between an IV and
another IV)
Example: DV = credit card balance, IV = age, IV = limit, IV = rating
fi
fl
ff
ffi
34
- The IV limit and the IV age are not collinear - there is no correlation between them
- The IV limit and the IV rating are highly collinear - there variables are highly correlated
Problems with collinearity:
- it poses issues to the regression as it can be di cult to separate out the individual e ects of the
collinear IVs on the DV (hard to say which part of the e ect on the DV comes speci cally from
which of the two collinear IVs)
- Example: since limit and rating tend to increase/decrease together, it is hard to determine how
each of these is separately associated with the DV credit card balance
So: collinearity reduces the accuracy of the estimated coe cients and hence increases the
standard error -> this lowers the t-value -> this lowers the p-value -> we are less likely to obtain a
signi cant coe cient
ff
fi
ffi
ff
ffi
ffi
fi
35
-> When the IVs used in the regression are not collinear (as in Age and Limit), statistical
signi cance of the coe cients is present
-> When the IVs used are collinear (such as Rating and Limit), statistical signi cance is absent - at
least for one of the two IVs (here limit is not signi cant, while rating is still signi cant, but less)
—-> Multicollinearity
- It occurs when there is collinearity (correlation) between three or more independent variables,
even if no individual pair of these variables has a particularly high collinearity (correlation)
Perfect multicollinearity - example: organ donation model with an IV (COND) that has more than
two categories (dummy variable was used in the past)
The conditions are mutually exclusive (not possible to belong to many).
-> For any instance: COND_1 + COND_2 + COND_3 = 1
Where COND_1, …, … are the dummy variables we created out of the IV: COND.
If we plot all three dummy variables (instead of just 2 as before), the software outputs an error for
one of them:
-> The reason is that perfect multicollinearity occurs
-> Speci cally, the sum of the three dummy variables is constant and is always 1 (regardless of
the category allocation of a particular instance)
-> The intercept term is also a constant
-> Hence, the three dummy variables combined and the constant term in the regression model are
perfectly collinear (perfectly, constantly correlated regardless of the instance category allocation)
In other words:
fi
fi
fi
ffi
fi
fi
36
— Assessing collinearity:
- compute the variance in ation factor (VIF) (should be used for both collinearity and
multicollinearity)
- The smallest possible value of VIF is 1. This indicates complete absence of collinearity
- There is no upper limit. As a rule of thumb, a VIF that exceeds 5 or 10 indicates a problematic
amount of collinearity
- It is computed as follows: vif(regression_model_name)
-> We run the vif function on the regression model. The output is presented.
-> The IVs Rating and Limit exceed the acceptable limit. They are implicated in collinearity
-> We can conclude that there is collinearity in the data
A solution for collinearity: omitting one of the two (for collinearity) problematic collinear IVs from
the model
-> only relevant when collinearity is very high or when dropping one of the variables does not yield
any loss of information (as with dropping one dummy variable)!
fl
37
-> The collinear IVs are rating and limit. We can drop, for instance, Rating from the model
-> This results in an acceptable VIF
-> Simultaneously the adjusted R^2 drops only slightly - no compromise in the t of the model.
- If dropping one variable results in a loss of information (lower adjusted R^2), the results are
distorted. Then the solution might involve collecting signi cantly more data instead of dropping
the variable (having more data might potentially disentangle the di erent e ects) or combining
the two problematic variables into one (for instance, taking the average of Limit and Rating)
-> We can create a combed variable by summing the two individual variables and dividing the
result by two. Given di erent scales, the individual variables might rst need to be adjusted. Then,
the combined variable can be used in the normal regression. As above, this is successful.
fi
ff
fi
ff
fi
ff
38
Workshop:
-> When an IV has more than 2 categories (mutually exclusive), this corresponds to perfect
collinearity. If we do not remove a dummy variable, a variable will have NA in the regression
output.
-> Removing outliers should decrease (improve) the p-value
Module 13 - regression with a binary DV
-> So far the DV was quantitative
- A binary DV is often used to capture a choice / state. A dummy is used for this purpose. The
binary DV is 1 if the choice happens, 0 if it does not happen.
- Examples of binary DVs:
-> Choice: purchase / do not purchase, stay / leave (always mutually exclusive)
-> State: employed / unemployed, higher education / no higher education
-> Example of a binary DV. The IV (quantitative here) is plotted on the x-axis, the DV is on the yaxis. 0 and 1 are the only possible values for any X.
39
-> The Linear Probability model is a linear regression in the context of a binary DV
-> Another tool (not linear regression) is the Logit Model, it’s often better.
—- Linear Probability Model
-> The Linear Probability Model is a regular regression
-> In general, the regression outputs the expected value of Y given a certain value of X
-> When the DV is binary, we do not have an expected value anymore. Instead, we have an
expected probability
-> The regression model (for a given set of inputs) outputs the expected probability of Y = 1.
-> Otherwise, everything works the same. For instance, the coe cient of X_1 corresponds to the
expected change in the probability of Y = 1 with a unitary increase in the value of X_1.
Example:
-> IV is the allocation, DV is organ donation participation (Yes / No). In other words, we are
interested in how allocating people to di erent groups (opt in vs opt out) in uences their decision
to register as organ donors.
-> Note that we are interested in the -probability- of Yes vs No for the di erent situations.
The database for the above situation looks as follows:
fl
ff
ffi
ff
40
-> For each person, the IV is identi ed (opt in vs opt out) and the DV is identi ed (not organ donor
vs organ donor)
-> Non-regression summary is shown above. For instance, opt-in is associated with N = 136
organ donors and N = 864 non-donors. With opt-out, this is basically reverse. The relevant
proportions are shown above as well.
-> The regression is done as always. Only the interpretation of the output changes. The P above
the Y and DON stands for the probability.
fi
fi
41
-> Note that the IV is categorical but with only 2 categories. As such, no dummy variable is
needed (we proceed as normal, since IV is binary)
-> Note that the IV is de ned as OPT_OUT such that IV = 0 means not opt out (i.e. opt in)
condition, IV = 1 means opt out.
-> If we set the IV = 0, we get that DON^p = 0.14 (i.e. equals to the intercept). So for not opt out,
the probability of a donation is 14%.
-> If we set the IV = 1, the probability increases by 0.68 relative to the base case (i.e. an increase
by 1 of the IV increases the donation probability by 0.68). So the probability is 0.14 + 0.68 = 0.82.
This corresponds to the proportions provided earlier.
-> Example 2: same as above, adding a moderator variable. The moderator is: organ donation
endorsement, from 1 to 5 (quantitative).
H2: the positive e ect of opt-out vs. opt-in on consenting rate will be stronger when endorsement
is low rather than high.
-> A non-regression output, similar to a hypothesized graph. X-axis shows the moderator variable,
y-axis shows the DV - speci cally a proportion / probability since the DV is binary. The upper line
corresponds to the opt-out condition of the IV, the red corresponds to opt_in
-> Note that when IV = 0, we are in the opt in group (red). The higher the value of the moderator
variable, the lower the change from red (COND = 0) to blue (COND = 1). This would suggest to
con rm the H2.
- The regression is exactly the same for a binary DV as always.
fi
fi
ff
fi
42
-> When OPT_OUT = 0 and END = 0, DON^p = 0.03 (equal to the intercept). So there is a 3%
probability / proportion of organ donation in the opt in group with zero endorsement (base case).
However: note that p > 0.05. As such, the coe cient (probability) is not signi cantly di erent from
zero!
-> When we go from opt in to opt out (holding endorsement as zero), the probability of organ
donation signi cantly increases by 79.8% relative to the base case.
-> A unitary increase in endorsement (for opt in, i.e. OPT_OUT = 0). Increases the probability of a
donation by 4%.
-> Finally, let’s investigate the moderation e ect. Note that the moderator (END) is quantitative:
If END = 1, DON^p = 0.8 * OPT_OUT - 0.05 * OPT_OUT * 1 = 0.75 * OPT_OUT
If END = 7, DON^p = 0.45 * OPT_OUT
Furthermore, the coe cient of the interaction term is signi cant and equal to -0.045.
Hence: the change in the probability of a donation decreases when the levels of endorsement are
higher as we move from opt in to opt out. Note that the probability of a donation itself is not
necessarily lower (given the higher starting point with higher END) - if we account for the
standalone END term as well, the actual probability is fairly constant.
Problems with the linear probability model:
- Predicted probabilities can be below 0 or above 1
- Heteroskedasticity is present by design
- It assumes linearity between variables (For example, it assumes that the change of
endorsement from 2 to 3 has the same e ect as the change of endorsement from 5 to 6).
Solution - Logit Model:
- It is non-linear. As such, the coe cients cannot be interpreted in terms of probabilities
- For interpreting, we add an additional step: computing the marginal e ects (not done in the
course, we only test the sign and signi cance of the coe cients)
-> The function for the logit model is the following:
glm(formula = DV ~ IV, family = binomial(link = “logit”), data = name_of_dataset)
Interpretation:
- The coe cient of OPT_OUT (i.e. of the IV) is 3.36. As such, it is a) positive and b) statistically
signi cant. We can conclude that the opt out condition (i.e. OPT_OUT = 1) yields a signi cantly
higher probability of organ donation than the opt in condition.
-> This is all we can conclude. We cannot use the coe cient (3.36) itself, just the fact it is positive.
fi
ff
fi
ff
fi
ffi
ffi
ffi
ff
ff
fi
ffi
ffi
fi
ffi
fi
43
It is also possible to do the moderation e ect in the logit model.
-> Works same as before.
-> We can again only include about the sign and signi cance. All e ects are signi cant. When
OPT_OUT = 1, the probability of organ donation increases (positive coe cient). Same for
endorsement. The interaction term is negative - a higher endorsement means a lower change
from OPT_OUT = 0 to OPT_OUT = 1 (from control to intervention condition). However, we cannot
specify anything else.
-> In sum, the logit model is often preferred as it addressed the issues of the linear probability
model (predicted probabilities, heteroscedasticity by design, etc.)
-> Nonetheless, the linear probability model is straightforward and serves to examine the change
in probability that Y = 1 as X changes (as discussed before).
-> We can use the same tests for the linear probability model as before, such as the breuschpagan test for heteroscedasticity of the error term variances.
Workshop:
- table function gives the absolute values, prop.table gives the proportions
- Logit model is also called logistic regression
- When testing for collinearity, the VIF function is used.
- Standard errors are computed with the assumption of homoscedasticity (in the regular lm
model) <- this refers to the previous module
- Example of a linear probability model prediction that yields negative probability:
fi
ffi
ff
fi
ff
44
The above outputs would not have this issue if the logit model was used instead.
- With VIF output as below, we only care about the column GVIF
Module 14 - nal session (course summary)
[…]
fi
45
46
47
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )