Review of Chapter 11

advertisement
Review of Chapter 11
Team One: Leigh Anne, Javid, Kevin, Mantie and Hugh
• Causality is the relationship between cause and effect, in the case of data science it is the relationship between an event
as a cause and a second event as an effect.
• One of the most difficult items to learn is to establish causal relationships between two variables.
• One of the first points to understand is to plot the variables, in this case, they are using the example of ice cream sales
and bathing suits. In establishing a correlation, further investigation is needed to determine if there is a causality. Some
questions that may be asked to determine whether or not the desire to eat ice cream is caused by wearing a bathing suit.
Or is it wearing a bathing suit is caused by eating ice cream. Or is there another variable that is the cause…. Maybe the
weather?
• Correlation doesn’t necessary mean that there is causality. One way of phrasing the basic question is: What is the effect
of a on b? This on the surface appears to be relatively easy, however it is fundamental to the process and if you get the
question wrong then you may wind up getting the answers that do not answer the question that you were intending to.
• Confounders are a variable that has an effect on both the treatment and the outcome.
• The best way to test causality is to run a randomized clinical trial. This is a trial, where one group gets the effect and the
other doesn’t. There is an expected outcome that is needed to be identified before the project.
• Another type of test is the A/B test, this is a test where one group is given one cause and the other group is given a
variation or version of the cause and uses metrics to determine the impact of the difference. An example of this is to
have one website that randomly provides a different home page experience and using metrics to determine the impact
of the differences between the two pages.
• Observational studies are the second best type of study to perform, if you are unable to perform a Randomized Clinical
trial or an A/B test. This is an empirical study where the goal is to explain the cause and effect relationships when a
controlled experiment is not feasible or possible.
• Simpson’s Paradox is when a trend appears in different groups of data but disappears when the groups are combined or
when you break a group down into two groups and the trend appears in both groups.
• Rubin Causal Model is a mathematical framework used to determine what we know and what we don’t know in
observational studies. Although this is a great model to infer on a population, it can’t be used to determine on a specific
individual.
• We can look at causal modeling using a causal graph. A causal graph has three objects on it. The objects in the picture
are represented by “W”, “Y” and “A”. W is listed as the set of all possible confounders, A is the one confounder (known
as the treatment) that we are interested in. In the case of Frank, this is the use of the word “Beautiful”, A is a binary with
the possible values of 0 or 1. 0 being used to identify if Frank didn’t use the word beautiful and 1 is being used to signify
that Frank did use the word beautiful in his email. Y is the outcome, this is also a binary. In this case, the condition must
be clearly stated that Frank was successful in getting the woman’s phone number or not. So the Y binary value for
getting a phone number is 1 and 0 if Frank was not successful in getting the number.
y
w
A
• The causal effect: using an example of 100 people that are taking a drug, and we screen them for cancer and 30 of the
100 people have cancer. Then the cancer rate is 0.30. However, if we were able to determine if the drugs weren’t taken
that 20% of the population would still get cancer then we could determine the causal effect to be 10%, which is the
difference between the total number of people who took a drug and got cancer minus the total number of people who
didn’t take the drug and still got cancer. Since we are not able to determine this. It is easier to compare another
population to this one and then making comparisons by using the other population as if they were a control group. We
have to accept that the proxy group has the same natural cancer rate as the tested group, with the difference being one
group the control isn’t taking the drug and the other group being tested did. The control or proxy group should be made
up of people who have similar makeup or propensity. Essentially this is as close to the tested group that you can get, by
looking for people who may have been in the treatment group but weren’t. To do this, you need to get people that are in
the same age group as the treatment group, and any other factors that can be identified in the treatment group. Factors
such as same number of siblings, same number of people in the family with cancer, diet and exercise. All of this to build
a similar looking control group.
Download