ppt - AN-DASH Research Group

advertisement
Correlation
implies
Causation ?
Saad Saleh
Team Lead, Wisnet Lab, SEECS
saad.saleh@seecs.edu.pk
Contents
• Correlation
• Causality
• Examples
• Causal Research
• Causality Techniques:
•
•
•
•
Granger Causality
Zhang Causality
Peter Causality
LINGAM Causality
• Practical Applications
• Conclusion
2
Correlation
•
•
Correlation means how closely related two sets of data are
In statistics, Dependence refers to any statistical relationship
between two random variables or two sets of data.
Correlation refers to any of a broad class of statistical relationships
involving dependence.
[wiki : http://en.wikipedia.org/wiki/Correlation_and_dependence]
•
Relates to closeness, implying a relationship between objects,
people, events, etc.
For example, people often believe there are more bizarre
behaviors exhibited when the moon is full.
3
Causality
• Causality (also referred to as causation) is the relation
between an event (the cause) and a second event
(the effect), where the second event is understood as a
consequence of the first.
[Random House Unabridged Dictionary]
4
Correlation Examples
Drivers Age vs Sign Legibility distance
Driver’s age is negatively correlated with Sign Legibility Distance
5
Speed vs Fuel Consumption
6
Speed vs Fuel Consumption
Speed is correlated with the fuel consumption by the vehicle
7
Incentive vs Percentage Returned
Incentive is positively correlated with the Percentage Returned
8
Gun ownership vs Crime rate
Gun ownership and crime
r = .71
Gun Ownership correlates positively with crime rate
9
In a Gallup poll, surveyors asked,
“Do you believe correlation implies causation?”
• 64% of American’s answered “Yes” .
• 28% replied “No”.
• The other 8% were undecided.
10
See 10 simple questions to check
the influence of correlation over causality
11
Does Ice cream consumption
leads to drowning ??
Ice cream consumption is positivey correlated
with number of drowning people
12
Do Pirates Stop Global Warming ??
No. of pirates are positivey correlated
with Global Temperature
13
Does Shoe Size increases
Reading Ability??
Shoe Size is positivey correlated
with Reading Ability
14
Do Firemen cause
Large Fire Damage??
Firemen are positivey correlated
with amount of damage
15
Does Nationality effect
SAT Score??
SAT scores are positivey correlated
with nationality
16
Is Cholestrol level affected by
Facebook??
Cholesterol level is correlated
with Facebook invention
17
Are bad movies made because of
low sale of newspapers??
Shyamalin bad movies production is correlated
with Newspapers
18
Can Internet Explorer
effect Murder Rate??
Use of Internet explorer is correlated
with murder Rate
19
Can Mexican lemon imports
effect highway deaths??
Mexican Lemon imports are correlated with Highway deaths
20
Are noble prizes won
by chocolate consumption??
The number of Nobel prizes won by a country (adjusting for
population) correlates well with per capita chocolate consumption.
21
(New England Journal of Medicine)
Reality
Correlation vs. Causation
• ‘‘The correlation between workers’ education levels and
wages is strongly positive”
• Does this mean education “causes” higher wages?
•
We don’t know for sure !
• Correlation tells us two variables are related BUT does not
tell us why
22
Reality
Correlation vs. Causation
• Possibility 1
•
Education improves skills and skilled workers
get better paying jobs
•
Education causes wages to 
• Possibility 2
•
Individuals are born with quality A which is relevant for success in
education and on the job
•
Quality (NOT education) causes wages to 
23
Correlation vs Causation
24
Without proper interpretation,
causation should not be
assumed, or even implied.
25
Causal Research
• If the objective is to determine which variable might be
causing a certain behavior (whether there is a cause and
effect relationship between variables) causal research
must be undertaken.
26
Causal discovery
What affects…
… the economy?
…your health?
…climate
changes?
Which actions will have beneficial effects?
27
Available data
• A lot of “observational” data.
Correlation  Causality!
• Experiments are often needed, but:
•
•
•
Costly
Unethical
Infeasible
28
Establishing Causality
•
To establish whether two variables are causally related, that is, whether a
change in the independent variable X results in a change in the dependent
variableY, you must establish:
•
Time order: The cause must have occurred before the effect
•
Co-variation (statistical association): Changes in the value of the
independent variable must be accompanied by changes in the value of the
dependent variable
•
Rationale: There must be a logical and compelling explanation for why
these two variables are related
•
Non-spuriousness: It must be established that the independent variable
X, and only X, was the cause of changes in the dependent variable Y; rival
explanations must be ruled out.
29
Establishing Causality
• Note
that it is never possible to
prove causality, but only to show to
what degree it is probable.
30
Causation Possibilities
• A causes B.
• B causes A.
• A and B both partly cause each other.
• A and B are both caused by a third factor, C.
• The observed correlation was due purely to chance.
31
Third or Missing Variable Problem


A relationship other than causal might
exist between the two variables.
It is possible that there is some other
variable or factor that is causing the
outcome.
32
Causal graph example
Anxiety
Yellow
Fingers
Smoking
Allergy
Born an
Even Day
Peer Pressure
Genetics
Lung Cancer
Coughing
Attention
Disorder
Fatigue
Car Accident
33
A?B
A -> B
B =Temperature
B
A
A = log(Altitude)
34
Best fit: A -> B
A -> B
A <- B
35
Linear case?
A <- B
A -> B
• Linear function
• Gaussian input
• Gaussian noise
36
Google Trends
&
Google Correlate
37
38
39
40
Approach 1:
Granger
Causality
Prof. Clive W.J. Granger,
recipient of the 2003 Nobel Prize in Economics
History
In the early 1960's, I was considering a
pair of related stochastic processes which
were clearly inter-related and I wanted to
know if this relationship could be broken
down into a pair of one way relationships. It
was suggested to me to look at a definition of
causality proposed by a very famous
mathematician, Norbert Weiner, so I adapted
this definition (Wiener 1956) into a practical
form and discussed it.
Applied economists found the definition
understandable and useable and applications
of it started to appear. However, several
writers stated that "of course, this is not real
causality, it is only Granger causality.“
Clive W. J. Granger
42
Grangers Idea
“If variables are cointegrated, the
relationship among them can be
expressed as Error Correction
Mechanism (ECM)”.
43
Granger Causality
•
•
•
•
Suppose that we have three terms, Xt , Yt , and Wt , and that we first
attempt to forecast Xt+1 using past terms of Xt and Wt (without Yt).
We then try to forecast Xt+1 using past terms of Xt , Wt ,and Yt (withYt).
If the second forecast is found to be more successful, according to
standard cost functions, then the past of Y appears to contain
information helping in forecasting Xt+1 that is not in past Xt or Wt .
In short, Yt would "Granger cause" Xt+1 if
•
•
Yt occurs before Xt+1 ;
it contains information useful in forecasting Xt+1 that is not found in a
group of other appropriate variables.
44
Vector Autoregression (VAR)
Mathematical Definition
[Y]t = [A][Y]t-1 + … + [A’][Y]t-k + [e]t or
Yt1   A
 2   11
Yt   A21
Y 3    A
 t   31
...   ...
 p 
Yt   Ap1
A12
A22
A32
A13
A23
A33
...
...
...
...
Ap 2
... ...
Ap 3 ...
1

Y
A1 p  t 1 
 A'11
 2
 '

A2 p  Yt 1 
 A 21
A3 p  Yt 31   ...   A'31

 
...  ... 
 ...
 
 A' p1
App  Yt p1 

 
'
12
A
A' 22
A'32
...
A' p 2
'
13
A
A' 23
A'33
...
A' p 3
...
...
...
...
...
1

Y
A   t  k  e1t 
 
 2
A' 2 p  Yt  k  e2t 
 3 
A'3 p  Yt  k   e3t 
 



...  ...
... 
'  p 
A pp  Yt  k  e pt 
 
'
1p
where:
p = the number of variables be considered in the system
k = the number of lags be considered in the system
[Y]t, [Y]t-1, …[Y]t-k = the 1x p vector of variables
[A], … and [A'] = the p x p matrices of coefficients to be estimated
[e]t = a 1 x p vector of innovations that may be contemporaneously
correlated but are uncorrelated with their own lagged values and
uncorrelated with all of the right-hand side variables.
45
Vector Autoregression (VAR)
Example
Consider a case in which the number of variables n is 2, the
number of lags p is 1 and the constant term is suppressed. For
concreteness, let the two variables be called money, mt and
output, yt .
The structural equation will be:
mt  1 yt   11mt 1   12 yt 1   mt
yt   2 yt   21mt 1   22 yt 1   yt
46
Vector Autoregression (VAR)
Example
 Then, the reduced form is
 11  1 21
 12  1 22
1
1
mt 
mt 1 
yt 1 
 mt 
 yt
1  1 2
1  1 2
1  1 2
1  1 2
 11mt 1  12 yt 1  1t
 21   2 11
 22   2 12
2
1
yt 
mt 1 
yt 1 
 mt 
 yt
1  1 2
1  1 2
1  1 2
1  1 2
 21mt 1  22 yt 1   2t
47
Vector Autoregression (VAR)
Example
Among the statistics computed from VARs are helpful in
predicting Granger Causality.
 Granger causality tests – which have been interpreted as
testing, for example, the validity of the monetarist proposition
that autonomous variations in the money supply have been a
cause of output fluctuations.
48
Vector Autoregression (VAR)
Granger Causality
 In a regression analysis, we deal with the dependence of one
variable on other variables, but it does not necessarily imply
causation.
 In our GDP and M example, the often asked question is whether
GDP  M or M GDP. Since we have two variables, we are
dealing with bilateral causality.
 Given the previous GDP and M VAR equations:
mt  1 yt   11mt 1   12 yt 1   mt
yt   2 mt   21mt 1   22 yt 1   yt
49
Vector Autoregression (VAR)
Granger Causality
 We can distinguish four cases:




Unidirectional causality from M to GDP
Unidirectional causality from GDP to M
Feedback or bilateral causality
Independence
 Assumptions:
 Stationary variables for GDP and M
 Number of lag terms
 Error terms are uncorrelated – if it is, appropriate
transformation is necessary
50
Vector Autoregression (VAR)
Granger Causality – Estimation (t-test)
mt  11mt 1  12 yt 1  1t
yt  21mt 1  22 yt 1   2t
A variable, say mt is said to fail to Granger cause another variable,
say yt, relative to an information set consisting of past m’s and y’s
if: E[ yt | yt-1, mt-1, yt-2, mt-2, …] = E [yt | yt-1, yt-2, …].
mt does not Granger cause yt relative to an information set
consisting of past m’s and y’s iff 21 = 0.
yt does not Granger cause mt relative to an information set
consisting of past m’s and y’s iff 12 = 0.
 In a bivariate case, as in our example, a t-test can be used to test
the null hypothesis that one variable does not Granger cause
another variable. In higher order systems, an F-test is used. 51
Vector Autoregression (VAR)
Granger Causality – Estimation (F-test)
1. Regress current GDP on all lagged GDP terms but do not
include the lagged M variable (restricted regression). From this,
obtain the restricted residual sum of squares, RSSR.
2. Run the regression including the lagged M terms (unrestricted
regression). Also get the residual sum of squares, RSSUR.
3. The null hypothesis is Ho: i = 0, that is, the lagged M terms do
not belong in the regression.
( RSSR  RSSUR ) / m
F
RSSUR /(n  k )
5. If the computed F > critical F value at a chosen level of
significance, we reject the null, in which case the lagged m
belong in the regression. This is another way of saying that m 52
causes y.
Criticisms of Causality Tests
Granger causality test, much used in VAR
modelling, however do not explain some
aspects of the VAR:
• It does not give the sign of the effect, we
do not know if it is positive or negative
• It does not show how long the effect lasts
for.
• It
does not provide evidence of whether
this effect is direct or indirect.
53
54
Max Planck at centre, 1931
Prof. Dr. Bernhard Schölkopf
Kun Zhang
55
Approach 2
“Distinguishing Causes from Effects using
Nonlinear Acyclic Causal Models”
Kun Zhang, Aapo Hyv¨arinen
Background
•
•
•
Model-based causal discovery assumes a generative model to
explain the data generating process.
If the assumed model is close to the true one, such methods could
not only detect the causal relations, but also discover the form in
which each variable is influenced by others.
For example,
•
•
Granger causality assumes that effects must follow causes and that the
causal effects are linear (Granger,1980).
If the data are generated by a linear acyclic causal model and at most one of
the disturbances is Gaussian, independent component analysis (ICA)
(Hyv¨arinen et al., 2001)can be exploited to discover the causal relations in a
convenient way (Shimizu et al., 2006).
57
Shortcomings
• Previous models were too restrictive for real-life
problems.
If the assumed model is wrong, model-based causal
discovery may give misleading results.
58
Zhang Approach
In a large class of real-life problems, the following three
effects usually exist.
1. The effect of the causes is usually nonlinear.
2. The final effect received by the target variable from all
its causes contains some noise which is independent
from the causes.
3. Sensors or measurements may introduce nonlinear
distortions into the observed
values of the variables.
Assumption: Involved nonlinearities are invertible.
59
Proposed Solution:

Each observed variable is non-linear function of its parents with
additive noise, followed by non-linear distortion

If all non-linearities are invertible, conditions are given for causal
relationship
 Two-step
method: Constrained nonlinear ICA followed by
statistical independence tests, to distinguish the cause from the
effect in the two-variable case
60
Proposed Causal Model:
Noise Effect in transmission from
pai to xi
Xi = fi,2 { fi,1 (pai) + ei}
Non-linear Distortion
Non-linear
transformation
(Continuous and
Invertible)
(Not necessarily
Invertible)
First stage: a nonlinear transformation of its parents pai,
denoted by fi,1(pai), plus some noise (or disturbance) ei
(which is independent from pai).
Second stage: a nonlinear distortion fi,2 is applied to the
output of the first stage to produce xi.
61
Zhang Approach
•
•
•
•
Suppose the causal relation under examination is x1 → x2. If
this causal relation holds, there exist nonlinear functions f2,2
and f2,1 such that
e2 = f−1 2,2 (x2)−f2,1(x1) is independent from x1.
y1 = x1, y2 = g2(x2) − g1(x1).
Use Multi-Layer perceptrons (MLP’s) to model the
nonlinearities g1 and g2.
Parameters in g1 and g2 are learned by making y1 and y2 as
independent as possible.
62
Multilayer Perceptron (MLP)
• A multilayer perceptron (MLP) is a feedforward artificial
neural network model that maps sets of input data onto a
set of appropriate outputs.
63
Zhang Analysis
• y1 and y2 produced by the first step are the assumed
cause and the estimated corresponding disturbance,
respectively.
• In the second step, one needs to verify if they are
independent.
• If y1 and y2 are independent, it implies x1 causes x2, and
that g1 and g2 provide an estimate of f2,1 and f−12,2 ,
respectively.
64
Success !!
• Zhang approach solved the problem “CauseEffectPairs”
in the Pot-luck challenge, and successfully identified
causes from effects
• Earned Reward : 200$
65
Approach 3
“Nonlinear causal discovery
with additive noise models”
Patrik O. Hoyer, Dominik Janzing, Joris Mooij,
Jonas Peters, Bernhard Sch¨olkopf
Claim:
“Non-linearities are a blessing rather than a curse” -- Hoyer
Idea:
In reality, many causal relationships are non-linear.
How about generalizing Basic linear framework to
non-linear models??
67
Hoyer Approach
When causal relationships are nonlinear it typically helps
break the symmetry between the observed variables and allows
the identification of causal directions.
As Friedman and Nachman have pointed out, non-invertible
functional relationships between the observed variables can
provide clues to the generating causal model.
We show that the phenomenon is much more general; for
nonlinear models with additive noise almost any nonlinearities
(invertible or not) will typically yield identifiable models.
68
Hoyer Approach
Model:
xi := fi ( xpa(i) ) + ni
where
fi is an arbitrary function (possibly different for each i),
xpa(i) is a vector containing the elements xj such that there
is an edge from j to i in the DAG G,
the noise variables ni may have arbitrary probability
densities pni (ni),
69
Hoyer Model Estimation
Test whether x and y are statistically independent.
If not : Test whether a model
y := f(x)+n
is consistent with the data, simply by doing a nonlinear regression of y on x (to
get an estimate f’ of f), calculating the corresponding residuals n’ = y - f(x),
and testing whether n’ is independent of x. If so, accept the model
y := f(x) + n;
if not, reject it.
Similarly test whether the reverse model x := g(y) + n fits the data
70
Hoyer Test Results
the “Old Faithful” dataset
• Obtains a p-value of 0.5 for the (forward) model “current duration
causes next interval length” and
• a p-value of 4:4*10-9 for the (backward) model “next interval length
causes current duration”
71
Hoyer Test Results
the “Abalone” dataset from the UCI ML repository
• The correct model “age causes length” leads to a p-value of 0.19,
• The reverse model “length causes age” comes with p < 10-15
72
Hoyer Test Results
Temperature Alitude Statistics
• The correct model “altitude causes temperature” leads to p = 0:017,
• “Temperature causes altitude” can clearly be rejected (p = 8*10-15)
73
Approach 4
“A Linear Non-Gaussian Acyclic
Model for Causal Discovery (LINGAM)”
Shohei Shimizu, Patrik O. Hoyer,
Aapo Hyv¨arinen, Antti Kerminen
Approach:
Use of Independent Component Analysis (ICA)----- called Linear
Non-Gaussian Acyclic Model (LINGAM ) Analysis
“when working with continuous-valued data, a significant advantage
can be achieved by departing from the Gaussianity assumption”
Assumptions
1. Data Generating Process is Linear
2. No unobserved confounders
3. Disturbance variables have non-gaussian distribution of
non-zero variances
75
LINGAM Model
• Linear Non-Gaussina Acyclic Model
• Data Generating process:
76
LINGAM Idea
• Key to Solution :
Observed variables are linear functions of the
disturbance variables, and the disturbance variables are
mutually independent and non-Gaussian.
x = Bx+e,
x= Ae,
where A = (I−B)−1.
77
LINGAM Algorithm
LINGAM can be briefly summarized as follows:
•
•
First, use a standard ICA algorithm (e.g., FastICA
algorithm) to obtain an estimate of the mixing matrix A
(or equivalently of W),
subsequently permute it and normalize it appropriately
before using it to compute B containing the sought
connection strengths bij.3
78
LINGAM Algorithm
(1) Given : m*n data matrix X (m<<n) where each column contains one
sample vector X.
(a) Subtract mean from each row of X
(b) Apply ICA to get X = A*S, where S contains independent
components in its rows
(c) Note : W= A-1
(2) Find W1 where W1 contains NO zeros on main diagonal and is
obtained by permutting rows of W.
(3) Divide each row of W1 by corresponding diagonal element to get W1`
with all 1’s on main diagonal
79
LINGAM Algorithm
(4) Find B^ such that B^ = I – W~`
(5) To find causal order, find permutation matrix P of B^ which yields
B~ = P*B^*PT
B~ (close to strictly lower triangular) can be measured using
summation{i<=j} (B 2)
ij
80
Practical Experiments
Project
Detecting Covert Links in Instant Messaging
(IM) Networks Using Flow Level Log Data
81
Introduction
• Users sending Instant
Messages (IM) to relay
server
• Relay server forwards
messages to corresponding
users
• All packets contain source
Scenario # 1
and destination IP
addresses of user and server
IP addresses only
82
Introduction
• Users may be
communicating behind a
proxy server
• Users behind proxy servers
are visible in scenario#2.
Scenario # 2
83
Data Set
• Yahoo! Messenger IM network.
• Data Set Details:
•
•
Area: New York City area.
Time: 12am to 12am
• Data Set Files:
•
Input Data File:
• User-to-server traffic traces.
•
Ground Truth Data File:
• Record of the actual user-to-user connections.
84
Data Set Statistics
Time
Duration
Users
Messages
Sessions
8-8:10a
10 mins
3,420
15,370
1,968
8-8:20a
20 mins
5,405
33,192
3,265
8-8:30a
30 mins
7,438
53,649
4,661
8-8:40a
40 mins
9,513
75,810
6,179
8-8:50a
50 mins
11,684
99,721
7,669
8-9a
60 mins
13,953
126,694
9,264
85
Granger Causality
F-test statistics for Granger Causalty test
86
Zhang Approach Results
Zhang results for talking and non-talking
pairs for IM networks in Yahoo!
87
Just for Knowledge
• Classifier Tool
WEKA (Waikato Environment for Knowledge Analysis) -> popular suite
of machine learning software written in Java, developed at the University of
Waikato, New Zealand
WEKA Bird : Found in New Zealand,
Vulnerable Species.
88
WEKA
89
Conclusion
90
Given: A causes of B;
To Prove: Is it must that A and B are
correlated??
Result:YES or NO;
why?? Can you show??
91
92
Download