Path analysis - Plant Sciences

advertisement
106750029
Revised: 3/6/2016
Chapter 12. Path Analysis
12:1 Why use path analysis?
What is path analysis?
1. Application of MLR.
2. Special case of Structural Equation Modeling.
3. Method to summarize and display information about relationships among variables.
12:2 Situations in which one would use path analysis.
Path analysis is a good presentation tool for results of multiple linear regression where there are
intermediate variables and indirect effects because the causal variables are correlated.
Path analysis reflects part of the collinearity among explanatory variables.
Path analysis can be used to test how well a priori models are supported by the data. It cannot be used to
derive the form of the relationships or of the diagram.
12:3 Model and path diagram.
The model and the diagram are imposed on
the analysis, not derived from it.
A simple path diagram has a single response variable and a series of explanatory variables that may be
correlated.
Y = 0 + 1 X1 + 2 X2 + 
In this example from the TB book, an experiment was conducted in which students were randomly assigned
to one of two teaching treatments, traditional and novel, and their performance and motivation were measured.
Whereas ANOVA and ANCOVA focus on the direct path from the treatment group to the Y variable (exam
score), path analysis shows that motivation is a potential mediating variable that can be affecting scores while it
is affected by treatments. This diagram clearly shows that treatment can have both direct and indirect effects on
scores. If the direct effect of motivation is significant, it is possible to think of improving scores by increasing
motivation by other means other than the treatments tested. More importantly, a negative effect of the novel
teaching method on motivation can mask its direct effects, if motivation has a positive effect on scores. Path
analysis would uncover this situation and reveal that the novel teaching methods would be successful if its
negative impacts on motivation could be avoided. This use of path analysis is the same as the application of
ANCOVA in which the effect of a treatment on a primary response variable (exam scores) is studied after
removing the effects through a secondary response variable (motivation).
1
106750029
Revised: 3/6/2016
Figure 12-1. Example of path analysis where two responses have a common cause.
More complex models involve more than one MLR equation. Variables are classified as exogenous if the
path diagram includes no predictors for them, or endogenous if the diagram includes explanatory variables for
them.
Consider a second example, typical of the applications of path analysis. Yield or fitness of individual plants
can be partitioned into components such as number of flowers, seeds per flower, and seed size. One may be
interested in determining which component is the most important and what environmental factors such as
fertility of the soil, water availability, and density of competitors affect it. A possible path diagram for that
situation would be the following:
Figure 12-2. A two-layered path diagram relating yield (or fitness0 to yield components and to
environmental factors.
The variables in black (fertility, water, and competitor density) are exogenous; no other variables have
unidirectional arrows pointing at these variables. The purple (seeds per flower, number of flowers and seed
2
Revised: 3/6/2016
106750029
size) and the red (yield) variables are endogenous. The variance of endogenous variables is explained by the
other variables in the diagram. This path analysis allows one to quantify which yield component is more
important, if any, in explaining the variation in yield. The importance of each environmental factor in determining
each yield component and yield is also quantified. The total effect of each environmental factor is partitioned
into direct effects and indirect effects through covariance with other environmental factors. In addition, the total
pattern of covariance among yield components is partitioned into a component that is explained by
environmental factors, and a component that is due to the covariances among errors.
Note that in the previous diagram, the exogenous variables could have been controlled in a manipulative
experiment. If the experiment is balanced and has all treatment combinations, then the correlations among the
exogenous variables would be zero and the diagram would be more powerful and easier to interpret.
12:3.1
Types of questions addressed by path analysis.
In the yield components example, path analysis can address questions such as: What will happen if we try
to modify yield by changing fertility? What plant traits should be targeted for improvement through genetic
manipulation? For example, if seed size and seed number both have important direct effects on yield and at the
same time are highly negatively correlated through the error terms, selection for either trait will probably not
result in any improvement, because the correlation probably has a strong genetic (non-environmental)
component.
The diagram allows us to detect direct effects that could be masked by indirect ones when inspected without
the vantage point of path analysis. Suppose that in the previous example there is no overall relationship
between seed weight and water. Path may reveal that in fact water has a strong positive direct effect on seed
weight that is masked by a strong positive correlation between water and competitor density and a strong
negative direct effect of competitor density on seed weight.
12:3.2
Partition of correlations.
Path analysis partitions the correlation between variables into direct and indirect effects. In the previous
diagram, the total correlation between seed size and yield is partitioned into a direct effect of seed size on yield,
represented by the unidirectional arrow from seed size to yield, and indirect effects through the correlation of
common causes of seed size and the other yield components. For example, the path from seed size to Fertility
to seed number to yield is an indirect effect of seed size on yield. A simpler example is the correlation between
fertility and seed weight, which can be partitioned into the direct path from fertility to seed weight, an indirect
path from fertility through its correlation with water to seed weight, and a final indirect component from fertility to
competitor density to seed weight.
As indicated in the lectures on linear regression, the partition of correlations is based on the normal
equations. For the following explanation we use the following statistical fact:
“The correlation between two variables is the same as the correlation between the standardized variables”.
The example assumes that there are three exogenous variables and one dependent variable, but the
equations can be extended to any number of variables. Apply normal equations to y and x’s where these are the
standardized versions of the original X and Y variables :
b'0 n  b1' x1  b2' x2  b'3 x 3  y
b'0 x1  b1' x12  b2' x1 x 2  b3' x1 x 3  x1 y
b0 x2  b1x1 x 2  b2 x 2  b3 x 2 x 3  x 2 y
1
'
'
2
'
b'0 x3  b1'x1 x 3  b'2 x 2 x 3  b'3 x 23  x 3 y
Since xi = 0 the first row and column drop out. (By the way, the first row demonstrates that for standardized
variables b0 must be zero.) Dividing both sides of each equation by (n-1) and using p’s instead of b’s for the
standardized regression coefficients we obtain:
3
Revised: 3/6/2016
106750029
p1 
x12
n  1  p2 
x1 x 2
n  1  p3 
x1 x 3
n 1  
x1 y
n 1
x1 x 2
x 22
x2 x3
x 2y
p1 

p

p




2
3
n 1
n 1
n 1
n 1
2
x1 x 3
x2 x3
x3
x 3y
p1 

p

p




2
3
n 1
n 1
n 1
n 1
 xi2
but
 xi x j
2
 S  1 and
r
n 1
x
n 1 ij
so the equations can be written as follows:
p1 + p2 r12 + p3 r13 = r1y
p1 r12 + p2 + p3 r23 = r2Y
p1 r13 + p2 r23 + p3 = r3y
Usually, to avoid confusion in more complex cases, the path coefficients specify which dependent variable
they address:
py1, py2, py3, py4
(see figure 119 from Li, 1975)
4
106750029
Revised: 3/6/2016
Figure 12-3. A simple path diagram showing the direct paths and the correlations among the explanatory
variables (Li, 1975).
For more complex diagrams the same procedure is applied to each dependent variable as a function of all
variables whose direct paths reach the DV directly. In the pollination example (see figure 10.1 of Mitchell 93):
This path has the following implicit models:
Approaches = b1’ No. flowers + b2’ nectar p.r. + b3’ n. neighbor d.
fruit set = c1’ approaches + c2’ probes + c3’ n. neighbor d.
probes = d1’ appr. + d2’ No. flowers + d3’ nectar p.r. + d4’ n. neighbor d.
5
106750029
Revised: 3/6/2016
12:4 Procedure.
12:4.1
Assumptions.
Because path analysis is an application of multiple linear regression, the same assumptions apply. In
addition, it is more important than in MLR to have multivariate normal distribution of all the variables. This
assumption is particularly important for the more general version of path analysis: structural equation modeling.
12:4.2
1.
Path analysis will require as many multiple linear regression analyses as the number
of endogenous variables in the diagram.
2.
Linearity. Check with partial regression scatterplots. Transform as necessary.
3.
Multivariate normality. Check univariate distributions for normality with Shapiro-Wilk
or other test. Use 2 test of Mahalanobis distances for multivariate normality. Use
transformations as necessary.
4.
Outliers. Examine scatterplots. Use Mahalanobis distance. Eliminate outliers, with
precautions.
Calculation of parameters.
Parameters are calculated by performing separate MLR analyses for each response variable, and by
obtaining a complete matrix of correlations among variables. For the regressions, use the Fit Model platform of
JMP. Right-click on the table of Parameter Estimates and request the standardized betas. Copy the Parameter
Estimates table and paste it into an Excel spreadsheet to facilitate further calculations and preparation of tables.
For the matrix of correlations use the Multivariate platform, copy and paste the matrix into the same Excel
spreadsheet.
Direct effects (arrows with a single direction) are the standardized partial regression coefficients of MLR.
Correlations are represented by two-headed arrows. All parameters can be represented such that the diagram
reflects their size and significance.
12:4.2.1
Example of procedure in JMP.
Three varieties of wheat were grown in Yugoslavia during 10 years. Each plot was scored for disease
resistance (DisRes) and lodging resistance (LodRes). Other variables measured were leaf area index (LAI), leaf
area duration (LAD), density of spikes per square meter (SPM), number of kernels per spike (KSP), and kernel
weight (KWT). The complete data is in the xmpl_PATH.jmp file.
In the example we restrict our analysis to the effects of spike density, kernels per spike and kernel weight,
which are the components of yield. The total effect of each component on yield will be partitioned into direct and
indirect effects.
MLR. First, Yield is regressed on SPM, KSP, and KWT. All assumptions of MLR would have to be checked
at this point. We proceed as if all assumptions are met. The standardized partial regression coefficients for the
predictors are the values of the arrow that join each one with Yield. The path between U for yield and yield is the
square root of 1-Rsquare=0.6873.
6
Revised: 3/6/2016
106750029
SPM
0.2248
0.2867
KSP
-0.0519
0.7126
Yield
-0.4851
0.4291
KWT
0.687
Other
(resid.)
Path diagram showing three exogenous variables that reflect a factorial experiment of three wheat varieties
by ten years. Spikes per square meter (SPM), kernels per spike (KSP), and kernel weight (KWT) were
measured as yield components. Yield is analyzed as a function of the three measured components. The
residual is attributed to other components of yield and errors not captured by SPM, KSP or KWT. In this
diagram, the total correlation between SPM and yield is partitioned into 3 paths: direct from SPM to yield, SPMKSP-yield, and SPM-KWT-yield. The numerical calculation of the parts is presented later.
7
Revised: 3/6/2016
106750029
R2
Correlations. The correlation matrix and the level of significance of each pairwise correlation can be
obtained through the multivariate platform of JMP. To get the Pairwise correlation, select the option in the popdown menu by clicking on the red triangle on the left of “Multivariate” at the top of the results window. To
complete the path diagram, we need all correlations among yield and yield components.
Path diagram. Now we have all elements to complete the path diagram and then proceed to interpreting it.
Values for the double-headed arrows are obtained from the correlations table above. Direct paths are obtained
form the MLR. The values inside boxes with borders are significant at the 0.01 probability level.
8
For easier
interpretation,
Revised: 3/6/2016
106750029
12:5 Interpretation.
12:5.1
Path Analysis Rules
1.
The absence of a valid path between two variables means they are uncorrelated.
Corollary: if two variables are correlated they must have some valid connecting path.
The diagrams must be complete.
2.
A path wit more than one leg is not valid if it goes first forward (with arrow) and then
backward (against arrow). A path that goes backward and then forward is valid. For
example, the path YXY is valid, but UYX is not valid.
X
Y
U
3.
If any two variables are hypothesized to have a correlation, some path must connect
them. The diagram must be complete. In the figure above, because no valid path
between x and u is possible we interpret the diagram to mean they are not correlated.
The typical case is that errors and effects of a model should not have any possible
paths that connect them, because they are defined as orthogonal by the minimization
of SSE. In the yield components example above, no valid path connects any of the
errors with any of the exogenous variables.
4.
Paths on double-headed arrows are permitted in both directions.
5.
The correlation between two variables is the sum of all valid paths joining them. Each
leg of the path is a factor in the calculation of total path value. For example, the total
correlation between SPM and yield is 0.4068. This comes from a direct effect
(0.2248), an indirect effect through KSP (0.2867*0.7126=0.2043), and an indirect
effect through KWT (-0.0519*0.4291=-0.0223). The sum of all paths is equal to
0.4068.
6.
Correlation is not a transitive relation; thus, paths with more than one correlation step
are not valid.
7.
In balanced designed experiments, controlled treatment factors should not be
correlated. The total correlation between two dependent variables can be partitioned
among the different experimental treatments. (See Pantone et al. 1989, Weed Sci.
37:778). In their figure 1 (top) density of fiddleneck affected inflorescences/plant and
flowers/inflor. in the same direction (negative), generating a positive association
between the two yield components. The converse was true for the second year.
In the wheat example fully developed, we can see some interesting results. Kernels-per-spike was the
variable that had the largest direct effect on yield, but its overall contribution to yield was diminished because
KSP had a strong negative correlation with kernel weight. This is a typical “compensatory” relationship between
two yield components, and may be happening because a plant has a limited amount of carbohydrates to
contribute to seed growth. When more seed are present in the spike, the average seed weight will be smaller.
This points to a situation where production of wheat was not limited by the availability of sinks (storage and
seeds) but by an availability of sources (carbohydrates). The large unexplained component of yield indicates
that there was a large deviation between the yield as calculated by the yield components (note that yield
measured by yield components is the product of the three variables SPM, KSP and KWT) and the yield finally
achieved. This difference can be attributed to large spatial variability and inaccuracy of the method to measure
yield, as well as losses due to lodging. Lodging can keep grain below the reach of the combine, so grain may
have been produced and detected by the yield components, but it was not accounted for in the final harvest.
9
106750029
12:5.2
Revised: 3/6/2016
Pollination example
Question: How is pollination, and the factors that determine pollination, related to plant reproduction
(fitness)?
Ho: plant  visitation  pollination  fitness traits
The species in the example, Scarlet gilia must cross-pollinate, so seed production depends on insects
transporting the pollen among flowers of different plants. The path is created by theoretical and practical
consideration of the pollination process. On can arrive at Figure 10.1 by the following rationale. Based on
foraging theory, we expect bees to be attracted to area where there are many plants that have many flowers. If
they experience plants that have high rate of nectar production, bees will learn to visit them more frequently.
Flowers in areas with more plants and flowers will be probed more frequently. In addition, plants that are
approached more frequently, regardless of the reason for the approach, will also tend to be probed more
frequently. Plants that are approached more frequently and probed more times will produce more fruits, but they
will have to compete with neighbors for resources and pollen. If the arrow from N neighbor to fruit set is not
included and there is a strong negative effect of competition or fruit set, U for fruit set would be inflated.
For simplicity, consider the factors that affect probes/flower/hour. The diagram has two components:
correlations and direct effects.
Correlation matrix
Standardized partial regression coefficients:
10
106750029
Revised: 3/6/2016
Approaches=0.21*No. flowers + 0.24*Nectar p. rate + residual
The total correlation between number of flowers and approaches is 0.25 and it is decomposed into a direct
effect (0.21) and an indirect effect through the correlation between number of flowers and nectar production rate
(0.16*0.24=0.04).
11
Download