106750029 Revised: 3/6/2016 Chapter 12. Path Analysis 12:1 Why use path analysis? What is path analysis? 1. Application of MLR. 2. Special case of Structural Equation Modeling. 3. Method to summarize and display information about relationships among variables. 12:2 Situations in which one would use path analysis. Path analysis is a good presentation tool for results of multiple linear regression where there are intermediate variables and indirect effects because the causal variables are correlated. Path analysis reflects part of the collinearity among explanatory variables. Path analysis can be used to test how well a priori models are supported by the data. It cannot be used to derive the form of the relationships or of the diagram. 12:3 Model and path diagram. The model and the diagram are imposed on the analysis, not derived from it. A simple path diagram has a single response variable and a series of explanatory variables that may be correlated. Y = 0 + 1 X1 + 2 X2 + In this example from the TB book, an experiment was conducted in which students were randomly assigned to one of two teaching treatments, traditional and novel, and their performance and motivation were measured. Whereas ANOVA and ANCOVA focus on the direct path from the treatment group to the Y variable (exam score), path analysis shows that motivation is a potential mediating variable that can be affecting scores while it is affected by treatments. This diagram clearly shows that treatment can have both direct and indirect effects on scores. If the direct effect of motivation is significant, it is possible to think of improving scores by increasing motivation by other means other than the treatments tested. More importantly, a negative effect of the novel teaching method on motivation can mask its direct effects, if motivation has a positive effect on scores. Path analysis would uncover this situation and reveal that the novel teaching methods would be successful if its negative impacts on motivation could be avoided. This use of path analysis is the same as the application of ANCOVA in which the effect of a treatment on a primary response variable (exam scores) is studied after removing the effects through a secondary response variable (motivation). 1 106750029 Revised: 3/6/2016 Figure 12-1. Example of path analysis where two responses have a common cause. More complex models involve more than one MLR equation. Variables are classified as exogenous if the path diagram includes no predictors for them, or endogenous if the diagram includes explanatory variables for them. Consider a second example, typical of the applications of path analysis. Yield or fitness of individual plants can be partitioned into components such as number of flowers, seeds per flower, and seed size. One may be interested in determining which component is the most important and what environmental factors such as fertility of the soil, water availability, and density of competitors affect it. A possible path diagram for that situation would be the following: Figure 12-2. A two-layered path diagram relating yield (or fitness0 to yield components and to environmental factors. The variables in black (fertility, water, and competitor density) are exogenous; no other variables have unidirectional arrows pointing at these variables. The purple (seeds per flower, number of flowers and seed 2 Revised: 3/6/2016 106750029 size) and the red (yield) variables are endogenous. The variance of endogenous variables is explained by the other variables in the diagram. This path analysis allows one to quantify which yield component is more important, if any, in explaining the variation in yield. The importance of each environmental factor in determining each yield component and yield is also quantified. The total effect of each environmental factor is partitioned into direct effects and indirect effects through covariance with other environmental factors. In addition, the total pattern of covariance among yield components is partitioned into a component that is explained by environmental factors, and a component that is due to the covariances among errors. Note that in the previous diagram, the exogenous variables could have been controlled in a manipulative experiment. If the experiment is balanced and has all treatment combinations, then the correlations among the exogenous variables would be zero and the diagram would be more powerful and easier to interpret. 12:3.1 Types of questions addressed by path analysis. In the yield components example, path analysis can address questions such as: What will happen if we try to modify yield by changing fertility? What plant traits should be targeted for improvement through genetic manipulation? For example, if seed size and seed number both have important direct effects on yield and at the same time are highly negatively correlated through the error terms, selection for either trait will probably not result in any improvement, because the correlation probably has a strong genetic (non-environmental) component. The diagram allows us to detect direct effects that could be masked by indirect ones when inspected without the vantage point of path analysis. Suppose that in the previous example there is no overall relationship between seed weight and water. Path may reveal that in fact water has a strong positive direct effect on seed weight that is masked by a strong positive correlation between water and competitor density and a strong negative direct effect of competitor density on seed weight. 12:3.2 Partition of correlations. Path analysis partitions the correlation between variables into direct and indirect effects. In the previous diagram, the total correlation between seed size and yield is partitioned into a direct effect of seed size on yield, represented by the unidirectional arrow from seed size to yield, and indirect effects through the correlation of common causes of seed size and the other yield components. For example, the path from seed size to Fertility to seed number to yield is an indirect effect of seed size on yield. A simpler example is the correlation between fertility and seed weight, which can be partitioned into the direct path from fertility to seed weight, an indirect path from fertility through its correlation with water to seed weight, and a final indirect component from fertility to competitor density to seed weight. As indicated in the lectures on linear regression, the partition of correlations is based on the normal equations. For the following explanation we use the following statistical fact: “The correlation between two variables is the same as the correlation between the standardized variables”. The example assumes that there are three exogenous variables and one dependent variable, but the equations can be extended to any number of variables. Apply normal equations to y and x’s where these are the standardized versions of the original X and Y variables : b'0 n b1' x1 b2' x2 b'3 x 3 y b'0 x1 b1' x12 b2' x1 x 2 b3' x1 x 3 x1 y b0 x2 b1x1 x 2 b2 x 2 b3 x 2 x 3 x 2 y 1 ' ' 2 ' b'0 x3 b1'x1 x 3 b'2 x 2 x 3 b'3 x 23 x 3 y Since xi = 0 the first row and column drop out. (By the way, the first row demonstrates that for standardized variables b0 must be zero.) Dividing both sides of each equation by (n-1) and using p’s instead of b’s for the standardized regression coefficients we obtain: 3 Revised: 3/6/2016 106750029 p1 x12 n 1 p2 x1 x 2 n 1 p3 x1 x 3 n 1 x1 y n 1 x1 x 2 x 22 x2 x3 x 2y p1 p p 2 3 n 1 n 1 n 1 n 1 2 x1 x 3 x2 x3 x3 x 3y p1 p p 2 3 n 1 n 1 n 1 n 1 xi2 but xi x j 2 S 1 and r n 1 x n 1 ij so the equations can be written as follows: p1 + p2 r12 + p3 r13 = r1y p1 r12 + p2 + p3 r23 = r2Y p1 r13 + p2 r23 + p3 = r3y Usually, to avoid confusion in more complex cases, the path coefficients specify which dependent variable they address: py1, py2, py3, py4 (see figure 119 from Li, 1975) 4 106750029 Revised: 3/6/2016 Figure 12-3. A simple path diagram showing the direct paths and the correlations among the explanatory variables (Li, 1975). For more complex diagrams the same procedure is applied to each dependent variable as a function of all variables whose direct paths reach the DV directly. In the pollination example (see figure 10.1 of Mitchell 93): This path has the following implicit models: Approaches = b1’ No. flowers + b2’ nectar p.r. + b3’ n. neighbor d. fruit set = c1’ approaches + c2’ probes + c3’ n. neighbor d. probes = d1’ appr. + d2’ No. flowers + d3’ nectar p.r. + d4’ n. neighbor d. 5 106750029 Revised: 3/6/2016 12:4 Procedure. 12:4.1 Assumptions. Because path analysis is an application of multiple linear regression, the same assumptions apply. In addition, it is more important than in MLR to have multivariate normal distribution of all the variables. This assumption is particularly important for the more general version of path analysis: structural equation modeling. 12:4.2 1. Path analysis will require as many multiple linear regression analyses as the number of endogenous variables in the diagram. 2. Linearity. Check with partial regression scatterplots. Transform as necessary. 3. Multivariate normality. Check univariate distributions for normality with Shapiro-Wilk or other test. Use 2 test of Mahalanobis distances for multivariate normality. Use transformations as necessary. 4. Outliers. Examine scatterplots. Use Mahalanobis distance. Eliminate outliers, with precautions. Calculation of parameters. Parameters are calculated by performing separate MLR analyses for each response variable, and by obtaining a complete matrix of correlations among variables. For the regressions, use the Fit Model platform of JMP. Right-click on the table of Parameter Estimates and request the standardized betas. Copy the Parameter Estimates table and paste it into an Excel spreadsheet to facilitate further calculations and preparation of tables. For the matrix of correlations use the Multivariate platform, copy and paste the matrix into the same Excel spreadsheet. Direct effects (arrows with a single direction) are the standardized partial regression coefficients of MLR. Correlations are represented by two-headed arrows. All parameters can be represented such that the diagram reflects their size and significance. 12:4.2.1 Example of procedure in JMP. Three varieties of wheat were grown in Yugoslavia during 10 years. Each plot was scored for disease resistance (DisRes) and lodging resistance (LodRes). Other variables measured were leaf area index (LAI), leaf area duration (LAD), density of spikes per square meter (SPM), number of kernels per spike (KSP), and kernel weight (KWT). The complete data is in the xmpl_PATH.jmp file. In the example we restrict our analysis to the effects of spike density, kernels per spike and kernel weight, which are the components of yield. The total effect of each component on yield will be partitioned into direct and indirect effects. MLR. First, Yield is regressed on SPM, KSP, and KWT. All assumptions of MLR would have to be checked at this point. We proceed as if all assumptions are met. The standardized partial regression coefficients for the predictors are the values of the arrow that join each one with Yield. The path between U for yield and yield is the square root of 1-Rsquare=0.6873. 6 Revised: 3/6/2016 106750029 SPM 0.2248 0.2867 KSP -0.0519 0.7126 Yield -0.4851 0.4291 KWT 0.687 Other (resid.) Path diagram showing three exogenous variables that reflect a factorial experiment of three wheat varieties by ten years. Spikes per square meter (SPM), kernels per spike (KSP), and kernel weight (KWT) were measured as yield components. Yield is analyzed as a function of the three measured components. The residual is attributed to other components of yield and errors not captured by SPM, KSP or KWT. In this diagram, the total correlation between SPM and yield is partitioned into 3 paths: direct from SPM to yield, SPMKSP-yield, and SPM-KWT-yield. The numerical calculation of the parts is presented later. 7 Revised: 3/6/2016 106750029 R2 Correlations. The correlation matrix and the level of significance of each pairwise correlation can be obtained through the multivariate platform of JMP. To get the Pairwise correlation, select the option in the popdown menu by clicking on the red triangle on the left of “Multivariate” at the top of the results window. To complete the path diagram, we need all correlations among yield and yield components. Path diagram. Now we have all elements to complete the path diagram and then proceed to interpreting it. Values for the double-headed arrows are obtained from the correlations table above. Direct paths are obtained form the MLR. The values inside boxes with borders are significant at the 0.01 probability level. 8 For easier interpretation, Revised: 3/6/2016 106750029 12:5 Interpretation. 12:5.1 Path Analysis Rules 1. The absence of a valid path between two variables means they are uncorrelated. Corollary: if two variables are correlated they must have some valid connecting path. The diagrams must be complete. 2. A path wit more than one leg is not valid if it goes first forward (with arrow) and then backward (against arrow). A path that goes backward and then forward is valid. For example, the path YXY is valid, but UYX is not valid. X Y U 3. If any two variables are hypothesized to have a correlation, some path must connect them. The diagram must be complete. In the figure above, because no valid path between x and u is possible we interpret the diagram to mean they are not correlated. The typical case is that errors and effects of a model should not have any possible paths that connect them, because they are defined as orthogonal by the minimization of SSE. In the yield components example above, no valid path connects any of the errors with any of the exogenous variables. 4. Paths on double-headed arrows are permitted in both directions. 5. The correlation between two variables is the sum of all valid paths joining them. Each leg of the path is a factor in the calculation of total path value. For example, the total correlation between SPM and yield is 0.4068. This comes from a direct effect (0.2248), an indirect effect through KSP (0.2867*0.7126=0.2043), and an indirect effect through KWT (-0.0519*0.4291=-0.0223). The sum of all paths is equal to 0.4068. 6. Correlation is not a transitive relation; thus, paths with more than one correlation step are not valid. 7. In balanced designed experiments, controlled treatment factors should not be correlated. The total correlation between two dependent variables can be partitioned among the different experimental treatments. (See Pantone et al. 1989, Weed Sci. 37:778). In their figure 1 (top) density of fiddleneck affected inflorescences/plant and flowers/inflor. in the same direction (negative), generating a positive association between the two yield components. The converse was true for the second year. In the wheat example fully developed, we can see some interesting results. Kernels-per-spike was the variable that had the largest direct effect on yield, but its overall contribution to yield was diminished because KSP had a strong negative correlation with kernel weight. This is a typical “compensatory” relationship between two yield components, and may be happening because a plant has a limited amount of carbohydrates to contribute to seed growth. When more seed are present in the spike, the average seed weight will be smaller. This points to a situation where production of wheat was not limited by the availability of sinks (storage and seeds) but by an availability of sources (carbohydrates). The large unexplained component of yield indicates that there was a large deviation between the yield as calculated by the yield components (note that yield measured by yield components is the product of the three variables SPM, KSP and KWT) and the yield finally achieved. This difference can be attributed to large spatial variability and inaccuracy of the method to measure yield, as well as losses due to lodging. Lodging can keep grain below the reach of the combine, so grain may have been produced and detected by the yield components, but it was not accounted for in the final harvest. 9 106750029 12:5.2 Revised: 3/6/2016 Pollination example Question: How is pollination, and the factors that determine pollination, related to plant reproduction (fitness)? Ho: plant visitation pollination fitness traits The species in the example, Scarlet gilia must cross-pollinate, so seed production depends on insects transporting the pollen among flowers of different plants. The path is created by theoretical and practical consideration of the pollination process. On can arrive at Figure 10.1 by the following rationale. Based on foraging theory, we expect bees to be attracted to area where there are many plants that have many flowers. If they experience plants that have high rate of nectar production, bees will learn to visit them more frequently. Flowers in areas with more plants and flowers will be probed more frequently. In addition, plants that are approached more frequently, regardless of the reason for the approach, will also tend to be probed more frequently. Plants that are approached more frequently and probed more times will produce more fruits, but they will have to compete with neighbors for resources and pollen. If the arrow from N neighbor to fruit set is not included and there is a strong negative effect of competition or fruit set, U for fruit set would be inflated. For simplicity, consider the factors that affect probes/flower/hour. The diagram has two components: correlations and direct effects. Correlation matrix Standardized partial regression coefficients: 10 106750029 Revised: 3/6/2016 Approaches=0.21*No. flowers + 0.24*Nectar p. rate + residual The total correlation between number of flowers and approaches is 0.25 and it is decomposed into a direct effect (0.21) and an indirect effect through the correlation between number of flowers and nectar production rate (0.16*0.24=0.04). 11