CHAPTER 6 : LOG LINEAR ANALYSIS Purpose: Log linear analysis is an advanced technique that can be used instead of independence tests. There are two main advantages to using the Log linear technique: 1) you can test more than two variables at a time with as many levels as you like, and 2) you can also test for main effects. A main effect occurs when the frequencies for a variable differ among levels of that variable. For example, let’s assume that one of your variables was blue-eye color with two levels, blue and not blue. Let’s also assume that the percentages from your sample were 44% blue and 56% not-blue. Remember that a sample provides estimates but not the true results. The log linear analysis for the main effect “blue-eye color” tests whether the actual proportions of blue to not-blue were really 50-50. Background: Main effects in a 2X2 table Let’s illustrate the main effect in a simple problem. You have noticed that there are two variations of a shrub species, one with smooth leaves and one with hairy leaves. You think that the presence of smooth leaves might be related to the presence of serpentine in the soil. You sample 120 random locations in Santa Clara county where the shrubs are found and you record the leaf type (hairy or smooth) and whether Serpentine is present in the soil. You obtained the following data: Table 6-1: Serpentine soil versus leaf type data for example of log-linear analysis Leaf type Serpentine Soil Yes No Leaf Type Total Hairy 12 36 48 Smooth 22 50 72 Serp. Soil Total 34 86 120 Analyzed as a 2X2 Test of Independence Ho: Leaf type is independent of the presence or absence of Serpentine soil. Alpha ()= 0.025 (Why?) Results: Because the p-value (Prob=5.06) > (0.025) you would accept Ho and conclude that leaf type is independent of the presence or absence of serpentine soil (Table 6-2). Table 6-2: Results fromTwo -Way Crosstabs (Systat™ 10.0) for data in Table 6-1. 6-1 Analyzed as a 2x2 Log linear For this log linear example, there are three Ho’s instead of one: 1. Ho #1: There is no Leaf type*Serpentine soil interaction. This is the same thing as Leaf type is independent of the presence or absence of Serpentine soil. In addition, the following hypotheses can be tested if Ho #1 is accepted: 2. Ho #2: The proportion of sites with serpentine soil is equal to the proportion of sites without serpentine soil. Remember we are dealing with a sample here. 3. Ho #3: The proportion of sites with hairy leaves is equal to the proportion of sites with smooth leaves. The Log linear test results: 1. Accept Ho for Ho #1: There is no interaction (p=0.506) between leaf type and the presence or absence of Serpentine soil (Figure 6-1). This means the leaf type is independent of the presence or absence of Serpentine soil that is the same conclusion as for the 2x2 Test of Independence. Figure 6-1: Test Hypothesis that there is no interaction between leaf type and the presence or absence of Serpentine soil 2. Because we accepted Ho for the interaction, we go on to test the other Hos 3. Reject Ho for Ho #2: The proportion of sites with serpentine soil is significantly different (p<0.001) from the proportion of sites without serpentine soil (Figure 6-2). From the table percentages (not shown here), it can be determined that the proportion of sites with serpentine soil (28.3%) is significantly (p<0.001) less than the proportion of sites without serpentine soil (71.7%). Figure 6-2: Tests for significance of Serpentine Soil type and Leaf type. 4. Accept Ho for Ho #3: The proportion of sites with hairy leaves is not significantly different (p=0.0279 with α=0.025) from the proportion of sites with smooth leaves (Figure 6- 2). This means that you have no evidence that one leaf type is more abundant than the other (i.e., the difference in frequencies could have been a function of chance given your sample size). How does Log Linear analysis work? Log linear analysis involves testing a series of models. Log linear models are equations that contain terms made up of combinations of the variables used in the analysis plus a constant. The 6-2 most complex model (i.e. the one that contains the most terms) is called the Fully Saturated Model. The terms are arranged in a hierarchy starting with all variables singly, then all combinations of 2 variables at a time, then all combinations of three variables at a time etc. until all combinations have been exhausted. In the preceding example there were two variables, Serpentine Soil (Yes or No) and Leaf Type (Hairy or Smooth). In this example, we will use “S” for the Serpentine Soil variable and “L” for the Leaf Type variable. The fully saturated model for the problem above is: CONSTANT+S+L+(S*L). The models consist of terms with each term representing a particular effect. In the model above there are three effects (terms) and a constant (included in all models for statistical reasons). The three effects are the Serpentine main effect (S), the Leaf type main effect (L) and the interaction between Serpentine and Leaf type (S*L). With the exception of CONSTANT, each effect or term in the model refers to a specific Ho. The fully saturated model is used as a standard of comparison for all tests of that system because it explains all of the possible variation in the data. The idea behind Log linear analysis is to find the simplest model that does the same job as the fully saturated model. Simplified models are those with less effects or terms. To find the most simplified model, you do a series of Goodness of Fit type comparisons with the fully saturated model. If you throw out a term and the two models match, the term or effect was not important. If you throw out a term and the two models do not match, that term is important and should be included in the final simplified model. The order in which you throw out terms is important because, if an interaction is important, all main effects that make up the interaction are not important. This is because a significant interaction implies that the main effects affect each other, so you can’t make a simple statement about one main effect without dealing with the others. Therefore, you start the process by first throwing out the interactions. In our simple example, we would first throw out the S*L interaction. This would create the simplified model CONSTANT+S+L. We would then see if the simplified model does the same job as the fully saturated model. If it does, the term we threw out wasn’t important and we would accept Ho for that term. If it doesn’t, we would reject Ho for that term and keep it the model. IMPORTANT: If you do end up rejecting Ho for an interaction, when you put the term back in, you also must get rid of all lower order terms that could be made from the terms in the interaction. In this example, if we reject Ho for S*L, that term would be put back in the model but S and L would we taken out. When lower order terms are taken out in this way, you are NOT testing them; they are simply irrelevant. Why? When there are no more terms to test, the remaining model is called the FINAL MODEL and is the simplest model that will do the same job as the fully saturated model. 6-3 For our specific example, when we threw out S*L and tested the simplified model CONSTANT+S+L, the results of the G-test indicate that we accept Ho. Therefore the interaction was not important (i.e. not significant). The next step would be to throw out one of the single (main) effects; it doesn’t matter which. Let’s start with the S term. If we throw out the S effect and compare the model CONSTANT+L to the fully saturated model, we will find that the two models do not match (we rejected Ho). Therefore, the S effect is important and needs to be included in the final model. Next, we would put the S effect back in and take out the L effect. We find that the model CONSTANT+S does NOT differ significantly from the fully saturated model. Therefore, the L effect is not important and we accept Ho for that term. If you had thrown out the L term first, you would have found a match with the fully saturated model and concluded that Leaf type was not important. Since there are no more terms to throw out, we have the final model: CONSTANT+S. We can then interpret the meaning of the final terms be looking at the totals for the levels. We would then reach the conclusions listed in the Analyzed as a 2x2 log linear section. Computing the test – Basic Steps 1) Determine what you are going to test. 2) Design the experiment. a. What are the variables? b. What are the levels for the variables? c. What analysis should you use? d. What planned comparisons do I want to make among the levels? e. What terms are there in the fully saturated model? f. What are all of the Hos and Has? g. What would it mean if you accept Ho? h. What would it mean if you reject Ho? i. How would you conduct the experiment? j. What statistical error should you avoid? 3) Collect data. 4) In sequence, test most complex (more terms) model to least complex (simplified – with less terms) model for goodness of fit to the fully saturated model. If a simplified model fits the fully saturated model, it is doing the same job as the fully saturated model and all terms not included in the simplified model are not statistically significant. 5) For any significant effects, plot the percentages. 6-4 6) Conduct planned comparisons if you reject Ho for step 5. See pages 5-12 to 5-14 RxC Test of Independence. 7) Conduct any unplanned comparisons if you reject Ho for step 5. See pages 5-14 to 5-15 RxC Test of Independence. 8) Draw conclusion. EXAMPLE 1: 2x2 Log Linear analysis We will use the same experiment and data as for the Banded and Unbanded snake patterns versus the presence or absence of brush (see EXAMPLE 1: 2x2 Test of Independence on Page 6-3) 1) Determine what you are going to test. We want to determine if the snake pattern is related to the presence or absence of brush 2) Design the experiment. a. What are the variables? Snake and Brush b. What are the levels for the variables? Snake: Banded or Unbanded Brush: Present or Absent c. What analysis should you use? You are going to use a stepwise backward hierarchical 2x2 Log linear analysis. d. What planned comparisons do I want to make among the levels? See page 5-10 RxC Test of Independence. Because none of the levels have more than 2 levels, planned comparisons cannot be done. e. What terms are there in the fully saturated model? Constant + Snake + Brush + Snake*Brush f. What are all of the Hos and Has? i. Ho #1: whether a snake is banded or unbanded is independent of the presence or absence of brush (Snake*Brush interaction). Ha #1 is that whether a snake is banded or unbanded depends on the presence or absence of brush. ii. Ho #2: The proportion of sites with banded snakes is equal to the proportion of sites with unbanded snakes. Ha #2 is that the proportion of sites with banded snakes is NOT equal to the proportion of sites with unbanded snakes. iii. Ho #3: The proportion of sites with brush is equal to the proportion of sites without brush. Ha #3 is that the proportion of sites with brush is NOT equal to the proportion of sites without brush. g. What would it mean if you accept Ho? i. Accept Ho #1 would mean that there is no relationship or interaction between the presence or absence of brush and whether snakes are banded or unbanded. Also it is ok to test Hypotheses 2 and 3. ii. Accept Ho #2 would mean that the proportion of banded snakes is not different from the proportion of unbanded snakes. iii. Accept Ho #3 would mean that the proportion of sites with brush is not different from the proportion of sites without brush. h. What would it mean if you reject Ho? 6-5 i. Reject Ho #1 would mean that the presence or absence of brush does have some relationship to the presence of banded or unbanded snakes. Also, do NOT test hypotheses 2 and 3. ii. Reject Ho #2 would mean that the proportion of banded snakes is different from the proportion of unbanded snakes. iii. Reject Ho #3 would mean that the proportion of sites with brush is different from the proportion of sites without brush. i. How would you conduct the experiment? You will randomly sample sites until you find 180 sites with snakes. For each site, you will record whether or not brush was present and whether the snake was banded or unbanded. j. What statistical error should you avoid? Conclude that the worse error is Type I so alpha will equal 0.025. 3) Collect data Table 6-3: Frequency of banded/unbanded snakes and presence/absence of brush for 180 sites with snakes. Data SNAKE Banded Unbanded BRUSH Absent 32 43 Present 46 59 4) In sequence, test most complex (more terms) model to least complex (simplified – with less terms) model for goodness of fit to the fully saturated model. If a simplified model fits the fully saturated model, it is doing the same job as the fully saturated model and all terms not included in the simplified model are not statistically significant. Use SPSS™ 10.0 to compute a stepwise backward elimination hierarchical log linear analysis (see page 7-11 for SPSS instructions). Figure 6-3: Final model for 2x2 log linear analysis Example 1 Figure 6-4: Test Hos for 2x2 log linear analysis - Example 1. Note that the output has been compressed here; normally there are items in between the interaction and the main effects. o Accept Ho #1. There is there is no relationship or interaction (p=0.879) between the presence or absence of brush and whether snakes are banded or unbanded. Also it is ok to test Hypotheses 2 and 3. 6-6 5) 6) 7) 8) o Accept Ho #2. The proportion of banded snakes is not different (p=0.732) from the proportion of unbanded snakes. o Accept Ho #3. The proportion of sites with brush is not different (p=0.025 with α=0.025) from the proportion of sites without brush. For any significant effects, plot the percentages. No graphs needed. Conduct planned comparison if you reject Ho for step 5. None of the variables have more than 2 levels so there can be no unplanned comparisons. Conduct any unplanned comparisons if you reject Ho for step 5. None of the variables have more than 2 levels so there can be no unplanned comparisons. Draw conclusion The banding pattern doesn’t appear to have anything to do with the presence of brush in the environment. EXAMPLE 2: 2x2x2 Log Linear analysis We will now learn how to do a Log linear analysis with 3 variables. You are exploring the relationship between a color morph of lizard (light and dark), the type of ground (sand or dirt) and the presence or absence of shade. 1) Determine what you are going to test. We want to determine if the color morph of the lizard is related to the type of ground and/or the presence or absence of shade. 2) Design the experiment. a. What are the variables? Morph, Ground and Shade b. What are the levels for the variables? Morph: Light or Dark Ground: Sand or Dirt Shade: Present or Absent c. What analysis should you use? You are going to use a 2x2x3 stepwise backward hierarchical log linear analysis. d. What terms are there in the fully saturated model? Constant + Morph + Ground + Shade + Morph*Ground + Morph*Shade + Ground*Shade + Morph*Ground*Shade Main effects Two-way interactions Three-way interaction e. What are all of the Hos and Has (we won’t include these here)? i. Ho #1: there is no interaction between lizard morph, ground and shade. ii. Ho #2: there is no interaction between lizard morph and ground. iii. Ho #3: there is no interaction between lizard morph and shade. iv. Ho #4: there is no interaction between ground and shade. v. Ho #5: the proportion of dark lizard morphs is equal to the proportion of light lizard morphs. vi. Ho #6: the proportion of dirt sites is equal to the proportion of sand sites. vii. Ho #7: the proportion of shaded sites is equal to the proportion of unshaded sites. 6-7 f. What would it mean if you accept Ho? i. Accept Ho #1 would mean that there is no relationship or interaction between the three variables. Also, you can test the two-way interactions. ii. Accept Ho #2 would mean that there is no relationship between lizard morph and ground. Also, you can test the lizard morph and ground main effects. iii. Accept Ho #3 would mean there is no relationship between lizard morph and shade. Also, you can test the lizard morph and shade main effects. iv. Accept Ho #4 would mean there is no relationship between ground and shade. Also, you can test the ground and shade main effects. v. Accept Ho #5 would mean that the proportions of dark and light lizard morphs are equal. vi. Accept Ho #6 would mean that the proportions of dirt and sand sites are equal. vii. Accept Ho #7 would mean that the proportions of shaded and unshaded sites are equal. g. What would it mean if you reject Ho? (not included here). h. How would you conduct the experiment? You will randomly sample sites until you find 885 sites with lizards. For each site, you will record whether the lizard was dark or light, whether the ground was dirt or sand and whether or not shade was present. i. What statistical error should you avoid? Assume that you concluded that the worse error is Type II so alpha will equal 0.050. 3) Collect data Table 6-4: Frequency of lizard morphs, ground types and shade for 885 sites with lizards. Number value of variable in (). SHADE LIZARD COLOR Light (0) Light (0) Dark (1) Dark (1) GROUND Sand (0) Dirt (1) Sand (0) Dirt (1) Absent (0) 231 81 57 177 Present (1) 136 37 47 119 4) In sequence, test most complex (more terms) model to least complex (simplified – with less terms) model for goodness of fit to the fully saturated model. If a simplified model fits the fully saturated model, it is doing the same job as the fully saturated model and all terms not included in the simplified model are not statistically significant. 6-8 Use SPSS 10.0 to compute a stepwise backward elimination hierarchical log linear analysis (see page 7-11 for instructions). Figure 6-5: Final model for 2x2 log linear analysis Example 1 o A c c Figure 6-6: e Test Hos for 2x2 log linear analysis - Example 1. Note that the output has been compressed p here; normally there are items in between the interaction and the main effects. o Accept Ho #1. There is there is no relationship or interaction (p=0.877) between lizard morph (dark or light), ground type (dirt or shade) and shade (present or absent). Also it is ok to test two-way interactions. o Reject Ho #2. There is a significant (p<0.001) interaction between lizard morph (dark or light) ground type (dirt or sand). Need to graph this interaction. Cannot test lizard morph or ground type main effects. o Accept Ho #3. There is there is no relationship or interaction (p=0.076) between lizard morph (dark or light) and shade (present or absent). Also it is ok to test shade main effect. o Accept Ho #4. There is there is no relationship or interaction (p=0.153) between ground type (dirt or sand) and shade (present or absent). Also it is ok to test shade main effect. o Ho #5 can’t be tested. o Ho #6 can’t be tested. o Reject Ho #7. The proportion of shaded sites is significantly different (p<0.001) from the proportion of unshaded sites. 5) For any significant effects, plot the percentages. a. Lizard morph*Ground type interaction. 6-9 Percent 80% 60% Sand Dirt 40% 20% 0% Light Dark Lizard Morph Figure 6- 7: Significant Lizard morph (dark or light)*Ground type (sand or dirt) interaction Percent b. Shade main effect. 70% 60% 50% 40% 30% 20% 10% 0% Shade No shade Figure 6- 8: Significant Shade main effect 6) Conduct planned comparison if you reject Ho for step 5. None of the variables have more than 2 levels so there can be no unplanned comparisons. 7) Conduct any unplanned comparisons if you reject Ho for step 5. None of the variables have more than 2 levels so there can be no unplanned comparisons. 8) Draw Conclusions 6-10 There is there is no relationship (p=0.877) between lizard morph (dark or light), ground type (dirt or shade) and shade (present or absent There is a significant (p<0.001) interaction between lizard morph (dark or light) ground type (dirt or sand). The proportion of light morphs is greater in sandy areas and the proportion of dark morphs is greater in dirt areas (Figure 6- 7). There is there is no relationship (p=0.076) between lizard morph (dark or light) and shade (present or absent). There is there is no relationship (p=0.153) between ground type (dirt or sand) and shade (present or absent). The proportion of shaded sites is significantly less (p<0.001) than the proportion of unshaded sites. Using SPSS 10.0: 2x2x2 Log linear Analysis You are exploring the relationship between a color morph of lizard (light and dark), the type of ground (sand or dirt) and the presence or absence of shade. You have collected the following data: Enter The Data 1) When you start the SPSS program, you will be faced with the window depicted in Figure 6- 9. 2) Select “Type in data” and click on “OK” 3) At the bottom of the next window (Figure 6- 10), click on the “Variable View” tab. Figure 6-10: First Window in SPSS™ 10.00 Figure 6-9: Tabs at bottom of editor window 4) Enter the first variable name, “morph” in the NAME column and enter “Lizard Morph” in the LABEL column (Figure 6- 11). Then click on the VALUES column and then on the gray box in the VALUES column. Figure 6- 11: creating a variable called Morph . 9) The next window (Value Labels) allows you to specify labels for numerical values. We are going to specify that 0 = “Light” and 1= “Dark”. Enter 0 for the VALUE and “Light” for the VALUE LABEL; then click on ADD. Figure 6-12: Value labels window. Enter 1 for VALUE and “Dark” for the VALUE LABEL; then click on ADD. Finally click on OK. 10) Add another variable “ground” with a label of “Ground Type” and values (0=Sandy and 1 =Dirt). 11) Add another variable “shade” with a label of “Shade” and values (0=Absent, 1=Present). 12) Add another variable “freq” with a label of “Frequency”. 13) Click on the DATA VIEW tab (Figure 6- 9). 6-11 14) Table 6-5: Frequency of lizard morphs, ground types and shade for 885 sites with lizards. Number value of variable in (). SHADE LIZARD GROUND Absent (0) Present (1) COLOR Light (0) Sand (0) 231 136 Light (0) Dirt (1) 81 37 Dark (1) Sand (0) 57 47 Dark (1) Dirt (1) 177 119 15) Select VALUE LABELS from the VIEW pull-down menu. 16) Enter 0 for the “morph” variable, 0 for the “ground” variable, 0 for the “shade” variable and 231 for the frequency. 17) Continue entering all of the data. Your data should look like Figure 6- 13. Let SPSS know that you have entered frequency data 1) Select WEIGHT CASES from the DATA pull-down menu. 2) Click on Frequency in the left-hand box (Figure 6Figure 6-13: Data for 2x2x2 14). Click stepwise backward hierarchical on the log linear analysis WEIGHT CASES BY option and then click on the arrow. “Frequency” should appear in the FREQUENCY VARIABLE box. Figure 6-14: Frequency variable window. Create a new draft output window 1) From the FILE pull-down menu, select NEW and then select DRAFT OUTPUT. Run the log linear analysis 1) From the ANALYZE pull-down menu, select LOGLINEAR and then select MODEL SELECTION. 2) You will see the window illustrated in Figure 6- 15. Click on “Lizard Morph” and then on the arrow to select Lizard Morph as a FACTOR. Select “Ground Type” and “Shade” as FACTORS. 6-12 3) Next Click on “morph” in the FACTOR(S) box and then click on the DEFINE RANGE button. Specify 0 as the MINIMUM and 1 as the MAXIMUM. Click on CONTINUE. 4) Do the same for “ground” and “shade”. 5) Finally click on OK to run the analysis. Output Figure 6-15: Loglinear analysis window 1) First examine the total number of observations. It should match your total number. 2) Next, examine the frequencies to make sure you entered your data correctly. 3) K-way tests. These tests tell you the levels of terms that are likely to be significant. In this case, there are some 1 way (main effects) (p=<0.0001) and some 2 way interactions (p<0.0001) but no 3-way interactions (p=0.8776) (Figure 16). Figure 6-16: Test for K-way effects. 5) Next go almost to the end of the output to find the final model (Figure 6- 17). In this case there are two terms in the final model: the Morph*Ground 2-way interaction and the Shade main effect. Figure 6-17: Final model for 2x2x2 stepwise backward hierarchical log linear analysis 6) Fill in the following table by examining the output from the bottom to the top. If Deleted Simple Effect is Df L.R. Chisq Change Prob MORPH*GROUND*SHADE 1 0.024 0.8774 MORPH*GROUND 1 226.544 0.0000 MORPH*SHADE 1 3.149 0.0760 GROUND*SHADE 1 2.044 0.1528 MORPH N/A GROUND N/A SHADE 1 48.868 0.0000 6-13 On Your Own Problem: You are interested in determining if there are relationships between the class of cars people buy (economy, standard, sport, luxury), gender and whether or not they are retired. You especially want to know if people preferences for low cost (economy and standard) versus expensive (sports and luxury) are dependent upon gender and/or whether or not they are retired. If you find differences, you will invest more time and energy in trying to find out why. Data: For this problem you will use the dataset: Car Data.SAV 1) Determine what you are going to test. 2) Design the experiment. a. What are the variables? b. What are the levels for the variables? c. What analysis should you use? d. What planned comparisons do I want to make among the levels? (note: make aa many as possible) e. What terms are there in the fully saturated model? f. What are all of the Hos and Has? g. What would it mean if you accept Ho? 6-14 h. What would it mean if you reject Ho? i. How would you conduct the experiment? j. What statistical error should you avoid? 3) Use the car data.sav file 4) In sequence, test most complex (more terms) model to least complex (simplified – with less terms) model for goodness of fit to the fully saturated model. If a simplified model fits the 6-15 fully saturated model, it is doing the same job as the fully saturated model and all terms not included in the simplified model are not statistically significant. 5) For any significant effects, plot the percentages. 6) Conduct planned comparisons if you reject Ho for step 5. See pages 5-12 to 5-14 RxC Test of Independence. 7) Conduct any unplanned comparisons if you reject Ho for step 5. See pages 5-14 to 5-15 RxC Test of Independence. 8) Draw conclusion. 6-16