Scatter Diagrams A scatter diagram (also known as a scatter plot) is a graphic representation of the relationship between two variables. It helps us visualize the apparent relationship between two variables that are plotted in pairs. In the Six Sigma quality improvement DMAIC methodology, scatter diagrams are usually used to explore relationships in the Analyze Phase. They are used to help verify the potential root causes because the premise is that a change in the cause (the X) will produce a change in the effect (the Y). Although we would like to claim causation, based on a scatter diagram we can only claim correlation. The Analyze Phase in DMAIC is essentially a fact-based search for cause-effect relationships based on the ideas formulated in the Measure Phase. We start with the symptom of a problem—the measurable “effect” (the Y). Next, through the use of the cause-effect diagram, we theorize about the possible “causes” (the Xs). Then we collect data and search for those possible causes that have the strongest influence on the effect. If we can eliminate or control these causes, we will eliminate or control the effect; the symptom and the problem will be gone. While the cause-effect diagram helps a team develop theories about possible causes, the scatter diagram helps them analyze data to verify or disprove those theories. The scatter diagram is an ideal way to display data when an improvement team is trying to evaluate a cause-effect relationship of paired Y and X data. Paired data where the Y and the X are both continuous is an ideal situation to use scatter diagrams. [Note: Scatter diagrams can also be used with ranked data and certain discrete Xs but we’ll discuss that another time.] Because the data on cause-effect relationships almost always display variation, the scatter diagram is better than a simple table of numbers for summarizing information. The graphic nature of the scatter diagram helps a team to “see” the relationships between the variables. To be successful in constructing and analyzing scatter diagrams you will need a good theory, correctly paired data, accuracy, complete information, and representative data. You must also be aware of the potential pitfalls including stratification, range of the data, range of operation, effect of scale, numerical summaries, confounding factors, correlation without physical understanding, and data problems. Visual interpretation of scatter diagrams provides a useful, but sometimes limited, analysis of the relationship between two variables. If a team is examining many causeeffect relationships simultaneously, they may find it difficult to determine which has the strongest correlation. Calculating the correlation coefficients provides a useful enhancement to the scatter diagrams in these situations. This correlation coefficient is known as Pearson’s r. In other cases, a team may need to have a more precise, mathematical description of the relationship between the variables (i.e., finding the descriptive equation for the “cause” variable to produce a desired “effect”). In these situations, a regression analysis must be performed to enhance the scatter diagram. All Rights Reserved, Juran Institute, Inc. 1 Typical Patterns of Correlation are shown and described below: Strong Positive: If one variable increases at the same time the other variable increases, they are said to be positively correlated. Strong Negative: If one variable decreases at the same time the other variable increases, or vice versa, they are said to be negatively correlated. Complex: The data points are scattered in a curved pattern. The shape may look like a rainbow or an arch. The two variables are correlated, though not linearly. As X increases, Y first increases, then it decreases (or vice versa). Weak Relationships: A weak correlation does not necessarily mean that the factor being studied is not a cause. It may simply be a weak cause or a cause that requires the presence of another contributing factor to bring about the effect. In this latter case, both the factor under study and the contributing factor are perfectly good causes; you just need them both to be active simultaneously to get the effect. No Relationship: The data points are scattered in a shapeless pattern. You can conclude that the two variables are not correlated over the ranges for which the data was collected. All Rights Reserved, Juran Institute, Inc. 2 Example: A financial services company that serves the “middle market” of investors had a team improve service to its customers in order to increase its market share of assets under management. The team had already observed that there was wide variation among their account executives in the amount of new business. Now what do you think of the competing theories? What else should the team do? For hands-on practice, the reader can copy and paste this data set onto MINITAB® New Business 10282 12279 16702 10277 10844 9387 15593 12792 13977 10954 8074 6433 16856 17962 7008 15804 14157 9589 6688 9380 16174 11382 17190 5248 13140 18102 13609 6466 12740 12492 8790 11736 8598 8707 Number 78 147 217 106 138 127 121 91 158 121 57 129 149 122 25 125 89 38 107 77 178 62 168 97 145 138 189 40 150 153 80 86 120 97 Size 132 84 77 97 79 74 129 141 88 91 142 50 113 147 280 126 159 252 63 122 91 184 102 54 91 131 72 162 85 82 110 136 72 90 All Rights Reserved, Juran Institute, Inc. 3 14957 10262 12042 11362 6927 12653 13331 12250 9825 13953 4163 12104 13740 12588 12088 10578 12039 201 82 56 127 105 107 94 150 102 118 39 64 80 75 181 89 104 74 125 215 89 66 118 142 82 96 118 107 189 172 168 67 119 116 Using MINITAB®: Select: Graph > Scatterplot Simple Y Variables: New Business X Variables: Number Select: Graph > Scatterplot Simple Y Variables: New Business X Variables: Size All Rights Reserved, Juran Institute, Inc. 4 All Rights Reserved, Juran Institute, Inc. 5 There appears to be a strong positive relationship (positive correlation) between New Business and Number. There does NOT appear to be any relationship between New Business and Size. We can confirm this numerically next. All Rights Reserved, Juran Institute, Inc. 6 The most common way to measure association is using the Correlation Coefficient. MINITAB® uses the Pearson Product Moment Correlation Coefficient. The Correlation Coefficient r: Always falls between –1 and +1. It is a positive value if the value of one variable increases, and so does the other. It is a negative value if the value of one variable increases, and the other decreases. A Positive Correlation: Occurs when the values of both variables move in the same direction. As one goes up, so does the other. As one goes down, so does the other. A Negative Correlation: Occurs when the value of one variable increases while the other decreases. We can test the statistical significance of the correlation. The Correlation Test is based on the hypotheses: Ho: There is no relationship between X and Y Ha: There is a relationship between X and Y As such, the p-value may be used to evaluate the significance of the relationship: If p-value ≤ α, reject the null (in other words, the relationship is significant). In the example, select Stat > Basic Statistics > Correlation: All Rights Reserved, Juran Institute, Inc. 7 Select all 3 variables: All Rights Reserved, Juran Institute, Inc. 8 The printout in the Session window displays: Correlations: New Business, Number, Size Number Size New Business 0.578 0.000 Number 0.033 0.818 -0.698 0.000 Cell Contents: Pearson correlation P-Value We can conclude that there is a statistically significant correlation between New Business and Number (p-value = 0.000, reject the null hypothesis), with a positive correlation coefficient of 0.578. There is no correlation between New Business and Size. Both of these results confirm the scatter plots shown earlier. ===================================================== All Rights Reserved, Juran Institute, Inc. 9