An Elephant Never Forgets – Effective Analogies for Teaching Statistical Modeling Michael A. Martin Research School of Finance, Actuarial Studies and Applied Statistics, Australian National University, Canberra, ACT 0200, AUSTRALIA Abstract Analogies are useful and potent tools for introducing new topics in statistics to students. Martin (2003, 2008) considered the case for teaching with analogies in introductory statistics courses, and also gave many examples of particular analogies that had been successfully used to make difficult statistical concepts more accessible to students. In this article, we explore more deeply analogies for statistical concepts from more advanced topics such as regression modeling and highdimensional data. Introduction Many students approach their statistics classes with trepidation, perhaps because many of the concepts they encounter seem so foreign. Yet, despite a lexicon steeped in jargon and technical expressions, much statistical thinking has its basis in ideas with which most students will be already familiar – the trick for statistics educators, it seems, is to bridge that gap between existing, familiar ideas and new, forbidding ones. Analogy is an effective tool for bridging this gap, with some particularly evocative uses including the alignment between statistical hypothesis testing and the process of a criminal trial (Feinberg, 1971; Bangdiwala, 1989; Brewer, 1989; among many others), and the idea of a sample mean as a “balance point” for the data (Moore, 1997, p.262 as well as on the cover of the text). These famous examples leverage the key features of analogical thinking: access – the relevant source idea must be retrieved from memory mapping – alignment between elements (both objects and relationships) in the source and the target must be identified evaluation and adaptation – the appropriateness of the mappings need to be assessed and adapted where necessary to account for critical features of the target 2 learning – the target is understood, and new knowledge and relevant items and relationships are added to memory. The “transfer” from old to new domains is completed and the new situation can be accessed without reference to the source domain. These elements are described in detail in the monograph by Holyoak and Thagard (1995), in which is presented a comprehensive, modern overview of analogical thinking. Martin (2003) explored the use of analogies in teaching statistics and offered many examples of analogies that had been effectively used in his statistics classes, including the legal analogy for hypothesis testing and the balance point analogy for the average. Martin later presented this work at the OZCOTS 2008 conference (Martin, 2008). In the original 2003 paper and in the OZCOTS presentation, Martin focused on analogies and examples useful in a first course in statistics – the critical time when students first encounter our “mysterious” discipline. In this article, we consider examples and analogies specific to statistical concepts from more advanced topics from regression modeling and high-dimensional data analysis. We explore in more detail the mappings – for both items and relationships – that exist between the source and target ideas, critique the strengths and weaknesses of the analogies, and offer some new ideas that have been found useful in describing these more advanced topics. In describing and critiquing the examples below, we utilize the “teaching-withanalogies” framework developed by Glynn (1991) (see also Glynn and Law, 1993; Glynn, 1994, 1995, 1996; and Glynn, Duit and Thiele, 1995, for further discussion and refinements). This framework identified six steps: introduce target concept; access source analog; identify relevant features of both source and target; map elements and relationships between source and target; assess breakdowns of the analogy; adaption and conclusion. These six steps essentially give form to the four features (access, mapping, evaluation, learning) listed above, and allow the construction of powerful analogs for thinking and learning. This article is designed to be read in combination with the earlier article by Martin (2003), in which a formal argument is made supporting the use of analogies in teaching statistics, so the focus of this article is principally descriptive. The focus of that paper was largely on analogies for teaching a first course in statistics, while this article gives more consideration to and provides more detail for topics covered in a later course on statistical modeling. Analogies for Describing Regression Modeling Martin (2003) introduced several analogies useful in the context of describing regression models. We explore some of these examples in greater detail here, in- 3 cluding a couple of analogies not included in the 2003 article or the OZCOTS Martin (2008) presentation. Analogy 1: Signal-to-noise and F-ratios Most students become familiar with hypothesis testing by considering tests for means and proportions, and so come to associate testing with the location-scale structure of Z and t tests. Similarly, in regression contexts, tests for coefficients also work in this familiar way. So, when F tests are introduced, the immediate reaction is that this test is somehow different, as it is now based on a ratio rather than a scaled difference. Worse still, that ratio is “tampered with” through degrees of freedom adjustments! To motivate the use of a ratio-based test statistic, the analogous concept of a “signal-to-noise” ratio is a useful one. Almost without exception these days, students use Wi-Fi technology every day, so the idea that signals emanate from some central server and then are degraded by noise as the wireless device moves further from the source is a very familiar one. Most devices measure “signal strength” using bars – a rudimentary graphical display. The idea of a signal-to-noise ratio is thus a natural one, and the further idea that as the signal-to-noise ratio drops, the ability of the receiver to satisfactorily recover the true signal drops with it. In this analogy, the correspondence between objects is strong (signal/model; noise/error), and a key relationship (the use of a multiplicative measure of distance) also holds. As a result, the analogy has strong appeal and good memorability. On the other hand, there are some unmapped elements: in the Wi-Fi example, the notion of distance from the server is not represented in the target domain, and the role played by degrees of freedom in the F test has no direct map back to the source domain. As a result, the map is incomplete, but good enough to serve to motivate further discussion. The way that degrees of freedom impact the definition of the F test statistic is often a difficult one for students to understand. To elucidate this idea, one approach that has been successful is the notion of “a fair fight”. In comparing the signal with the noise, we wish to make this comparison as “fair” as possible, but the numerator in the F statistic is based on only one piece of information (the location of the line), while the denominator is based, essentially, on n – 2 pieces of information (this having been established when degrees of freedom were discussed), and so in order to make the comparison “fair” we must scale each of the numerator and denominator by the number of pieces of information on which each is based. This argument seems to resonate with some students, though the idea attempts to use knowledge about degrees of freedom that may be too “new” for students to readily access initially. This problem leads to inevitable questions: why is the line based on only one piece of information? Why n – 2? Why isn’t the regular ratio (unadjusted by degrees of freedom) good enough? These are tough questions – and the analogy is not strong enough to provide accessible answers. Of course, the questions have reasonable answers, but the answers lie outside the map implied by the analogy. The double-edged sword of analogies remains that while 4 they can produce in students that “eureka” moment, when the map is incomplete they can instead produce frustration. Analogy 2: The undiscovered island and partitioning variability with sequential sums of squares Martin (2003) introduced the analogy of the “undiscovered island” to explain how the order in which variables are fit in a model changes their sequential sums of squares. Here, the analogy is explored more deeply, with a view to more clearly incorporating the notion of multicollinearity and its effect on the sequential breakdone of explained variation in the analysis of variance. The analogy describes an uninhabited, unexplored island in the days of the great exploration of the oceans by colonial powers. The source idea is that the exploration and the claiming of territory depended critically on which explorer arrived first. So, as explorers arrive at the island one after another, they are only able to explore and claim territory that has not already been claimed. Further, some parts of the island are impenetrable jungle, so some territory cannot be explored (remembering that the colonial powers did not have access to Google Earth!). Mapped objects (source/target) exist in both domains (explorers/covariates; explored territory/explained variation; impenetrable jungle/unexplained variation; sequence of arrival of explorers/sequence of fit in model). The map is fairly strong, and the story sufficiently engaging that students can readily transfer the idea from the source domain to the target domain. Further, other notions such as multicollinearity and marginal explanatory power can also be integrated into the analogy with strongly mapped elements. Panels A and B, below, show how two great explorers coming from the same direction can each look “marginal” if they happen to arrive after the other one. This situation is an analog of two “good” variables that are roughly collinear – i.e. they are both carrying much the same information – so, the order in which they are fit determines which of them seems most important in terms of explaining variation in the response. Panel C shows the situation when variables (explorers) are roughly independent (coming from completely different directions) – in this case, the order of fit (arrival) doesn’t matter as the way the variation is explained (island is partitioned) does not change. In either case, the total amount of explored island (variation explained) is the same irrespective of the order of arrival (order of fit), so the fitted model itself does not change, only the way that the territory has been divided up (explained variation has been partitioned). 5 Panel A: X1 and X2 come from roughly the same direction (collinear), X1 arrives first (gains largest sequential sum of squares), leaving little for X2 Panel B: X1 and X2 come from roughly the same direction (collinear), X2 arrives first (gains largest sequential sum of squares), leaving little for X1 6 Panel C: Panel A: X1 and X2 come from opposite sides (roughly independent), order of arrival does not materially affect territory claimed (sums of squares) Analogy 3: Symptoms versus disease and leverage versus influence Perhaps the most frustrating experience when teaching regression is that of having students confuse high leverage with influence. Over many years, this one concept seems to have been the hardest of all to reliably communicate in my regression courses. Why this should be is hard to pinpoint, as the distinctions could hardly have been made clearer, with many facts and examples used as evidence of the difference between the two concepts. For instance, leverage is a function only of the covariates, not the response, so it simply cannot be the same as influence which must, necessarily, involve some consideration of the response variable. Yet in almost every assessment item, students routinely (and cheerfully) declared points with high leverage to be influential. For many years, despite my intense efforts to clarify this issue, the confusion between these two concepts continued – until I began using a simple but powerful analogy that has radically addressed this confusion. In the analogy, high influence is aligned with a disease, while high leverage is aligned with a symptom of that disease. The analogy is strong because along with the strong map between objects in the analogy, there is also a strong map of structural relationships in the source domain to the target domain. In general, in human health, symptoms are fairly readily detected, just as leverage is easily calculated. Disease, on the other hand, may be hard to directly detect, and so it is often the case with high influence. Diseases are often signaled by symptoms, just as high influence is often signaled by high leverage, but as everybody knows, a sneeze may be a sign of a cold, but to actually detect the virus causing a cold would require a visit to a laboratory, and, in any event, not every sneeze is associated with a cold. In this way, the tendency I noticed for students to declare a point as influential in many ways resembled the tendency for people to declare that they had a cold when, in fact, they were simply sneezing! Even more complex concepts such as masking are well accommodated within the analogy – the presence of a disease may well be masked by the presence of additional symptoms beyond those classically associated with the disease. Since I began using this analogy, the tendency for points to be routinely assessed as having high influence simply because they have high leverage has dropped markedly. Just as the realization that symptoms and diseases are associated but not synonymous is part of what people commonly understand, now this realization has been transplanted to the context of leverage and influence. Even more powerfully, the context of both the source domain and the target domain is diagnostic, the former medical and the latter statistical, making this a very appealing analogy. Analogy 4: Competition between sporting teams and combining p-values A very common sight within journals in just about any field is a large table with columns labeled “variable”, “estimate”, “SE”, “t-value” and “p-value” under which sits row after row of figures, typically festooned down the right-hand edge 7 with an array of daggers and stars representing significance at 10%, 5% and 1% for each of the listed variables. This table is typically the result of a model fitting exercise, the ultimate intention being to make a judgment – and choice – of which explanatory variables are important in describing the response variable, and in many instances, this choice is made by simply retaining those variables “starred” in the table and removing the others as extraneous. What this exercise does not explicitly take into account, of course, is that the multiple tests on which this aggregate judgment is made cannot be simply combined to produce this outcome because each individual test is marginal – that is, each test is based on an assumption that is incompatible with every other test. At first blush, most students find this situation utterly confusing: if the first line in the table suggests that is plausibly zero, and the second line in the table suggests that is plausibly zero, why can’t I just take both of the corresponding variables out of the model? The answer lies, of course, in considering the actual models being compared in each of the tests being conducted (this discussion leaving aside for the moment the vexed problem of multiple testing). For simplicity, suppose there are only two covariates, X1 and X2. The test summarized by the first line (corresponding to X1 and ) in the table reflects a comparison between the two models E(Yi ) = b0 + b1 X1i + b2 X2i and E(Yi ) = b0 + b2 X2i , while the test summarized by the second line in the table (corresponding to X2 and reflects a completely different comparison, between E(Yi ) = b0 + b1 X1i + b2 X2i and E(Yi ) = b 0 + b1 X1i . Meanwhile, the proposed action suggested by combining the two tests is that of removing both variables from the model, which amounts to making a comparison between yet another pair of models, between E(Yi ) = b0 + b1 X1i + b2 X2i and E(Yi ) = b0 ; that is, between the full model and a model featuring only an intercept. You cannot combine the tests summarized by the first two lines, the separate tests for and – as the underlying comparisons are inherently incompatible. Further, 8 you certainly cannot infer the result of the final comparison you wish to make by considering the first two tests. By now, most students’ heads are spinning. Compare what model with what model? How can you say is plausibly zero and is plausibly zero, but they are not both plausibly zero? Huh?? One of the difficulties rests with the way in which null hypotheses are expressed, typically only explicitly referencing one parameter under an implicit assumption that all other parameters are present. But students often interpret these statements as absolute statements and ignore the implicit assumptions and underlying models, leading to the misunderstanding related above. Here, a simple analogy can help. Think of each of the competing models in the above description as sporting teams engaged in a round-robin contest. Call the teams Sydney ( E(Yi ) = b0 + b1 X 1i + b2 X 2i ), Melbourne ( E(Yi ) = b0 + b2 X 2i ), Brisbane ( E(Yi ) = b0 + b1 X 1i ) and Canberra ( E(Yi ) = b0 ). Then run the student through the following sequence of games for Sydney: Game 1, Melbourne plays Sydney, and Melbourne wins Game 2, Brisbane plays Sydney, and Brisbane wins Game 3, Sydney plays Canberra, and Sydney wins. Then ask, is there anything about this set of results that is inherently contradictory. The answer is invariably no, sets of results like this are commonplace in sports. Even if stronger teams always beat weaker teams, this set of results is completely unsurprising, indeed expected, if Melbourne and Brisbane are strong teams, Sydney is a medium-ranking team, and Canberra is weak. But this system of games is analogous to the sequence of tests described above, a series of hypothesis tests for which students typically assume that the results of the first two tests imply the result for the third. The strong identification between teams and models is a useful mapping, and the key to understanding in the target domain rests in the realization that the set of results in the source domain is also unsurprising because the three contests (tests) are not as related as the null hypothesis statements make them seem. Analogy 5: Choosing the “right” meal from the menu and model selection Model selection is a process many students find difficult to understand, particularly when there are a large number of covariates. Having been warned off selecting combinations of variables based on large tables of marginal p-values, they know they cannot proceed that way, and in the presence of many covariates, the sheer number of available models is formidable. Automatic procedures such as stepwise procedures are a seductive alternative, but remembering the algorithm has proven difficult for many students (particularly when the process is completely automatic – “black box” – in software). To motivate the algorithm, the following analogy has proven useful. Imagine a restaurant with a large, diverse menu. Obviously you 9 want the “optimal” meal. So, begin by selecting from the menu the food you most like. Having eaten that morsel, you gaze again at the menu, at the next step choosing the food you like next best provided, of course, it goes well with what you have already eaten. The process continues until either there is nothing on the menu that complements what you have already eaten or you are full. This process is like the forward selection method of model selection – each step is conditional on the previous step, and the process cannot step backwards (since you eat the courses as you progress). In the model selection process, the same sequence is followed, with variables chosen at each step depending on their contribution to the model given what has already been added. The meal (model) is built one item at a time until the contributions from additional menu items (variables) has diminished below some acceptable threshold. Refinements such as moving to a forward stepwise procedure that incorporates successive add-delete variable phases can be accommodated by the analogy by simply removing the requirement that courses are eaten as they arrive at the table – instead, the order is built sequentially with menu items added – and potentially deleted – as their suitability is assessed in the context of what else has already been ordered at the preceding step. The analogy is very simple but in my experience students find it very motivating. The experience of ordering food and thinking of pleasant combinations of food is both a common experience and, generally, a positive and pleasant one. These factors, plus the strength of the maps between objects and relationships between the source and target, create a positive environment for understanding the new algorithm, and my experience has been that this analogy is a particularly effective way to describe stepwise regression procedures. Analogy 6: The blind men and the elephant and understanding highdimensional data Visualization is an incredibly useful tool in statistical modeling. Every student of statistical modeling has to have seen Anscombe’s quartet (Anscombe, 1973), the collection of four data sets that all yield identical numerical regression output but which could scarcely be more different when plotted. This powerful example immediately convinces all students of the wisdom of visualizing data, but visualization is a seriously difficult problem when data is high dimensional. Explaining why visualization in high dimensions is so problematic can be difficult – many graphical displays, for instance scatterplot matrices and trellis displays, offer a glimpse at high-dimensional data, but the truth behind the data can remain well hidden. One approach to demonstrating this truism is to carefully construct a multivariate data set that effectively defeats all attempts to discover its real structure by looking in the obvious directions. This approach can work well, but it has a considerable downside – it casts the teacher as an illusionist, a trickster, even a huckster. 10 Yet here an analogy – the brilliant fable of the blind men and the elephant – illustrates the situation wonderfully. The history of this story is long, and it has been used to teach a wide range of lessons, from the need for effective communication to the idea of tolerance for those who have different perspectives. Perhaps the best-known rendering of the tale is the poem by John Godfrey Saxe (1816-1887), a work now in the public domain as Saxe has been deceased for over a century: It was six men of Indostan, To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind. The First approach’d the Elephant, And happening to fall Against his broad and sturdy side, At once began to bawl: “God bless me! but the Elephant Is very like a wall!” The Second, feeling of the tusk, Cried, “Ho! what have we here ? So very round and smooth and sharp? To me ’tis mighty clear, This wonder of an Elephant Is very like a spear!” The Third approach’d the animal, And happening to take The squirming trunk within his hands, Thus boldly up and spake: “I see,” quoth he “the Elephant Is very like a snake!” The Fourth reached out an eager hand, And felt about the knee: “What most this wondrous beast is like Is mighty plain,” quoth he, ‘Tis clear enough the Elephant Is very like a tree!” The Fifth, who chanced to touch the ear, Said “E’en the blindest man Can tell what this resembles most; Deny the fact who can, 11 This marvel of an Elephant Is very like a fan!” The Sixth no sooner had begun About the beast to grope, Then, seizing on the swinging tail That fell within his scope, “I see,” quoth he, “the Elephant Is very like a rope!” And so these men of Indostan Disputed loud and long, Each in his own opinion Exceeding stiff and strong, Though each was partly in the right, And all were in the wrong! MORAL So, oft in theologic wars The disputants, I ween, Rail on in utter ignorance Of what each other mean; And prate about an Elephant Not one of them has seen! Apart from the teacher having the delightful experience of reciting a poem in a statistics class (so you already have everyone’s attention), the reward is that the final line of the poem states exactly the critical problem with high-dimensional data – it simply cannot be seen, at least not in the low-dimensional space in which humans live. The time to consider the scatterplot matrix for that trick data set is right after the poem has been read. Even despite the enormous advantages conferred by the use of small multiples allowing so many directions in the data to be assessed at once, the students realize very quickly that they are no better off than the committee of blind men standing before the elephant. It is then that, as a class, the journey to understand high-dimensional data begins, acknowledging that we all begin with the same basic problem – we are all essentially blind. It is interesting to note also that visualization of data and relationships within data – a basic tool for statisticians – is itself a classic example of analogical thinking, one that is so embedded that it is now a completely automatic process. Statistical graphics all embed a very simple metaphor – the size of a visual element (e.g. length, area, angle) must be proportional to the number it represents. As long as this metaphor is satisfied – and, remarkably, this rule is broken very frequently – then the simple analogy allows our visual comparisons of size to transfer seam- 12 lessly and quickly to an understanding of the difference between the underlying numbers. The metaphor is extraordinarily powerful, and the effects when the metaphor fails can be catastrophic. Edward Tufte even has a name for the effect when the metaphor breaks – he calls it the “lie factor” (Tufte, 2001, p.57). The effect on decision-making when graphics misrepresent the numbers they are supposed to communicate further demonstrates the power of analogical thinking – when the relationship map behind the analogy fails, the whole house of cards can come tumbling down. Analogies are a potent bridge between what is familiar and comfortable and what is new, uncharted territory. Analogical structure – mappings from the old to the new, along with the preservation of critical relationship maps – can be used to acquire new knowledge, and thus explore new vistas. Once the new knowledge is transferred from the source to the target domain, it becomes itself accessible. Analogies are also evocative, so their use promotes students remembering concepts far better than rote memorizing of formulas ever could – as the folklore says of elephants, they never forget1. As a postscript, it is also prudent to remind students of that other lesson from the fable of the blind men and the elephants: the value of considering differing perspectives. In that vein, I close with the following tale… Six blind elephants gathered together and the discussion turned to what humans were like. After a gentle discussion (elephants dislike heated argument), it was decided that they should each feel a human and then they could meet again to discuss their findings. After a careful examination of a human, the first blind elephant returned to the group. One by one the elephants went and made their own assessments, and when the group assembled again, the first blind elephant announced that she had determined what humans were like. A brief discussion ensued, with each elephant describing its findings. The verdict was unanimous. Humans are flat. References Anscombe, F. J. (1973), Graphs in Statistical Analysis, American Statistician 27 (1): 17–21. Bangdiwala, S. I. (1989), The teaching of the concepts of statistical tests of hypotheses to nonstatisticians, Journal of Applied Statistics, 16, pp. 355-361. Brewer, J. K. (1989), Analogies and parables in the teaching of statistics, Teaching Statistics, 11, pp. 21-23. Feinberg, W. E. (1971), Teaching the type I and II errors: the judicial process, The American Statistician, 25, pp. 30-32. 1 http://www.scientificamerican.com/article.cfm?id=elephants-never-forget 13 Glynn, S. M. (1991). Explaining science concepts: A teaching-with-analogies model, in The Psychology of Learning Science, eds. S. M. Glynn, R. H. Yeany, and B. K. Britton, Hillsdale, NJ: Lawrence Erlbaum Associates, 219-240. Glynn, S. M. (1994). Teaching science with analogies: A strategy for teachers and textbook authors, Research Report No. 15, Athens, GA: University of Georgia and College Park, MD: University of Maryland, National Reading Research Center. Glynn, S. M. (1995). Conceptual bridges: Using analogies to explain scientific concepts, The Science Teacher, December 1995, 24-27. Glynn, S. M. (1996). Teaching with analogies: Building on the science textbook, The Reading Teacher, 49(6), 490-492. Glynn, S.M., Duit, R., and Thiele, R. (1995), Teaching science with analogies: a strategy for transferring knowledge, in Learning Science in the Schools: Research Reforming Practice, eds. S. M. Glynn and R. Duit, Mahwah, NJ: Lawrence Erlbaum Associates, 247-273. Glynn, S. M., and Law, M. (1993). Teaching science with analogies: Building on the book [Video], Athens, GA: University of Georgia and College Park, MD: University of Maryland, National Reading Research Center. Holyoak, K. J., and Thagard, P. (1995). Mental Leaps: Analogy in Creative Thought, Cambridge, Massachusetts: MIT Press. Martin, M.A. (2003) It’s like, you know – the use of analogies and heuristics in teaching introductory statistics. Journal of Statistics Education, Volume 11(2) online. http://www.amstat.org/publications/jse/v11n2/martin.html See also Letter to the Editor: http://www.amstat.org/publications/jse/v11n3/lesser_letter.html and response: http://www.amstat.org/publications/jse/v11n3/martin_letter_response.html Martin, M.A. (2008) What lies beneath: inventing new wheels from old. OZCOTS 2008, Proceedings of the 6th Australian Conference on Teaching Statistics, H.L. MacGillivray and M.A. Martin, editors. pp. 35-52. Moore, D. S. (1997), Statistics: Concepts and Controversies (4th ed.), New York: W. H. Freeman and Co., Inc. Tufte, E.R. (2001), The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT. Saxe, John Godfrey, Six Men of Indostan. Poem is in the public domain and the work is no longer in copyright.