MATH 20 Project Paper Joshua Lee 1 Shannon Index – Its Development, Implication, and Effectiveness As an example of how probability is used in practical applications, this paper will focus on the Shannon index, an index that is used in ecology and environmental studies to quantify species diversity. The paper will cover how the index was conceived and developed, what the index implies mathematically to quantify species diversity, and how effective it is in representing species diversity. The Shannon index was created by Claude E. Shannon, an MIT mathematician who paved the way for a field in mathematics called information theory. During World War II, Shannon collaborated with mathematician Alan M. Turing to decode encrypted German messages using linguistic patterns to figure out the probability of specific letters coming after certain characters. This knowledge also made Shannon realize that one could use probability to compress information as much as possible (Gleick, 2011). For example, if a letter ‘q’ appears, the probability of a ‘u’ following right after is high, but this also means that the letter ‘u’ is redundant and is not necessary when one needs to send a compacted message. Thus, a passage like “u cn rd ths” would deliver the same information as “you can read this,” even though the first passage had less characters. In this development of information theory, Shannon realized that information can be quantified through uncertainty (Gleick, 2011). This is because uncertainty is measured by the number of possible messages, and if the uncertainty is low, it means that fewer messages are possible and thus less information is delivered. Thus, a message can be represented as an outcome of a process that generated events with specific probabilities p1, p2, etc (a measure of uncertainty). This quantification of information also needed to represent the fact that certain events can be broken down into sub-events with its probabilities being additive (Gleick, 2011). MATH 20 Project Paper Joshua Lee 2 “For example, the probability of a particular diagram should be a weighted sum of the probabilities of the individual symbols” (Gleick, 2011), and when these probabilities are equal, the amount of information conveyed by each symbol is the logarithm of the number of possible symbols, generating the formula: 𝐻 = 𝑛 log 𝑠 Based on this rule, Shannon created a formula that measured information as a function of probabilities, with pi being the probability of each message: 𝐻 = − ∑𝑛𝑖=1(𝑝𝑖 × ln 𝑝𝑖 ) (Shannon, 1948, p. 389) This formula was later utilized to deliver the diversity of species as well, in which pi represents the fraction of individuals that belong to the i-th species. The logic of using this formula to indicate species diversity is as follows. If each species in an ecosystem was represented as a code, we can use short codes for more abundant species and long codes for rarer species. As we survey an ecosystem and observe individual organisms, we input the code. Then, the average length of the code will resemble the Shannon index. In ecology, the Shannon index is as follows: 𝑛 𝑛 𝐻 = − ∑𝑆𝑖=1(𝑝𝑖 × ln 𝑝𝑖 ) = − ∑𝑆𝑖=1( 𝑁𝑖 × ln 𝑁𝑖 ) (Magurran, 2004, 107) In this formula, S is the total number of species, N is the total number of individuals, and ni is the number of individuals of the i-th species. 𝑛𝑖 𝑁 is equivalent to pi, the probability of finding the i-th species. A greater H value implies greater species diversity. There is no mathematical reason why the logarithmic value has to be the natural log compared to log10 or log2, but there is a movement to standardize the value to ln so that the scientific community can communicate effectively (Magurran, 2004, 107). As part of the project, a MATLAB program (shannon.m) was created to calculate the Shannon index once given the number of animals in each species. MATH 20 Project Paper Joshua Lee 3 The premise of the Shannon index is that the higher the diversity, the lower likelihood that two individuals chosen at random are of the same species. But species diversity can be measured in two different ways. One is species richness, which is the number of species in a given ecosystem. The other is species evenness, which is the distribution of individuals across the species present, also known as relative abundance. Both measures are important; in a high diversity community, one would expect to see different types of organisms. At the same time, if one abundant species dominates the community, we would not call that ecosystem diverse. The Shannon index is useful in the sense that the function takes into account both the richness and evenness of a given ecosystem. It is easy to see that the greater the number of species (S), the larger the Shannon index (H) will become, since there will be more 𝑛𝑖 𝑁 𝑛 × ln 𝑁𝑖 to add. A MATLAB program (shannon_graph.m) was created to show how the Shannon index changes depending on S while 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑆 , and 𝑁 = 𝑛𝑖 × 𝑆. The results are shown in Figure 1: Shannon Index According to Species Richness 7 6 Shannon Index 5 Figure 1: shannon_graph (1000, 5, 100). The blue line is the Shannon index when n = 5, and the dotted line is when n = 100. 4 3 2 1 0 0 200 400 600 Number of Species 800 1000 One thing to notice is that the value of ni does not matter as long as 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑆 . This is because if all n values are the same, 𝑝𝑖 = 𝑛𝑖 𝑁 remains constant, regardless of the actual n MATH 20 Project Paper Joshua Lee 4 value. This can be shown in the MATLAB program (shannon_graph.m), as the program is designed to create two separate graphs – each graph with different values of n. The fact that the program only generates one graph means that at the end H remains the same. An additional graph that is generated through the same program (Figure 2) further proves this fact: Shannon Index According to Number of Animals 5 4.5 Shannon Index 4 3.5 Figure 2: shannon_graph (1000, 5, 100). The blue line is the Shannon index when S = 5, and the dotted line is when S = 100. 3 2.5 2 1.5 0 200 400 600 800 Number of Animals in Each Species 1000 In Figure 2, one can see that the Shannon index remains constant even though the number of animals in each species increases. The description of each program will be in further detail in the attachments: shannon.m, and shannon_graph.m. One can also prove that maximum evenness also maximizes the index, given that S remains constant: 𝑛 𝑛 𝐻 = − ∑𝑆𝑖=1 𝑁𝑖 × ln 𝑁𝑖 𝑁 × 𝐻 = − ∑𝑆𝑖=1{𝑛𝑖 × (ln 𝑛𝑖 − ln 𝑁)} = − ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑛𝑖 ) + ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑁) ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑁) = ln 𝑁 × ∑𝑆𝑖=1 𝑛𝑖 = (ln 𝑁) × 𝑁 𝑁 × 𝐻 − 𝑁 × ln 𝑁 = − ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑛𝑖 ) Since N is a positive constant, as well as 𝑁 × ln 𝑁, maximizing − ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑛𝑖 ) will maximize the H value, which can only be done if 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑆 ∎ MATH 20 Project Paper Joshua Lee 5 One can also see this from the MATLAB program (shannon.m) by plugging in different values of ni while keeping N and S constant. The only way to maximize H is to keep all values even. This also shows that the maximum H value will be ln S: 𝑛𝑖 = 𝑝𝑖 = 𝑁 𝑆 (the total population is divided evenly by the number of species) 𝑁 𝑆 ÷𝑁 = 1 𝑆 1 1 1 𝐻𝑚𝑎𝑥 = − ∑𝑆𝑖=1( 𝑆 × ln 𝑆 ) = −𝑆 × 𝑆 × (− ln 𝑆) = ln 𝑆 ∎ Typically, the Shannon index in real ecosystems ranges between 1.5 and 3.5 (MacDonald, 2003, p. 409). The value rarely surpasses 4 (Margalef, 1972), and the increment is small due to the logarithmic element in the function. Since the range is small, the index itself tells little about the actual species diversity (for example, it is difficult to tell if an ecosystem with H1 = 2.56 and ecosystem with H2 = 2.68 are substantially different or similar). Also, because the index is influenced by both species richness and evenness, it is difficult to tell which factor contributed more simply by looking at the index (Magurran, 2004, p. 109). The index is particularly useful when comparing two ecosystems that are similar but might have a substantial difference in one of the elements (either richness or evenness). Despite its shortcomings to reflect detailed species diversity, the index is still commonly used. The Shannon index was created to effectively deliver a message concerning species richness and evenness in a given ecosystem. The index implies that as the number of species increases, or as the distribution of species becomes more even, the better the biological diversity (indicated by a larger number). The small range due to its logarithmic element in the function makes it difficult to effectively identify species diversity, but it is still an effective measure to see if similar ecosystems’ diversity is affected by either species richness or evenness. MATH 20 Project Paper Joshua Lee 6 References Gleick, J. (2011). Information theory. In The Information: A History, a Theory, a Flood (7). New York, NY: Pantheon Books. Retrieved from Amazon Kindle. Magurran, A. E. (2004). Measuring Biological Diversity. Malden, MA: Blackwell Publishing. MacDonald, G. M. (2003). Biogeography: Space, Time, and Life. New York, NY: John Wiley & Sons, Inc. Margalef, R. (1972). Homage to Evelyn Hutchinson, or why is there an upper limit to diversity. Transactions of the Connecticut Academy of Arts and Sciences, 44, 211-235. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379-423.