Shannon Index

advertisement
MATH 20
Project Paper
Joshua Lee
1
Shannon Index – Its Development, Implication, and Effectiveness
As an example of how probability is used in practical applications, this paper will focus
on the Shannon index, an index that is used in ecology and environmental studies to quantify
species diversity. The paper will cover how the index was conceived and developed, what the
index implies mathematically to quantify species diversity, and how effective it is in representing
species diversity.
The Shannon index was created by Claude E. Shannon, an MIT mathematician who
paved the way for a field in mathematics called information theory. During World War II,
Shannon collaborated with mathematician Alan M. Turing to decode encrypted German
messages using linguistic patterns to figure out the probability of specific letters coming after
certain characters. This knowledge also made Shannon realize that one could use probability to
compress information as much as possible (Gleick, 2011). For example, if a letter ‘q’ appears,
the probability of a ‘u’ following right after is high, but this also means that the letter ‘u’ is
redundant and is not necessary when one needs to send a compacted message. Thus, a passage
like “u cn rd ths” would deliver the same information as “you can read this,” even though the
first passage had less characters.
In this development of information theory, Shannon realized that information can be
quantified through uncertainty (Gleick, 2011). This is because uncertainty is measured by the
number of possible messages, and if the uncertainty is low, it means that fewer messages are
possible and thus less information is delivered. Thus, a message can be represented as an
outcome of a process that generated events with specific probabilities p1, p2, etc (a measure of
uncertainty). This quantification of information also needed to represent the fact that certain
events can be broken down into sub-events with its probabilities being additive (Gleick, 2011).
MATH 20
Project Paper
Joshua Lee
2
“For example, the probability of a particular diagram should be a weighted sum of the
probabilities of the individual symbols” (Gleick, 2011), and when these probabilities are equal,
the amount of information conveyed by each symbol is the logarithm of the number of possible
symbols, generating the formula:
𝐻 = 𝑛 log 𝑠
Based on this rule, Shannon created a formula that measured information as a function of
probabilities, with pi being the probability of each message:
𝐻 = − ∑𝑛𝑖=1(𝑝𝑖 × ln 𝑝𝑖 ) (Shannon, 1948, p. 389)
This formula was later utilized to deliver the diversity of species as well, in which pi
represents the fraction of individuals that belong to the i-th species. The logic of using this
formula to indicate species diversity is as follows. If each species in an ecosystem was
represented as a code, we can use short codes for more abundant species and long codes for rarer
species. As we survey an ecosystem and observe individual organisms, we input the code. Then,
the average length of the code will resemble the Shannon index.
In ecology, the Shannon index is as follows:
𝑛
𝑛
𝐻 = − ∑𝑆𝑖=1(𝑝𝑖 × ln 𝑝𝑖 ) = − ∑𝑆𝑖=1( 𝑁𝑖 × ln 𝑁𝑖 ) (Magurran, 2004, 107)
In this formula, S is the total number of species, N is the total number of individuals, and ni is the
number of individuals of the i-th species.
𝑛𝑖
𝑁
is equivalent to pi, the probability of finding the i-th
species. A greater H value implies greater species diversity. There is no mathematical reason
why the logarithmic value has to be the natural log compared to log10 or log2, but there is a
movement to standardize the value to ln so that the scientific community can communicate
effectively (Magurran, 2004, 107). As part of the project, a MATLAB program (shannon.m) was
created to calculate the Shannon index once given the number of animals in each species.
MATH 20
Project Paper
Joshua Lee
3
The premise of the Shannon index is that the higher the diversity, the lower likelihood
that two individuals chosen at random are of the same species. But species diversity can be
measured in two different ways. One is species richness, which is the number of species in a
given ecosystem. The other is species evenness, which is the distribution of individuals across
the species present, also known as relative abundance. Both measures are important; in a high
diversity community, one would expect to see different types of organisms. At the same time, if
one abundant species dominates the community, we would not call that ecosystem diverse. The
Shannon index is useful in the sense that the function takes into account both the richness and
evenness of a given ecosystem.
It is easy to see that the greater the number of species (S), the larger the Shannon index
(H) will become, since there will be more
𝑛𝑖
𝑁
𝑛
× ln 𝑁𝑖 to add. A MATLAB program
(shannon_graph.m) was created to show how the Shannon index changes depending on S while
𝑛1 = 𝑛2 = ⋯ = 𝑛𝑆 , and 𝑁 = 𝑛𝑖 × 𝑆. The results are shown in Figure 1:
Shannon Index According to Species Richness
7
6
Shannon Index
5
Figure 1: shannon_graph (1000, 5, 100).
The blue line is the Shannon index when n
= 5, and the dotted line is when n = 100.
4
3
2
1
0
0
200
400
600
Number of Species
800
1000
One thing to notice is that the value of ni does not matter as long as 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑆 .
This is because if all n values are the same, 𝑝𝑖 =
𝑛𝑖
𝑁
remains constant, regardless of the actual n
MATH 20
Project Paper
Joshua Lee
4
value. This can be shown in the MATLAB program (shannon_graph.m), as the program is
designed to create two separate graphs – each graph with different values of n. The fact that the
program only generates one graph means that at the end H remains the same. An additional
graph that is generated through the same program (Figure 2) further proves this fact:
Shannon Index According to Number of Animals
5
4.5
Shannon Index
4
3.5
Figure 2: shannon_graph (1000, 5, 100).
The blue line is the Shannon index when S =
5, and the dotted line is when S = 100.
3
2.5
2
1.5
0
200
400
600
800
Number of Animals in Each Species
1000
In Figure 2, one can see that the Shannon index remains constant even though the number of
animals in each species increases. The description of each program will be in further detail in the
attachments: shannon.m, and shannon_graph.m.
One can also prove that maximum evenness also maximizes the index, given that S
remains constant:
𝑛
𝑛
𝐻 = − ∑𝑆𝑖=1 𝑁𝑖 × ln 𝑁𝑖
𝑁 × 𝐻 = − ∑𝑆𝑖=1{𝑛𝑖 × (ln 𝑛𝑖 − ln 𝑁)} = − ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑛𝑖 ) + ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑁)
∑𝑆𝑖=1(𝑛𝑖 × ln 𝑁) = ln 𝑁 × ∑𝑆𝑖=1 𝑛𝑖 = (ln 𝑁) × 𝑁
𝑁 × 𝐻 − 𝑁 × ln 𝑁 = − ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑛𝑖 )
Since N is a positive constant, as well as 𝑁 × ln 𝑁, maximizing − ∑𝑆𝑖=1(𝑛𝑖 × ln 𝑛𝑖 ) will
maximize the H value, which can only be done if 𝑛1 = 𝑛2 = ⋯ = 𝑛𝑆 ∎
MATH 20
Project Paper
Joshua Lee
5
One can also see this from the MATLAB program (shannon.m) by plugging in different values
of ni while keeping N and S constant. The only way to maximize H is to keep all values even.
This also shows that the maximum H value will be ln S:
𝑛𝑖 =
𝑝𝑖 =
𝑁
𝑆
(the total population is divided evenly by the number of species)
𝑁
𝑆
÷𝑁 =
1
𝑆
1
1
1
𝐻𝑚𝑎𝑥 = − ∑𝑆𝑖=1( 𝑆 × ln 𝑆 ) = −𝑆 × 𝑆 × (− ln 𝑆) = ln 𝑆 ∎
Typically, the Shannon index in real ecosystems ranges between 1.5 and 3.5 (MacDonald,
2003, p. 409). The value rarely surpasses 4 (Margalef, 1972), and the increment is small due to
the logarithmic element in the function. Since the range is small, the index itself tells little about
the actual species diversity (for example, it is difficult to tell if an ecosystem with H1 = 2.56 and
ecosystem with H2 = 2.68 are substantially different or similar).
Also, because the index is influenced by both species richness and evenness, it is difficult
to tell which factor contributed more simply by looking at the index (Magurran, 2004, p. 109).
The index is particularly useful when comparing two ecosystems that are similar but might have
a substantial difference in one of the elements (either richness or evenness). Despite its
shortcomings to reflect detailed species diversity, the index is still commonly used.
The Shannon index was created to effectively deliver a message concerning species
richness and evenness in a given ecosystem. The index implies that as the number of species
increases, or as the distribution of species becomes more even, the better the biological diversity
(indicated by a larger number). The small range due to its logarithmic element in the function
makes it difficult to effectively identify species diversity, but it is still an effective measure to see
if similar ecosystems’ diversity is affected by either species richness or evenness.
MATH 20
Project Paper
Joshua Lee
6
References
Gleick, J. (2011). Information theory. In The Information: A History, a Theory, a Flood (7). New
York, NY: Pantheon Books. Retrieved from Amazon Kindle.
Magurran, A. E. (2004). Measuring Biological Diversity. Malden, MA: Blackwell Publishing.
MacDonald, G. M. (2003). Biogeography: Space, Time, and Life. New York, NY: John Wiley &
Sons, Inc.
Margalef, R. (1972). Homage to Evelyn Hutchinson, or why is there an upper limit to diversity.
Transactions of the Connecticut Academy of Arts and Sciences, 44, 211-235.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical
Journal, 27, 379-423.
Download