Section 2.7

advertisement
Section 2.7: The Friedman and Kasiski Tests
Practice HW (not to hand in)
From Barr Text
p. 1-4, 8
Using the probability techniques discussed in the last section, in this section we will
develop a probability based test that will be used to provide an estimate of the keyword
length used to encipher a message with the Vigene
test designed to estimate the keyword length that is based on the coincidental alignment
of letter groups in the plaintext with the keyword. We first develop some facts concerning
probability of letters occurring in standard English.
Probability of Selecting Multiple Letters in Standard English
In the standard English frequency table for letters, the probability of selecting a single
letter list is the relative frequency converted to decimal, that is
 selecing a single letter  relative frequency of the letter found in the table
 
P
.
100
 in standard English 
Example 1: Using the standard English frequency table, what is the probability of
selecting an E? A X?
Solution:
█
Example 2: In a large sample of English text, estimate the probability of selecting two
E’s. Two A’s.
Solution:
█
For convenience, we will assign the variables p0 , p1, p 2 , p3 , p 4 , p 24 , p 25 to represent
the probabilities of selecting the letters A, B, C, D, E, …, Y, Z from the standard English
alphabet. The subscripts of the variables correspond to the MOD 26 alphabet assignment
number of the corresponding alphabet letter. We will use this variable assignment in the
next example.
Example 3: What is the probability that two randomly selected English letters are the
same?
Solution: Using the standard English frequency table, we see that
 two letters 
  P(2 A' s)  P(2 B' s)  P(2 C' s)    P(2 Y' s)  P(2 Z' s)
P
 are the same 
2
2
 p02  p12  p 22    p 24
 p 25
 (0.08167) 2  (0.01492) 2  (0.02782) 2  (0.04253) 2  (0.12702) 2
 (0.02228) 2  (0.02015) 2  (0.06094) 2  (0.06966) 2  (0.00153) 2
 (0.00772) 2  (0.04025) 2  (0.02406) 2  (0.06749) 2  (0.07507) 2
 (0.01929) 2  (0.00095) 2  (0.05987) 2  (0.06327) 2  (0.09056) 2
 (0.02758) 2  (0.00978) 2  (0.02360) 2  (0.00150) 2  (0.01974) 2
 (0.00074) 2
 0.065
█
Friedman Test
The Friedman Test is a probabilistic test that can be used to determine the likelihood that
the ciphertext message produced comes from a monoalphabetic or polyalphabetic cipher.
This technique of cryptanalysis was developed in 1925 by William Friedman.
William Friedman
approximating the length of the keyword used. To show how this works, we start with the
following definition:
Definition: Index of Coincidence. Denoted by I, the index of coincidence represents the
probability that two randomly selected letters are identical.
Index of Coincidence for Monoalphabetic Ciphers
In monoalphabetic ciphers, the frequencies of letters in standard English are preserved
when converting from plaintext to ciphertext. The following example illustrates why this is
true.
Example 4: Illustrate why the Caesar shift cipher y  ( x  3) MOD 26 preserves
frequencies when converting from plain the ciphertext.
Solution:
█
Recall the index of coincidence represents the probability that two randomly selected
letters are identical. Since monoalphabetic ciphers preserves frequencies, the index of
coincidence of the plaintext alphabet of standard English will be exactly the same as the
index of coincidence of the ciphertext alphabet for a monoalphabetic cipher. Using the
result from Example 3, this fact results in the following statement:
Index of Coincidence for Monoalphabetic Ciphers
Index of Coincidenc e
 Two randomly selected 
  0.065
 I  P
For Monoalphab etic Ciphers
 letters are the same 
Index of Coincidence for Polyalphabetic Ciphers
In a polyalphabetic cipher, the goal is to distribute the letter frequencies so that each letters
has the same likelihood of occurring in the ciphertext. The next example determines what
the index of coincidence is for a polyalphabetic cipher for a large collection of letters.
Example 5: Determine the probability that two randomly selected letters are identical of
the ciphertext of a message enciphered with a polyalphabetic cipher, assuming there are a
very large number of letters in the ciphertext.
Solution:
█
Since the index of coincidence represents the probability that two randomly selected letters
are identical, Example 5 allows us to make the following statement:
Index of Coincidence for Polyalphabetic Ciphers
Index of Coincidenc e
 Two randomly selected 
  0.0385
 I  P
For Polyalphab etic Ciphers
 letters are the same 
The index of coincidence values for monoalphabetic (0.065) and polyalphabetic ciphers
(0.0385) were derived assuming that the plaintext message has a very large number of
letters. When messages are enciphered and deciphered, these messages are normally
much shorter. Hence, the index of coincidence for a typical enciphered message enciphered
will be bounded somewhere between 0.0385 and 0.065. This leads to the following
statement:
Index of Coincidence Bound
For a typical ciphertext message, the index of coincidence I satisfies:
0.0385  I  0.065
Fact: If I is close to 0.0385, then the cipher is likely to have been obtained
from a polyalphabetic cipher. If I is closer to 0.065, the cipher is
likely to be monoalphabetic.
Knowing what the value for the index of coincidence tells us, we now need to derive a
formula for calculating it. Before doing this, we need to recall the following fact
concerning summation notation:
Fact: Summation notation is a shorthand notation in mathematics for indicating the sum of
many terms. We say that
k
 ai  a1  a2    ak
i 1
represents the sum of k terms a1, a2 , , ak , where the index i starts at the first term (i = 1)
and we sum until we reach the upper index k of the summation symbol.
5
Example 6: Compute
i2 .
i 1
Solution:
█
Derivation of Formula for the Index of Coincidence for a Given Ciphertext Message
Suppose a ciphertext message is received. Let n0 , n1, n2 , , n24 , n25 be the counts of the
number of occurrences of the alphabet letters A, B, C, …, Y, Z that occur in the ciphertext
(note that the subscript of each variable corresponds to the MOD 26 alphabet assignment
number of the corresponding letter). Suppose
Total number of letters  n  n0  n1  n2    n24  n25
represents the total sum of all of the letters in the ciphertext. Recall that the index of
coincidence represents the probability that two randomly selected letters are identical.
Using the multiplication principle of probability for n total letters, we can compute
the following probabilities for each individual letter:
 2 A' s are randomly
P
selected

 2 B' s are randomly
P
selected


 


 

 2 C's are randomly 
 
P
selected



 2 Y' s are randomly 
 
P
selected


 2 Z' s are randomly
P
selected

n0 (n0  1) n0 (n0  1)


n (n  1)
n(n  1)
n1 (n1  1) n1(n1  1)


n (n  1)
n(n  1)
n2 (n2  1) n2 (n2  1)


n (n  1)
n(n  1)
n24 (n24  1) n24 (n24  1)


n
(n  1)
n(n  1)
 n25 (n25  1) n25 (n25  1)
 


n
(n  1)
n(n  1)

Since these probabilities are mutually exclusive, we can sum the individual probabilities for
each letters to find the index on coincidence.
 2 letters randomly 

P
 selected are the same 
n (n  1) n1 (n1  1) n2 (n2  1)
n (n  1) n25 (n25  1)
 0 0


  24 24

n(n  1)
n(n  1)
n(n  1)
n(n  1)
n(n  1)
\
1
n0 (n0  1)  n1(n1  1)  n2 (n2  1)    n24 (n24  1)  n25 (n25  1)

n(n  1)
Index of

Coincidenc e

25
1
 ni (ni  1)
n(n  1) i 0
We summarize this result.
Formula for the Index of Coincidence
The index of coincidence I is given by the formula:
25
1
I
 ni (ni  1)
n(n  1) i 0
1
n0 (n0  1)  n1(n1  1)  n2 (n2  1)    n24 (n24  1)  n25 (n25  1)
n(n  1)
where n represents the total number of letters in the ciphertext and ni , i  0, 1, , 25
represents the number of letters corresponding to each individual letters in the ciphertext
with n0  the number of A' s , n1  the number of B' s , n2  the number of C's , etc.

The next example illustrates an example of how this formula is used.
Example 7: Suppose we receive the message
"HLUBN WFSFK IGIHM GBSIM MBSEJ MAFUT QECII LJSUB BAXMA JCWXC
MBSGZ GGSMK BHUQB ETVUS MLMER CFDTW UBASW ERFIE LOMVY
SIMMY YEDDM MSGZA NCOFY YTIHL JRYOH KLOFH IEFKQ OFAAI ZGIEJ
HAKNZ JSQRU QXDKW HSNNF AOUMO ROFAA IZPIQ YHQFY SWEFK ILDPQ
GIXUE ADFWN NFVYO TRXRG QKRUS HVYHA GYONT TZISI EPUOF XAZRN
ZTSQK BGIIS MIMII SMIMX HAHUF ZNMFG WIMMB QWQLT SMZTU XRBSA
EMFGW IHUAM WQFFV CKNSM TIJYY RCOJR ASBOE YHQAI KYPAK YJKUX
VUFIG GBCFY HQKIJ QDFVU LBOGZ XTQOI MIMWH QOXUQ EMBIX KYAIB
SAEFC UKPYA ILKJL RRIQT URSYD QUOYS OJLXR IQTUB IHC"
Use the index of coincidence to decide if this ciphertext was produced by a
monoalphabetic or polyalphabetic cipher.
Solution: Using the Friedman Maplet, we generate the frequency table for the letters in
this message:
A
B
C
D
E
F
G
H
I
J
K
L
M
n0  23
n1  18
n2  10
n3  8
n4  16
n5  25
n6  16
n7  18
n8  37
n9  12
n10  17
n11  12
n12  29
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
n13  11
n14  18
n15  5
n16  22
n17  16
n18  25
n19  14
n20  22
n21  7
n22  12
n23  13
n24  21
n25  11
Letter Frequency Table
This gives n  n0  n1  n2    n24  n25  438 total letters. Thus the index of
coincidence is:
25
1
 ni (ni  1)
n(n  1) i 0
1
=
[n0 (n0  1)  n1 (n1  1)  n2 (n2  1)   n24 (n24  1)  n25 (n25  1)]
n(n  1)
I=
=
23  22  18  17  10  9  8  7  16  15  25  24  16  15  18  17  37  36  12  11  17  16 
1
 12  11  29  28  11  10  18  17  5  4  22  21  16  15  25  24  14  13  22  21  7  6

438  437 
 12  11  13  12  21  20  11  10

=
1 506  306  90  56  240  600  240  306  1332  132  272  132  812  110  306

191406  20  462  240  600  182  462  42  132  156  420  110

=
1
8266
4133
(8266) 

 0.043186
191406
191406 95703
Since 0.043186 is much closer to 0.0385 than 0.065, the cipher is likely polyalphabetic. █
Cipher
So far we have used the index of coincidence to determine whether a cipher is
polyalphabetic or monoalphabetic. Once this is determined, it can be used to estimate the
keyword length. Knowing the keyword length is the first essential step when attempting to
length:
k
0.0265n
where
0.065  I  n( I  0.0385)
n = the number of letters in the ciphertext message.
I = index of coincidence.
k = keyword length.
Example 8: For the ciphertext message given in the previous example, use the index of
coincidence to estimate the keyword length.
Solution:
█
The Kasiski Test
The Kasiski test is another method that can be used to approximate the keyword length in
he cipher was first published by a retired Prussian Army officer
named Friedrich Wilhelm Kasiski in 1863. The Kasiski test had been independently
discovered almost a decade earlier, in 1854, by the English inventor Charles Babbage .
Charles Babbage
The Kasiski test relies on the occasional coincidental alignment of letter groups in the
plaintext with the keyword to give a keyword length estimate. The test says if a string of
characters appears repeatedly in a polyalphabetic ciphertext message, it is possible (though
not certain), that the distance between the occurrences is a multiple of the length of the
keyword. To demonstrate ho
encipher the message “THE CHILD IS FATHER OF THE MAN” using the keyword
“POETRY” to produce the following ciphertext.
Plaintext
T H E C H I L D I S F A T H E R O F T H E M A N
Keyword
P O E T R Y P O E T R Y P O E T R Y P O E T R Y
Ciphertext I V E V Y G A R M L M Y I V E K F D I V E F R L
The keyword “POETRY” is six letters long. Note that the trigraph “IVE” occurs three
times in the ciphertext. The second occurrence of “IVE” occurs 12 character positions
after the first. The third occurrence of “IVE” occurs 6 character positions after the
second. This leads to the assertion that the separations of common letter occurrences
stand a good chance of being multiple of the keyword. This observation leads to the
following fact concerning the Kasiski test.
Fact: The greatest common divisor or divisor of it of the separations of common
good chance of being equal or at least some multiple of the keyword.
Since “IVE” was separated by 12 characters and then 6 characters, then by observing that
gcd(6, 12) = 6, we see that we have hit exactly the number of letters that occurred in the
keyword. We conclude with one more example illustrating the Kasiski test.
Example 9: Using the Kasiki Maplet, estimate the keyword length of the ciphertext given
in Example 7 applying the principles of the Kasiski test.
Solution: Will demonstrate using the Kasiki Maplet in class.
█
Download