Decipherment with a Million Random Restarts Taylor Berg-Kirkpatrick Dan Klein Computer Science Division University of California, Berkeley {tberg,klein}@cs.berkeley.edu Abstract This paper investigates the utility and effect of running numerous random restarts when using EM to attack decipherment problems. We find that simple decipherment models are able to crack homophonic substitution ciphers with high accuracy if a large number of random restarts are used but almost completely fail with only a few random restarts. For particularly difficult homophonic ciphers, we find that big gains in accuracy are to be had by running upwards of 100K random restarts, which we accomplish efficiently using a GPU-based parallel implementation. We run a series of experiments using millions of random restarts in order to investigate other empirical properties of decipherment problems, including the famously uncracked Zodiac 340. in accuracy even as we scale the number of random restarts up into the hundred thousands. In both cases the difference between few and many random restarts is the difference between almost complete failure and successful decipherment. We also find that millions of random restarts can be helpful for performing exploratory analysis. We look at some empirical properties of decipherment problems, visualizing the distribution of local optima encountered by EM both in a successful decipherment of a homophonic cipher and in an unsuccessful attempt to decipher Zodiac 340. Finally, we attack a series of ciphers generated to match properties of Zodiac 340 and use the results to argue that Zodiac 340 is likely not a homophonic cipher under the commonly assumed linearization order. 2 1 Introduction What can a million restarts do for decipherment? EM frequently gets stuck in local optima, so running between ten and a hundred random restarts is common practice (Knight et al., 2006; Ravi and Knight, 2011; Berg-Kirkpatrick and Klein, 2011). But, how important are random restarts and how many random restarts does it take to saturate gains in accuracy? We find that the answer depends on the cipher. We look at both Zodiac 408, a famous homophonic substitution cipher, and a more difficult homophonic cipher constructed to match properties of the famously unsolved Zodiac 340. Gains in accuracy saturate after only a hundred random restarts for Zodiac 408, but for the constructed cipher we see large gains Decipherment Model Various types of ciphers have been tackled by the NLP community with great success (Knight et al., 2006; Snyder et al., 2010; Ravi and Knight, 2011). Many of these approaches learn an encryption key by maximizing the score of the decrypted message under a language model. We focus on homophonic substitution ciphers, where the encryption key is a 1-to-many mapping from a plaintext alphabet to a cipher alphabet. We use a simple method introduced by Knight et al. (2006): the EM algorithm (Dempster et al., 1977) is used to learn the emission parameters of an HMM that has a character trigram language model as a backbone and the ciphertext as the observed sequence of emissions. This means that we learn a multinomial over cipher symbols for each plaintext character, but do not learn transition parameters, which are fixed by the language model. We predict the deciphered text using posterior decoding in the learned HMM. 3 Experiments We ran experiments on several homophonic substitution ciphers: some produced by the infamous Zodiac killer and others that were automatically generated to be similar to the Zodiac ciphers. In each of these experiments, we ran numerous random restarts; and in all cases we chose the random restart that attained the highest model score in order to produce the final decode. 3.1 Experimental Setup The specifics of how random restarts are produced is usually considered a detail; however, in this work it is important to describe the process precisely. In order to generate random restarts, we sampled emission parameters by drawing uniformly at random from the interval [0, 1] and then normalizing. The corresponding distribution on the multinomial emission parameters is mildly concentrated at the center of the simplex.2 For each random restart, we ran EM for 200 itera1 0.9 We used a single workstation with three NVIDIA GTX 580 GPUs. These are consumer graphics cards introduced in 2011. 2 We also ran experiments where emission parameters were drawn from Dirichlet distributions with various concentration parameter settings. We noticed little effect so long as the distribution did not favor the corners of the simplex. If the distribution did favor the corners of the simplex, decipherment results deteriorated sharply. -1470 0.8 -1480 Implementation 0.7 Accuracy Running multiple random restarts means running EM to convergence multiple times, which can be computationally intensive; luckily, restarts can be run in parallel. This kind of parallelism is a good fit for the Same Instruction Multiple Thread (SIMT) hardware paradigm implemented by modern GPUs. We implemented EM with parallel random restarts using the CUDA API (Nickolls et al., 2008). With a GPU workstation,1 we can complete a million random restarts roughly a thousand times more quickly than we can complete the same computation with a serial implementation on a CPU. -1460 Log likelihood Accuracy 0.6 0.5 -1490 -1500 0.4 Log likelihood 2.1 1 -1510 0.3 -1520 0.2 0.1 -1530 1 10 100 1000 10000 100000 1e+06 Number of random restarts Figure 1: Zodiac 408 cipher. Accuracy by best model score and best model score vs. number of random restarts. Bootstrapped from 1M random restarts. tions.3 We found that smoothing EM was important for good performance. We added a smoothing constant of 0.1 to the expected emission counts before each M-step. We tuned this value on a small held out set of automatically generated ciphers. In all experiments we used a trigram character language model that was linearly interpolated from character unigram, bigram, and trigram counts extracted from both the Google N-gram dataset (Brants and Franz, 2006) and a small corpus (about 2K words) of plaintext messages authored by the Zodiac killer.4 3.2 An Easy Cipher: Zodiac 408 Zodiac 408 is a homophonic cipher that is 408 characters long and contains 54 different cipher symbols. Produced by the Zodiac killer, this cipher was solved, manually, by two amateur code-breakers a week after its release to the public in 1969. Ravi and Knight (2011) were the first to crack Zodiac 408 using completely automatic methods. In our first experiment, we compare a decode of Zodiac 408 using one random restart to a decode using 100 random restarts. Random restarts have high 3 While this does not guarantee convergence, in practice 200 iterations seems to be sufficient for the problems we looked at. 4 The interpolation between n-gram orders is uniform, and the interpolation between corpora favors the Zodiac corpus with weight 0.9. Also in Figure 1, we plot a related graph, this time showing the effect that random restarts have on the achieved model score. By construction, the (maximum) model score must increase as we increase the number of random restarts. We see that it quickly saturates in the same way that accuracy did. This raises the question: have we actually achieved the globally optimal model score or have we only saturated the usefulness of random restarts? We can’t prove that we have achieved the global optimum,5 but we can at least check that we have surpassed the model score achieved by EM when it is initialized with the gold encryption key. On Zodiac 408, if we initialize with the gold key, EM finds a local optimum with a model score of −1467.4. The best model score over 1M random restarts is −1466.5, which means we have surpassed the gold initialization. The accuracy after gold initialization was 92%, while the accuracy of the best local optimum was only 89%. This suggests that the global optimum may not be worth finding if we haven’t already found it. From Figure 1, it appears that large increases in likelihood are correlated with increases in accuracy, but small improvements to high likelihoods (e.g. the best local optimum versus the gold initialization) may not to be. 5 ILP solvers can be used to globally optimize objectives corresponding to short 1-to-1 substitution ciphers (Ravi and Knight, 2008) (though these objectives are slightly different from the likelihood objectives faced by EM), but we find that ILP encodings for even the shortest homophonic ciphers cannot be optimized in any reasonable amount of time. 0.8 -1280 0.75 -1285 0.7 Accuracy 0.65 -1290 0.6 Log likelihood Accuracy 0.55 -1295 0.5 -1300 0.45 0.4 Log likelihood variance, so when we present the accuracy corresponding to a given number of restarts we present an average over many bootstrap samples, drawn from a set of one million random restarts. If we attack Zodiac 408 with a single random restart, on average we achieve an accuracy of 18%. If we instead use 100 random restarts we achieve a much better average accuracy of 90%. The accuracies for various numbers of random restarts are plotted in Figure 1. Based on these results, we expect accuracy to increase by about 72% when using 100 random restarts instead of a single random restart; however, using more than 100 random restarts for this particular cipher does not appear to be useful. -1305 0.35 0.3 -1310 1 10 100 1000 10000 100000 1e+06 Number of random restarts Figure 2: Synth 340 cipher. Accuracy by best model score and best model score vs. number of random restarts. Bootstrapped from 1M random restarts. 3.3 A Hard Cipher: Synth 340 What do these graphs look like for a harder cipher? Zodiac 340 is the second cipher released by the Zodiac killer, and it remains unsolved to this day. However, it is unknown whether Zodiac 340 is actually a homophonic cipher. If it were a homophonic cipher we would certainly expect it to be harder than Zodiac 408 because Zodiac 340 is shorter (only 340 characters long) and at the same time has more cipher symbols: 63. For our next experiment we generate a cipher, which we call Synth 340, to match properties of Zodiac 340; later we will generate multiple such ciphers. We sample a random consecutive sequence of 340 characters from our small Zodiac corpus and use this as our message (and, of course, remove this sequence from our language model training data). We then generate an encryption key by assigning each of 63 cipher symbols to a single plain text character so that the number of cipher symbols mapped to each plaintext character is proportional to the frequency of that character in the message (this balancing makes the cipher more difficult). Finally, we generate the actual ciphertext by randomly sampling a cipher token for each plain text token uniformly at random from the cipher symbols allowed for that token under our generated key. In Figure 2, we display the same type of plot, this time for Synth 340. For this cipher, there is an abso- 60000 50000 14000 42% ntyouldli 12000 59% veautital 10000 Frequency Frequency 40000 30000 8000 6000 20000 4000 10000 2000 74% veautiful 0 0 -1340 -1330 -1320 -1310 -1300 -1290 -1280 -1270 Log likelihood Figure 3: Synth 340 cipher. Histogram of the likelihoods of the local optima encountered by EM across 1M random restarts. Several peaks are labeled with their average accuracy and a snippet of a decode. The gold snippet is “beautiful.” lute gain in accuracy of about 9% between 100 random restarts and 100K random restarts. A similarly large gain is seen for model score as we scale up the number of restarts. This means that, even after tens of thousands of random restarts, EM is still finding new local optima with better likelihoods. It also appears that, even for a short cipher like Synth 340, likelihood and accuracy are reasonably coupled. We can visualize the distribution of local optima encountered by EM across 1M random restarts by plotting a histogram. Figure 3 shows, for each range of likelihood, the number of random restarts that led to a local optimum with a model score in that range. It is quickly visible that a few model scores are substantially more likely than all the rest. This kind of sparsity might be expected if there were a small number of local optima that EM was extremely likely to find. We can check whether the peaks of this histogram each correspond to a single local optimum or whether each is composed of multiple local optima that happen to have the same likelihood. For the histogram bucket corresponding to a particular peak, we compute the average relative difference between each multinomial parameter and its mean. The average relative difference for the highest peak in Figure 3 is 0.8%, and for the second highest peak is 0.3%. These values are much smaller than -1330 -1320 -1310 -1300 Log likelihood -1290 -1280 Figure 4: Zodiac 340 cipher. Histogram of the likelihoods of the local optima encountered by EM across 1M random restarts. the average relative difference between the means of these two peaks, 40%, indicating that the peaks do correspond to single local optima or collections of extremely similar local optima. There are several very small peaks that have the highest model scores (the peak with the highest model score has a frequency of 90 which is too small to be visible in Figure 3). The fact that these model scores are both high and rare is the reason we continue to see improvements to both accuracy and model score as we run numerous random restarts. The two tallest peaks and the peak with highest model score are labeled with their average accuracy and a small snippet of a decode in Figure 3. The gold snippet is the word “beautiful.” 3.4 An Unsolved Cipher: Zodiac 340 In a final experiment, we look at the Zodiac 340 cipher. As mentioned, this cipher has never been cracked and may not be a homphonic cipher or even a valid cipher of any kind. The reading order of the cipher, which consists of a grid of symbols, is unknown. We make two arguments supporting the claim that Zodiac 340 is not a homophonic cipher with row-major reading order: the first is statistical, based on the success rate of attempts to crack similar synthetic ciphers; the second is qualitative, comparing distributions of local optimum likelihoods. If Zodiac 340 is a homophonic cipher should we expect to crack it? In order to answer this question we generate 100 more ciphers in the same way we generated Synth 340. We use 10K random restarts to attack each cipher, and compute accuracies by best model score. The average accuracy across these 100 ciphers was 75% and the minimum accuracy was 36%. All but two of the ciphers were deciphered with more than 51% accuracy, which is usually sufficient for a human to identify a decode as partially correct. We attempted to crack Zodiac 340 using a rowmajor reading order and 1M random restarts, but the decode with best model score was nonsensical. This outcome would be unlikely if Zodiac 340 were like our synthetic ciphers, so Zodiac 340 is probably not a homophonic cipher with a row-major order. Of course, it could be a homophonic cipher with a different reading order. It could also be the case that a large number of salt tokens were inserted, or that some other assumption is incorrect. In Figure 4, we show the histogram of model scores for the attempt to crack Zodiac 340. We note that this histogram is strikingly different from the histogram for Synth 340. Zodiac 340’s histogram is not as sparse, and the range of model scores is much smaller. The sparsity of Synth 340’s histogram (but not Zodiac 340’s histogram) is typical of histograms corresponding to our set of 100 generated ciphers. 4 Conclusion Random restarts, often considered a footnote of experimental design, can indeed be useful on scales beyond that generally used in past work. In particular, we found that the initializations that lead to the local optima with highest likelihoods are sometimes very rare, but finding them can be worthwhile; for the problems we looked at, local optima with high likelihoods also achieved high accuracies. While the present experiments are on a very specific unsupervised learning problem, it is certainly reasonable to think that large-scale random restarts have potential more broadly. In addition to improving search, large-scale restarts can also provide a novel perspective when performing exploratory analysis, here letting us argue in support for the hypothesis that Zodiac 340 is not a row-major homophonic cipher. References Taylor Berg-Kirkpatrick and Dan Klein. 2011. Simple effective decipherment via combinatorial optimization. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Thorsten Brants and Alex Franz. 2006. Web 1t 5-gram version 1. Linguistic Data Consortium, Catalog Number LDC2009T25. Arthur Dempster, Nan Laird, and Donald Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. 2006. Unsupervised analysis for decipherment problems. In Proceedings of the 2006 Annual Meeting of the Association for Computational Linguistics. John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue. Sujith Ravi and Kevin Knight. 2008. Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Sujith Ravi and Kevin Knight. 2011. Bayesian inference for Zodiac and other homophonic ciphers. In Proceedings of the 2011 Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Benjamin Snyder, Regina Barzilay, and Kevin Knight. 2010. A statistical model for lost language decipherment. In Proceedings of the 2010 Annual Meeting of the Association for Computational Linguistics.