The Cracking Code Book. Simon Singh
then this is probably a substitute for t, and so on. Al-Kindī’s technique, known as frequency analysis, shows that it is unnecessary to check each of the billions of potential keys. Instead, it is possible to reveal the contents of a scrambled message simply by analyzing the frequency of the characters in the ciphertext.
Table 1 This table of relative frequencies is based on passages taken from newspapers and novels, and the total sample was 100,362 alphabetic characters. The table was compiled by H. Beker and F. Piper, and originally published in Cipher Systems: The Protection of Communication.
However, it is not possible to apply al-Kindī’s recipe for cryptanalysis unconditionally, because the standard list of frequencies in Table 1 is only an average, and it will not correspond exactly to the frequencies of every text. For example, a brief message discussing the effect of the atmosphere on the movement of striped quadrupeds in Africa (“From Zanzibar to Zambia and Zaire, ozone zones make zebras run zany zigzags”) would not, if encrypted, yield to straightforward frequency analysis. In general, short texts are likely to deviate significantly from the standard frequencies, and if there are fewer than a hundred letters, then decipherment will be very difficult. On the other hand, longer texts are more likely to follow the standard frequencies, although this is not always the case. In 1969, the French author Georges Perec wrote La Disparition, a 200-page novel that did not use words that contain the letter e. Doubly remarkable is the fact that the English novelist and critic Gilbert Adair succeeded in translating La Disparition into English while still following Perec’s avoidance of the letter e. Entitled A Void, Adair’s translation is surprisingly readable (see Appendix A). If the entire book were encrypted via a monoalphabetic substitution cipher, then a naive attempt to decipher it might be prevented by the complete lack of the most frequently occurring letter in the English alphabet.
Having described the first tool of cryptanalysis, I shall continue by giving an example of how frequency analysis is used to decipher a ciphertext. I have avoided littering the whole book with examples of cryptanalysis, but with frequency analysis I make an exception. This is partly because frequency analysis is not as difficult as it sounds, and partly because it is the primary cryptanalytical tool. Furthermore, the example that follows provides insight into the method of the cryptanalyst. Although frequency analysis requires logical thinking, you will see that it also demands cunning, intuition, flexibility and guesswork.
CRYPTANALYZING A CIPHERTEXT
PCQ VMJYPD LBYK LYSO KBXBJXWXV BXV ZCJPO EYPD KBXBJYUXJ LBJOO KCPK. CP LBO LBCMKXPV XPV IYJKL PYDBL, QBOP KBO BXV OPVOV LBO LXRO CI SX’XJMI, KBO JCKO XPV EYKKOV LBO DJCMPV ZOICJO BYS, KXUYPD: “DJOXL EYPD, ICJ X LBCMKXPV XPV CPO PYDBLK Y BXNO ZOOP JOACMPLYPD LC UCM LBO IXZROK CI FXKL XDOK XPV LBO RODOPVK CI XPAYOPL EYPDK. SXU Y SXEO KC ZCRV XK LC AJXNO X IXNCMJ CI UCMJ SXGOKLU?”
OFYRCDMO, LXROK IJCS LBO LBCMKXPV XPV CPO PYDBLK
Imagine that we have intercepted this scrambled message. The challenge is to decipher it. We know that the text is in English, and that it has been scrambled according to a monoalphabetic substitution cipher, but we have no idea of the key. Searching all possible keys is impractical, so we must apply frequency analysis. What follows is a step-by-step guide to cryptanalyzing the ciphertext, but if you feel confident, then you might prefer to ignore this and attempt your own independent cryptanalysis.
The immediate reaction of any cryptanalyst upon seeing such a ciphertext is to analyze the frequency of all the letters, which results in Table 2. Not surprisingly, the letters vary in their frequency. The question is, can we identify what any of them represent, based on their frequencies? The ciphertext is relatively short, so we cannot rely wholly on frequency analysis. It would be naive to assume that the commonest letter in the ciphertext, O, represents the commonest letter in English, e, or that the eighth most frequent letter in the ciphertext, Y, represents the eighth most frequent letter in English, h. An unquestioning application of frequency analysis would lead to gibberish. For example, the first word, PCQ, would be deciphered as aov.
Table 2 Frequency analysis of enciphered message.
However, we can begin by focusing attention on the only three letters that appear more than thirty times in the ciphertext, namely O, X and P. Let us assume that the commonest letters in the ciphertext probably represent the commonest letters in the English alphabet, but not necessarily in the right order. In other words, we cannot be sure that O = e, X = t and P = a, but we can make the tentative assumption that
O = e, t or a X = e, t or a P = e, t or a
In order to proceed with confidence and pin down the identity of the three most common letters, O, X and P, we need a more subtle form of frequency analysis. Instead of simply counting the frequency of the three letters, we can focus on how often they appear next to all the other letters. For example, does the letter O appear before or after several other letters, or does it tend to neighbour just a few special letters? Answering this question will be a good indication of whether O represents a vowel or a consonant. If O represents a vowel, it should appear before and after most of the other letters, whereas if it represents a consonant, it will tend to avoid many of the other letters. For example, the vowel e can appear before and after virtually every other letter, but the consonant t is rarely seen before or after b, d, g, j, k, m, q or v.
The table below takes the three most common letters in the ciphertext, O, X and P, and lists how frequently each appears before or after every letter. For example, O appears before A on one occasion but never appears immediately after it, giving a total of one in the first box. The letter O neighbours the majority of letters, and there are only seven that it avoids completely, represented by the seven zeroes in the O row. The letter X is equally sociable, because it too neighbours most of the letters and avoids only eight of them. However, the letter P is much less friendly. It tends to lurk around just a few letters and avoids fifteen of them. This evidence suggests that O and X represent vowels, while P represents a consonant.
Now we must ask ourselves which vowels are represented by O and X. They are probably e and a, the two most popular vowels in the English language, but does O = e and X = a, or does O = a and X = e? An interesting feature in the ciphertext is that the combination OO appears twice, whereas XX does not appear at all. Since the letters ee appear far more often than aa in plaintext English, it is likely that O = e and X = a.
At this point, we have confidently identified two of the letters in the ciphertext. Our conclusion that X = a is supported by the fact that X appears on its own in the ciphertext, and a is one of only two English words that consist of a single letter. The only other letter that appears on its own in the ciphertext