1 #! perl -lw
2 # aid to solving Chu-Carrol's cryptogram statistically
3
I was annoyed that a mathematician's introduction to Crytanalysis
http://scienceblogs.com/goodmath/2008/08/introducing_cryptanalysis.php
was in the worked example largely wordsmithing and inadequately statistical. While the crypto puzzle fans have tools that run on MSWindows, I don't by choice. So I hacked together the following. I added Getopt to suppress later enhancements by default for presenting this code to Boston.PM.
May not be used, most if not all debug is #'d instead.
This is for Simple Substitution which no one uses but simpleton schoolboys and pedantic scholars like us. The -W
words option as coded requires word breaks, and the -X
pairs table option has most value if neither word breaks nor doubled letters are suppressed. This serves as a simple demonstration as to why the final generations of professional pencil and paper codes and ciphers required obscuring punctuation, word breaks, Capitalization, and doubled letters, all of which would be restored on a best effort basis by the decipering code clerk using common sense and the intrinsic redundancy of natural language.
41 42 use warnings; 43 use strict; 44 45 use List::Util qw/max/; 46 use Getopt::Long; 47 my ($verbose, $pairs, $file, $words, $cheat,); 48 GetOptions( 49 "X" => \$pairs, # digraph or contact frequency 50 "file=s" => \$file, # solve a different example 51 "verbose" => \$verbose, # flag 52 "W" => \$words, # highlight words indicated 53 "xyzzy" => \$cheat, # you wouldn't !! 54 ); 55 56 57
since lower case is easier to read as natural language and upper case is easier to see non-liguistic patterns, it is conventional to write the enciphered text in upper case, eg WSSHU FDVH and clear text in all hower case as in this paragraph. (in military circles, red pencil or ink is used for cleartext, if only to ensure worksheets with clear is burned.)
Hash %Xlate is keyed by single uppercase cipher letters with values as the matching lower case cleartext letter.
The current hash is printed CRYPT over / clear under in both XYZ and abc orders at the top of the output. $XYZ and $abc are lists of the solved letters, in similar sequence. $XYZ2, $abc2 are likewise, but the other sequence. Then print the un-matched letters.
67 68 my %Xlate = split //, shift @ARGV; 69 70 if ($cheat) 71 { 72 @Xlate{split //, "XVCTYHKQMZLGBUJI"} = 73 (split //, scalar reverse "rhsiaoefbwtgydum"); 74 75 } 76 77 # printf "%s=>'%s', ",$_,$Xlate{$_} for sort keys %Xlate ; 78 my ($XYZ, $abc) = ( 79 ( 80 join q{}, sort { $Xlate{$a} cmp $Xlate{$b} } 81 keys %Xlate 82 ), 83 ( 84 join q{}, 85 @Xlate{ 86 sort { $Xlate{$a} cmp $Xlate{$b} } 87 keys %Xlate 88 } 89 ) 90 ); 91 my ($XYZ2, $abc2) = ( 92 ( 93 join q{}, sort 94 keys %Xlate 95 ), 96 ( 97 join q{}, 98 @Xlate{ 99 sort 100 keys %Xlate 101 } 102 ) 103 ); 104 105 printf "\n %s\t%s =>\n=>%s\t%s \n", $XYZ, $XYZ2, $abc, $abc2; 106 107 my $alphabet = join q{}, 'a' .. 'z'; 108 my $ALPHABET = join q{}, 'A' .. 'Z'; 109 110 if (keys %Xlate) 111 { 112 $alphabet =~ s/ @{[ q{[} . join( q{}, values %Xlate ) . q{]} ]} //xg; 113 $ALPHABET =~ s/ @{[ q{[} . join( q{}, keys %Xlate ) . q{]} ]} //xg; 114 print "\n unsolved: $ALPHABET; not found: $alphabet \n "; 115 } 116
The Chu-Carroll cryptogram is appended to the source so read from data.
122 123 # slurp ... must be in {} or after dict load 124 125 my $all = do { local $/; <DATA>; }; 126 127 if (@ARGV) 128 { 129 local $/; 130 $all = <>; 131 } 132 133 # enforce convention 134 $all =~ y/a-z/A-Z/; 135 136 # print $all; # -- will do this later 137 138 my %Count; # used repeatedly 139
ordercannon()
Cryptographers classify words by positions of same or different letters.
Words ``the'' and ``car'' are both of type 123, since no letter is reused, while ``aardvark'' is 11234125. The ordercannon()
function allows for words more than 9 unique letters, so it interpolates dots to distinguish .1.1. from .11. . The %WordsByShape hash indexes words in a dictionary by these ordercannon()
strings and holds array of matching words. So
'1.2.2.3.4' =
[ 'upper', ...] >
as would $WordsByShape{ ordercannon('HCCRE') } # rot13
Uri and Bob Rogers optimized ordercannon()
with s///ge
instead of s///g for split
.
if -W is specified, any words that are sufficiently deciphered that the unmatched clear letters and ordercannon()
find only one or a handful in the dictionary, they are output next.
If there a lot of strange, useless words suggested and others with no matches it's a hint that some of the guessed letters are wrong. Try
-W Ui
to see this.
This scheme is amenable to optimization. This isn't, except the dictionary is only built when it will be used (-W).
167 168 169 my %WordsByShape; 170 171 if ($words) 172 { 173 my $fn = q(/usr/share/dict/words); 174 my $dict; 175 open $dict, $fn or die $!; 176 while (<$dict>) 177 { 178 chomp; 179 next if /\W|[A-Z]/; # skip compounds and Proper 180 push @{$WordsByShape{ordercannon($_)}}, $_; 181 } 182 183 #use Data::Dumper; 184 #print Dumper(%WordsByShape); 185 #exit 186 } 187 188 sub ordercannon 189 { 190 my $w = shift; 191 my (%C, $n); 192 $w =~ s/(\w)/($C{$1} ||= ++$n) . '.'/ge; # BobR 193 194 # my $i = 1; 195 # $i += !! ( $w =~ s/$_/$i./g ) 196 # for (split //, $w); 197 $w; 198 } 199 200 sub wordsLike 201 { 202 my $partial = shift; 203 my $cannon = ordercannon($partial); 204 205 # print "$partial => $cannon"; 206 207 $partial =~ s/[A-Z]/./g if !$abc; 208 $partial =~ s/[A-Z]/[^$abc]/g if $abc; ## letters not yet matched 209 # print "$partial => $cannon"; 210 211 local $" = q(,); 212 my @W = @{$WordsByShape{$cannon} || []}; 213 214 # print "<<@W>>"; 215 216 @W = grep /^$partial$/i, @W; 217 218 if ($partial =~ s/s$/s?/) 219 { 220 $cannon =~ s/\d+[.]$//; 221 222 # plurals is optional 223 224 push @W, $_ for grep /^$partial$/i, @{$WordsByShape{$cannon}}; 225 226 # unshift @W, $partial; 227 #unshift @W, $cannon; 228 } 229 230 # print "<<<@W>>>"; 231 # unshift %Xlate@W, $partial; 232 #unshift @W, $cannon; 233 return wantarray ? @W : "@W"; 234 } 235 236 if ($words) 237 { 238 print "\n -W : show candidate words"; 239 240 my @Words = split /[\W]/, $all; 241 my $n = 10; 242 $Count{$_}++ for @Words; 243 244 for my $w ( 245 sort { +1 * ($Count{$b} <=> $Count{$a}) } 246 grep { 247 /\w/ 248 249 # ... and $Count{$_}>1 250 } keys %Count 251 ) 252 { 253 my $tmp = $w; 254 eval "\$tmp =~ tr/$XYZ/$abc/"; 255 die $@ if $@; 256 next unless $tmp =~ /[A-Z]/; ## still in play? ignore solved 257 next unless $tmp =~ /[a-z]/ ## any info? 258 or length($tmp) <= 2; # or trivial 259 my @W = wordsLike($tmp); 260 next if @W > 30; 261 262 local $" = q','; 263 264 printf "%12s '%12s' %-6d : %-40s\n", $w, $tmp, $Count{$w}, 265 $tmp =~ /[A-Z]/ ? substr("@W", 0, 40) : ""; 266 267 last if $n-- == 0; 268 } 269 } 270
The frequency in decreasing order should approximate
ETAOIN SHRDLU
which are famous as columns one and two on the Linotype(tm)
keyboard.
But note *approximately*.
And this and other statistical comments assume a cleartext in unstilted English.
282 283 %Count = (); 284 285 my @Letters = split //, $all; 286 $Count{$_}++ for @Letters; 287 my $cut = 0; # (max values %Count)/5; # no cut if wrapping 288 289 my $i = 0; 290 printf "%s%s %s %-3d", ($i++ % 8 ? ' ' : "\n "), $_, $Xlate{$_} || q{ }, 291 $Count{$_}, 292 for sort { 1 * ($Count{$b} <=> $Count{$a}) } 293 grep { /\w/ and $Count{$_} > $cut } keys %Count; 294 printf "\n"; 295
First letters of pairs are down the left; second letters are along the top. Guessed cheartext as usual is lower case and makes a broken copy of the edge keys.
Zero counts are suppressed for clarity. Zeroes on the main Diagonal are plotted as '/'. Double letters in the ciphertext have a number on the main diagonal, so look for gaps in the line of /s for the telltale doubles.
Pairs like '_X' and 'X_' are cipher X as respectively first letter (follows a space or other nonword) and last letter of a word (precedes a space or punctuation). Clear 't' should be popular both as '_t' and 't_' and 'th' but not 'ht'. Initial '_e' is less common than terminal 'e_'. Both 'tt' and 'ee' may occur.
306 307 %Count = (); 308 my $temp = " $all "; 309 310 if ($pairs) 311 { 312 print "\n-X Digraph or pairs or 'contact' frequency table"; 313 $temp =~ tr/A-Z/_/cs; 314 315 # print $temp; 316 # print length($temp); 317 $Count{$_}++ for ($temp =~ m/(..)/g); 318 $temp =~ s/^.//; 319 $Count{$_}++ for ($temp =~ m/(..)/g); 320 my $nn; 321 $nn += $Count{$_} for keys %Count; 322 323 # print $nn; 324 my @L = ('_', 'A' .. 'Z'); 325 print 'v> ', join(' ', @L); 326 print ' ', join(' ', map { $Xlate{$_} || q[ ] } @L); 327 for my $c (@L) 328 { 329 printf "%s%s|", $c, ($Xlate{$c} || q[ ]); 330 printf "%2s ", 331 ( 332 $Count{"$c$_"} 333 ? (($Count{"$c$_"} > 99) ? '**' : $Count{"$c$_"}) 334 : ($c eq $_ ? q{ \\} : q{ }) 335 ) for @L; 336 printf "\n"; 337 } 338 printf "\n"; 339 } 340
Use the guesses to partially decode.
This uses the approved exception to the ``never use eval of a string, always of a block ... unless there's no other way''. TR does not do doublequote evaluation of its args as m//, s/// do.
348 349 printf "\n"; 350 351 eval " tr/$XYZ/$abc/ " for ($all); 352 die $@ if $@; 353 354 if (0 && $cheat) 355 { 356 # simple, but -X, -W don't see it ... so moved earlier. 357 $all =~ y(HZLGBUJIPFXVCTYKQDMW) 358 (teoaishrncmudygwbpfk); 359 } 360 print $all; 361
365 366 __END__ 367 b czfbczc bh bu gqlvh hbxz hl uhgih vd xt lkp qrly. lpz hjbpy hjgh b jgaz 368 plhbfzc bp xt hbxz qilkubpy hjz qrlyludjziz bu hjgh hjziz giz g kjlrz 369 rlh lm hziibmbf ufbzpfz qrlyu lvh hjziz: ligf, djgitpyvrg, gzhblrlyt, 370 evuh hl pgxz g mzk. qvh hjziz bu plh pzgirt ul xvfj lvh hjziz czcbfghzc 371 hl xghj - gpc bp dgihbfvrgi hl hjz xbuvuz lm xghj. b hjbpw hjgh hjgh bu 372 g cgxp ujgxz, qzfgvuz bp xt zodzibzpfz, lpz lm hjz xluh frzgi kgtu lm 373 bczphbmtbpy g figfwdlh bu hjilvyj xghj. pl xghhzi hjz udzfbmbf uvqezfh, 374 hjz figfwdlhu grkgtu zbhjzi galbc li ufizk vd hjz xghj. kjzhjzi bh'u 375 hjz "xzifvit fgvuzu gvhbux" mlrwu, hjz azrbwlauwbgpu, fizghblpbuhu, 376 grh-xzcbfbpz svgfwu, izdvqrbfgp dlrruhziu, li ufbzphlrlybuhu - tlv fgp 377 grkgtu izflypbnz hjz figfwdlhu qt hjzbi xghj. ul b gx ylbpy hl cl xt 378 qzuh hl dilabcz g albfz lm xghjzxghbfgr ugpbht - qlhj qt 379 ujlkbpy kjgh'u kilpy kbhj hjz qgc xghj urld dvxdzc lvh qt hjz rllpbzu, 380 gpc qt ujlkbpy jlk yllc xghj kliwu. 381