chu.pl

Code Index:


   1 #! perl -lw
   2 # aid to solving Chu-Carrol's cryptogram statistically
   3 


aid to solving cryptograms statistically

I was annoyed that a mathematician's introduction to Crytanalysis

http://scienceblogs.com/goodmath/2008/08/introducing_cryptanalysis.php

was in the worked example largely wordsmithing and inadequately statistical. While the crypto puzzle fans have tools that run on MSWindows, I don't by choice. So I hacked together the following. I added Getopt to suppress later enhancements by default for presenting this code to Boston.PM.

Options and args

XaYbZc -- guess cipher X represents cheartext 'a' etc.
-X -- adds a table of digraph or contact frequency
-file=filename -- solve a different cryptogram
-verbose

May not be used, most if not all debug is #'d instead.

-W -- list words partly solved with unique or few dictionary matches.

Limitations

This is for Simple Substitution which no one uses but simpleton schoolboys and pedantic scholars like us. The -W words option as coded requires word breaks, and the -X pairs table option has most value if neither word breaks nor doubled letters are suppressed. This serves as a simple demonstration as to why the final generations of professional pencil and paper codes and ciphers required obscuring punctuation, word breaks, Capitalization, and doubled letters, all of which would be restored on a best effort basis by the decipering code clerk using common sense and the intrinsic redundancy of natural language.

  41 
  42 use warnings;
  43 use strict;
  44 
  45 use List::Util qw/max/;
  46 use Getopt::Long;
  47 my ($verbose, $pairs, $file, $words, $cheat,);
  48 GetOptions(
  49            "X"       => \$pairs,      # digraph or contact frequency
  50            "file=s"  => \$file,       # solve a different example
  51            "verbose" => \$verbose,    # flag
  52            "W"       => \$words,      # highlight words indicated
  53            "xyzzy"   => \$cheat,      # you wouldn't !!
  54           );
  55 
  56 
  57 

%Xlate and convention; show guesses in normal form

since lower case is easier to read as natural language and upper case is easier to see non-liguistic patterns, it is conventional to write the enciphered text in upper case, eg WSSHU FDVH and clear text in all hower case as in this paragraph. (in military circles, red pencil or ink is used for cleartext, if only to ensure worksheets with clear is burned.)

Hash %Xlate is keyed by single uppercase cipher letters with values as the matching lower case cleartext letter.

The current hash is printed CRYPT over / clear under in both XYZ and abc orders at the top of the output. $XYZ and $abc are lists of the solved letters, in similar sequence. $XYZ2, $abc2 are likewise, but the other sequence. Then print the un-matched letters.

  67 
  68 my %Xlate = split //, shift @ARGV;
  69 
  70 if ($cheat)
  71 {
  72     @Xlate{split //, "XVCTYHKQMZLGBUJI"} =
  73       (split //, scalar reverse "rhsiaoefbwtgydum");
  74 
  75 }
  76 
  77 # printf "%s=>'%s', ",$_,$Xlate{$_} for sort keys %Xlate ;
  78 my ($XYZ, $abc) = (
  79                    (
  80                     join q{}, sort { $Xlate{$a} cmp $Xlate{$b} }
  81                       keys %Xlate
  82                    ),
  83                    (
  84                     join q{},
  85                     @Xlate{
  86                         sort { $Xlate{$a} cmp $Xlate{$b} }
  87                           keys %Xlate
  88                       }
  89                    )
  90                   );
  91 my ($XYZ2, $abc2) = (
  92                      (
  93                       join q{}, sort
  94                         keys %Xlate
  95                      ),
  96                      (
  97                       join q{},
  98                       @Xlate{
  99                           sort
 100                             keys %Xlate
 101                         }
 102                      )
 103                     );
 104 
 105 printf "\n  %s\t%s =>\n=>%s\t%s \n", $XYZ, $XYZ2, $abc, $abc2;
 106 
 107 my $alphabet = join q{}, 'a' .. 'z';
 108 my $ALPHABET = join q{}, 'A' .. 'Z';
 109 
 110 if (keys %Xlate)
 111 {
 112     $alphabet =~ s/ @{[ q{[} . join( q{}, values %Xlate ) . q{]} ]} //xg;
 113     $ALPHABET =~ s/ @{[ q{[} . join( q{},  keys  %Xlate ) . q{]} ]} //xg;
 114     print "\n unsolved: $ALPHABET; not found: $alphabet \n ";
 115 }
 116 

load...

The Chu-Carroll cryptogram is appended to the source so read from data.

 122 
 123 # slurp ... must be in {} or after dict load
 124 
 125 my $all = do { local $/; <DATA>; };
 126 
 127 if (@ARGV)
 128 {
 129     local $/;
 130     $all = <>;
 131 }
 132 
 133 # enforce convention
 134 $all =~ y/a-z/A-Z/;
 135 
 136 # print $all;  # -- will do this later
 137 
 138 my %Count;    # used repeatedly
 139 

-W wordsLike(), shape dictionary %WordsByShape, ordercannon()

Cryptographers classify words by positions of same or different letters. Words ``the'' and ``car'' are both of type 123, since no letter is reused, while ``aardvark'' is 11234125. The ordercannon() function allows for words more than 9 unique letters, so it interpolates dots to distinguish .1.1. from .11. . The %WordsByShape hash indexes words in a dictionary by these ordercannon() strings and holds array of matching words. So

'1.2.2.3.4' = [ 'upper', ...] >

as would $WordsByShape{ ordercannon('HCCRE') } # rot13

Uri and Bob Rogers optimized ordercannon() with s///ge instead of s///g for split.

if -W is specified, any words that are sufficiently deciphered that the unmatched clear letters and ordercannon() find only one or a handful in the dictionary, they are output next.

If there a lot of strange, useless words suggested and others with no matches it's a hint that some of the guessed letters are wrong. Try

-W Ui

to see this.

This scheme is amenable to optimization. This isn't, except the dictionary is only built when it will be used (-W).

 167 
 168 
 169 my %WordsByShape;
 170 
 171 if ($words)
 172 {
 173     my $fn = q(/usr/share/dict/words);
 174     my $dict;
 175     open $dict, $fn or die $!;
 176     while (<$dict>)
 177     {
 178         chomp;
 179         next if /\W|[A-Z]/;    # skip compounds and Proper
 180         push @{$WordsByShape{ordercannon($_)}}, $_;
 181     }
 182 
 183     #use Data::Dumper;
 184     #print Dumper(%WordsByShape);
 185     #exit
 186 }
 187 
 188 sub ordercannon
 189 {
 190     my $w = shift;
 191     my (%C, $n);
 192     $w =~ s/(\w)/($C{$1} ||= ++$n) . '.'/ge;    # BobR
 193 
 194     #    my $i = 1;
 195     #       $i += !! ( $w =~ s/$_/$i./g )
 196     #                for (split //, $w);
 197     $w;
 198 }
 199 
 200 sub wordsLike
 201 {
 202     my $partial = shift;
 203     my $cannon  = ordercannon($partial);
 204 
 205     # print "$partial => $cannon";
 206 
 207     $partial =~ s/[A-Z]/./g       if !$abc;
 208     $partial =~ s/[A-Z]/[^$abc]/g if $abc;    ## letters not yet matched
 209                                               # print "$partial => $cannon";
 210 
 211     local $" = q(,);
 212     my @W = @{$WordsByShape{$cannon} || []};
 213 
 214     # print "<<@W>>";
 215 
 216     @W = grep /^$partial$/i, @W;
 217 
 218     if ($partial =~ s/s$/s?/)
 219     {
 220         $cannon =~ s/\d+[.]$//;
 221 
 222         # plurals is optional
 223 
 224         push @W, $_ for grep /^$partial$/i, @{$WordsByShape{$cannon}};
 225 
 226         # unshift @W, $partial;
 227         #unshift  @W, $cannon;
 228     }
 229 
 230     # print "<<<@W>>>";
 231     # unshift %Xlate@W, $partial;
 232     #unshift  @W, $cannon;
 233     return wantarray ? @W : "@W";
 234 }
 235 
 236 if ($words)
 237 {
 238     print "\n -W : show candidate words";
 239 
 240     my @Words = split /[\W]/, $all;
 241     my $n = 10;
 242     $Count{$_}++ for @Words;
 243 
 244     for my $w (
 245         sort { +1 * ($Count{$b} <=> $Count{$a}) }
 246         grep {
 247             /\w/
 248 
 249               #  ... and $Count{$_}>1
 250         } keys %Count
 251       )
 252     {
 253         my $tmp = $w;
 254         eval "\$tmp =~ tr/$XYZ/$abc/";
 255         die $@ if $@;
 256         next unless $tmp =~ /[A-Z]/;    ## still in play? ignore solved
 257         next unless $tmp =~ /[a-z]/     ## any info?
 258           or length($tmp) <= 2;         # or trivial
 259         my @W = wordsLike($tmp);
 260         next if @W > 30;
 261 
 262         local $" = q',';
 263 
 264         printf "%12s '%12s' %-6d : %-40s\n", $w, $tmp, $Count{$w},
 265           $tmp =~ /[A-Z]/ ? substr("@W", 0, 40) : "";
 266 
 267         last if $n-- == 0;
 268     }
 269 }
 270 

Simple frequency counts

The frequency in decreasing order should approximate

   ETAOIN SHRDLU

which are famous as columns one and two on the Linotype(tm) keyboard. But note *approximately*. And this and other statistical comments assume a cleartext in unstilted English.

 282 
 283 %Count = ();
 284 
 285 my @Letters = split //, $all;
 286 $Count{$_}++ for @Letters;
 287 my $cut = 0;    # (max values %Count)/5;  # no cut if wrapping
 288 
 289 my $i = 0;
 290 printf "%s%s %s %-3d", ($i++ % 8 ? ' ' : "\n "), $_, $Xlate{$_} || q{ },
 291   $Count{$_},
 292   for sort { 1 * ($Count{$b} <=> $Count{$a}) }
 293   grep { /\w/ and $Count{$_} > $cut } keys %Count;
 294 printf "\n";
 295 

-X Digraph or pairs or 'contact' frequency table

First letters of pairs are down the left; second letters are along the top. Guessed cheartext as usual is lower case and makes a broken copy of the edge keys.

Zero counts are suppressed for clarity. Zeroes on the main Diagonal are plotted as '/'. Double letters in the ciphertext have a number on the main diagonal, so look for gaps in the line of /s for the telltale doubles.

Pairs like '_X' and 'X_' are cipher X as respectively first letter (follows a space or other nonword) and last letter of a word (precedes a space or punctuation). Clear 't' should be popular both as '_t' and 't_' and 'th' but not 'ht'. Initial '_e' is less common than terminal 'e_'. Both 'tt' and 'ee' may occur.

 306 
 307 %Count = ();
 308 my $temp = " $all ";
 309 
 310 if ($pairs)
 311 {
 312     print "\n-X Digraph or pairs or 'contact' frequency table";
 313     $temp =~ tr/A-Z/_/cs;
 314 
 315     # print $temp;
 316     # print length($temp);
 317     $Count{$_}++ for ($temp =~ m/(..)/g);
 318     $temp =~ s/^.//;
 319     $Count{$_}++ for ($temp =~ m/(..)/g);
 320     my $nn;
 321     $nn += $Count{$_} for keys %Count;
 322 
 323     # print $nn;
 324     my @L = ('_', 'A' .. 'Z');
 325     print 'v> ', join('  ', @L);
 326     print '   ', join('  ', map { $Xlate{$_} || q[ ] } @L);
 327     for my $c (@L)
 328     {
 329         printf "%s%s|", $c, ($Xlate{$c} || q[ ]);
 330         printf "%2s ",
 331           (
 332             $Count{"$c$_"}
 333             ? (($Count{"$c$_"} > 99) ? '**' : $Count{"$c$_"})
 334             : ($c eq $_ ? q{ \\} : q{  })
 335           ) for @L;
 336         printf "\n";
 337     }
 338     printf "\n";
 339 }
 340 

print progress so far

Use the guesses to partially decode.

This uses the approved exception to the ``never use eval of a string, always of a block ... unless there's no other way''. TR does not do doublequote evaluation of its args as m//, s/// do.

 348 
 349 printf "\n";
 350 
 351 eval " tr/$XYZ/$abc/ " for ($all);
 352 die $@ if $@;
 353 
 354 if (0 && $cheat)
 355 {
 356     # simple, but -X, -W don't see it ... so moved earlier.
 357     $all =~ y(HZLGBUJIPFXVCTYKQDMW)
 358                           (teoaishrncmudygwbpfk);
 359 }
 360 print $all;
 361 

cryptogram

 365 
 366 __END__
 367 b czfbczc bh bu gqlvh hbxz hl uhgih vd xt lkp qrly. lpz hjbpy hjgh b jgaz
 368 plhbfzc bp xt hbxz qilkubpy hjz qrlyludjziz bu hjgh hjziz giz g kjlrz
 369 rlh lm hziibmbf ufbzpfz qrlyu lvh hjziz: ligf, djgitpyvrg, gzhblrlyt,
 370 evuh hl pgxz g mzk. qvh hjziz bu plh pzgirt ul xvfj lvh hjziz czcbfghzc
 371 hl xghj - gpc bp dgihbfvrgi hl hjz xbuvuz lm xghj. b hjbpw hjgh hjgh bu
 372 g cgxp ujgxz, qzfgvuz bp xt zodzibzpfz, lpz lm hjz xluh frzgi kgtu lm
 373 bczphbmtbpy g figfwdlh bu hjilvyj xghj. pl xghhzi hjz udzfbmbf uvqezfh,
 374 hjz figfwdlhu grkgtu zbhjzi galbc li ufizk vd hjz xghj. kjzhjzi bh'u
 375 hjz "xzifvit fgvuzu gvhbux" mlrwu, hjz azrbwlauwbgpu, fizghblpbuhu,
 376 grh-xzcbfbpz svgfwu, izdvqrbfgp dlrruhziu, li ufbzphlrlybuhu - tlv fgp
 377 grkgtu izflypbnz hjz figfwdlhu qt hjzbi xghj. ul b gx ylbpy hl cl xt
 378 qzuh hl dilabcz g albfz lm xghjzxghbfgr ugpbht - qlhj qt
 379 ujlkbpy kjgh'u kilpy kbhj hjz qgc xghj urld dvxdzc lvh qt hjz rllpbzu,
 380 gpc qt ujlkbpy jlk yllc xghj kliwu.
 381