words-count.pl

Code Index:

   1 #! perl -lw
   2 # prep aid to solving Chu-Carrol's cryptogram statistically
   3 
   4 

aid to solving cryptograms statistically

I was annoyed that a mathematician's introduction to Crytanalysis

http://scienceblogs.com/goodmath/2008/08/introducing_cryptanalysis.php

was in the worked example largely wordsmithing and inadequately statistical. While the crypto puzzle fans have tools that run on MSWindows, I don't by choice. So I hacked together the following. I added Getopt to suppress later enhancements by default for presenting this code to Boston.PM.

This script does word counts only, to see the common words.

Options and args

  30 
  31 use warnings;
  32 use strict;
  33 
  34 use List::Util qw/max/;
  35 use Getopt::Long;
  36 my ($verbose,  $file,  $dump,);
  37 GetOptions(
  38            "file=s"  => \$file,       # solve a different example
  39            "verbose" => \$verbose,    # flag
  40            "dump"    => \$dump,    # flag
  41           );
  42 
  43 
  44 

convention

since lower case is easier to read as natural language and upper case is easier to see non-liguistic patterns, it is conventional to write the enciphered text in upper case, eg WSSHU FDVH and clear text in all hower case as in this paragraph. (in military circles, red pencil or ink is used for cleartext, if only to ensure worksheets with clear is burned.)

  51 
  52 
  53 
  54 # printf "%s=>'%s', ",$_,$Xlate{$_} for sort keys %Xlate ;
  55 
  56 
  57 
  58 

load...

The Chu-Carroll cryptogram is appended to the source so read from data.

  64 
  65 # slurp ... must be in {} or after dict load
  66 
  67 my $all = do { local $/; <DATA>; };
  68 
  69 if (@ARGV)
  70 {
  71     local $/;
  72     $all = <>;
  73 }
  74 
  75 # enforce convention
  76 $all =~ y/a-z/A-Z/;
  77 
  78 
  79 if ($dump){
  80     print $all;
  81     exit;
  82 }
  83 
  84 
  85 
  86 my %Count;    # used repeatedly
  87 
  88 
  89 
  90 
  91 
  92 
  93 
  94 
  95     my @Words = split /[\W]/, $all;
  96     $Count{$_}++ for @Words;
  97 
  98 
  99 
 100     my $n=30;
 101     my $seen=0;
 102 
 103     # _score is length * count, except 20*count if length==1
 104     # because the words A, I, and plural 'S are very important
 105     sub _score {my $w = shift; 
 106                 return $Count{$w}*(1==length($w) ? 20 : length($w)); };
 107 
 108     for my $w (
 109         sort { +1 * ( _score($b) <=> _score($a) ) }
 110         grep {
 111             /\w/
 112 
 113               #  ... and $Count{$_}>1
 114         } keys %Count
 115       )
 116     {
 117         
 118 
 119         print sprintf("%-12s\t%3s\t%3s", 'word', 'Count', 'length' )
 120             unless $seen++;
 121         print sprintf("%-12s\t%3d\t%3d", $w, $Count{$w}, length($w) );
 122 
 123 
 124 
 125         last if $n-- == 0;
 126     }
 127 
 128 
 129 
 130 
 131 
 132 
 133 
 134 
 135 
 136 

cryptogram

 140 
 141 __END__
 142 b czfbczc bh bu gqlvh hbxz hl uhgih vd xt lkp qrly. lpz hjbpy hjgh b jgaz
 143 plhbfzc bp xt hbxz qilkubpy hjz qrlyludjziz bu hjgh hjziz giz g kjlrz
 144 rlh lm hziibmbf ufbzpfz qrlyu lvh hjziz: ligf, djgitpyvrg, gzhblrlyt,
 145 evuh hl pgxz g mzk. qvh hjziz bu plh pzgirt ul xvfj lvh hjziz czcbfghzc
 146 hl xghj - gpc bp dgihbfvrgi hl hjz xbuvuz lm xghj. b hjbpw hjgh hjgh bu
 147 g cgxp ujgxz, qzfgvuz bp xt zodzibzpfz, lpz lm hjz xluh frzgi kgtu lm
 148 bczphbmtbpy g figfwdlh bu hjilvyj xghj. pl xghhzi hjz udzfbmbf uvqezfh,
 149 hjz figfwdlhu grkgtu zbhjzi galbc li ufizk vd hjz xghj. kjzhjzi bh'u
 150 hjz "xzifvit fgvuzu gvhbux" mlrwu, hjz azrbwlauwbgpu, fizghblpbuhu,
 151 grh-xzcbfbpz svgfwu, izdvqrbfgp dlrruhziu, li ufbzphlrlybuhu - tlv fgp
 152 grkgtu izflypbnz hjz figfwdlhu qt hjzbi xghj. ul b gx ylbpy hl cl xt
 153 qzuh hl dilabcz g albfz lm xghjzxghbfgr ugpbht - qlhj qt
 154 ujlkbpy kjgh'u kilpy kbhj hjz qgc xghj urld dvxdzc lvh qt hjz rllpbzu,
 155 gpc qt ujlkbpy jlk yllc xghj kliwu.
 156