Chapter 4

Midterm Projects

4.1  Project #1: Cryptanalyst's Tool Kit

/Users/tld/email/book/intro/exercises/midterm/ciphers
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin/check
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin/kurshan
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin/kurshan/eymessage
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin/kurshan/eyscript
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin/kurshan/eytools
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin/pretty
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin/score
/Users/tld/email/book/intro/exercises/midterm/ciphers/bin/toolkit
/Users/tld/email/book/intro/exercises/midterm/ciphers/README

Build a tool kit (a collection of commands implemented as shell
scripts) for automating the process of finding secret codes.  Your
commands should partially automate the process of decrypting an
encrypted message given a corpus of unencrypted entries from Pepys
Diary.  The more automation you can manage the better.  Your commands
should exploit frequency analysis coupled with word matching using the
online list of words from the 2nd Edition of Webster's
Dictionary.  Note that not all words in Pepys Diary will be found in
the Webster's Dictionary.  Some of the words are proper nouns that
aren't found in the dictionary.  Also Pepys often used spellings that
are archaic or idiosyncratic.  In general, spelling was not as
standardized in the 17th century as it is today and authors would
frequently take poetic license in spelling a word; the same word often
appears with multiple spellings within the same text.  Even so you can
often gain crucial hints by searching for commonly appearing words in
a text. For example, suppose you have a good idea of how the letters
't' and 'e' map in a substitution code; you might search for strings
that match the regular expression 't[a-z]e'.  You will be graded on
the code you write, how you document it, and how it works.  A TA will
sit down with you and watch you demonstrate your code on a target
corpus of plain text entries from Pepys Diary and a mystery entry
encrypted using a substitution code.

The exercises related to Ciphers and Secrets are available online

http://www.cs.brown.edu/~tld/talk/home-Z-H-4.html#node_sec_2.2.2

and also in the form of supplementary tips and sample scripts

 ../../04/09/20/exercises.txt
 ../../04/09/20/bin/README

If you're working on CS department machines - either in the lab or
remotely using 'ssh', you can set your 'path' variable to include
Ciphers and Secrets 'bin' directory as follows:

% set path = ( ~tld/tmp/exercises/04/09/20/bin $path )
% which wordhist
/u/tld/tmp/exercises/04/09/20/bin/wordhist

You might also want to set a variable to provide a short cut to 
the directory containing the excerpts from the Pepys diary:

% set pepys = ~tld/tmp/exercises/04/09/20/bin/pepys
% ls $pepys
1660-01-05.txt  1660-02-04.txt  1661-06-03.txt  1662-01-22.txt  README
1660-01-09.txt  1660-02-12.txt  1662-01-06.txt  1662-02-18.txt
1660-02-01.txt  1660-06-22.txt  1662-01-19.txt  1663-09-04.txt

If you're using a shell on your personal computer and not logged on
remotely, you can simply copy the files to your computer.  However, I
recommend that you use department machines whenever possible both to
ensure that your files are backed up and to avoid discrepancies
between the environment on your computer and the environment on the
department machines.

In building on the Ciphers and Secrets exercise you'll probably want
to exploit the information available in the online dictionary
/usr/share/dict/words.  I think you'll find that you can make lots of
useful inferences about the relative value of various attempts to
decrypt a message by using 'grep' and 'sed' along with the dictionary
data.  For example,

% set dictionary = "/usr/share/dict/words"
% grep "\<[a-z]\{3\}\>" $dictionary | wc -l
1134

counts the number of three letter words in the dictionary.  The
special characters '\<' and '\>' refer to the begin and end word
boundaries; without these 'grep' would return all words that have 
at least three characters.  We can reduce the number of matches
by restricting the characters for each position:

% grep "\<[at][a-z]e\>" $dictionary | wc -l
21
% grep "\<a[a-z]e\>" $dictionary | wc -l
14

The first counts the number of three-letter words that begin with the
letters 'a' or 't' and end with the letter 'e'; the second counts the
number of three letter words that begin with the letter 'a' and end
with the letter 'e'.  Note that you build a regular expression from
bits and pieces of other regular expressions and then "feed" the result 
to 'grep' using a variable:

% set beg = 'a'
% set end = 'e'
% set num = 1
% set regexp = "\<${beg}[a-z]\{$num\}$end\>"
% grep "$regexp" $dictionary | wc -l
14

Note that we had to be careful with variable substitutions.  Without
the curly brackets around 'beg', the shell would have tried to treat
'beg' as an array (list) and look up the 'a-z' elements.  But we 
didn't intend $beg[a-z] as an array reference.  The curly brackets
shield the variable reference from the surrounding text. 

% set regexp = "\<$beg[a-z]\{$num\}$end\>"
tcsh: Missing -.
% set regexp = "\<${beg}[a-z]\{$num\}$end\>"
% echo "$regexp"
\<a[a-z]\{1\}e\>

The fewer matches the less ambiguous a partially specified character
sequence.  If there are no matches, however, then presumably the
decrypted sequence is either incorrect or it is correct and the 
word is not present in the dictionary.

If you assume that the original text has all punctuation and other
non-alphabetic characters removed and all remaining characters
converted to lowercase, then there is a simple method for keeping
track of partial substitutions: you simply make your substitutions in
uppercase characters, and so, at any point, the characters remaining
to be decrypted are exactly the lowercase letters.  To evaluate a
partially specified substitution code, one technique would be to
reward those partially decoded strings that match exactly one word in
the dictionary and penalize those that match many words or that match
none.


Implementation Tips

Suppose that you want to grab the most frequently occurring word in a
text file.  You know that it will be present on the first line of a
word histogram, but you need to somehow select the first line of the
histogram and then select the word from that line.  The 'head' command
with an argument '-n' where 'n' is a positive integer, prints out the
first 'n' lines in a file ('tail -n' prints out the last 'n' lines in a
file).  We can use the 'head' command to select the first line in the
word histogram as follows:

% set text = $pepys/1663-09-04.txt
% wordhist $text | head -1 
  48 and

To get the 'n'th line of a file, we could use 'head -n | tail -1':

% set n = 8
% wordhist $text | head -$n | tail -1
  10 with

But following short 'sed' script is more efficient.  Note that in the
case of the 'sed' script, the 'n' in '-n' is meant literally.  In the
'sed' command, the '-n' option turns off the default behavior which
causes every line to be echoed to the standard output.  We then use 
the 'p' directive to cause the 'n'th line to be printed.

% wordhist $text | sed -n $n'p' 
  10 with

Now we can use 'sed' to grab just the word from a histogram ignoring
its frequency:

% wordhist $text | sed -n $n'p' | sed 's/[^a-z]*\([a-z]*\)/\1/'
with

Defining a convenient alias will save some typing:

% alias grabnth "sed -n \!^'p' | sed 's/[^a-z]*\([a-z]*\)/\1/'"
% wordhist $text | grabnth 8
with


Checking Substitution Codes

Suppose that the most commonly appearing word in a partially decrypted
text is 'TzE' and the most commonly appearing word in the entire
corpus is 'the'; it would be useful to have a script to determine if a
match is possible, and, if so, return the partial substitution
that would make it so.  A match is possible just in case the uppercase
characters in the partially decrypted word match - insensitive to case
- the corresponding characters in the word from the unencrypted
corpus.  Given 'TzE' and 'the', this script would determine that a
match is possible and return the substitution pair "z" "H".  Here's
the description of one possible algorithm for accomplishing this:

let 'crypt' be the list of characters in a word from the encrypted text 
let 'plain' be the list of characters in a word from the unencrypted text
let 'in' and 'out' be strings that are initially empty 
set 'flag' to be 1
if 'crypt' and 'plain' are not the same length then 
  set 'flag to be 0
if they are the same length then 
  set 'i' to be 1
  while 'i' is not greater than the length of 'crypt'
    if the 'i'th character of 'crypt' is uppercase then
      if it doesn't match the 'i'th character of 'plain' then
        set 'flag to be 0
      else if the 'i'th character of 'crypt' is not uppercase then
        add the 'i'th character of 'crypt' to 'in' 
        add the 'i'th character of 'plain' to 'out' 
    set 'i' to be 'i' + 1
if the 'flag' is 1 then
  return 'in' and 'out' as the result
else 
  return 'flag' as the result


Scoring Substitution Codes

In order to guide the search for a substitution code that correctly
decodes the encrypted file, we need to measure the quality of a
partial substitution.  We can then use this measure to extend a given
substitution code that handles a subset of characters to handle
additional characters.  Here we use the technique described earlier of
rewarding those partially decoded strings that match exactly one word
in the dictionary and penalizing those that match many words or that
match none.

Here we assume that each partially decrypted word consists of
lowercase characters corresponding to characters that have yet to be
translated and uppercase characters corresponding to those that have
have already been translated (though perhaps incorrectly).

let 'sample' be a list of words taken from a partially decrypted file
let 'dictionary' be a list of the words in a suitable dictionary
let 'reward' be the reward for matching exactly one word
let 'penalty' be the penalty for failing to match any word
let 'threshold' be the threshold for penalizing ambiguous matches
let 'threshold_penalty' be the penalty for ambiguous matches
set 'score' to be 0
for each 'word' in the 'sample'
  transform the 'word' into a regular expression (RE) as follows:
  replace each lowercase character with the regular expression '[a-z]'
  replace each uppercase character with its lowercase counterpart
  set 'count' to be number of words in 'dictionary' that match the RE
  if 'count' is equal to zero then
    set 'score' to be 'score' minus 'penalty'
  else if 'count' is greater than 'threshold' 
    set 'score' to be 'score' minus 'threshold_penalty'
  else if 'count' is equal to one then
    set 'score' to be 'score' plus 'reward'
return 'score' as the result


Controlling Filename Substitution 

As we build a regular expression from a partially decrypted word to
use in matching against words in the dictionary, we have to foil the
shell's attempts to match regular expressions against files.  If a
word contains any of the characters '*', '?', '[' or '{' or begins
with the character '~', then that word is a candidate for 'filename
substitution', also known as 'globbing'.  In this case the word is
regarded as a pattern, and replaced with an alphabetically sorted list
of file names that match the pattern.

% ls
decrypt 
% echo de[a-z]* 
decrypt

If the shell fails to find a match, normally it signals an error. 

% echo dc[a-z]* 
tcsh: echo: No match.

We can inhibit filename substitution by enclosing the regular
expression in double quotes.

% echo "dc[a-z]*"
dc[a-z]*

We could use this technique in implementing our scoring algorithm as
follows:

% set word = "THeRe"
% set dictionary = "/usr/share/dict/words"
% set word = "`echo "$word" | sed 's/[a-z]/[a-z]/g' | tr "A-Z" "a-z"`"
% echo "$word"
th[a-z]r[a-z]

The additional layer of quotes is a bit inelegant however.  If all we
want to do is prevent the shell from signaling an error and aborting
the execution of the current command, we can set the C-shell variable
'nonomatch' as follows:

% set nonomatch
% set word = THeRe
% set word = `echo $word | sed 's/[a-z]/[a-z]/g' | tr "A-Z" "a-z"`
% echo $word
th[a-z]r[a-z]

This works in the present circumstances because there are no matching
filenames and hence no substitutions.  And sometimes this is exactly
what we want:

% touch file.tmp
% if ( -e *.tmp ) echo yes
yes
% rm -f file.tmp
% if ( -e *.tmp ) echo yes
tcsh: *.tmp: No match.
% set nonomatch
% if ( -e *.tmp ) echo yes
%

For our present purposes, however, we want to prevent filename
substitutions altogether and this can be accomplished by setting
the C-shell variable 'noglob' as shown below:

% set noglob
% set word = THeRe
% set word = `echo $word | sed 's/[a-z]/[a-z]/g' | tr "A-Z" "a-z"`
% echo $word
th[a-z]r[a-z]


Managing a Queue of Substitutions

Note that even if we choose the best extension for a partial
substitution code according to our scoring method, it is possible that
we end up with an incorrect substitution code.  If we discover this,
say by finding that we can make no further progress in decoding the
encrypted file, then we may want to 'backtrack' to a previous partial
substitution code.  To facilitate incremental substitutions and
backtracking, we might maintain a queue of partial substitutions
called 'qsub' implemented as a list in which the first item on the
list is the last partial substitution made and the last item on the
list is the first partial substitution made; the queue is said to be
in LIFO order (last in first out).  When we extend a partial
substitution, we use 'tr' to make the relevant substitutions and then
'push' the extension on the front of the queue.  To backtrack, we
'pop' the first extension off the queue and use 'tr to reverse the
effect of making the substitutions.  Since a partial substitution is
represented as a pair of strings to be used as arguments to 'tr',
we're actually pushing and popping pairs of strings.

We'll initialize the queue to be empty and set 'word' to be a 
partially decoded word.  

% set qsub = ( )
% set word = THzxz

Suppose that we have a candidate extension: "z" translates to "E";
perform the substitution and then push the extension on the queue:

% set wsub = ( z E )
% set word = `echo $word | tr "$wsub[1]" "$wsub[2]"`
% set qsub = ( $wsub $qsub )
% echo $word
THExE

Now suppose we have another extension: "x" translates to "S":

% set wsub = ( x S )
% set word = `echo $word | tr "$wsub[1]" "$wsub[2]"`
% set qsub = ( $wsub $qsub )
% echo $word
THESE

Then suppose we determine that the last extension was wrong and we
need to backtrack; pop the last extension off the queue and then
reverse the translation to restore the partially decrypted text to
what it was prior to the last extension:

% set wsub = ( $qsub[1-2] ) ; shift qsub ; shift qsub
% set word = `echo $word | tr "$wsub[2]" "$wsub[1]"`
% echo $word
THExE

We can then try an alternative extension; pushing it on the queue and
performing the translation:

% set wsub = ( x R )
% set word = `echo $word | tr "$wsub[1]" "$wsub[2]"`
% set qsub = ( $wsub $qsub )
% echo $word
THERE

This technigue accomplishes our objective of keeping track of partial
substitutions, but it's tedious and hence prone to human error.
However, we can define a few aliases to reduce the amount of typing
and avoid the most obvious sources of error.

alias voidsub 'set qsub = ( )'
alias pushsub 'set qsub = ( \!* $qsub ) ; \
               set text = `echo $text | tr "$qsub[1]" "$qsub[2]"`'
alias popsub  'set text = `echo $text | tr "$qsub[2]" "$qsub[1]"` ; \
               shift qsub ; shift qsub' 
alias showsub 'echo $text'

You can put these in your .cshrc file in your home directory if you
want them to be available whenever you start up a shell or you can put
them in a file called, say, 'aliases' in an appropriate directory and
load them whenever you like by typing 'source aliases' to the shell.
Here is a simple demonstration making use of the above aliases:

% voidsub
% set text = "THzxz Axz zxxOxS"
% pushsub z E
% showsub
THExE AxE ExxOxS
% pushsub x R
% showsub
THERE ARE ERRORS
% popsub
% showsub
THExE AxE ExxOxS
% popsub
% showsub
THzxz Axz zxxOxS


Using the Cryptanalyst's Tool Kit

One way of building a Cryptanalyst's Tool Kit is to write a set of
scripts to handle key algorithms like the scoring and testing
procedures that I described above and then customize your shell
environment by defining variables to keep track of basic information
(like the queue of partial substitutions) and aliases to handle
commonly performed functions (like pushing and popping partial
substitutions).  The result might look something like:

% source toolkit            <<< load the definitions for the tool kit
% mysecret                  <<< randomly select and encrypt a file
% secret_charhist 3         <<< top 3 lines of the character histogram
 554 z
 285 e
 246 i
% secret_charfreq ' '       <<< check on the frequency of spaces
     114
% pushsub 'z ' ' @'         <<< guess identity of the space character
% secret_charfreq @         <<< sanity check on the substitution
     114
% secret_wordhist 5         <<< top 5 lines of secret word histogram 
  27 iye
  26 phb
  23 ic
  19 cs
  19 iypi
% corpus_wordhist 5         <<< top 5 lines of corpus word histogram 
 319 and
 231 to
 200 the
 131 i
  99 of
% check iye and             <<< check to see if the match is possible
iye AND
% pushsub iye AND           <<< go ahead and try the most obvious 
% scoresub                  <<< this score isn't very promising 
-34
% pushsub phb THE           <<< we'll go with it a bit further
% scoresub                  <<< the score is getting worse not better
-52
% showtext 4                <<< let's take a look at the partial decryption

     scl ANDtD A@c cl ANlDD ETdt j NTkD uDDH mwnN AlcwuaDE @jAN ANcwgNAt Nc@ Ac 
     gDA mcHDd Ac rTd ANDm ANTA j NTkD ucllc@DE mcHDd cs ud lDTtcH cs md mcHDd 
     uDjHg jH md wHnaDt NTHEt j lctD DTlad ANjt mclHjHg THE accoDE ckDl THE 

% popsub                    <<< this isn't working out; let's bail
% popsub
% pushsub phbiye ANDTHE     <<< reverse the codes for AND and THE
% scoresub                  <<< ahhh! this is looking much better
-22
% showtext 4                <<< let's take another look at the decryption

     scl THEtE T@c cl THlEE DAdt j HAkE uEEN mwnH TlcwuaED @jTH THcwgHTt Hc@ Tc 
     gET mcNEd Tc rAd THEm THAT j HAkE ucllc@ED mcNEd cs ud lEAtcN cs md mcNEd 
     uEjNg jN md wNnaEt HANDt j lctE EAlad THjt mclNjNg AND accoED ckEl AND 

% pushsub ics TOF           <<< let's go with the next two obvious matches
% scoresub                  <<< the score is looking more promising
21
% showtext 4

     FOl THEtE T@O Ol THlEE DAdt j HAkE uEEN mwnH TlOwuaED @jTH THOwgHTt HO@ TO 
     gET mONEd TO rAd THEm THAT j HAkE uOllO@ED mONEd OF ud lEAtON OF md mONEd 
     uEjNg jN md wNnaEt HANDt j lOtE EAlad THjt mOlNjNg AND aOOoED OkEl AND 

% pushsub jtlku ISRVB       <<< now it's getting pretty obvious what to do
% showtext 4

     FOR THESE T@O OR THREE DAdS I HAVE BEEN mwnH TROwBaED @ITH THOwgHTS HO@ TO 
     gET mONEd TO rAd THEm THAT I HAVE BORRO@ED mONEd OF Bd REASON OF md mONEd 
     BEINg IN md wNnaES HANDS I ROSE EARad THIS mORNINg AND aOOoED OVER AND 

4.2  Project #2: Dynamic Web Content

/Users/tld/email/book/intro/exercises/midterm/dynamic
/Users/tld/email/book/intro/exercises/midterm/dynamic/bin
/Users/tld/email/book/intro/exercises/midterm/dynamic/bin/allthumbs
/Users/tld/email/book/intro/exercises/midterm/dynamic/bin/hclean
/Users/tld/email/book/intro/exercises/midterm/dynamic/bin/rosebud
/Users/tld/email/book/intro/exercises/midterm/dynamic/bin/thumbnail
/Users/tld/email/book/intro/exercises/midterm/dynamic/images
/Users/tld/email/book/intro/exercises/midterm/dynamic/images/adirondack.gif
/Users/tld/email/book/intro/exercises/midterm/dynamic/index.html
/Users/tld/email/book/intro/exercises/midterm/dynamic/README

Write a script that takes two arguments: a string specifying a
directory (it can be an absolute or relative path name) and an integer
specifying the number of columns in a table that you will construct.
Where appropriate create separate commands for handling subtasks of
the full task.  The directory should contain a set of images in JPEG
format with the extension 'jpg'.  Your script will construct an HTML
document (essentially a web-based photo album) containing a table in
which thumbnail versions of the images in the specified directory are
displayed.  You are to use the ImageMagick library to create the
thumbnail-sized images so that the width of the table is no wider than
the width of a typical browser window.  The thumbnails will serve as
anchor images for hyperlinks that will allow a user to click on the
thumbnail to view the full-sized images.  You'll be given an example
of what the HTML output should look like; no prior experience working
HTML is required.  As a variant, you can assume that the images are
album covers or pictures of recording artists and link the thumbnails
to short audio clips extracted from MP3 files using the SoundExchange
library.  You will be graded on the code, how you document it and how
it works.  You will be required to demonstrate your code to a TA and
explain how it works.

For an example of the sort of code that your script will be
producing in this exercise, take a look at:

  http://www.cs.brown.edu/~tld/note/prospect/index.htm

  which is linked off of:

  http://www.cs.brown.edu/~tld/note/ri.htm

  Note that the first web page is more complicated than your 
  program need produce.  In particular, you don't have to 
  reproduce the navigation arrows that appear at the top of 
  these pages, unless, of course, you want to focus on this 
  aspect of the project.  See ./index.html for a somewhat 
  cleaned up version of a web page that is similar to what 
  you might provide in your project.  See below for a
  description of the HTML tags that are found in these files.


New HTML tags that you'll be using in the Photo Album Project:

Image tag 'img':

  An image tag is used to specify a JPEG or GIF file
  to display.  The 'src' is used to specify the location
  of the image file either relative to the containing 
  page, as is done here, or as a full URL:

  <img src="thumbnails/Prospect-01.jpg" border="0">

  Note that the image tag does not have a closing tag.
  The 'border' option is similar to the option by the
  same name for the 'table' tag.

Anchor tag 'a':

  The anchor tag is used to set off the anchor image or text for 
  a hyperlink.  The 'href' (HTTP references ) option is used to
  specify a destination URL such that if a user clicks the image 
  or text, the browser attempts to load the destination URL.  The 
  'href' can be specified relative to the containing page as in:

  <a href="pages/Prospect-01.htm">

  or it can be specified as a full URL as in:

  <a href="http://www.cs.brown.edu/~tld/note/tour.html">

  An image can be used to anchor the link as in:

  <a href="pages/Prospect-01.htm">
  <img src="thumbnails/Prospect-01.jpg" border="0">
  </a>

  or alternatively text can be used to anchor the link:

  <a href="pages/Prospect-01.htm">Prospect Street</a> 


Comment tag '!--':

  Comments are not displayed by the browser; they are 
  present only to help document the HTML code:

  <!-- Thumbnails with hyperlinks -->

When you put these all together to create a table representing
a photo album you get an HTML document that looks like:

  % cat index.html
  <html>
  <body>
  <table>
  <tr>
    <td>
       <a href="pages/page.1.1.jpg">
       <img src="thumbnails/thumb.1.1.jpg">
       </a>
    </td>
    <td>
       <a href="pages/page.1.2.jpg">
       <img src="thumbnails/thumb.1.2.jpg">
       </a>
    </td>
   <tr>
   <td>
       <a href="pages/page.2.1.jpg">
       <img src="thumbnails/thumb.2.1.jpg">
       </a>
    </td>
    <td>
       <a href="pages/page.2.2.jpg">
       <img src="thumbnails/thumb.2.2.jpg">
       </a>
    </td>
   </tr>
  </table>

  for a 2 X 2 table, where the thumbnails are just smaller 
  versions of the orginal photos and the pages are simple
  HTML documents designed to display a single image, e.g. 

  % cat pages/page.1.1.jpg
  <html>
  <body>
  <center>
   <img src="../originals/photo.1.1.jpg">
  </center>
  </body>
  </html>

  and the relevant documents are organized as follows:

  ./index.html
  ./originals/photo.1.1.jpg
              photo.1.2.jpg 
              ...
  ./thumbnails/thumb.1.1.jpg
               thumb.1.2.jpg 
              ...
  ./pages/page.1.1.jpg
          page.1.2.jpg 
          ...

  Your shell script starts with just the photos in ./originals/.


Creating Thumbnail Images

Here's a shell script that creates a thumbnail image using ImageMagick.

% cat ./bin/thumbnail
#!/bin/csh
# thumbnail - width image name 

# 'width' is the width of the requested thumbnail image in pixels
set thumb_width = $1

# 'image' is the name of original image file in any standard format 

# get rid of any old temporary files that happen to be around
if ( -e tmp.miff ) rm -f tmp.miff

# convert the input image to ImageMagick internal format 
convert $2 tmp.miff

# 'name' is the basename (without any extension) of the thumbnail
set thumb_name = $3

# determine the size of the original image file 
set size = ( `identify -format "%w %h" tmp.miff` )
set image_width = $size[1]
set image_height = $size[2]

# given that 'thumb_width' = 'image_width' * 'scale_factor'
# then we have 'scale_factor' = 'thumb_width' / 'image_width'
set scale_factor = `echo "$thumb_width / $image_width" | bc -l`

# and therefore 'thumb_height' = 'scale_factor' * 'image_height'
set thumb_height = `echo "scale = 0 ; $scale_factor * $image_height" | bc -l`

# make imagemagick carry out the necessary conversion
convert -geometry $thumb_width x $thumb_height tmp.miff $thumb_name.jpg

# finally clean up any temporary files left around
if ( -e tmp.miff ) rm -f tmp.miff


And here's an example demonstrating its use:

% ls
tom.jpg
% thumbnail 100 tom.jpg thumb
% ls 
thumb.jpg tom.jpg

4.3  Project #3: Account Management System

/Users/tld/email/book/intro/exercises/midterm/account
/Users/tld/email/book/intro/exercises/midterm/account/bin
/Users/tld/email/book/intro/exercises/midterm/account/bin/insert
/Users/tld/email/book/intro/exercises/midterm/account/bin/pretty
/Users/tld/email/book/intro/exercises/midterm/account/bin/restore
/Users/tld/email/book/intro/exercises/midterm/account/bin/select
/Users/tld/email/book/intro/exercises/midterm/account/bin/simple
/Users/tld/email/book/intro/exercises/midterm/account/bin/simple/myjoin
/Users/tld/email/book/intro/exercises/midterm/account/bin/simple/mypretty
/Users/tld/email/book/intro/exercises/midterm/account/bin/simple/myselect
/Users/tld/email/book/intro/exercises/midterm/account/databases
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account/fields
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account/id_auto_seq
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account/offsets
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account/records
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment/fields
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment/id_auto_seq
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment/offsets
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment/records
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account/fields
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account/id_auto_seq
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account/offsets
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account/records
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment/fields
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment/id_auto_seq
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment/offsets
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment/records
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale/fields
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale/id_auto_seq
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale/offsets
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale/records
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale/fields
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale/id_auto_seq
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale/offsets
/Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale/records
/Users/tld/email/book/intro/exercises/midterm/account/databases/merlins
/Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/account
/Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/restore
/Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/restore/account
/Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/restore/sale
/Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/sale
/Users/tld/email/book/intro/exercises/midterm/account/README

Build an account-management system by implementing basic database
operations using the shell commands 'awk', 'join', 'grep' (or
'egrep'), 'sed', and 'sort'.  In your database, there will be an
'account' record for each customer with fields for a unique
identifier, customer name, billing address, etc.  There should also be
a 'sale' record for each sales transaction with fields for a unique
identifier, account to bill, amount of sale, date of transaction and
description of goods purchased.  Finally, there will be a 'payment'
record for each payment received with the amount received, account to
credit, date of receipt, etc.  Each record should have a unique
identifier which is commonly referred to as the 'primary key' for the
record type.  You should implement a set of commands that mimic basic
SQL database commands and simplify inserting, extracting and
formatting financial data.  Use your system to generate an end-of-the
month financial report that shows the balance (sales minus payments)
for each customer, total sales, total payments and end-of-the month
balance.  You will be graded on how the code is organized and written,
how you document it and how well it works.  You will be required to
demonstrate your code to a TA and explain how it works.


Specifying Fields and Tables by Name

In augmenting the database system that we looked at in 

http://www.cs.brown.edu/~tld/talk/home-Z-H-5.html#node_sec_3.1.2

you'll probably want to make it possible to refer to the individual
fields of records by name.  Here are three different techniques for
managing this:

  i. keep field names and record data in a single file and assume that
     all commands are aware that the first line of a file representing
     a database table consists of the names of the fields:

     % cat employee
     first;last;age;gender
     Freddie;Jones;35;M
     Sally;Smith;27;F

 ii. keep two separate files, one for each type of information:

     % cat employee.fields
     first;last;age;gender

     % cat employee.records
     Freddie;Jones;35;M
     Sally;Smith;27;F

iii. similar to (ii) but create a directory with the same name
     as the record table that contains files for each type of
     information:

     % cat employee/fields
     first;last;age;gender

     % cat employee/records
     Freddie;Jones;35;M
     Sally;Smith;27;F


Using Field Names to Interactively Insert Data

Here's a script illustrating how you might interactively add a record
to a database table.  This version assumes that each field name is a
single word.  If you wish to have multi-word field names, separate
words with the underscore character as in 'business_address'.  You may
want to rewrite this script to display the completed record and then
prompt the user to review the record and respond (yes or no) whether
or not the record is actually added to the data file for the
corresponding table.  This script assumed the second technique (above)
for keeping track of field names.

% ls account/*
fields records
% cat account/fields
first;last;age;gender
% cat account/records
Sally;Smith;26;F

Here's the script:

% cat insert
#!/bin/csh
# the first and only argument specifies a table
set table = $argv
# grab the field names for the specified table
set fields = ( `cat $table/fields | sed 's/;/ /g'` )
# create an empty record 
set record = ""
# build the record field by field 
foreach field ( $fields )
  # prompt the user for the value of a field
  printf "enter $field = "
  # append the entered value to the record
  set record = "$record;$<"
end
# strip off the leading semicolon and append 
# the completed record to the table 
echo $record | sed 's/;//' >> $table/records


Here is an example illustrating its use; in this exchange the 
script prints the prompt, e.g., 'enter first = ', and the user 
types the response, e.g., 'Freddy':

% insert account
enter first = Freddy
enter last = Jones
enter age = 21
enter gender = M
% cat account/records
Sally;Smith;26;F
Freddy;Jones;21;M


Using Field Names in Queries

If we want to use the names of fields instead of their integer offsets
in queries, we need to map names to offsets.  We can simplify this
mapping process by associating an additional file with each table.  
For example:

% cat acme/sale/offsets
1;id
2;account
3;date
4;amount
5;product
% cat acme/sale/offsets | grep date | sed 's/\([0-9]*\);date/\1/'
3
% set database = acme
% set table = account
% alias index "cat $database/$table/offsets | grep \!^ | sed 's/\([0-9]*\);'\!^'/\1/'"
% index id
1
% index address
3

Here is my version of the 'select' command using the 'index' alias 
to map field names to offsets.  My implementation uses the standard
SQL keywords, 'FROM' and 'SORT BY': 

% SELECT first address last FROM account SORT BY last

+----------+-------------+-----------+
|    Frodo |     Baggins |  Hobbiton |
+----------+-------------+-----------+
|    Bilbo |     Baggins |  Hobbiton |
+----------+-------------+-----------+
| Hermione |     Granger |  Hogworts |
+----------+-------------+-----------+
|    Harry |      Potter |  Hogworts |
+----------+-------------+-----------+
|    Wally |  Ringwraith |    Mordor |
+----------+-------------+-----------+

You can use the 'awk' script './../../doc/pretty.awk' in your script
to 'pretty print' the output of your select command.


Implementing Joins

If you decide to use the first line of a file of records to specify
the field names for the corresponding table, you'll want to strip off
the first line for performing selects and joins.  The 'head' and
'tail' commands with the '-n' option where 'n' is an integer allow you
to print, respectively, the first 'n' lines or the last 'n' lines of a
file. The 'tail' command with the '+n' option where 'n' is an integer
prints the file starting at the 'n'th line.  For our immediate
purposes, 'head -1 file' prints the first line of 'file' and 'tail +2
file' prints all but the first line of 'file'.

Here's how you might use 'sort' and 'join' with 'head' and 'tail' to
manage the accounts of Merlin's Magic Shop.  I'll start by just
listing the contents of the two table files that we'll be using:

% cat merlins/account
id;First;Last;Address
1;Frodo;Baggins;Hobbiton
2;Bilbo;Baggins;Hobbiton
3;Harry;Potter;Hogworts
4;Hermione;Granger;Hogworts
5;Wally;Ringwraith;Mordor

% cat merlins/sale
id;account;item;amount
1;3;Potions for Dummies;19
2;2;Magic Ring Tricks;12
3;4;One Hundred Nifty Spells;15
4;5;Planning a Second Career;29
5;1;Big Book of Spiders;17
6;2;Introductory Elvish;24

In the following, we join the 'account' and 'sale' records using the
'id' field of the 'account' table and the 'account' field of the sale
table.  Recall that the files corresponding to the arguments of 'join'
must be sorted according to their respective join fields.

% tail +2 merlins/account | sort -t ";" -k 1 > account.tmp
% tail +2 merlins/sale | sort -t ";" -k 2 > sale.tmp
% join -t ";" -1 1 -2 2 account.tmp sale.tmp 
1;Frodo;Baggins;Hobbiton;5;Big Book of Spiders;17
2;Bilbo;Baggins;Hobbiton;6;Introductory Elvish;24
2;Bilbo;Baggins;Hobbiton;2;Magic Ring Tricks;12
3;Harry;Potter;Hogworts;1;Potions for Dummies;19
4;Hermione;Granger;Hogworts;3;One Hundred Nifty Spells;15
5;Wally;Ringwraith;Mordor;4;Planning a Second Career;29

Note that the join field is only listed once.  When the records are
listed, the records of the second file are appended to the end of the
records of first file with matching join fields, but the join field
doesn't appear a second time. If we reverse the order of the files
and, accordingly, the join fields, we get the following:

% join -t ";" -1 2 -2 1 sale.tmp account.tmp 
1;5;Big Book of Spiders;17;Frodo;Baggins;Hobbiton
2;6;Introductory Elvish;24;Bilbo;Baggins;Hobbiton
2;2;Magic Ring Tricks;12;Bilbo;Baggins;Hobbiton
3;1;Potions for Dummies;19;Harry;Potter;Hogworts
4;3;One Hundred Nifty Spells;15;Hermione;Granger;Hogworts
5;4;Planning a Second Career;29;Wally;Ringwraith;Mordor

Recall that we can use the '-o' option to print exactly the fields
that we're interested in:

% join -t ';' -1 2 -2 1 -o 2.2 2.3 1.3 sale.tmp account.tmp 
Frodo;Baggins;Big Book of Spiders
Bilbo;Baggins;Introductory Elvish
Bilbo;Baggins;Magic Ring Tricks
Harry;Potter;Potions for Dummies
Hermione;Granger;One Hundred Nifty Spells
Wally;Ringwraith;Planning a Second Career


Adding Simple Joins 

You might want to implement database joins as a separate command;
you'll probably want to call it something other than 'join' however
so that it doesn't conflict with the standard 'join' command (above)
which you'll want to use in implementing your database join.

You can also add simple two-table joins to the 'select' command by
extending the 'select' syntax.  Referring to the fields in a join is
complicated by the fact that there are two tables involved.  I suggest
that you use some variant of the standard SQL solution and refer to
fields by their 'long' names, appending the table name to field name
as in 'employee.address'.  This doesn't allow you to join a table with
itself (so-called 'self joins'), but is otherwise satisfactory:

% SELECT account.last sale.product FROM account sale \
                                   SORT BY account.last \
                                   WHERE account.id = sale.account

If you're really ambitious you can implement SQL aliases and support
self-joins:

% SELECT one.first two.first FROM account AS one account AS two \
                             WHERE one.first = two.last

Doesn't make much sense in the case of the tables you'll need for the
midterm project, but you get the basic idea.