/Users/tld/email/book/intro/exercises/midterm/ciphers /Users/tld/email/book/intro/exercises/midterm/ciphers/bin /Users/tld/email/book/intro/exercises/midterm/ciphers/bin/check /Users/tld/email/book/intro/exercises/midterm/ciphers/bin/kurshan /Users/tld/email/book/intro/exercises/midterm/ciphers/bin/kurshan/eymessage /Users/tld/email/book/intro/exercises/midterm/ciphers/bin/kurshan/eyscript /Users/tld/email/book/intro/exercises/midterm/ciphers/bin/kurshan/eytools /Users/tld/email/book/intro/exercises/midterm/ciphers/bin/pretty /Users/tld/email/book/intro/exercises/midterm/ciphers/bin/score /Users/tld/email/book/intro/exercises/midterm/ciphers/bin/toolkit /Users/tld/email/book/intro/exercises/midterm/ciphers/README Build a tool kit (a collection of commands implemented as shell scripts) for automating the process of finding secret codes. Your commands should partially automate the process of decrypting an encrypted message given a corpus of unencrypted entries from Pepys Diary. The more automation you can manage the better. Your commands should exploit frequency analysis coupled with word matching using the online list of words from the 2nd Edition of Webster's Dictionary. Note that not all words in Pepys Diary will be found in the Webster's Dictionary. Some of the words are proper nouns that aren't found in the dictionary. Also Pepys often used spellings that are archaic or idiosyncratic. In general, spelling was not as standardized in the 17th century as it is today and authors would frequently take poetic license in spelling a word; the same word often appears with multiple spellings within the same text. Even so you can often gain crucial hints by searching for commonly appearing words in a text. For example, suppose you have a good idea of how the letters 't' and 'e' map in a substitution code; you might search for strings that match the regular expression 't[a-z]e'. You will be graded on the code you write, how you document it, and how it works. A TA will sit down with you and watch you demonstrate your code on a target corpus of plain text entries from Pepys Diary and a mystery entry encrypted using a substitution code. The exercises related to Ciphers and Secrets are available online http://www.cs.brown.edu/~tld/talk/home-Z-H-4.html#node_sec_2.2.2 and also in the form of supplementary tips and sample scripts ../../04/09/20/exercises.txt ../../04/09/20/bin/README If you're working on CS department machines - either in the lab or remotely using 'ssh', you can set your 'path' variable to include Ciphers and Secrets 'bin' directory as follows: % set path = ( ~tld/tmp/exercises/04/09/20/bin $path ) % which wordhist /u/tld/tmp/exercises/04/09/20/bin/wordhist You might also want to set a variable to provide a short cut to the directory containing the excerpts from the Pepys diary: % set pepys = ~tld/tmp/exercises/04/09/20/bin/pepys % ls $pepys 1660-01-05.txt 1660-02-04.txt 1661-06-03.txt 1662-01-22.txt README 1660-01-09.txt 1660-02-12.txt 1662-01-06.txt 1662-02-18.txt 1660-02-01.txt 1660-06-22.txt 1662-01-19.txt 1663-09-04.txt If you're using a shell on your personal computer and not logged on remotely, you can simply copy the files to your computer. However, I recommend that you use department machines whenever possible both to ensure that your files are backed up and to avoid discrepancies between the environment on your computer and the environment on the department machines. In building on the Ciphers and Secrets exercise you'll probably want to exploit the information available in the online dictionary /usr/share/dict/words. I think you'll find that you can make lots of useful inferences about the relative value of various attempts to decrypt a message by using 'grep' and 'sed' along with the dictionary data. For example, % set dictionary = "/usr/share/dict/words" % grep "\<[a-z]\{3\}\>" $dictionary | wc -l 1134 counts the number of three letter words in the dictionary. The special characters '\<' and '\>' refer to the begin and end word boundaries; without these 'grep' would return all words that have at least three characters. We can reduce the number of matches by restricting the characters for each position: % grep "\<[at][a-z]e\>" $dictionary | wc -l 21 % grep "\<a[a-z]e\>" $dictionary | wc -l 14 The first counts the number of three-letter words that begin with the letters 'a' or 't' and end with the letter 'e'; the second counts the number of three letter words that begin with the letter 'a' and end with the letter 'e'. Note that you build a regular expression from bits and pieces of other regular expressions and then "feed" the result to 'grep' using a variable: % set beg = 'a' % set end = 'e' % set num = 1 % set regexp = "\<${beg}[a-z]\{$num\}$end\>" % grep "$regexp" $dictionary | wc -l 14 Note that we had to be careful with variable substitutions. Without the curly brackets around 'beg', the shell would have tried to treat 'beg' as an array (list) and look up the 'a-z' elements. But we didn't intend $beg[a-z] as an array reference. The curly brackets shield the variable reference from the surrounding text. % set regexp = "\<$beg[a-z]\{$num\}$end\>" tcsh: Missing -. % set regexp = "\<${beg}[a-z]\{$num\}$end\>" % echo "$regexp" \<a[a-z]\{1\}e\> The fewer matches the less ambiguous a partially specified character sequence. If there are no matches, however, then presumably the decrypted sequence is either incorrect or it is correct and the word is not present in the dictionary. If you assume that the original text has all punctuation and other non-alphabetic characters removed and all remaining characters converted to lowercase, then there is a simple method for keeping track of partial substitutions: you simply make your substitutions in uppercase characters, and so, at any point, the characters remaining to be decrypted are exactly the lowercase letters. To evaluate a partially specified substitution code, one technique would be to reward those partially decoded strings that match exactly one word in the dictionary and penalize those that match many words or that match none. Implementation Tips Suppose that you want to grab the most frequently occurring word in a text file. You know that it will be present on the first line of a word histogram, but you need to somehow select the first line of the histogram and then select the word from that line. The 'head' command with an argument '-n' where 'n' is a positive integer, prints out the first 'n' lines in a file ('tail -n' prints out the last 'n' lines in a file). We can use the 'head' command to select the first line in the word histogram as follows: % set text = $pepys/1663-09-04.txt % wordhist $text | head -1 48 and To get the 'n'th line of a file, we could use 'head -n | tail -1': % set n = 8 % wordhist $text | head -$n | tail -1 10 with But following short 'sed' script is more efficient. Note that in the case of the 'sed' script, the 'n' in '-n' is meant literally. In the 'sed' command, the '-n' option turns off the default behavior which causes every line to be echoed to the standard output. We then use the 'p' directive to cause the 'n'th line to be printed. % wordhist $text | sed -n $n'p' 10 with Now we can use 'sed' to grab just the word from a histogram ignoring its frequency: % wordhist $text | sed -n $n'p' | sed 's/[^a-z]*\([a-z]*\)/\1/' with Defining a convenient alias will save some typing: % alias grabnth "sed -n \!^'p' | sed 's/[^a-z]*\([a-z]*\)/\1/'" % wordhist $text | grabnth 8 with Checking Substitution Codes Suppose that the most commonly appearing word in a partially decrypted text is 'TzE' and the most commonly appearing word in the entire corpus is 'the'; it would be useful to have a script to determine if a match is possible, and, if so, return the partial substitution that would make it so. A match is possible just in case the uppercase characters in the partially decrypted word match - insensitive to case - the corresponding characters in the word from the unencrypted corpus. Given 'TzE' and 'the', this script would determine that a match is possible and return the substitution pair "z" "H". Here's the description of one possible algorithm for accomplishing this: let 'crypt' be the list of characters in a word from the encrypted text let 'plain' be the list of characters in a word from the unencrypted text let 'in' and 'out' be strings that are initially empty set 'flag' to be 1 if 'crypt' and 'plain' are not the same length then set 'flag to be 0 if they are the same length then set 'i' to be 1 while 'i' is not greater than the length of 'crypt' if the 'i'th character of 'crypt' is uppercase then if it doesn't match the 'i'th character of 'plain' then set 'flag to be 0 else if the 'i'th character of 'crypt' is not uppercase then add the 'i'th character of 'crypt' to 'in' add the 'i'th character of 'plain' to 'out' set 'i' to be 'i' + 1 if the 'flag' is 1 then return 'in' and 'out' as the result else return 'flag' as the result Scoring Substitution Codes In order to guide the search for a substitution code that correctly decodes the encrypted file, we need to measure the quality of a partial substitution. We can then use this measure to extend a given substitution code that handles a subset of characters to handle additional characters. Here we use the technique described earlier of rewarding those partially decoded strings that match exactly one word in the dictionary and penalizing those that match many words or that match none. Here we assume that each partially decrypted word consists of lowercase characters corresponding to characters that have yet to be translated and uppercase characters corresponding to those that have have already been translated (though perhaps incorrectly). let 'sample' be a list of words taken from a partially decrypted file let 'dictionary' be a list of the words in a suitable dictionary let 'reward' be the reward for matching exactly one word let 'penalty' be the penalty for failing to match any word let 'threshold' be the threshold for penalizing ambiguous matches let 'threshold_penalty' be the penalty for ambiguous matches set 'score' to be 0 for each 'word' in the 'sample' transform the 'word' into a regular expression (RE) as follows: replace each lowercase character with the regular expression '[a-z]' replace each uppercase character with its lowercase counterpart set 'count' to be number of words in 'dictionary' that match the RE if 'count' is equal to zero then set 'score' to be 'score' minus 'penalty' else if 'count' is greater than 'threshold' set 'score' to be 'score' minus 'threshold_penalty' else if 'count' is equal to one then set 'score' to be 'score' plus 'reward' return 'score' as the result Controlling Filename Substitution As we build a regular expression from a partially decrypted word to use in matching against words in the dictionary, we have to foil the shell's attempts to match regular expressions against files. If a word contains any of the characters '*', '?', '[' or '{' or begins with the character '~', then that word is a candidate for 'filename substitution', also known as 'globbing'. In this case the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names that match the pattern. % ls decrypt % echo de[a-z]* decrypt If the shell fails to find a match, normally it signals an error. % echo dc[a-z]* tcsh: echo: No match. We can inhibit filename substitution by enclosing the regular expression in double quotes. % echo "dc[a-z]*" dc[a-z]* We could use this technique in implementing our scoring algorithm as follows: % set word = "THeRe" % set dictionary = "/usr/share/dict/words" % set word = "`echo "$word" | sed 's/[a-z]/[a-z]/g' | tr "A-Z" "a-z"`" % echo "$word" th[a-z]r[a-z] The additional layer of quotes is a bit inelegant however. If all we want to do is prevent the shell from signaling an error and aborting the execution of the current command, we can set the C-shell variable 'nonomatch' as follows: % set nonomatch % set word = THeRe % set word = `echo $word | sed 's/[a-z]/[a-z]/g' | tr "A-Z" "a-z"` % echo $word th[a-z]r[a-z] This works in the present circumstances because there are no matching filenames and hence no substitutions. And sometimes this is exactly what we want: % touch file.tmp % if ( -e *.tmp ) echo yes yes % rm -f file.tmp % if ( -e *.tmp ) echo yes tcsh: *.tmp: No match. % set nonomatch % if ( -e *.tmp ) echo yes % For our present purposes, however, we want to prevent filename substitutions altogether and this can be accomplished by setting the C-shell variable 'noglob' as shown below: % set noglob % set word = THeRe % set word = `echo $word | sed 's/[a-z]/[a-z]/g' | tr "A-Z" "a-z"` % echo $word th[a-z]r[a-z] Managing a Queue of Substitutions Note that even if we choose the best extension for a partial substitution code according to our scoring method, it is possible that we end up with an incorrect substitution code. If we discover this, say by finding that we can make no further progress in decoding the encrypted file, then we may want to 'backtrack' to a previous partial substitution code. To facilitate incremental substitutions and backtracking, we might maintain a queue of partial substitutions called 'qsub' implemented as a list in which the first item on the list is the last partial substitution made and the last item on the list is the first partial substitution made; the queue is said to be in LIFO order (last in first out). When we extend a partial substitution, we use 'tr' to make the relevant substitutions and then 'push' the extension on the front of the queue. To backtrack, we 'pop' the first extension off the queue and use 'tr to reverse the effect of making the substitutions. Since a partial substitution is represented as a pair of strings to be used as arguments to 'tr', we're actually pushing and popping pairs of strings. We'll initialize the queue to be empty and set 'word' to be a partially decoded word. % set qsub = ( ) % set word = THzxz Suppose that we have a candidate extension: "z" translates to "E"; perform the substitution and then push the extension on the queue: % set wsub = ( z E ) % set word = `echo $word | tr "$wsub[1]" "$wsub[2]"` % set qsub = ( $wsub $qsub ) % echo $word THExE Now suppose we have another extension: "x" translates to "S": % set wsub = ( x S ) % set word = `echo $word | tr "$wsub[1]" "$wsub[2]"` % set qsub = ( $wsub $qsub ) % echo $word THESE Then suppose we determine that the last extension was wrong and we need to backtrack; pop the last extension off the queue and then reverse the translation to restore the partially decrypted text to what it was prior to the last extension: % set wsub = ( $qsub[1-2] ) ; shift qsub ; shift qsub % set word = `echo $word | tr "$wsub[2]" "$wsub[1]"` % echo $word THExE We can then try an alternative extension; pushing it on the queue and performing the translation: % set wsub = ( x R ) % set word = `echo $word | tr "$wsub[1]" "$wsub[2]"` % set qsub = ( $wsub $qsub ) % echo $word THERE This technigue accomplishes our objective of keeping track of partial substitutions, but it's tedious and hence prone to human error. However, we can define a few aliases to reduce the amount of typing and avoid the most obvious sources of error. alias voidsub 'set qsub = ( )' alias pushsub 'set qsub = ( \!* $qsub ) ; \ set text = `echo $text | tr "$qsub[1]" "$qsub[2]"`' alias popsub 'set text = `echo $text | tr "$qsub[2]" "$qsub[1]"` ; \ shift qsub ; shift qsub' alias showsub 'echo $text' You can put these in your .cshrc file in your home directory if you want them to be available whenever you start up a shell or you can put them in a file called, say, 'aliases' in an appropriate directory and load them whenever you like by typing 'source aliases' to the shell. Here is a simple demonstration making use of the above aliases: % voidsub % set text = "THzxz Axz zxxOxS" % pushsub z E % showsub THExE AxE ExxOxS % pushsub x R % showsub THERE ARE ERRORS % popsub % showsub THExE AxE ExxOxS % popsub % showsub THzxz Axz zxxOxS Using the Cryptanalyst's Tool Kit One way of building a Cryptanalyst's Tool Kit is to write a set of scripts to handle key algorithms like the scoring and testing procedures that I described above and then customize your shell environment by defining variables to keep track of basic information (like the queue of partial substitutions) and aliases to handle commonly performed functions (like pushing and popping partial substitutions). The result might look something like: % source toolkit <<< load the definitions for the tool kit % mysecret <<< randomly select and encrypt a file % secret_charhist 3 <<< top 3 lines of the character histogram 554 z 285 e 246 i % secret_charfreq ' ' <<< check on the frequency of spaces 114 % pushsub 'z ' ' @' <<< guess identity of the space character % secret_charfreq @ <<< sanity check on the substitution 114 % secret_wordhist 5 <<< top 5 lines of secret word histogram 27 iye 26 phb 23 ic 19 cs 19 iypi % corpus_wordhist 5 <<< top 5 lines of corpus word histogram 319 and 231 to 200 the 131 i 99 of % check iye and <<< check to see if the match is possible iye AND % pushsub iye AND <<< go ahead and try the most obvious % scoresub <<< this score isn't very promising -34 % pushsub phb THE <<< we'll go with it a bit further % scoresub <<< the score is getting worse not better -52 % showtext 4 <<< let's take a look at the partial decryption scl ANDtD A@c cl ANlDD ETdt j NTkD uDDH mwnN AlcwuaDE @jAN ANcwgNAt Nc@ Ac gDA mcHDd Ac rTd ANDm ANTA j NTkD ucllc@DE mcHDd cs ud lDTtcH cs md mcHDd uDjHg jH md wHnaDt NTHEt j lctD DTlad ANjt mclHjHg THE accoDE ckDl THE % popsub <<< this isn't working out; let's bail % popsub % pushsub phbiye ANDTHE <<< reverse the codes for AND and THE % scoresub <<< ahhh! this is looking much better -22 % showtext 4 <<< let's take another look at the decryption scl THEtE T@c cl THlEE DAdt j HAkE uEEN mwnH TlcwuaED @jTH THcwgHTt Hc@ Tc gET mcNEd Tc rAd THEm THAT j HAkE ucllc@ED mcNEd cs ud lEAtcN cs md mcNEd uEjNg jN md wNnaEt HANDt j lctE EAlad THjt mclNjNg AND accoED ckEl AND % pushsub ics TOF <<< let's go with the next two obvious matches % scoresub <<< the score is looking more promising 21 % showtext 4 FOl THEtE T@O Ol THlEE DAdt j HAkE uEEN mwnH TlOwuaED @jTH THOwgHTt HO@ TO gET mONEd TO rAd THEm THAT j HAkE uOllO@ED mONEd OF ud lEAtON OF md mONEd uEjNg jN md wNnaEt HANDt j lOtE EAlad THjt mOlNjNg AND aOOoED OkEl AND % pushsub jtlku ISRVB <<< now it's getting pretty obvious what to do % showtext 4 FOR THESE T@O OR THREE DAdS I HAVE BEEN mwnH TROwBaED @ITH THOwgHTS HO@ TO gET mONEd TO rAd THEm THAT I HAVE BORRO@ED mONEd OF Bd REASON OF md mONEd BEINg IN md wNnaES HANDS I ROSE EARad THIS mORNINg AND aOOoED OVER AND
/Users/tld/email/book/intro/exercises/midterm/dynamic /Users/tld/email/book/intro/exercises/midterm/dynamic/bin /Users/tld/email/book/intro/exercises/midterm/dynamic/bin/allthumbs /Users/tld/email/book/intro/exercises/midterm/dynamic/bin/hclean /Users/tld/email/book/intro/exercises/midterm/dynamic/bin/rosebud /Users/tld/email/book/intro/exercises/midterm/dynamic/bin/thumbnail /Users/tld/email/book/intro/exercises/midterm/dynamic/images /Users/tld/email/book/intro/exercises/midterm/dynamic/images/adirondack.gif /Users/tld/email/book/intro/exercises/midterm/dynamic/index.html /Users/tld/email/book/intro/exercises/midterm/dynamic/README Write a script that takes two arguments: a string specifying a directory (it can be an absolute or relative path name) and an integer specifying the number of columns in a table that you will construct. Where appropriate create separate commands for handling subtasks of the full task. The directory should contain a set of images in JPEG format with the extension 'jpg'. Your script will construct an HTML document (essentially a web-based photo album) containing a table in which thumbnail versions of the images in the specified directory are displayed. You are to use the ImageMagick library to create the thumbnail-sized images so that the width of the table is no wider than the width of a typical browser window. The thumbnails will serve as anchor images for hyperlinks that will allow a user to click on the thumbnail to view the full-sized images. You'll be given an example of what the HTML output should look like; no prior experience working HTML is required. As a variant, you can assume that the images are album covers or pictures of recording artists and link the thumbnails to short audio clips extracted from MP3 files using the SoundExchange library. You will be graded on the code, how you document it and how it works. You will be required to demonstrate your code to a TA and explain how it works. For an example of the sort of code that your script will be producing in this exercise, take a look at: http://www.cs.brown.edu/~tld/note/prospect/index.htm which is linked off of: http://www.cs.brown.edu/~tld/note/ri.htm Note that the first web page is more complicated than your program need produce. In particular, you don't have to reproduce the navigation arrows that appear at the top of these pages, unless, of course, you want to focus on this aspect of the project. See ./index.html for a somewhat cleaned up version of a web page that is similar to what you might provide in your project. See below for a description of the HTML tags that are found in these files. New HTML tags that you'll be using in the Photo Album Project: Image tag 'img': An image tag is used to specify a JPEG or GIF file to display. The 'src' is used to specify the location of the image file either relative to the containing page, as is done here, or as a full URL: <img src="thumbnails/Prospect-01.jpg" border="0"> Note that the image tag does not have a closing tag. The 'border' option is similar to the option by the same name for the 'table' tag. Anchor tag 'a': The anchor tag is used to set off the anchor image or text for a hyperlink. The 'href' (HTTP references ) option is used to specify a destination URL such that if a user clicks the image or text, the browser attempts to load the destination URL. The 'href' can be specified relative to the containing page as in: <a href="pages/Prospect-01.htm"> or it can be specified as a full URL as in: <a href="http://www.cs.brown.edu/~tld/note/tour.html"> An image can be used to anchor the link as in: <a href="pages/Prospect-01.htm"> <img src="thumbnails/Prospect-01.jpg" border="0"> </a> or alternatively text can be used to anchor the link: <a href="pages/Prospect-01.htm">Prospect Street</a> Comment tag '!--': Comments are not displayed by the browser; they are present only to help document the HTML code: <!-- Thumbnails with hyperlinks --> When you put these all together to create a table representing a photo album you get an HTML document that looks like: % cat index.html <html> <body> <table> <tr> <td> <a href="pages/page.1.1.jpg"> <img src="thumbnails/thumb.1.1.jpg"> </a> </td> <td> <a href="pages/page.1.2.jpg"> <img src="thumbnails/thumb.1.2.jpg"> </a> </td> <tr> <td> <a href="pages/page.2.1.jpg"> <img src="thumbnails/thumb.2.1.jpg"> </a> </td> <td> <a href="pages/page.2.2.jpg"> <img src="thumbnails/thumb.2.2.jpg"> </a> </td> </tr> </table> for a 2 X 2 table, where the thumbnails are just smaller versions of the orginal photos and the pages are simple HTML documents designed to display a single image, e.g. % cat pages/page.1.1.jpg <html> <body> <center> <img src="../originals/photo.1.1.jpg"> </center> </body> </html> and the relevant documents are organized as follows: ./index.html ./originals/photo.1.1.jpg photo.1.2.jpg ... ./thumbnails/thumb.1.1.jpg thumb.1.2.jpg ... ./pages/page.1.1.jpg page.1.2.jpg ... Your shell script starts with just the photos in ./originals/. Creating Thumbnail Images Here's a shell script that creates a thumbnail image using ImageMagick. % cat ./bin/thumbnail #!/bin/csh # thumbnail - width image name # 'width' is the width of the requested thumbnail image in pixels set thumb_width = $1 # 'image' is the name of original image file in any standard format # get rid of any old temporary files that happen to be around if ( -e tmp.miff ) rm -f tmp.miff # convert the input image to ImageMagick internal format convert $2 tmp.miff # 'name' is the basename (without any extension) of the thumbnail set thumb_name = $3 # determine the size of the original image file set size = ( `identify -format "%w %h" tmp.miff` ) set image_width = $size[1] set image_height = $size[2] # given that 'thumb_width' = 'image_width' * 'scale_factor' # then we have 'scale_factor' = 'thumb_width' / 'image_width' set scale_factor = `echo "$thumb_width / $image_width" | bc -l` # and therefore 'thumb_height' = 'scale_factor' * 'image_height' set thumb_height = `echo "scale = 0 ; $scale_factor * $image_height" | bc -l` # make imagemagick carry out the necessary conversion convert -geometry $thumb_width x $thumb_height tmp.miff $thumb_name.jpg # finally clean up any temporary files left around if ( -e tmp.miff ) rm -f tmp.miff And here's an example demonstrating its use: % ls tom.jpg % thumbnail 100 tom.jpg thumb % ls thumb.jpg tom.jpg
/Users/tld/email/book/intro/exercises/midterm/account /Users/tld/email/book/intro/exercises/midterm/account/bin /Users/tld/email/book/intro/exercises/midterm/account/bin/insert /Users/tld/email/book/intro/exercises/midterm/account/bin/pretty /Users/tld/email/book/intro/exercises/midterm/account/bin/restore /Users/tld/email/book/intro/exercises/midterm/account/bin/select /Users/tld/email/book/intro/exercises/midterm/account/bin/simple /Users/tld/email/book/intro/exercises/midterm/account/bin/simple/myjoin /Users/tld/email/book/intro/exercises/midterm/account/bin/simple/mypretty /Users/tld/email/book/intro/exercises/midterm/account/bin/simple/myselect /Users/tld/email/book/intro/exercises/midterm/account/databases /Users/tld/email/book/intro/exercises/midterm/account/databases/acme /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account/fields /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account/id_auto_seq /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account/offsets /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/account/records /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment/fields /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment/id_auto_seq /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment/offsets /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/payment/records /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account/fields /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account/id_auto_seq /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account/offsets /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/account/records /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment/fields /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment/id_auto_seq /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment/offsets /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/payment/records /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale/fields /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale/id_auto_seq /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale/offsets /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/restore/sale/records /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale/fields /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale/id_auto_seq /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale/offsets /Users/tld/email/book/intro/exercises/midterm/account/databases/acme/sale/records /Users/tld/email/book/intro/exercises/midterm/account/databases/merlins /Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/account /Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/restore /Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/restore/account /Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/restore/sale /Users/tld/email/book/intro/exercises/midterm/account/databases/merlins/sale /Users/tld/email/book/intro/exercises/midterm/account/README Build an account-management system by implementing basic database operations using the shell commands 'awk', 'join', 'grep' (or 'egrep'), 'sed', and 'sort'. In your database, there will be an 'account' record for each customer with fields for a unique identifier, customer name, billing address, etc. There should also be a 'sale' record for each sales transaction with fields for a unique identifier, account to bill, amount of sale, date of transaction and description of goods purchased. Finally, there will be a 'payment' record for each payment received with the amount received, account to credit, date of receipt, etc. Each record should have a unique identifier which is commonly referred to as the 'primary key' for the record type. You should implement a set of commands that mimic basic SQL database commands and simplify inserting, extracting and formatting financial data. Use your system to generate an end-of-the month financial report that shows the balance (sales minus payments) for each customer, total sales, total payments and end-of-the month balance. You will be graded on how the code is organized and written, how you document it and how well it works. You will be required to demonstrate your code to a TA and explain how it works. Specifying Fields and Tables by Name In augmenting the database system that we looked at in http://www.cs.brown.edu/~tld/talk/home-Z-H-5.html#node_sec_3.1.2 you'll probably want to make it possible to refer to the individual fields of records by name. Here are three different techniques for managing this: i. keep field names and record data in a single file and assume that all commands are aware that the first line of a file representing a database table consists of the names of the fields: % cat employee first;last;age;gender Freddie;Jones;35;M Sally;Smith;27;F ii. keep two separate files, one for each type of information: % cat employee.fields first;last;age;gender % cat employee.records Freddie;Jones;35;M Sally;Smith;27;F iii. similar to (ii) but create a directory with the same name as the record table that contains files for each type of information: % cat employee/fields first;last;age;gender % cat employee/records Freddie;Jones;35;M Sally;Smith;27;F Using Field Names to Interactively Insert Data Here's a script illustrating how you might interactively add a record to a database table. This version assumes that each field name is a single word. If you wish to have multi-word field names, separate words with the underscore character as in 'business_address'. You may want to rewrite this script to display the completed record and then prompt the user to review the record and respond (yes or no) whether or not the record is actually added to the data file for the corresponding table. This script assumed the second technique (above) for keeping track of field names. % ls account/* fields records % cat account/fields first;last;age;gender % cat account/records Sally;Smith;26;F Here's the script: % cat insert #!/bin/csh # the first and only argument specifies a table set table = $argv # grab the field names for the specified table set fields = ( `cat $table/fields | sed 's/;/ /g'` ) # create an empty record set record = "" # build the record field by field foreach field ( $fields ) # prompt the user for the value of a field printf "enter $field = " # append the entered value to the record set record = "$record;$<" end # strip off the leading semicolon and append # the completed record to the table echo $record | sed 's/;//' >> $table/records Here is an example illustrating its use; in this exchange the script prints the prompt, e.g., 'enter first = ', and the user types the response, e.g., 'Freddy': % insert account enter first = Freddy enter last = Jones enter age = 21 enter gender = M % cat account/records Sally;Smith;26;F Freddy;Jones;21;M Using Field Names in Queries If we want to use the names of fields instead of their integer offsets in queries, we need to map names to offsets. We can simplify this mapping process by associating an additional file with each table. For example: % cat acme/sale/offsets 1;id 2;account 3;date 4;amount 5;product % cat acme/sale/offsets | grep date | sed 's/\([0-9]*\);date/\1/' 3 % set database = acme % set table = account % alias index "cat $database/$table/offsets | grep \!^ | sed 's/\([0-9]*\);'\!^'/\1/'" % index id 1 % index address 3 Here is my version of the 'select' command using the 'index' alias to map field names to offsets. My implementation uses the standard SQL keywords, 'FROM' and 'SORT BY': % SELECT first address last FROM account SORT BY last +----------+-------------+-----------+ | Frodo | Baggins | Hobbiton | +----------+-------------+-----------+ | Bilbo | Baggins | Hobbiton | +----------+-------------+-----------+ | Hermione | Granger | Hogworts | +----------+-------------+-----------+ | Harry | Potter | Hogworts | +----------+-------------+-----------+ | Wally | Ringwraith | Mordor | +----------+-------------+-----------+ You can use the 'awk' script './../../doc/pretty.awk' in your script to 'pretty print' the output of your select command. Implementing Joins If you decide to use the first line of a file of records to specify the field names for the corresponding table, you'll want to strip off the first line for performing selects and joins. The 'head' and 'tail' commands with the '-n' option where 'n' is an integer allow you to print, respectively, the first 'n' lines or the last 'n' lines of a file. The 'tail' command with the '+n' option where 'n' is an integer prints the file starting at the 'n'th line. For our immediate purposes, 'head -1 file' prints the first line of 'file' and 'tail +2 file' prints all but the first line of 'file'. Here's how you might use 'sort' and 'join' with 'head' and 'tail' to manage the accounts of Merlin's Magic Shop. I'll start by just listing the contents of the two table files that we'll be using: % cat merlins/account id;First;Last;Address 1;Frodo;Baggins;Hobbiton 2;Bilbo;Baggins;Hobbiton 3;Harry;Potter;Hogworts 4;Hermione;Granger;Hogworts 5;Wally;Ringwraith;Mordor % cat merlins/sale id;account;item;amount 1;3;Potions for Dummies;19 2;2;Magic Ring Tricks;12 3;4;One Hundred Nifty Spells;15 4;5;Planning a Second Career;29 5;1;Big Book of Spiders;17 6;2;Introductory Elvish;24 In the following, we join the 'account' and 'sale' records using the 'id' field of the 'account' table and the 'account' field of the sale table. Recall that the files corresponding to the arguments of 'join' must be sorted according to their respective join fields. % tail +2 merlins/account | sort -t ";" -k 1 > account.tmp % tail +2 merlins/sale | sort -t ";" -k 2 > sale.tmp % join -t ";" -1 1 -2 2 account.tmp sale.tmp 1;Frodo;Baggins;Hobbiton;5;Big Book of Spiders;17 2;Bilbo;Baggins;Hobbiton;6;Introductory Elvish;24 2;Bilbo;Baggins;Hobbiton;2;Magic Ring Tricks;12 3;Harry;Potter;Hogworts;1;Potions for Dummies;19 4;Hermione;Granger;Hogworts;3;One Hundred Nifty Spells;15 5;Wally;Ringwraith;Mordor;4;Planning a Second Career;29 Note that the join field is only listed once. When the records are listed, the records of the second file are appended to the end of the records of first file with matching join fields, but the join field doesn't appear a second time. If we reverse the order of the files and, accordingly, the join fields, we get the following: % join -t ";" -1 2 -2 1 sale.tmp account.tmp 1;5;Big Book of Spiders;17;Frodo;Baggins;Hobbiton 2;6;Introductory Elvish;24;Bilbo;Baggins;Hobbiton 2;2;Magic Ring Tricks;12;Bilbo;Baggins;Hobbiton 3;1;Potions for Dummies;19;Harry;Potter;Hogworts 4;3;One Hundred Nifty Spells;15;Hermione;Granger;Hogworts 5;4;Planning a Second Career;29;Wally;Ringwraith;Mordor Recall that we can use the '-o' option to print exactly the fields that we're interested in: % join -t ';' -1 2 -2 1 -o 2.2 2.3 1.3 sale.tmp account.tmp Frodo;Baggins;Big Book of Spiders Bilbo;Baggins;Introductory Elvish Bilbo;Baggins;Magic Ring Tricks Harry;Potter;Potions for Dummies Hermione;Granger;One Hundred Nifty Spells Wally;Ringwraith;Planning a Second Career Adding Simple Joins You might want to implement database joins as a separate command; you'll probably want to call it something other than 'join' however so that it doesn't conflict with the standard 'join' command (above) which you'll want to use in implementing your database join. You can also add simple two-table joins to the 'select' command by extending the 'select' syntax. Referring to the fields in a join is complicated by the fact that there are two tables involved. I suggest that you use some variant of the standard SQL solution and refer to fields by their 'long' names, appending the table name to field name as in 'employee.address'. This doesn't allow you to join a table with itself (so-called 'self joins'), but is otherwise satisfactory: % SELECT account.last sale.product FROM account sale \ SORT BY account.last \ WHERE account.id = sale.account If you're really ambitious you can implement SQL aliases and support self-joins: % SELECT one.first two.first FROM account AS one account AS two \ WHERE one.first = two.last Doesn't make much sense in the case of the tables you'll need for the midterm project, but you get the basic idea.