Chapter 2

Selected Exercises

2.1 Exercises for Wednesday, September 15, 2004

Exercises using shell variables, conditionals and foreach loops

# you could store all your information in files
% echo 1 > n
% cat n
1
% rm n
% echo 2 > n
% cat n
2

# in many cases however shell variables are a simpler solution

% set str = one
# use echo to look at the value of the variable
% echo $str
one
% set n = 1
% echo $n
1

# shell variables can be assigned lists (or vectors or arrays)

% set str = "one two three four"
# looks sort of like a list to me
% echo $str
one two three four
# even acts a little like a list
% echo $str[1]
one two three four
# unfortunately not enough like
% echo $str[2] <<<< 
str: Subscript out of range.

# creating real lists requires some special syntax

% set lst = ( one two three four )
# at first blush the variable behaves as above
% echo $lst
one two three four
# but now we can pick out the individual elements
% echo $lst[1]
one
% echo $lst[2]
two
% echo $lst[4]
four
# however we can't pick out elements that don't exist
% echo $lst[5] 
lst: Subscript out of range.
# we can also change the elements to have new values
% set lst[4] = five
% echo lst[4]
five
# and we can even shift the items in a list to the left
% shift lst
% echo $lst[1]
two
% in shifting left we eliminated one item in the list
% echo $lst[4] 
lst: Subscript out of range.
% and the 4th item in the list became the 3rd item
% echo $lst[3]
five

# creating a list of files using the backtick operator (`)

% ls *.mp3
bo_diddley.mp3 james_brown.mp3
% set files = ( `ls` )
% echo $files
bo_diddley.mp3 james_brown.mp3
% echo $files[1]
james_brown.mp3
% echo $files[2]
james_brown.mp3

# doing simple math calculations in a shell script 

# setting variables and doing simple integer math
% set x = 1
% echo $x
1
% @ x = $x + 1
echo $x
2
% @ x++
% echo $x
3

# dealing with real numbers (floats) requires other tools

# the shell can't handle real numbers
% @ x = 7.5 * 3.14 
@: Badly formed number
# using 'awk' to do math with real numbers
% echo 7.5 | awk '{ print ( 3.14 * $1 ) }'
23.55
# using 'bc' an arbitrary precision calculator
% set pi = 3.14
% set diameter = 7.5
% echo "$pi * $diameter" | bc
23.55

# conditional statements allow shell scripts to make choices

% ls *.mp3
bo_diddley.mp3  james_brown.mp3
% if ( -e bo_diddley.mp3 ) echo "Hey Bo Diddley!"
Hey Bo Diddley!

# here's a qualified conditional statement: ! = negation

% if ( ! -e wayne_newton.mp3 ) echo "Yo! No Wayne."
"Yo! No Wayne."

# a multiple-line conditional statement with an else clause

% if ( ! -e james_brown.mp3 ) then
if? echo "Get up off of that thing and dance."
if? else echo "Papa's got a brand new bag!"
Papa's got a brand new bag!

# the antecedent can be any command that you like

% if ( -e bo_short.wav ) esdplay bo_short.wav

# a compound conditional statement: && = conjunction

% if ( -e james_brown.mp3 && -e bo_diddley.mp3 ) then
% echo "Dance! You'll feel better."
Dance! You'll feel better.

# 'foreach' statements are for looping over lists 

% cd bo_tracks
% ls
bo_backward.wav   bo_forward.wav    bo_slowly.wav
% foreach file ( `ls` )
foreach? echo $file
foreach? esdplay $file
foreach? end
bo_backward.wav
bo_forward.wav
bo_slowly.wav


% In case you're interested, here's some info on working with sounds.

The command 'mpg123' plays audio files in MPEG 1.0/2.0 format (layers
1, 2 and 3).  MPEG files have to be decoded before they can be played
and 'mpg123' enables you to convert an audio file to a format that can
be handled by other command line tools.  The -s options pipes the
output to the standard output as raw (headerless) linear PCM audio
data, 16 bit, stereo, host byte order.  The -w option converts the
file to WAV format (used by Microsoft) and sends it to the standard
output (use - instead of a file name) or writes it to a specified
file.  This command also permits you to speed up the sound (-d n) by
only playing every n'th sample or slow it down (-h n) by playing each
sample n times.  You can learn more about 'mpg123' at its web site
(http://www.mpg123.de/).  If you really want to manipulate sound files
using fancy filters and synths, learn about 'sox' (for Sound Exchange)
(http://sox.sourceforge.net/), a command line tool that provides lots
of filters, special effects, mixers and synthesizers to play with.
You can use 'sox' to reverse the samples in a sound file which is
particularly useful in looking for satanic verses hidden in music
files.  Another useful sound library, called Elightened Sound Daemon
(http://www.tux.org/~ricdude/overview.html) provides tools for playing
sounds.  All of these commands are available on the department Linux
machines, but note that while these machines have soundcards (which
are necessary for generating sound) they don't have speakers and so
you'll have to bring headphones if you want to experiment with sound.

Here are some short tips. Use 'info' and 'man' to learn more:

mpg123 - MPEG decoder

mpg123 -w file (--wav)       # write file in wav format (- for standard output)

mpg123 -s file (--stdout)    # write file in headerless PCM, 16bits, host byte order

mpg123 -d n (--doublespeed)  # play every nth sample

mpg123 -h n (--halfspeed)    # play every sample n times

mpg123 -h 4 bo_diddley.mp3   # here's how I made some of the tracks

mpg123 --wav bo_slowly.wav --halfspeed 8 bo_diddley.mp3 


SoX - sound exchange - universal sound sample translator

sox in.wav out.wav trim 0:00:00.5 0:00:05.5  # grab a 5 second sample

sox in.wav out.wav reverse                   # reverse the file 

sox bo_diddley.wav bo_short.wav trim 0:00:01.5 0:00:08.5


Enlightened Sound Daemon - esd, esdcat, esdfilt, esdplay, esdsample

esdplay bo_short.wav       # plays WAV files and many other formats

2.2 Exercises for Monday, September 20, 2004

Supplementary tips and questions for 'Ciphers and Secrets' 

http://www.cs.brown.edu/~tld/talk/home-Z-H-4.html#node_sec_2.2.2

Shell scripts with lists of commands provide one way of stringing
together a sequence of computations.  The different computations /
commands are connected together using files and variables - the only
memory that shell scripts have.  Variables and files are the means
whereby different computations communicate with one another.  In this
exercise, we'll explore another method for computations to
communicate, Unix pipes, which you were introduced to in Chapter 1.
As demonstrated in Chapter 2, pipes allow you to do things that you
might accomplish using intermediate files, but without the clutter and
overhead of saving, reading and deleting files.  We'll start by
looking at some commands that work particularly well with pipes.


1. Here are some commands that will come in handy in this exercise.

   # 'translate' all uppercase characters to lowercase characters
   % cat text.txt | tr "A-Z" "a-z" 

   What does 'tr "0-9" "a-z"' do? 
   % echo "3.14" | tr "0-9" "a-z"

   What about 'tr "a-z" "0-9"' do? 
   % cat text.txt | tr "a-z" "0-9" 

   # delete all occurrences of the numeric characters 0-9
   % echo "Pi is approximately 3.14" | tr -d "0-9" 

   What does 'tr -dc "a-z \n"' do? 
   % cat text.txt | tr -dc "a-z \n"
 
   # 'squeeze' consecutive spaces into a single space using -s
   % echo "compresses long   sequences  of  spaces" | tr -s " " 

   What does 'tr -s " " "\n"' do?
   % cat text.txt | tr -s " " "\n"

   # eliminate repeated lines in a sorted list
   % echo "a b b c c c z" | tr " " "\n" | tr " " "\n" | uniq 

   What does 'uniq -c' do?
   % echo "a b b c c c z" | tr " " "\n" | tr " " "\n" | uniq -c 

   # 'explode' a file of lowercase words into characters
   % cat text.txt | sed 's/\([a-z ]\)/\1@/g' | tr -s "@" "\n"

   Explain how this script works - \(RE\) 'saves' match in \1 - Why
   not use sed 's/\([a-z ]\)/\1\n/g'?  The -s squeezes the whitespace.

   # 'implode' a file - basically the inverse of 'explode'
   # cat text.txt | tr "\n" " " | sed 's/\([A-Za-z]\) /\1/g' | tr -s " "

   Explain how this script works. 

   # 'sort' in reverse numerical order using the first field in a
   # file, then resolve any ties by sorting on the second field
   % cat text.txt | sort -k1,1nr -k2

   This is the sort of special-purpose application of 'sort' 
   that you have to figure out every time you need to use it.

   What does the following one-line shell script do? 

   % paste abc.txt 123.txt

   where abc.txt and 123.txt are text files.  Can you guess
   what the 'cut' command does? 

   % paste abc.txt 123.txt | cut -f 2

OK. Now we're going to apply what we learned to the task of creating
ciphers or coded messages.  I want you to think in terms of sending
text messages through pipes, encrypting a plain text message using a
secret code at one end of the pipe, sending it through the pipe, and
then decrypting it back into plain text at the other end.  We're going
to experiment with 'cracking' a particular class of coding schemes
called 'substitution codes.'

Samuel Pepys was a high-level government bureaucrat in the mid 17th
Century (he was more or less equivalent to the US Secretary of the
Navy).  He kept a diary for ten years beginning in 1660.  While often
cited for its accounts of Pepys extramarital exploits, it provides a
unique glimpse of 17th Century life in London.  In his diary, Pepys
wrote a detailed account of the great fire of London and many other
important historical events that would be otherwise lost to us. He
also gossiped about anyone and everyone and revealed his most intimate
thoughts.  It's no wonder that he took pains to render his diaries
secret.

2. Here's what I did to create the mystery file. 

   % set code = `cat alpha.txt | explode | shuffle | implode`
   % set file = `ls pepys/*.txt | shuffle | head -1`
   % cat $file | \
     tr "A-Z" "a-z" | \
     tr -dc "a-z \n" | \
     tr "a-z" "$code" > secret.tmp

   This is equivalent to running the 'secret' command as follows:

   % secret pepys

   In the following steps, I create a bunch of temporary files of the
   form *.hst (histograms) and *.tmp.  You can delete the lot of them
   using the 'cleanup' command.

   You might want to scan the 'README' file in this directory to learn 
   a little more about the commands (shell scripts) used in this exercise.

3. Create one big file from all the diary entries in the ./pepys/ directory.

   % cat ./pepys/*.txt > pepys.tmp


4. Create a character histogram of the combined entries in ./pepys/.

   % charhist pepys.tmp > pepys.chars.hst
 

5. Create a character histogram of the mystery text.

   % charhist secret.tmp > secret.chars.hst


6. Put the histograms side-by-side for making comparisons.

   % paste pepys.chars.hst secret.chars.hst > compare.chars.tmp


7. Figure out which character was substituted for the space and
   then use 'tr' to put the spaces back in the right places. 

   % cat secret.tmp | tr "#" "#" > secret.space.tmp

   This isn't quite right!  Figure out what's wrong and fix it.
   Think about using uppercase letters to keep track of the
   substitutions you've made so far.


8. If you correctly identified the space character, you have a 
   chance to exploit statistical regularities in word lengths. 

   % wordhist pepys.tmp > pepys.words.hst 

   % wordhist secret.space.tmp > secret.words.hst


9. As before, compare the two histograms side-by-side.

   % paste pepys.words.hst secret.words.hst > compare.words.tmp


*. From here on out, you're on your own, but here are some suggestions.

   Use a combination of word and character histograms to try to guess
   character substitutions.  Make selected substitutions using 'tr' to
   experiment with various decryption alternatives. 

   There's a list of all the words in the 2nd Edition of the Webster
   Dictionary in /usr/share/dict/. You can use 'egrep' to find words
   that match a specified pattern (called a regular expression).  For
   example, the invocation

   % egrep "\<thr[aeiou][a-z]{1,2}l\>" /usr/share/dict/words
   thrail
   thrall
   thrill
   thripel
   thronal

   matches all words in the dictionary that begin with 'thr' followed
   by a vowel, followed by at least one but not more than two letters,
   followed by an 'l'. 

   It's tricky making individual substitutions; if you replace each
   occurrence of 's' with 'g' then you'll confuse an inserted 'g' with
   a 'g' that was already present in the encrypted text.  Figure out a
   way of handling substitutions that won't confuse things.

   First person who decodes the file mystery.txt gets a perfect grade 
   on their choice of take-home quizzes.


Expanding on this exercise will be one of the projects you can select
for your midterm project. There is a lot you can do to build better
code-cracking tools.  Simple substitution codes are particularly
vulnerable to statistical techniques.  Think of some of the many
regularities in English words that you might exploit.  Look for common
sequences of words like 'qu', 'ing', 'ed', and 'ie', common words like
'the', 'a', 'I', and 'you'.  Sometimes you want to easily test a
hypothesis, e.g., do the letters 'jy' map to the letters 'ed'.  You
might have several such hypotheses.  How would you choose between
them?  How would you write a script to automate simple hypothesis
generation and evaluation?  You could rank substitutions by the number
of words that they reveal in the text.  How could you turn this
general rule of thumb into a procedure for evaluating substitutions?

2.3 Exercises for Wednesday, September 22, 2004

Some in-class exercises in writing short shell scripts. 

Exercises 1 and 2 are pretty easy; you should be able to handle 
each of them with relative ease.  Exercise 3 is more challenging.
If you're having trouble with any of them, please ask for help.

1. Write a script that takes two or more numbers specified as
arguments on the command line and uses a 'while' loop to print out
each number that is evenly divisible by the first number in the list.

% divides 3 5 7 9 11 12
3
9
12

2. Write a script that that takes the name of an existing directory
(which we'll refer to as the 'target' directory) as its only argument,
creates a new directory named 'images' in the target directory if a
directory of that same name doesn't already exist, and then uses a
'foreach' loop to copy every file with extension ".gif" or ".jpg" in
the target directory to the 'images' directory.  To test your script,
create a directory called 'test' and then use the 'touch' command to
create a bunch of files with extension ".gif" and ".jpg".

3. Write a script that takes the names of two text files as its only
arguments.  The first file is just an ordinary text file that might
contain, for example, the text of a mail message or a term paper.  The
second file is of a special form: each line consists of a commonly
mispelled word followed by a space followed by the correct spelling of
the word, for example

% cat mspl.txt
tey the
teh the
arguement argument
lawer lawyer
plaintif plaintiff

Your script should correct the spelling in the ordinary text file using
the corrections in the file named by the second argument.  For example, 

% cat msg.txt
teh lawer's arguement was pretty weak
but tey plaintif still got a big settlement
% spellfix msg.txt mspl.txt
the lawyer's argument was pretty weak
but the plaintiff still got a big settlement

Here are some hints that will help in writing your script

% set fix = ( `cat mspl.txt` )
% echo "$fix[1] => $fix[2]"
tey => the

Note that the following straightforward approach to making the 
substitutions doesn't work:

% echo "tey door closed" | sed 's/$foo[1]/$foo[2]/g'
tey door closed

The problem is that 'sed' has it's own interpretation of $ (it is used
in a regular expression to match the end of a line) which is different
from the shell's interpretation.  Somehow we want to force the shell
to 'interpolate' shell variables by looking up their values and
inserting the values into the 'sed' command.  One solution involves
the careful use of quotes.  Note that 

% echo "sed 's/$fix[1]/$fix[2]/g'"
sed 's/tey/the/g'

The outer double quotes render the inner single quotes as literals -
that is to say they don't behave like single quotes usually do in
suppressing the interpolation of shell variables.  This means that the
shell variables do get interpolated.  Now we can expand on this and
use 'eval' to force evaluation of the quoted string as follows:

% eval "echo 'tey door closed' | sed 's/$fix[1]/$fix[2]/g'"
the door closed

The shell command 'eval' is often used in exactly this fashion. There
are many possible solutions to this exercise; I want you to solve this
problem using a 'while' loop that repeatedly uses 'sed' to rewrite the
input file by making a different substitution on each iteration. Note
that you'll probably want to use one or more temporary files to manage
this iterative solution.

2.4 Exercises for Friday, September 24, 2004

Quiz for Monday September 27, 2004 and tips on debugging

I can only assume that most of you are doing just fine in writing
shell scripts; I held more than nine open office hours this past week
and only a few of you came for help.  You're going to need to know how
to write shell scripts in order to complete you midterms projects.

I want to remind you that this course does NOT assume any prior
knowledge of programming or computer science.  You will learn what you
need to know about programming.  Right now we're studying shell
scripts because they provide the best means for you to leverage
powerful libraries.  Writing programs in a suitable scripting language
is the most common way that professionals in many disciplines carry
out routine computing tasks in their daily work whether their work
involves bioinformatics, physics, mathematics, business, engineering
and the social sciences.  But you won't learn to write shell scripts
if you don't practice.

There is no reading for Monday, but THERE IS A QUIZ; you're to do
the following two exercises.  Print out your solutions and bring 
them to class on Monday morning.  

Here are the exercises:  

1. Exercise 1 that we did in class on Monday required you to use a
   'while' loop.  I want you to redo the exercise but this time use a
   'foreach' loop.  Here's the statement of the original exercise:

   Write a script called 'divides' that takes two or more numbers
   specified as arguments on the command line and prints out each
   number that is evenly divisible by the first number in the list.

   % divides 3 5 7 9 11 12
   3
   9
   12

   If you want something a little more challenging, print out only
   those numbers that are equal to the first number raised to some
   (integer) power.

   % powers 3 5 7 9 11 12
   3
   9

   And if that isn't challenging enough, write a sieve for primes: 

   % primes 2 3 4 5 6 7 
   2
   3
   5
   7 

2. Write a script called 'sumrows' that takes two arguments
   corresponding to files containing columns of numbers, and then
   prints out the sum of rows in the two columns.  Here's an example
   illustrating the script's behavior:

   % cat one
   3
   4
   % cat two
   5
   7
   % sumrows one two
   8
   11

   Your solution should exit with an appropriate message if the
   lengths of the two columns are different.  The keyword 'exit' will
   force a shell script to terminate.


In debugging your scripts, here are some common errors to check. Make
sure you DON'T put a dollar sign before a variable when you're setting
it either using 'set' or '@'.  Make sure you DO put a dollar sign
before a variable anywhere else you want to refer to its value.  Make
sure that you use '@' and not 'set' when you want to do some
arithmetic operations when setting the value of a variable.  Don't
confuse the different ways of quoting expressions.  In particular don't
confuse the backtick operator ` (which is used to force the evaluation
of an expression as in "set x = `ls`" or "foreach file ( `ls` )")
with the single quote which can be used when you want to prevent the
evaluation of variables.  Remember to end every script with a final
carriage return and remember to end every 'while' or 'foreach' loop
with the 'end' keyword. The 'end' keyword should appear on a line all
by itself with no additional spaces.  You can always use 'info csh'
to learn more about the C shell.

If you've carefully checked for all the above problems and your script
is still buggy, the next thing to do is to strategically insert some
diagnostic print statements so you can get some idea of what's going
on in your program as it executes.  For example, if you have a script
that uses the variable 'foo' and the program crashes at some point
where it refers to '$foo', insert 'echo $foo' right before the
statement that you suspect is causing problems.  You can put this sort
of diagnostic print statement anywhere in you program, including
the body of a loop.

2.5 Exercises for Monday, September 27, 2004

More practice writing shell scripts

Here are some additional hints for debugging shell scripts:

 In previous exercises, I suggested that you add print statements of
 the form 'echo $variable' to examine the value of key variables at
 strategic points in your code.  I also gave you tips about things to
 check when you encounter errors, e.g., make sure that your DON'T use
 the dollar sign when setting a variable but that you DO use it when
 you want to know the value of a variable.

 The 'csh' command takes a number of options but two of them are
 particularly useful in debugging.  The 'echo' option specified by
 '-x' displays each line of the script after variable substitution but
 before execution.  The 'verbose' option specified by '-v' displays
 each line of the script just as you typed it.  You can use these
 options individually or in combination.  You can apply these options
 in a number of ways.  You can add them to the directive that appears
 at the beginning of a shell script as in:

#!/bin/csh -xv

 You can also call 'csh' with these options and the name of the file
 containing the script you're trying to debug:

% csh -xv script

 You can use 'set' and 'unset' to turn the corresponding shell
 variables 'echo' and 'verbose' on and off selectively, for example, 
 in a particular block of code where you think you have a problem:

% cat script
#!/bin/csh
...
...
set verbose 
set echo 
...
...            <<< suspected problem area
...
unset verbose
unset echo
...
...

 Try these two options out so you'll be prepared when you run into a 
 really nasty bug.
 

Now let's take a quick review of the major flow-of-control constructs.


'foreach' - do something for each item in a list of items

foreach variable ( list )
 something
 something
end

% set r = 0
% foreach n ( 1 2 3 4 5 6 7 8 9 )
foreach?  @ r += $n
foreach? end
% echo $r
45

What do each of the following do?  Search forward to the section
on 'command substitution' to review the use of the backtick (also
called a backquote) notation. 

foreach file ( `ls` )
  echo $file
end

foreach word ( `cat file` )
  echo $word
end

foreach word ( `grep pattern file` )
  echo $word
end


'if' - do something on the condition that 

if ( condition ) something 

or 

if ( condition ) then
  something
  something
endif

or

if ( condition ) then
  something
  something
else if ( condition ) then
  something
  something
endif 

% set x = 0
% if ( $x == 0 ) echo "Big fat zero!"
Big fat zero!
% if ( ! ( $x > 0 ) && ! ( $x < 0 ) ) echo "Big fat zero!"
Big fat zero!
% set x = 1
% if ( $x != 0 ) echo "Ain't zero!"
Ain't zero!
% if ( ! ( $x == 0 ) ) echo "Ain't zero!"
Ain't zero!
% if ( ! $x ) echo "Big fat zero!"
% set x = 0
% if ( ! $x ) echo "Big fat zero!"
Big fat zero!

What's going on in the last two conditionals? 


1. Write a script 'sorthat' that sorts the words in a specified file
   into three new files: a file named 'nums' containing all the words
   corresponding to numbers or words beginning with numbers, a file
   named 'alphs' containing all the words that don't begin with a
   number and either begin with the letter 'L' (whether uppercase or
   lowercase) or some letter that appears earlier in the alphabet than
   'L', and finally a file named 'bets' containing all the words that
   begin with the letter 'M' or some letter that appears later in the
   alphabet than 'M'.  Hopefully my instructions are easier to
   understand than they were to compose.  You can use the 'tr'
   (translate) command to convert the characters in the file to
   lowercase and eliminate any characters other than numbers or
   lowercase letters.

   Comparing alphanumeric strings turns out to be a little awkward
   (the pun is definitely intended) in the C-shell.  The operators
   '<', '>', '>=' and '<=' only work on numbers, and, while '==' and
   '=~' work with strings, they don't allow us to establish an
   ordering among strings.  The 'awk' utility comes to the rescue:

% set u = aardvark
% set v = albatross
% echo $u $v | awk '{ print ( $1 < $2 ? $1 : $2 ) }'
aardvark

   It looks like we're just using '<' which we had available in the
   C-shell, but the 'awk' '<' is more powerful than the 'csh' '<'. 
   The above invocation is equivalent to the following which employs
   slightly more conventional syntax:

% echo $u $v | awk '{ if ( $1 < $2 ) print $1 ; else print $2 }'
aardvark

   We can actually simplify the 'awk' script since all we really care
   about is whether or not the first argument 'u' is less than the
   second argument 'v'.  For this all we need is the following:

% echo $u $v | awk '{ print ( $1 < $2 ) }'
1
% echo $v $u | awk '{ print ( $1 < $2 ) }'
0

   The 'awk' command is pretty complicated (it has its own specialized
   scripting language) and I don't recommend that you spend a lot of
   time learning about its intricacies apart from picking up the few
   idioms that I've shown you in various scripts.  

   You can save this as a shell script and give it a mnemonic name to
   make it a little more convenient to use.  Note that numbers are
   always lexicographically 'less' than words beginning with
   alphabetic characters.

% cat stringlessthan
#!/bin/csh
echo $1 $2 | awk '{ print ( $1 < $2 ) }'
% stringlessthan aardvark albatross
1
% stringlessthan aardvark 17 
0

   Now armed with the 'stringlessthan' command this exercise ought to
   be relatively straightforward.  Go back and look at the ciphers and
   secrets exercise (09-20-04.txt) to recall how to use 'tr' to
   preprocess the input file.


'while' - do something as long as some condition is met

while ( condition )
  something
  something
end

% set n = 10
% set i = 1
% set r = 0
% while ( $i < $n )
while?   @ r += $i
while?   @ i ++
while? end
% echo $r
45

You can 'nest' loops; you can have 'foreach' loops inside of 'while'
loops or visa versa.  The 'break' command allows you to terminate the
inner-most loop in which the 'break' appears.  Here's a simple example
illustrating the use of the 'break' statement in checking a list of
numbers for primes:

% cat primes
#!/bin/csh
foreach n ( $argv ) 
  @ d = $n / 2
  while ( $d > 1 )
    @ r = $n % $d
    if ( $r == 0 ) break
    @ d --
  end 
  if ( $d == 1 ) echo $n
end

2. Write a command 'diagonal' that takes a single integer argument and
   outputs a diagonal matrix with ones along the diagonal and zeros
   off the diagonal.  You'll need one new command to write this
   script.  The command 'printf' allows you print formatted text.
   While 'echo 1' prints a '1' it also prints a carriage return
   (newline). The command 'printf "1 "' prints a 1 followed by a space
   but no newline.  If you want to print a new line you would include
   a newline character ('\n') in the string argument to 'printf' as in
   'printf "0 \n"'.  You can read more about 'printf' using 'info' but
   the above description will suffice for implementing the 'diagonal'
   script.

% diagonal 3
1 0 0 
0 1 0 
0 0 1 
% diagonal 5
1 0 0 0 0 
0 1 0 0 0 
0 0 1 0 0 
0 0 0 1 0 
0 0 0 0 1 


Command substitution

The backtick or backquote enables you to assign a string or variable
the output of a command.

% echo "`date +%h` is the `date +%m`th month."
Sep is the 09th month.

When assigned to a variable the output of a command results in the 
variable being assigned a list (or array) of words

% ls
09-15-04.txt    09-20-04.txt    09-22-04.txt    09-24-04.txt    09-27-04.txt
% set exercises = ( `ls` )
% echo $exercises
09-15-04.txt 09-20-04.txt 09-22-04.txt 09-24-04.txt 09-27-04.txt

As with any list you can determine how many items it contains or refer to
any particular item in the list by its index.

% echo $#exercises
5
% echo $exercises[1]
09-15-04.txt
% echo $exercises[2]
09-20-04.txt

You can also refer to a subsequence of the words in the list by specifying a
beginning and ending index.

% echo $exercises[2-4]
09-20-04.txt 09-22-04.txt 09-24-04.txt


Some advanced topics for the determined and intrepid student.


Using subroutines in shell scripts - advanced topic #1

You've already been using subroutines in shells; every time you invoke
a command inside of a script you're executing a subroutine.  In some
cases, we execute a command for its 'side effects'.  For example, you
might execute 'mkdir images' to create a directory.  The creation of
the new directory is called a 'side effect'.  In other cases, e.g.,
command substitution (above), we execute a command for the output 
returned to the shell.

Unless you take pains to avoid it, the output of commands is directed
to the standard output and you see the output scrolling across your
screen.  It doesn't matter whether you execute a command directly in
the shell or indirectly by executing a script which then executes
other commands:

% cat testout
#!/bin/csh
echo "Starting 'testout'"
subtest
echo "Finishing 'testout'"
% cat subtest
#!/bin/csh
echo "Executing 'subtest'"
% testout
Starting 'testout'
Executing 'subtest'
Finishing 'testout'

You can, of course, redirect output to files using '>' or '>>' and you
can pipe the output from one command into another command as in 'ls |
wc -l' (which counts the number of files and directories in the
current working directory).  You can also use the backtick operator as
just mentioned to set a variable as in 'set var = `ls`' or specify the
list in a 'foreach' loop as in 'foreach var ( `ls ) ... end'. In the
following, we illustrate how use the output generated by one command
can determine the flow of control of another command by using
conditional statements.  First we define a command that checks to see
if its only argument is prime.

% cat isprime
#!/bin/csh
set n = $argv
set p = 1
@ d = $n / 2
while ( $d > 1 && $p )
  @ r = $n % $d
  if ( $r == 0 ) set p = 0
  @ d --
end 
if ( $p ) then
  echo 1
else 
  echo 0
endif

The echo commands are placed so that 'isprime' outputs a one (1) if its
sole argument is prime and a zero (0) otherwise.  The 'isprime' command is
now a general utility that we can use in any number of contexts.  We
can use the 'isprime' command in a conditional statement:

% if ( `isprime 3` ) echo "Prime!"
Prime!
% if ( ! `isprime 4` ) echo "Composite!"
Composite!

We can also use 'isprime' to write a more compact version of the
'primes' command:

% cat primes.sub
#!/bin/csh
foreach n ( $argv ) 
  if ( `isprime $n` ) echo $n
end


Exiting from a shell command using 'exit' - advanced topic #2

The 'exit' command allows you to exit from a script at any point.

3. Rewrite the 'isprime' command without the 'p' variable using 'exit'
   to terminate the loop when a divisor other than one is found.


Getting user input in a shell script - advanced topic #3

The pseudo variable '$<' substitutes a line read from the standard
input, with no further interpretation.  It can be used to read from
the keyboard in a shell script.  See if you can figure out what the
following script does:

% cat guess 
#!/bin/csh
set n = `date +"%M"`
while ( 1 )
  set m = $<
  if ( $m > $n ) then
    echo "Too high!"
  else if ( $m < $n ) then
    echo "Too low!"
  else 
    echo "You got it!"
    exit
  endif
end

If you're ambitious you might write a shell script that plays a 
a game of tic-tac-toe with a user. 


Using the exit status of a command - advanced topic #4

Every Unix command returns an exit status.  By convention, if the
command is successful, it returns an exit status of zero.  If the
command fails, then it returns an exit status greater than zero with
the exact value depending on how it failed; for example, grep returns
one (1) if it can't find the pattern it's looking for and two (2) if
it can't find the file supplied as its second argument.

In the C-shell the 'status' variable, is set to the exit status of
the last command executed:

% grep foreach isprime 
while ( $d > 1 )
% echo $status
0
% grep foreach isprime 
% echo $status
1

In writing a C-shell script you can use the keyword 'exit' with an
argument to make the script return a particular exit status based on
whatever you define as success or failure. 


4. Rewrite the 'isprime' command yet again, this time write it to have
   an exit status of zero (success) if its only argument is prime and
   one (failure) otherwise.  You solution should contain the
   invocations 'exit 0' and 'exit 1' and no 'echo' statements.  Here
   is what we might expect of your new version of 'isprime':

% isprime 3
% echo $status
0
% isprime 4
% echo $status
1

There is a special syntax that allows you to use the exit status of 
a command in an 'if' statement; you use curly braces instead of 
the usual parentheses.  The logic of using commands as conditional
is a little twisted however.  Zero is false in a conditional while 
one is true. 

% if ( 0 ) echo "No!"
% if ( ! 0 ) echo "No Way!"
No Way!
% if ( 1 ) echo "Way!"
Way!

In evaluating a command, if the command is successful (that is to say
it returns an exit status of zero), then the curly braces tell the
shell to return a one.  If the command fails (it returns an exit
status greater than zero), then the curly braces tell the shell to
return a zero.

% if { isprime 3 } echo "Prime!"
Prime!

Now we can rewrite the 'primes' command as follows:

% cat primes
#!/bin/csh
foreach n ( $argv ) 
  if { isprime $n } echo $n
end

2.6 Exercises for Friday, October 1, 2004

1. Here's a simple web page written in HTML:

   <html> 
   <head>
      <title>A Web page</title>
   </head>
   <body>
   <h1>Heading One</h1>
   <p>
   This is a paragraph.
   </p>
   </body>
   </html>

   Using this as a model create a home page with some basic contact and 
   biographical information and then open it using your favorite browser.


2. Consult the on-line HTML help to figure out how to add italics, bold 
   faced fonts and itemized lists.  Learn about 'anchor' tags and add a 
   link to the course home page from your home page. 


3. Define the following concepts: client-server model, communications
   protocol. 
   

4. Learn about the network that allows you to access the web from your
   dormroom.  Explain the idea of network topologies and networks of
   networks.


5. Here are the protocols that make the world-wide web possible

   HTTP (hypertext transfer protocol)
   FTP (file transfer protocol)
   TCP (Transfer Control Protocol)
   IP (Internet Protocol)

   Explain how each of these protocols plays a role in the sequence of
   events that follows when you click on the anchor text of a link on
   a web page displayed in a browser window.


6. HTML (hypertext markup language) now being supplanted by XHTML. Use the
   web to learn about XML and XHTML. What's the difference?

2.7 Exercises for Monday, October 4, 2004

Last class we discussed the client-server model, protocols, and the
world-wide web.  Your browser (Explorer, FireFox, Mozilla, Safari) is
an HTTP client which enables it to interact with HTTP servers such as
the Apache web server that serves up web pages for the computer
science department.  There are other simpler HTTP clients such as 
'curl' (for copy URL) that supply more limited functionality:

% curl "http://www.cs.brown.edu/~tld/talk/home.html"

Most browsers do a good deal more than simply request and fetch web
pages; a browser is said to 'parse' HTML documents and then 'render'
them for display in a browser window.  In this exercise, we'll write
an HTML parser and simple text-based renderer.  There's a very handy
text-based browser called 'lynx' that we'll use as a model for our
code.  This first invocation causes 'lynx' to take over a terminal
window running a shell in much the same was as 'info' does when
invoked on a the name of a shell command.

% lynx "http://www.cs.brown.edu/~tld/talk/home.html"

The next invocation causes 'lynx' to fetch a specified page, parse the
HTML file, remove the HTML tags and then 'dump' the result to the
standard output.  You can redirect the output to a file if you want to
save the result.  The -nolist option directs 'lynx' not to list the
hypertext links at the end of the listing as it normally would do.

% lynx -dump -nolist "http://www.cs.brown.edu/~tld/talk/home.html"

Parsing involves revealing the structure of a text.  When you parse a
sentence, something that you may have done in grade school, you
identify the subject, verb, direct object, prepositional phrases,
etc. and how they are related to one another.  In parsing an HTML
document we'll focus on identifying the HTML tags.  Technically this
is referred to as 'lexical analysis' but it is generally considered as
a part of the parsing process.  It's primary purpose for us is to
highlight the tags and thereby simplify subsequent processing.

We're going to use a 'sed' program to implement our parser.  A 'sed'
program is basically a list of 'sed' commands in a file.  Recall the
syntax of the 'sed' substitution command.

% echo "Damien has TA hours tonight." | sed 's/Damien/Vanessia/g'
Vanessia has TA hours tonight.

The usage 's/PATTERN/REPLACEMENT/g' is one of the most commonly used
'sed' commands; the 's' refers to substitution, the PATTERN is a
regular expression with much the same syntax as 'grep', and the 'g' is
the 'global' option which makes 'sed' make the substitution for every
occurrence of the pattern in a line of text.

We can run a batch of 'sed' commands at the same time by putting them
in a file, instructing 'sed' to load the file using the -f option and
then applying the commands in the order as they appear in the file to
each line in the standard input.

% cat change.sed
s/Damien/Vanessia/g
s/he/she/g
s/his/her/g
% echo "Damien left town; he will miss his class." | sed -f change.sed
Vanessia left town; she will miss her class.

1. Modify the file 'change.sed' so that it correct handles both the
   above example and the following one:

   % echo "Damien left town. He will miss his class." | sed -f change.sed
   Vanessia left town. She will miss her class.


2. Create a 'sed' program called 'viewpoint.sed' that produces the 
   following behavior.  This one is a little tricky. 

   % echo "you make me feel like I am tall." | sed -f viewpoint.sed
   I make you feel like you are tall.


 In parsing an HTML file, we want to convert each HTML tag to a
 canonical identifier called a 'token'.  Our parser should ignore the
 case of tags (though strictly speaking tags should be in lower case
 to comply with HTML 4.0 syntax) and it will ignore most of the
 options and formatting directives that are embedded in tags. Here is
 a simple command that converts the tag that signals the beginning of
 an HTML document; the 'g' option isn't really needed here as we don't
 expect multiple occurrences of the 'html' tag, but we will need it
 most of the time and it doesn't hurt.  The replacement string adds
 space before and after the token that we use to mark the occurrence
 of the 'html' tag.

s/<html>/ @BEGIN_HTML@ /g

We'll need another command to handle an uppercase tag:

s/<HTML>/ @BEGIN_HTML@ /g


3. How would you handle both lowercase and uppercase tags with a
   single 'sed' command?


4. That was pretty easy, but how would you manage a 'table' HTML tag 
   which takes options as in: 

   '<table align="center">'

   In this case we need a little more complicated regular expression.
   Suppose we want our parser to convert this tag into two tokens:

   ' @BEGIN_TABLE@ @CENTER@ '  

   Similarly, '<table align="left">' would be converted into 

   ' @BEGIN_TABLE@ @LEFT@ '  

   What would the 'sed' command look like?


5. Suppose we want to completely ignore the content of an anchor tag so that

   '<a href="http://www.cs.brown.edu/~tld/my_talk/home.html#name">text</a>'

   would be converted to 

   ' @BEGIN_ANCHOR@ text @END_ANCHOR@ '

   Write the sequence of 'sed' commands to implement this transformation.


6. What would happen if the '<' of a tag is on one line and the '>' is on
   another line?  How might you avoid this problem?


7. With the exception of a 'verbatim' mode (implemented using the
   'pre' tag), HTML pretty much ignores white space.  It recognizes
   separate words but doesn't distinguish among space, tab and
   linefeed characters or strings of such characters.  All formatting
   is specified by using appropriate HTML tags.  For example, the 'p'
   tag is used to format paragraphs; <p> is used to signal the
   beginning of a paragraph and </p> is used to signal the end a
   paragraph.  Given that spaces, tabs and linefeeds are irrelevant
   except to distinguish separate words, without losing information 
   we can convert an HTML document into a list of words using the
   backtick operator and then iterate through the list of words.

   Write a script 'format' that takes as its only argument the name of
   a file consisting of words with HTML 'p' tags and formats the
   output with an indentation of two spaces and a maximum line width
   of 72 characters.  The text alignment should be ragged right and
   there should be a blank line separating paragraphs.

   Here's the basic skeleton for your script:

   #!/bin/csh
   set max_width = 72
   set line_width = 0
   foreach word ( `cat $argv` )
     if ( "$word" == "<p>" ) then
       ...
       >> print statements <<
       ...
     else ... >> control logic << ...
       ...
       >> print statements <<
       ...
     endif
   end

   You'll need additional 'if', 'then' and 'else' keywords to
   implement the control logic.  Recall that 'printf' is more general
   than 'echo' in that it doesn't automatically add a linefeed, but
   you can always add line feeds by explicitly inserting them into the
   format string as in 'printf "\nText\n\n"'.

   You'll want to count the length of each word to stay within the
   maximum line width.  The wordcount command 'wc' counts words (-w),
   lines (-l), characters (-m), and bytes (-c).  What do the following
   incantations accomplish: 'ls *.jpg | wc -l', 'cat file.txt | wc -w' 
   and 'grep "<p>" file.html | wc -l'?  The following script will 
   come in handy in implementing your 'format' command:

   @ width += `echo $word | wc -m`

   By the way, you can define an alias for this short script by using
   the 'alias' command in the C-shell.  Many programmers use 'alias'
   to provide names to commonly used scripts or to redefine commands
   as in 'alias rm "/bin/rm -i"' which always uses the interactive
   (safe) option when running 'rm' to delete a file.  Typically these
   aliases are defined in a file called ".alias" which appears in the
   user's home directory and which is loaded by the user's ".cshrc"
   file by including the line "source .alias". In defining an alias,
   '\!^' refers to the arguments to the alias.

   alias length "echo \!^ | wc -m"

   Now you can refer to the alias as you would any other command and
   the name of the alias, if chosen judiciously, provides the reader
   with a better idea what's going on.

   @ width += `length $word`

   Note that in implementing word length in this way we compute the
   length of a word as the number of characters plus one.  (Why is
   this so and how might we avoid it?)  For this application, however,
   this is exactly what we want. (Why?)

   Appendix A below contains an example of 'format' input and output.


8. A case statement can come in handy when your flow-of-control logic
   has to deal with several alternatives.  Use 'info csh' to read
   about the syntax for using 'switch' together with 'case' and then
   reimplement 'format' using a case statement.  The basic skeleton
   for your script should look something like the following:

   #!/bin/csh
   set width = 0
   set max_width = 72
   foreach word ( `cat $argv` )
     switch ( $word )
       case '<p>':
         ...
         breaksw
       case '...':
         ...
         breaksw
       default: 
         ...
     endsw
   end

9. Now put it all together.  Write a 'sed' program that handles
   'html', 'head', 'title', 'body', 'h1' (heading), 'h2', 'h3', 'p'
   and 'a' (anchor) tags and ignores the rest. Then write a renderer
   that can format appropriate output for these tags.  Your renderer
   should work like 'format' except that it uses the tokens generated
   by your parser instead of raw HTML tags.

   If you're ambitious, I'll show you how to get input from the user
   in a shell script, and then using 'curl' you can actually write a
   simple browser.  A suitably expanded version of this exercise would
   serve as a midterm project.


Appendix A: An example of 'format' input and output

% cat paragraphs.html
<p>
The csh is a command language interpreter incorporating 
a history mechanism, job control facilities, interactive file name 
and user name completion, and a C-like syntax. 
It is used both as an interactive
login shell and a shell script command processor.
</p>
<p>
The break flag forces a halt to option processing, causing 
any further shell arguments to be treated as 
non-option arguments.  
The remaining arguments will not be interpreted as shell 
options.
This may be used to pass options to a shell script without 
confusion or possible subterfuge.
</p>
<p>
The shell repeatedly performs the following sequence of actions: 
a line of command input is read and broken into words.
This sequence of words is placed on the command history list 
and parsed.
Finally each command in the current line is executed.
</p>

% format paragraphs.html
 
  The csh is a command language interpreter incorporating a history 
  mechanism, job control facilities, interactive file name and user 
  name completion, and a C-like syntax. It is used both as an 
  interactive login shell and a shell script command processor. 

  The break flag forces a halt to option processing, causing any 
  further shell arguments to be treated as non-option arguments. The 
  remaining arguments will not be interpreted as shell options. This 
  may be used to pass options to a shell script without confusion or 
  possible subterfuge. 

  The shell repeatedly performs the following sequence of actions: a 
  line of command input is read and broken into words. This sequence of 
  words is placed on the command history list and parsed. Finally each 
  command in the current line is executed.

2.8 Exercises for Friday, October 8, 2004

In class last Friday, I told you that most web pages on web servers
are written in HTML.  This statement was once true but is no longer.
Today, most web pages are generated by running scripts on the web
server; these scripts construct HTML (or XHTML) files (referred to as
'dynamic content') on the fly and then send them on to the client.

The scripts are written in any number of languages including the shell
scripting language that we've been working in, Perl - a language that
is somewhere between our shell language and C, Python - a recently
introduced language designed for much the same range of applications
as Perl, PHP - a language designed from the ground up for dynamically
generating web content, and even Scheme - a dialect of Lisp that 
was introduced in Chapter 1 and we'll see in later in the course.

In this exercise, you'll write a script that takes as input a table
consisting of database records generated by an SQL query and produces
as output an HTML file featuring an HTML table in which the rows are
the records and the columns are the fields in the database table.
We'll assume that the database table is formatted as we did in our
database exercise: lines are records and the fields are separated
using semicolons at delimiters.

You'll need to know a little more HTML to carry out this exercise.
Save the following HTML code to a file named 'table.html' and then 
open it using a browser.

<html>
<head>
 <title>Title</title>
</head>
<body>
<table align="center" border="1">
 <caption>Table Caption</caption>
 <tr>
  <th></th>
  <th>COLUMN 1</th>
  <th>COLUMN 2</th>
 </tr>
 <tr>
  <th>ROW 1</th>
  <td>DATUM 1,1</td>
  <td>DATUM 1,2</td>
 </tr>
 <tr>
  <th>ROW 2</th>
  <td>DATUM 2,1</td>
  <td>DATUM 2,2</td>
 </tr> 
 <tr>
  <th>ROW 3</th>
  <td>DATUM 3,1</td>
  <td>DATUM 3,2</td>
 </tr> 
</table>
</body>
</html>

The 'table' tag takes many more options than shown but the two shown
are pretty common: 'align' can be 'right', 'left' or 'center' and
'border' is an integer greater than or equal to zero.  The 'tr' tag
inside a 'table' environment is used to specify a row of data, the
'td' and 'th' are used to display the data for a cell (or box) in the
table with the 'th' form typically reserved for row or column labels
(in most browsers the 'th' form is displayed in a bold face type).
Experiment with 'table.html' by adding rows and columns and altering
the alignment.  You can set the alignment in rows or cells using the
same syntax as shown for the 'table' tag; alignment specifications in
a 'table' tag refer to the alignment of the table on the page while
alignment specifications in 'tr', 'th' and 'td' tags determine the
arrangement of text in cells.


1. Create a table with columns 'First', 'Last', 'Age' and 'Gender'.
   Populate the table with 3 rows of made-up data.  The column headers
   should be in bold faced type and centered in their cells, the first
   and last names should be left justified, ages should be right
   justified and gender identifiers ('M' or 'F') should be centered in
   the gender column.


2. Write a script that reads from the standard input the result of a
   SQL query and produces as output an HTML document that features an
   HTML table displaying display the query.  For this exercise, the
   result of the SQL query is assumed to have one record per line with
   fields delimited by semicolons.  For this exercise, you need not
   label the columns or rows.  Here's an example of the sort of input 
   your script should be able to handle:

   % cat employee
   Carla;Carbuncle;21;F
   Nathan;Normal;33;M
   Alice;Altavista;23;F
   Ernest;Sequitur;26;M
   
   In writing this script, it would be nice if you could go through
   the file line by line just like the 'awk', 'grep' and 'sed'
   utilities do.  Fortunately, the C-shell scripting language provides
   just such an option: '$<' substitutes a line from the standard
   input, with no further interpretation.  ('$<' is also useful to
   read from the keyboard in a shell script; check out 'Advanced
   Topics' (topic #3) in ./../../../doc/scripts.txt for an example 
   of an interactive script that implements a simple guessing game.)

   % cat number
   #!/bin/csh
   set line = $<
   set n = 1
   while ( "$line" != "" )
     foreach word ( $line )
       printf "$n - $word\n"
     end
     set line = $<
     @ n ++
   end
   % cat employee | number
   cat employee | number
   1 - Carla;Carbuncle;21;F
   2 - Nathan;Normal;33;M
   3 - Alice;Altavista;23;F
   4 - Ernest;Sequitur;26;M

   That's a start, but how do you read the individual fields in each
   record?  As a hint, try replacing 'foreach word ( $line )' with
   'foreach word ( `echo $line | tr ";" " "` )'.  This doesn't quite
   work because, as you might recall, we used the semicolon as a
   delimiter precisely because if we used the space character as a
   field separator we would not be able to distinguish spaces
   separating words within an individual field from those separating
   distinct fields.  Judicious use of 'sed' will fix this however.

   Once you understand how to break up the input into records (lines)
   and fields (sequences of words separated by semicolons), the only
   tricky part left involves writing one loop nested inside a second
   loop to construct the table.  The outer loop reads in a line from
   the standard input and produces the required 'tr' (row) tags, while
   the inner loop breaks up the line into words and produces the
   requisite 'th' (column) tags and associated text.


3. Rewrite your solution to Exercise 2 but this time assume that the 
   first two lines of the input file include the names of the fields
   and alignment type for displaying the data in each column.  The
   resulting HTML file should display the field names at the top of 
   each column and format the data accordingly. 

   % cat employee
   First;Last;Age;Gender
   left;left;right;center
   Carla;Carbuncle;21;F
   Nathan;Normal;33;M
   Alice;Altavista;23;F
   Ernest;Sequitur;26;M


4. Rewrite your solution to Exercise 3 to take two optional arguments.
   If a second argument appears on the command line it is assumed to
   be a string intended for use as the title of the HTML page.  If a
   third argument appears on the command line, it is assumed to be a
   string intended for use as the caption of the HTML table.


5. Rewrite your solution to Exercise 4 to support two command-line
   options.  Command-line options can be handled in shell scripts just
   as you would handle any argument; typically the script scans 'argv'
   to extract options from the rest of the command line arguments,
   assigning variables and performing additional computations as
   appropriate to carry out the options as specified.

   Use this alternative directive as the first line of your script:

   #!/bin/csh -b

   The '-b' option to the C-shell forces a 'break' from option
   processing, causing any further shell arguments to be treated as
   non-option arguments.  The remaining arguments will not be
   interpreted as shell options.  This is used to pass options to
   a shell script without employing convoluted text processing. 

   % cat options
   #!/bin/csh -b
   foreach word ( $argv )
     if ( "$word" == "-u" ) echo "dash ewe"
     if ( "$word" == "-l" ) echo "dash ell"
   end
   % options -l misc -u misc misc 
   dash ell
   dash ewe

   Use the '-d' option to specify an alternative delimiter.  Your
   script should use the space character as the default delimiter if
   no '-d' option is specified.  Use the '-a' option to indicate that
   the line encoding the alignment directives is present.  If the '-a'
   option is not present on the command line, then you can assume that
   the line specifying alignment directives is not present in the
   input.


6. Figure out a way to use the script that you wrote for Exercise 5 in
   conjunction with the scripts that we wrote in the database exercise
   to produce nicely formatted output. (Think about how you might
   invoke a browser in a script to display query results.)

2.9 Exercises for Wednesday, October 13, 2004

1. In the following program, each line marked with a "<<<" has a bug.
   Briefly describe each bug and the problem it would cause when the
   program is run.

#!/bin/csh
set line = $<
echo "<table align="center" border="1">"            <<< 
while ( "$line" != " " )                            <<<
  printf <tr>\n<<<
  foreach word ( `echo $line | sed 's/;/ ;/g'` )    <<<
    printf "<td>"                                   <<<
    if ( "$word" == " ; " ) then                    <<<
      printf "</td>\n"
    else 
      printf $word
    endif
  end
  printf "</tr>\n"                                  <<<
end                                                 <<<
echo "</table>"

2. There are multiple ways of doing just about everything.  For 
   the exercise of building an HTML table from a file containing
   semicolon-delimited database records, we suggested that you use
   'sed' to expand each record into a list of separate words and 
   then use a 'foreach' loop to build the rows of the table field 
   by field. An alternative approach would be to use 'sed' to 
   iteratively dissect each record into fields.  Describe what 
   each line of the following program does:

#!/bin/csh
set line = "Sally Ann;Smith;26;F"
printf "0 - field = [] ; line = [$line]\n"
set line_number = 1
while ( "$line" != "" )
  set field = `echo $line | sed 's/\([^;]*\);.*/\1/'`
  set line = `echo $line | sed 's/'"$field"'[;]*\(.*\)/\1/'`
  printf "$line_number - field = [$field] ; line = [$line] ; \n"
  @ line_number ++
end

3. You can also build records field by field.  Here's an example
   illustrating how you might do this in a script that interactively
   adds a record to a databate table.  This version assumes that each
   field name is a single word.  Rewrite the script to support two
   additional features: (i) add code so that the script can handle
   multi-word field names and (ii) near the end of the script add code
   to display the completed record and prompt the user for a response
   (yes or no) that determines whether or not the record is actually
   added to the data file for the corresponding table.

% ls account.*
account.fields account.data
% cat account.fields
first;last;age;gender
% cat account.data
Sally;Smith;26;F
% cat insert
#!/bin/csh

# the first and only argument specifies a table
set table = $argv

# grab the field names for the specified table
set fields = ( `cat $table.fields | sed 's/;/ /g'` )

# create an empty record 
set record = ""

# build the record field by field 
foreach field ( $fields )
  # prompt the user for the value of a field
  printf "enter $field = "
  # append the entered value to the record
  set record = "$record;$<"
end

# strip off the leading semicolon and append 
# the completed record to the table data 
echo $record | sed 's/;//' >> $table.data

% insert account
enter first = Freddy
enter last = Jones
enter age = 21
enter gender = M
% cat account.data
Sally;Smith;26;F
Freddy;Jones;21;M

2.10 Exercises for Wednesday, October 20, 2004

The advantages and shortcomings of shell scripting languages. 

As a shell many programmers prefer the C-shell.  The C-shell was
written by Bill Joy and the interactive extension 'tcsh' provides an
environment well suited to program development.  As a scripting
language some programmers prefer the scripting language provided in
the Bourne again shell 'bash'.

Bash has several advantages over the C-shell: it is distributed under
the GNU General Public License, it is the default shell under most
Linux distributions, it has a more expressive scripting language and
avoids some of the quirks of the C-shell, and it combines many of the
best features of the C-shell, the Korn Shell 'ksh' written by David
Korn, and the original Bourne shell 'sh' written by Steve Bourne.

Bash has a few disadvantages as well, the primary one for me being
that 'bash' is big and complex.  I like the spare elegance of the
C-shell while admitting to its shortcomings as a scripting language.
My philosophy, however, is that if the job is too complicated to
program elegantly in the C-shell, then I'd prefer to write it in a
scripting language like Perl or Python or in a general purposes
language like C, Java or Scheme.

I like to use shell scripts to customize my working environment.
Anything that I have to do more than a few times is a candidate for a
shell script.  In addition to storing a few scripts in my '~/bin'
directory, I also have 'bin' directories sprinkled throughout my home
directory tree that are used for particular purposes.  Shell scripts
allow me to combine commands built into the shell, with commands that
others write and my own personal commands.  The directory tree in
which I wrote Talking With Computers has over a thousand individual
files and more than two hundred directories:

% find ~/book -type f -print | wc -l
    1369
% find ~/book -type d -print | wc -l
     220

I have shell scripts that put the book together out of hundreds of
pieces using the same technologies (version control systems and
'make' files) that programmers use to build large software systems.
Other scripts recompile the book web site and then install the web
pages on the department web server.  These scripts save me effort by
performing repetitive tasks and they also prevent errors - all I have
to do is make sure that I'm very careful when writing the scripts.

In the remainder of this course, the in-class exercises will turn to
the use of Scheme a dialect of Lisp which was one of the very first
general-purpose programming languages developed.  John McCarthy is
credited with the invention of Lisp though he never really intended it
to be used for practical programming; for McCarthy it was an elegant
mathematical language for proving theorems about computability.

But just because we're adopting another programming language doesn't mean
that we can't use that language to write shell scripts.  Indeed it is as
easy to write and use Scheme-based shell scripts as it is to write and 
use C-shell, bash, Perl or Python-based shell scripts; easier I would say
given the elegance and power of Scheme.  Here are some simple examples of 
Scheme-based shell scripts: 

% ls
args    hello   pipes
% cat hello
#! /usr/local/plt/bin/mzscheme -qr

;; This script implements the classic 'hello world' program.

(printf "Hello world!\n")

% hello
Hello world!
% cat args
#! /usr/local/plt/bin/mzscheme -qr

;; This scheme shell script prints out its arguments to the 
;; standard output numbered as they appear on command line.

(let ((n 1))
  (map (lambda (arg)
         (printf "~a - ~a~%" n arg)
         (set! n (+ n 1)))
       (vector->list (current-command-line-arguments))))

% args 1 2 3 four
1 - 1
2 - 2
3 - 3
4 - four
% cat pipes
#! /usr/local/plt/bin/mzscheme -qr

;; Here is a scheme shell script that reads lines from the 
;; standard input and then prints each line preceded with
;; a number indicating its order in the input sequence.

(let loop ((n 1))
  (let ((line (read-line)))
    (if (eof-object? line)
        null
        (begin (printf "~a - ~a~%" n line) (loop (+ n 1))))))

% ls | pipes 
1 - args
2 - hello
3 - pipes

Note that Scheme-based shell scripts take command-line arguments and
can be used with Unix pipes to read from the standard input and write
to the standard output.  I expect the syntax looks pretty foreign to
you, but I think you'll find that Scheme syntax is pretty simple (for
one thing there isn't a lot of it) and you'll find Scheme programming
environments more powerful and easier to use in debugging programs.

Here are some things that we'll gain by moving to Scheme as a
programming language:

 - a more expressive and more powerful programming language 
 - a language designed for developing larger applications 
 - a powerful set of libraries for building new applications
 - a graphical programming environment that ports easily to 
   all major platforms including Windows, Linux and Mac OS X. 
 - a rapid-prototyping language of proven power and value

It is true that programming C, C++, C#, Java and even Perl are more
marketable skills, but we're not trying to get you a job, we're trying
to provide you with the skills to make computers work for you and
Scheme has few peers when it comes to ease of learning and the ability
to support rapid software development.  As a biologist working in
genomics, an analyst working in the financial markets or an engineer
sorting through experimental data, Scheme is a fine tool for practical
programming.  And besides that it is an ideal first general-purpose
programming language to learn.

You can write, debug and use Scheme programs much as you write, debug
and use C-shell programs by using Emacs and an interactive shell.
Here's a short transcript illustrating how you might interact with the
Scheme 'interpreter' a program that, much like the C-shell, issues a
prompt, reads expressions typed by the user, evaluates them and prints
any output: 

% which mzscheme
/usr/local/plt/bin/mzscheme
% mzscheme
Welcome to MzScheme version 208, Copyright (c) 2004 PLT Scheme, Inc.
> (printf "This is my first line of input.~%")
This is my first line of input.
> (+ 3.14 2.71 1/3 5)
11.183333333333334
> (* 7 (expt 3.14 2.71))
155.51682682201624
> (map + '(1 2 3 4) '(1 2 3 4))
(2 4 6 8)
> (map expt '(2 2 2 2) '(1 2 3 4))
(2 4 8 16)
> (regexp-replace "Damien" "Damien has hours tonight." "Vanessia")
"Vanessia has hours tonight."

There are a number of things to notice about this short session.  Some
of what I typed must look familiar: 'printf' looks like it works
pretty much the same as the shell variant. Of course, there are lots
of parentheses sprinkled about and indeed parentheses are ubiquitous
in Scheme; they constitute the most common method of defining syntax
in Lisp and most of its dialects.  You'll also notice that
mathematical operators are written 'prefix' instead of 'infix', a
convention referred to as Polish notation.  But what is more
interesting is that Scheme is pretty capable and accommodating when it
comes to math; I can add integers, rationals and real numbers without
invoking any special programs and operators like exponentiation 'expt'
come standard.  The two examples involving 'map' illustrate a method
of iteration that appears different from a 'foreach' loop, but is
actually related though much more powerful.  Finally, there are
built-in tools for pattern matching that are every bit as powerful as
those available in the C-shell.

The program that I invoked from the shell was called 'mzscheme' (as in
"Ms. Scheme") and is one of a family of Scheme related open-source
programs developed by a group of programmers and programming language
experts affiliated with PLT Scheme Inc.  PLT also provides a graphical
programming environment called DrScheme that runs on all the major
platforms (Windows, Linux, Apple, Unix) and that we recommend you use
when working in the lab and that you download and install on your
personal computer. DrScheme is available at http://www.drscheme.org/.

The sophisticated procedures for doing math, pattern matching and 
the like provide the basis for writing powerful tools that can be
loaded into Scheme in the form of libraries.  These libraries extend
the syntax and the power of Scheme to suit particular purposes. 
There are Scheme libraries for working with databases, building 
spiders that search the web, creating fancy graphical front ends
for programs, and a host of other applications.

2.11 Exercises for Wednesday, October 20, 2004

The Midterm Projects are due by class time this coming Monday, October
22.  You don't actually have to hand in anything.  We expect that all
of your project materials - code, documentation, sample data, etc - is
in a directory called 'midterm' in your top-level directory on the
department machines.  DON'T MODIFY ANY FILES OR SUBDIRECTORIES WITHIN
THIS DIRECTORY AFTER 10:00AM ON MONDAY, OCTOBER 22, 2004.  We will use
the last modification dates of the files and directories in 'midterm'
to validate that you 'turned in' your project on time.

There have been more than 14 hours of help available to you this week
and Damien will have hours Sunday evening.  As always, you can send
email to cs009-2tas with course-related questions.  I've made sure 
that any help I give to one of you becomes available to all of you
by my updating the project README files.  The latest versions of the
project README files are in the usual place:

  Account Management - ~/tmp/exercises/projects/accounts/README

  Cryptanalyst's Tool Kit - ~/tmp/exercises/projects/ciphers/README

  Dynamic Web Content - ~/tmp/exercises/projects/photos/README

These files have tips on how to structure your projects, solve
specific problems, relevant commands you might want to use, and
suggestions on how to 'spiff' up your projects.  For example, my
implementation of the 'select' command for the "Account Management'
project uses an 'awk' script to 'pretty print' the database tables.
Feel free to use this for any of your projects:

% select first address last from account sort by last
----------+-------------+--------------
    Frodo |     Baggins |  Hobbit Town
----------+-------------+--------------
    Bilbo |     Baggins |  Hobbit Town
----------+-------------+--------------
 Hermione |     Granger |     Hogworts
----------+-------------+--------------
    Harry |      Potter |     Hogworts
----------+-------------+--------------
    Wally |  Ringwraith |       Mordor
----------+-------------+--------------

We've provided lots of help and allowed you to collaborate with one 
another, but we expect that your final project is your work alone. 
In particular, don't copy code from your classmates or take code from
sources other than those we provide.  Recall that you will have to 
demonstrate your code to a TA and they will ask you questions about
how your code works.  

In addition to having working code, we expect your 'midterm' directory
to be coherently organized and documented with a README file just like
the files and directories in ~/tmp/exercises/ are structured.  All of
your scripts should be in a 'bin' directory inside of 'midterm' and
this 'bin' directory should have a README file that provides a
high-level description of each command, e.g., see
'./../../../04/09/20/bin/README' for an example of what we're looking
for.  Each of your shell scripts should be carefully documented, e.g.,
see './../../../projects/accounts/bin/select' for an example.
Finally, you should have a 'midterm/doc' directory with documentation
for the project; this will depend on the project and doesn't have to
be lengthy, but it should describe the project in your own words and
discuss any extensions, simplifications or modifications you might
have made.

Next week we will start working on Scheme.  You should read Chapter 5
'Computational Muddles' for Monday; there won't be a quiz for Monday
since you have the midterm to turn in but we will have a quiz on this
Chapter for next Wednesday.  In the remainder of this class we'll talk
a bit about different programming languages, what shell scripts are
good for and why you might want to learn to program in a
general-purpose language like C, Java or Scheme.


The advantages and shortcomings of shell scripting languages. 

As a shell many programmers prefer the C-shell.  The C-shell was
written by Bill Joy and the interactive extension 'tcsh' provides an
environment well suited to program development.  As a scripting
language, some programmers prefer the scripting language provided in
the Bourne again shell 'bash'.

Bash has several advantages over the C-shell: it is distributed under
the GNU General Public License, it is the default shell under most
Linux distributions, it has a more expressive scripting language and
avoids some of the quirks of the C-shell, and it combines many of the
best features of the C-shell, the Korn Shell 'ksh' written by David
Korn, and the original Bourne shell 'sh' written by Steve Bourne.

Bash has a few disadvantages as well, the primary one for me being
that 'bash' is big and complex.  I like the spare elegance of the
C-shell while admitting to its shortcomings as a scripting language.
My philosophy, however, is that if the job is too complicated to
program elegantly in the C-shell, then I'd prefer to write it in a
scripting language like Perl or Python or in a general purposes
language like C, Java or Scheme.

I like to use shell scripts to customize my working environment.
Anything that I have to do more than a few times is a candidate for a
shell script.  In addition to storing a few scripts in my '~/bin'
directory, I also have 'bin' directories sprinkled throughout my home
directory tree that are used for particular purposes.  Shell scripts
allow me to combine commands built into the shell, with commands that
others write and my own personal commands.  The directory tree in
which I wrote Talking With Computers has over a thousand individual
files and more than two hundred directories:

% find ~/book -type f -print | wc -l
    1369
% find ~/book -type d -print | wc -l
     220

I have shell scripts that put the book together out of hundreds of
pieces using the same technologies (version control systems and
'make' files) that programmers use to build large software systems.
Other scripts recompile the book web site and then install the web
pages on the department web server.  These scripts save me effort by
performing repetitive tasks and they also prevent errors - all I have
to do is make sure that I'm very careful when writing the scripts.

In the remainder of this course, the in-class exercises will turn to
the use of Scheme a dialect of Lisp which was one of the very first
general-purpose programming languages developed.  John McCarthy is
credited with the invention of Lisp though he never really intended it
to be used for practical programming; for McCarthy it was an elegant
mathematical language for proving theorems about computability.

But just because we're adopting another programming language doesn't mean
that we can't use that language to write shell scripts.  Indeed it is as
easy to write and use Scheme-based shell scripts as it is to write and 
use C-shell, bash, Perl or Python-based shell scripts; easier I would say
given the elegance and power of Scheme.  Here are some simple examples of 
Scheme-based shell scripts: 

% ls
args    hello   pipes
% cat hello
#! /usr/local/plt/bin/mzscheme -qr

;; This script implements the classic 'hello world' program.

(printf "Hello world!\n")

% hello
Hello world!
% cat args
#! /usr/local/plt/bin/mzscheme -qr

;; This scheme shell script prints out its arguments to the 
;; standard output numbered as they appear on command line.

(let ((n 1))
  (map (lambda (arg)
         (printf "~a - ~a~%" n arg)
         (set! n (+ n 1)))
       (vector->list (current-command-line-arguments))))

% args 1 2 3 four
1 - 1
2 - 2
3 - 3
4 - four
% cat pipes
#! /usr/local/plt/bin/mzscheme -qr

;; Here is a scheme shell script that reads lines from the 
;; standard input and then prints each line preceded with
;; a number indicating its order in the input sequence.

(let loop ((n 1))
  (let ((line (read-line)))
    (if (eof-object? line)
        null
        (begin (printf "~a - ~a~%" n line) (loop (+ n 1))))))

% ls | pipes 
1 - args
2 - hello
3 - pipes

Note that Scheme-based shell scripts take command-line arguments and
can be used with Unix pipes to read from the standard input and write
to the standard output.  I expect the syntax looks pretty foreign to
you, but I think you'll find that Scheme syntax is pretty simple (for
one thing there isn't a lot of it) and you'll find Scheme programming
environments more powerful and easier to use in debugging programs.

Here are some things that we'll gain by moving to Scheme as a
programming language:

 - a more expressive and more powerful programming language 
 - a language designed for developing larger applications 
 - a powerful set of libraries for building new applications
 - a graphical programming environment that ports easily to 
   all major platforms including Windows, Linux and Mac OS X. 
 - a rapid-prototyping language of proven power and value

It is true that programming C, C++, C#, Java and even Perl are more
marketable skills, but we're not trying to get you a job, we're trying
to provide you with the skills to make computers work for you and
Scheme has few peers when it comes to ease of learning and the ability
to support rapid software development.  As a biologist working in
genomics, an analyst working in the financial markets or an engineer
sorting through experimental data, Scheme is a fine tool for practical
programming.  And besides that it is an ideal first general-purpose
programming language to learn.

You can write, debug and use Scheme programs much as you write, debug
and use C-shell programs by using Emacs and an interactive shell.
Here's a short transcript illustrating how you might interact with the
Scheme 'interpreter' a program that, much like the C-shell, issues a
prompt, reads expressions typed by the user, evaluates them and prints
any output: 

% which mzscheme
/usr/local/plt/bin/mzscheme
% mzscheme
Welcome to MzScheme version 208, Copyright (c) 2004 PLT Scheme, Inc.
> (printf "This is my first line of input.~%")
This is my first line of input.
> (+ 3.14 2.71 1/3 5)
11.183333333333334
> (* 7 (expt 3.14 2.71))
155.51682682201624
> (map + '(1 2 3 4) '(1 2 3 4))
(2 4 6 8)
> (map expt '(2 2 2 2) '(1 2 3 4))
(2 4 8 16)
> (regexp-replace "Damien" "Damien has hours tonight." "Vanessia")
"Vanessia has hours tonight."

There are a number of things to notice about this short session.  Some
of what I typed must look familiar: 'printf' looks like it works
pretty much the same as the shell variant. Of course, there are lots
of parentheses sprinkled about and indeed parentheses are ubiquitous
in Scheme; they constitute the most common method of defining syntax
in Lisp and most of its dialects.  You'll also notice that
mathematical operators are written 'prefix' instead of 'infix', a
convention referred to as Polish notation.  But what is more
interesting is that Scheme is pretty capable and accommodating when it
comes to math; I can add integers, rationals and real numbers without
invoking any special programs and operators like exponentiation 'expt'
come standard.  The two examples involving 'map' illustrate a method
of iteration that appears different from a 'foreach' loop, but is
actually related though much more powerful.  Finally, there are
built-in tools for pattern matching that are every bit as powerful as
those available in the C-shell.

The program that I invoked from the shell was called 'mzscheme' (as in
"Ms. Scheme") and is one of a family of Scheme related open-source
programs developed by a group of programmers and programming language
experts affiliated with PLT Scheme Inc.  PLT also provides a graphical
programming environment called DrScheme that runs on all the major
platforms (Windows, Linux, Apple, Unix) and that we recommend you use
when working in the lab and that you download and install on your
personal computer. DrScheme is available at http://www.drscheme.org/.

The sophisticated procedures for doing math, pattern matching and 
the like provide the basis for writing powerful tools that can be
loaded into Scheme in the form of libraries.  These libraries extend
the syntax and the power of Scheme to suit particular purposes. 
There are Scheme libraries for working with databases, building 
spiders that search the web, creating fancy graphical front ends
for programs, and a host of other applications.

2.12 Exercises for Monday, October 25, 2004

You can implement a powerful database system using just a handful of
standard shell commands.  Most of the work is involved in setting up
the calls to 'sort', 'cut' and 'join' by mapping symbolic field names
to integer offsets into the text records.  It took around 400 lines of
code and less than a day to write this and more than 350 lines were
required to parse the input and map the field names.  This attests to
the power of the C-shell scripting language.  Add the many libraries
available for manipulating images, sounds, HTML, etc and you can see
why scripting languages are so useful in everyday computing.

Here's a simple demonstration of the 'select' and 'insert' commands:

% cd /u/tld/tmp/exercises/projects/accounts/bin
% ls
insert  pretty  restore select  simple

% select account
----+--------+-------+----------
 id |  first |  last |  address
----+--------+-------+----------

% select all from account
---+-----------+-------------+------------------------------
 1 |     Frodo |     Baggins |      Hobbiton Village, Shire
---+-----------+-------------+------------------------------
 2 |     Bilbo |     Baggins |      Hobbiton Village, Shire
---+-----------+-------------+------------------------------
 3 |     Harry |      Potter |  Hogworts School of Wizardry
---+-----------+-------------+------------------------------
 4 |  Hermione |     Granger |  Hogworts School of Wizardry
---+-----------+-------------+------------------------------
 5 |     Wally |  Ringwraith |    Minas Morgul, Morgul Vale
---+-----------+-------------+------------------------------

% insert account
enter first = Freddy 
enter last = Jones
enter address = Orient, North Carolina
---+---------+--------+-------------------------
 7 |  Freddy |  Jones |  Orient, North Carolina
---+---------+--------+-------------------------
insert: add to 'account' (yes or no): yes
The record was added to the database.

% select first address from account sort by first
----------+------------------------------
    Bilbo |      Hobbiton Village, Shire
----------+------------------------------
   Freddy |       Orient, North Carolina
----------+------------------------------
    Frodo |      Hobbiton Village, Shire
----------+------------------------------
    Harry |  Hogworts School of Wizardry
----------+------------------------------
 Hermione |  Hogworts School of Wizardry
----------+------------------------------
    Wally |    Minas Morgul, Morgul Vale
----------+------------------------------

% select first last sale.product from account cross sale on id = account
----------+-------------+---------------------------
    Frodo |     Baggins |       Big Book of Spiders
----------+-------------+---------------------------
    Bilbo |     Baggins |         Magic Ring Tricks
    Bilbo |     Baggins |       Introductory Elvish
----------+-------------+---------------------------
    Harry |      Potter |       Potions for Dummies
----------+-------------+---------------------------
 Hermione |     Granger |  One Hundred Nifty Spells
----------+-------------+---------------------------
    Wally |  Ringwraith |  Planning a Second Career
----------+-------------+---------------------------

% select -p first last sale.product from account cross sale on id = account
Frodo;Baggins;Big Book of Spiders
Bilbo;Baggins;Magic Ring Tricks
Bilbo;Baggins;Introductory Elvish
Harry;Potter;Potions for Dummies
Hermione;Granger;One Hundred Nifty Spells
Wally;Ringwraith;Planning a Second Career

% select -p first last address from account | grep "Hob[a-z]*n" | pretty
-------+----------+--------------------------
 Frodo |  Baggins |  Hobbiton Village, Shire
-------+----------+--------------------------
 Bilbo |  Baggins |  Hobbiton Village, Shire
-------+----------+--------------------------


Whether your project was the Account Management System, the
Cryptanalyst's Tool Kit or Dynamic Web Content, you likely learned a
lot about how to make a shell do your bidding.  You probably
encountered several of the limitations of basic scripting languages.
The variety of syntax used in different tools like 'awk', 'sed',
'grep' etc., the contortions you have to go through in some cases to
pass information back and forth between parent shell an the subshells
in which commands are run.  Did any of you notice how variables in a
parent shell are isolated from those in a sub shell?  Because the
variable passing conventions in most shell are relatively primitive,
files and pipes are often the best way of passing information from
part of the computation to another.  Even simple data structures like
arrays don't always behave as one might expect.  These are just some
of the reasons why programmers turn to general-purpose languages like
C, C++, C#, Java and Scheme for implementing large software projects.
Here are some things that are easy in these general-purpose languages
but difficult in shell programming.


           >>>> mapping field names to record offsets <<<<

Let's say you have a list of field names: 

(define fields '( id first last address ))

Here's how you would create a 'field map'.

(define field_map (let ((i 0)) 
                    (map (lambda (field) 
                           (set! i (+ i 1)) 
                           (list field i)) 
                         fields)))

Here's how you define a general-purpose index function:

(define (index field map) (cadr (assoc field map)))

And here's how you would use the index function with the field map:

(index 'first field_map)

In the second half of this course, we'll show you how to use Scheme to
build abstractions similar to 'index' for all sorts of tasks from
information retrieval to building web pages.

2.13 Exercises for Wednesday, October 27, 2004

Make sure that your DrScheme language is set to 'Pretty Big (includes
MrEd and Advanced)'.  To do so click on the 'Language' menu in
DrScheme and you'll find 'Pretty Big (includes MrEd and Advanced)'
under 'Professional Languages' (you'll have to click on the 'PLT' tab
to show the 'PLT' family of languages).  The 'Student Languages'
impose too many restrictions and cause more confusion than they are
worth.  Once you make this change you'll find that 'first' and 'rest'
are not defined.  So, either use 'car' and 'cdr' or define 'first' and
'rest'.  To do the latter, open a file in DrScheme, type in the
following two definitions and then 'Execute' the file.

(define first car) 
(define rest cdr)

Here are the notes that were handed out in class along with some
additional help and hints to make the assignment easier.  Please
follow the instructions; there really is method in my seeming madness
- I'm not just assigning you busywork.  I want you to actually type
the expressions to the DrScheme interpreter; don't just read the text.
Typing the expressions to the interpreter is a very effective learning
strategy; it is just like typing sentences in an unfamiliar foreign
language - since you can't rely on filling in the details from your
knowledge of your native language, you have to attend to the syntax
carefully.  Trust me it works.  If you make a typing mistake, Scheme
will likely complain and you'll have to retype the expression thereby
learning something in the process.

These exercises are designed with this learning strategy in mind. Look
carefully at the examples.  If you understand the examples, it should
be easy to generalize enough to solve the exercise problems.  If you
really want to know more about the commands, then use the 'Help Desk'
under the DrScheme 'Help' menu, but only do that as a last resort.
These exercises do assume that you've read Chapter 5 in the text.


Basic (Primitive) Data Types

Symbols: any sequence of characters including at least one character
other than a digit.  There are some exceptions: you can't include the
single quote, double quote, back quote or any parentheses {},[],().
In addition, the first character of a symbol can't be a sharp sign #.
To define a symbol 'name' to have the value of some 'expression', type
(define name expression).  Try variations on the following:

> (define foo 1)
> foo
1
> (define foo_bar 2)
> foo_bar
2
> (define 1_foo 3)
> 1_foo
3
> (define n (+ 1 2 3))
> n
6

If you type a defined symbol to Scheme, it will respond by printing the
'value' of the symbol.  If you type a symbol that hasn't been defined, 
Scheme will complain:

> bar
reference to undefined identifier: bar

So far we've been defining symbols to have values corresponding to
integers.  There are other sorts of numbers available in Scheme
including rationals, real numbers in decimal notation and various
kinds of floating point numbers:

> (define third 1/3)
> third
1/3

Another primitive data type is the string; in Scheme, strings are
always enclosed in double quotes:

> (define my_name "Tom Dean")
> my_name
"Tom Dean"

Unlike symbols that have values assigned to them by 'define'.  Strings
and numbers always evaluate to themselves:

> "Vanessia Wu"
"Vanessia Wu"
> 2.712
2.712

The single quote is shorthand for the 'quote' function. The quote
function suppresses evaluation.  The following two invocations are 
equivalent:

> 'foo
foo
> (quote foo)
foo

Note that by quoting a symbol you can refer to the symbol rather
than any value that it might have; you can also refer to a symbol
that doesn't have a defined value.  Every symbol has a name but
not all symbols have values. 


List Processing in Scheme

Finally, one of the most important data types in Scheme is the list;
lists are used to store information, create more complex data
structures and even represent programs.  The 'list' function takes any
number of arguments and constructs a list whose elements correspond to
the values of its arguments:

> (list 1 2 3)
(1 2 3)
> (define ones (list "one" 1 'one ))
> ones
("one" 1 one)
> (define twos (list (+ 1 1) (/ 6 3) (/ (* 2 3) 2)))
> twos
(2 2 2)

Reflect on the last two definitions; the 'list' function evaluates its
arguments (arguments are like the words to the right of a shell
command).  The strings and the numbers evaluated to themselves and we
used to the quote to suppress the evaluation of the symbol 'one'.  You
can 'nest' lists simply by calling the 'list' function as one of the
arguments to 'list':

> (define nest (list 1 2 (list 3 4) 5 6))
> nest
(1 2 (3 4) 5 6)

You can also create lists using quote; in this case, the resulting 
list is specified literally and no evaluation takes place:

> (define nums '(1 2 3 4))

Notice what happens when we incorrectly try to create a nested list
inside a list specified with quote:

> (define nest '(1 2 (list 3 4) 5 6))
> nest
(1 2 (list 3 4) 5 6)

To achieve the intended effect we just specify what we want literally:

> (define nest '(1 2 (3 4) 5 6))
> nest
(1 2 (3 4) 5 6)

It doesn't do much good to create lists if you can't subsequently take
them apart and access their contents.  The most basic functions for
accessing lists are called 'car' and 'cdr'. These names refer to
locations called 'registers' in the memory of a now defunct computer.
You can use these name if you like but you can also give them more
mnemonic names. The 'car' of a list is the first element of the list
and the 'cdr' of a list is the list consisting of all the rest of the
elements of the list but the first one.  As mentioned in class, often
programmers define 'first' and 'rest' as aliases for 'car' and 'cdr':

> (define first car)
> (define rest cdr)

What do think the value of 'car' is? 

> car
#<primitive:car>
> cdr
#<primitive:cdr>

This indicates that 'car' and 'cdr' are defined as primitive
functions.  And now we've defined 'first' and 'rest' to have the same
definitions as 'car' and 'cdr' respectively.

> first
#<primitive:car>
> rest
#<primitive:cdr>

Now we can use these functions to access the contents of lists:

> (define nums '(1 2 (3 4) 5 6))
> nums
(1 2 (3 4) 5 6)
> (first nums)
1
> (rest nums)
(2 (3 4) 5 6)

We can use various combinations of 'first' and 'rest' to access any 
element or sublist of given list.

For example, suppose that you want to grab the 4 in '(0 (1 (3 4)) 5).

First we'll do it in several steps:

> (rest '(0 (1 (3 4)) 5))      <<< step 1
((1 (3 4)) 5)
> (first '((1 (3 4)) 5))       <<< step 2
(1 (3 4))
> (rest '(1 (3 4)))            <<< step 3
((3 4))
> (first '((3 4)))             <<< step 4
(3 4)
> (rest '(3 4))                <<< step 5
(4)
> (first '(4))                 <<< step 6
4

Now we put this all together and apply these steps in the order of 
evaluation which actually looks as if we're doing it backwards or at 
least inside out, but you should convince yourself that this is 
actually in the appropriate evaluation order.

  step 6 step 5 step 4 step 3 step 2 step 1 

> (first (rest  (first (rest  (first (rest '(0 (1 (3 4)) 5)))))))
4


1. Given the following list structure:

   > (define nums '(1 2 (3 4) 5 6))

   What do the following expressions return? 

   > (first (rest (rest nums)))

   > (rest (first nums))


2. Now use 'first' and 'rest' (or 'car' and 'cdr') to return each of
   the components 2, (3 4), 5 and 4,  given 'nums' as defined above.  
   Your solutions should look like 

   (first (rest (rest ... '(1 2 (3 4) 5 6))))

   with a different sequences of 'first' and 'rest' invocations. 


Predicates for Testing and Comparing Primitive Types

Each of the primitive data types has a corresponding predicate for
testing membership.  A predicate is just a function that returns true
or false.  In scheme, '#f' means 'false' and '#t' means 'true'.  The
'?' is often tacked on the end of predicates to identify them as such,
but the '?' is just a convention - it's not required when defining
your own predicates.

> (number? 1)
#t
> (string? "string")
#t
> (list? '(1 2 3))
#t
> (list? "string")
#f
> (symbol? 'a)
#t

The 'string<?' predicate is a single symbol consisting of eight
characters; we might have called it 'string-less-than?' instead of
'string<?'  'string<?' is the string analog of '<' for numbers.

You can also check for special cases:

> (zero? 0)
#t
> (null? '())
#t

There are also variants of equality and inequality for every primitive 
type (substitute '>', '=', '<=', '>=' for '<' in the following to get 
the other variants). 

> (< 1 2)
#t
> (string<? "a" "b")
#t
> 

The '<' predcate is an example of a predicate that doesn't have the
'?'  tacked on the end.  Note that 'string=?', 'string<?', 'string>?',
'string<=?'  and 'string>=?' all use lexicographic order for making
comparisons whereas '=', '<', '>', '<=', '>=' use numerical order.

In Scheme '=' means numerical equality in contrast to the C-shell
where '=' is used for assignment (in 'set' and '@').  There is also an
equality predicate defined on more general types (note that for
symbols, lowercase/uppercase doesn't matter - case does matter for
strings however - Scheme can be configured so that case does matter
but the default is for it to be insensitive to case):

> (equal? 'a 'A)
#t
> (equal? 'a "a")
#f
> (equal? 1 (first (rest '(0 1 2 3))))
#t
> (equal? "a" "A")
#f
> (equal? 1 (- 2 1))
#t


Building Boolean Expressions Using Boolean Operators

We can combine predicate invocations with the boolean operators:
'and', 'or', 'not'.  The first, the so-called conjunctive operator
'and' takes any number of arguments including zero it returns true 
#t if all of its arguments evaluate to #t and false #f otherwise:

> (and #t #t)
#t
> (and #t #t #f)
#f
> (and)
#t

We typically use boolean operators with predicates, e.g., 

> (and (list? arg) (not (null? (rest arg))) (number? (first arg)))

returns #t if arg is a list, has at least one element, and its first
element is a number.

The disjunctive operator 'or' works similarly; it returns #t if at
least one of its arguments evaluates to #t and false #f otherwise:

> (or #f #t #t)
#t
> (or #f)
#f
> (or)
#f

Again, the arguments typically involve predicates or nested boolean
expressions, e.g., 

> (or (and (number? arg) (> arg 0)) (symbol? arg))

returns #t if arg is a number greater than zero or a symbol.

See if you can explain why '(or) => #f' and '(and) => #t'. 

The negation operator 'not' accepts exactly one argument and returns
#t if its only argment evaluates to #f and returns #f if its only
argument evaluates to #t.

> (not #t)
#f
> (not #f)
#t
> (not (and (number? 1) (or (string? "") (null? '(1 2 3)))))
#f

Note that functions defined on strings, numbers, non-empty lists, etc., 
will complain if you call them with the wrong type of inputs:

> (rest '())
cdr: expects argument of type <pair>; given ()
> (string=? 1 "one")
string=?: expects type <string> as 1st argument, given: 1; 
          other arguments were: "one"

For this reason, it is often important to check the type of arguments
before evaluating them.  Note that in evaluating an 'and', the Scheme 
interpreter won't evaluate the second conjunct if the first conjunct 
returns #f; we can use this fact to check arguments and avoid errors of
the sort illustrated above. 

> (and (not (null? lst)) (not (null? (rest lst))) (= 1 (first (rest lst))))
#f


3. Write a boolean expression for the body (the '...') of the following 
   function:

   (define (special-list-type? arg)
      ... )

   such that it returns #t for the following type of data structure:

   (list number number (list string) symbol) 

   and #f otherwise; so, for example: 
   
   > (special-list-type? '(1 2 ("foo") a))
   #t
   > (special-list-type? (list 1 4 (list "foo") 'a))
   #t
   > (special-list-type? (list 1 7 '("foo" "bar") 'a))
   #f

   To help you out, here's a solution for a somewhat simpler example:

   (list symbol number (list))

   (define (my-special-list-type? arg)
      (and (list? arg) 
           (not (null? arg))
           (symbol? (first arg))
           (not (null? (rest arg)))
           (number? (first (rest arg)))
           (not (null? (rest (rest arg))))
           (null? (first (rest (rest arg))))))

   Note that the first two conjuncts (list? arg) and (not (null? arg))
   have to be there to make sure that the third conjunct (symbol?
   (first arg)) won't cause an error on an input like '().  The fourth
   conjunct (not (null? (rest arg))) has to be there to make sure that
   the fifth conjunct (number? (first (rest arg))) won't cause an
   error on an input like '(1).  Similarly, the sixth conjunct
   prevents the seventh conjunct from causing an error.  Here are some
   examples showing my-special-list-type? in action:

   > (my-special-list-type? '(a 1 ()))
   #t
   > (my-special-list-type? (list 'foo 1.2 (list)))
   #t
   > (my-special-list-type? (list 'foo 1.2))
   #f
   > (my-special-list-type? 3.14)
   #f
   > (my-special-list-type? (list 'foo 1.2))
   #f


Working With Conditional Statements   

Scheme conditionals are simpler and easier to use than conditionals in
the C-shell.  The 'else' clause is optional; however, if the condition
of a conditional statement lacking an 'else' clause evaluates to '#f'
then the result of the statement is undefined.

   > (if (symbol? 'a) 1 2)
   1
   > (if (null? '(1 2)) 1)
   > (if (and (number? 1) (string? "a")) 1 2)
   1

There is also a generalized conditional statement 'cond' which works 
like nested if-then-else statements and is very convenient for all 
sorts of complicated tests.  Here's the general format:

   (cond ( test_1 expression_1 )
         ( test_2 expression_2 )
         ( test_3 expression_3 )
         ( test_4 expression_4 )
         ( else   expression_5 ))

   =>

   (if test_1 
       expression_1 
       (if test_2 
           expression_2 
           (if test_3 
               expression_3 
               (if test_4 
                   expression_4 
                   (if #t   
                       expression_5 )))))

   Each ( test_n expression_n ) is referred to as a 'clause'.  Note
   that the first test to evaluate to #t will cause its corresponding
   expression to be evaluated.  The keyword 'else' evaluates to '#t'
   in the context of a conditional statement.  It should only be used
   as the test for the last clause - usually referred to as the
   'default' clause.  If no test evaluates to #t then the result of
   the 'cond' statement is undefined.

   > (define exp 1.2)
   > (cond ((list? exp) (printf "It's a list!~%"))
           ((symbol? exp) (printf "It's a symbol!~%"))
           ((number? exp) (printf "It's a number!~%"))
           (else (printf "I haven't the foggiest!~%")))
   It's a number!

   Note that 'printf' behaves pretty much like the C-shell 'printf'
   except that we use '~%' for '\n' and to interpolate a variable 
   into the format string, so instead of 
  
   printf "the $last number is $num\n"

   you would write 

   (printf "the ~a number is ~a~%" last num)

   For each appearance of '~a' in the format string there must be
   one additional argument to 'printf'. 

   > (define one 1)
   > (define two 2)
   > (define three 3)
   > (printf "~a is less than ~a is less than ~a~%" one two three)
   1 is less than 2 is less than 3
   > (printf "~a is less than ~a~%" one)
   > (printf "~a is less than ~a~%" one)
   printf: format string requires 2 arguments, given 1; 
           arguments were: "~a is less than ~a~%" 1


4. Now rewrite your solution to Exercise 3 using 'cond' so that it
   prints out useful informative statements about whether or not the
   input subscribes to the expected format:

   > (special-list-type-cond? '(1 2 ("foo") a))
   Looks good to me.
   > (special-list-type-cond? (list 1 "" (list "foo") 'a))
   I was hoping to see a number as the second element of your list.
   > (special-list-type-cond? (list 1 7 '("foo" "bar") 'a))
   Expecting a list consisting of a single item but got (foo bar).
   

A Word About Computer Literal-Mindedness

I thought it might be useful to reflect on the differences between
learning to program (where I am assuming that this involves actually
writing programs and trying to make them run on a computer) and
learning about some branch of history or philosophy or for that matter
learning mathematics.  The computer is stern task master; it allows no
lapse in precision.  Some of the requirements for precision may seem
picayune - why should it matter if you leave out a semicolon!  But the
computer isn't smart enough to resolve ambiguities on its own; nor
would we want it to in many cases.  Most programmers learn to like the
fact that the computer calls us to task when we are less that
perfectly precise.

In discussing philosophy with a professor, usually the professor gives
you, the student, the benefit of the doubt, and you probably give the
professor more credit that he or she deserves.  It's easy to be lax in
talking with another human.  Even in learning, say, calculus, you can
make mistakes on a test and often get partial credit; the human grader
fills in what you meant.  But late at night trying to get your
programming assignment finished, the computer is obstinate and
unforgiving.  Somewhere in your code there is a missing semicolon but
you can't find it and every time you try to run your code, it crashes
spitting out cryptic comments.

We're not used to such unflagging and obstinate adherence to rigid
convention and we find it annoying - an affront to our intelligence.
But it is worth keeping in mind that such precision is a boon to a
skillful programmer.  It assures that the machine will carry out his
or her commands with perfect precision and repeatability.  It's quite
possible to write a program that doesn't do what the programmer
intends, but at least we are assured that the computer will execute a
piece of code the same today as tomorrow, the same whether you wrote
it or someone else wrote it and the same whether it is run on one kind
of a machine rather than another.

2.14 Exercises for Friday, October 29, 2004

Calling and Creating Functions

We're going to experiment with the DrScheme Stepper today.  The
Stepper only works in the Student Languages Beginner through
Intermediate; so go to the Languages menu and set the Language to
'Intermediate Student'.  Now type the following lines to the upper
pane of a DrScheme window.

(define (square num) (* num num))

(define (area radius) (* pi (square radius)))

(area 7)

The constant 'pi' is already defined in Intermediate Student.  Now
save the file as 'circle.scm' and click on the 'Step' button.  If you
repeatedly click on 'Step >' button in the stepper pane, you'll see
the steps that Scheme takes in evaluating the expression '(area 7)'.

In the stepper you'll see how Scheme replaces the invocation of a
function, e.g., (area 7), with the definition of the function, e.g.,
(* pi (square radius)), using the calling pattern, (area radius),
specified in the definition.  The values of the function arguments are
then substituted for the formal parameters in the function definition.

1. Write a function that computes the average of four test scores. 

   (define (average score_1 score_2 score_3 score_4)
     ... )

   > (average 90 86 97 75)
   87 


2. Write a function 'distance' that computes the Euclidean distance
   between two points in the plane.  If you recall from high school
   geometry, the distance between two points in the plane is the
   square root of the sum of the squared differences of the x and y
   components, i.e., for points p1 = (x1 y1) and p2 = (x2 y2) the
   distance from p1 to p2 is sqrt of ( (x1 - x2)^2 + (y1 - y2)^2 ). 
   Now translate this expression into Scheme:
  
   (define (distance x1 y1 x2 y2)
     ... )

   > (distance 0 1 1 0)
   1.4142135623730951

   Use the 'square' function above and the built-in 'sqrt' function. 
   Then use the stepper to step through the execution of an example 
   invocation of the 'distance'. 

3. In this exercise, you'll define and use a 'point' data structure:

   (define (make-point x y) (list x y))
   (define (x-coord pt) (car pt))
   (define (y-coord pt) (car (cdr pt)))

   Rewrite the 'distance' function to work with the 'point' data structure:

   (define (distance pt1 pt2)
     ... )

   Create a couple of points: 

   (define point_1 (make-point 1 0))
   (define point_2 (make-point 0 1))
   
   And try out your function (distance pt1 pt2) with the stepper.

   > (distance point_1 point_2)
   1.4142135623730951   

Now type in the factorial function and save it in 'factorial.scm':

(define (factorial n)
  (if (= n 1)
      1
      (* n (factorial (- n 1)))))

(factorial 3)

Again watch what happens in the stepper. 

Now experiment with the last function.  Last only works if it is given
a list consisting of at least one element.  So we first check to see
if 'last' is given appropriate input. 

(define (last arg)
  (if (and (list? arg) 
           (not (null? arg)))
      (aux-last arg)
      "expecting a non-empty list"))

We then define an auxiliary function that does all the real work. 

(define (aux-last lst)
  (if (null? (rest lst))
      (first lst)
      (aux-last (rest lst))))

> (last '(1 2 3))
3

4. Write a function 'count' that counts the number of elements in a
   list.  Use two functions 'count' and 'aux-count' as we did for
   'last.  Your 'aux-count' should take a second argument that keeps
   track of how many elements have been encountered so far.

   > (count '(1 2 3 4 5))
   5

2.15 Exercises for Monday, November 1, 2004

Scheme Resources:

  Basic Introduction to Scheme - ~tld/tmp/exercises/doc/scheme.txt 

    This instroduction is tailored to the material we'll cover in
    the exercises and supplements the descriptions in the book.

  Learn to use the DrScheme Help Desk

    You can search for information on any command or any concept.


More on List Processing

Recall that the 'list' function takes any number of arguments and
constructs a list whose elements correspond to the values of its
arguments:

> (list 1 2 3)
(1 2 3)

There's also a more primitive list constructor called 'cons' that
constructs lists in exactly the same way that 'car' ('first') and
'cdr' ('rest') take them apart.

> (define lst (cons 0 '(1 2 3)))
> lst
(0 1 2 3)
> (car lst)
1
> (cdr lst)
(1 2 3)
> (cons 0 (cons 1 (cons 2 (cons 3 ()))))
(0 1 2 3)
> (car (cdr (cdr (cdr '(0 1 2 3)))))
3


1. Create the list structure: (1 2 (3 (4) 5) 6) using just 'cons', 
   the numbers 1, 2, 3, 4, 5 and 6 and the empty list '()'. 


The 'append' functions takes zero or more arguments that must evaluate
to lists and combines them into one big list:

> (append '(0 1 2) '(3 4 5) '(6))
(0 1 2 3 4 5 6)


2. What happens when you call 'append' with no arguments? 


Note that it is possible to define functions in Scheme (like 'append',
'list' and '+') that take a variable number of arguments in a manner
similar to the way shell commands use 'argv' to grab all the words on
the command line.  However, in many cases where you might think you
want a variable number of arguments what you really want is just a
single argument which is a list.

The 'cons' and 'append' functions will come in handy a little later
when we want to build lists from various pieces under the control of
a program.


Destructive Modification of Lists

The 'define' command is used to define the value of symbols.  When you
want to redefine a symbol you can use 'define' to do so, but often
once the symbol is initialized to a particular value subsequent
changes are typically made using another command called 'set!'  The
'!' is often added to commands that are said to 'destructively modify'
their arguments.  The 'set!' command is like 'set' in the C-shell
except instead of writing 'set var = expression' you write '(set! var
expression)'.  For our purposes, the first argument to 'set!' is
expected to be a symbol but, unlike symbols appearing as arguments to
other functions, this first argument is not evaluated; 'set!' works
like 'define' does in setting a symbol to have a particular value.

> (define counter 0)
> counter
0
> (set! counter (+ counter 1))
> counter 
1 
> (set! counter (+ counter 1))
> counter 
2 

There are also commands that allow you to destructively alter the
structure of lists.  

> (define nums (list 1 2 3))
> (set-car! nums 7)
> nums
(7 2 3)
> (set-cdr! (cdr (cdr nums)) (list 4 5 6))
> nums
(7 2 3 4 5 6)


3. Explain the behavior of the following sequence of commands:
  
  > (define lst_1 (list 1 2 3))
  > (define lst_2 (cons 0 lst_1))
  > lst_1
  (1 2 3)
  > lst_2
  (0 1 2 3)
  > (set-car! (cdr lst_1) "two")
  > lst_1
  (1 "two" 3)
  > lst_2
  (0 1 "two" 3)


Iteration and Recursion

Scheme offers a variety of methods for implementing iterative algorithms.
We're going to begin by emphasizing recursion, but later we'll provide
iterative constructs that work like 'foreach' and 'while'.  Last week
we implemented a program extracted the last element of a list:

(define (last lst)
  (if (null? (cdr lst))
      (car lst)
      (last (cdr lst))))

Let's morph 'last' into a program that uses 'cons' to copy a list:

(define (copy lst) 
  (if (null lst)
      lst 
      (cons (car lst) 
            (copy (cdr lst)))))
  

4. Now write a recursive function 'numbers' based on 'last' that 
   takes a single argument corresponding to an integer 'n' and 
   returns a list of the numbers 1 through n.

  > (numbers 5)
  (1 2 3 4 5)


5. Write a recursive function 'reverse' that takes a list and uses
   'append' to produce a new list in which the elements appear in
   reverse order.

  > (reverse '(1 2 3))
  (3 2 1)


6. Write another version of 'reverse' that uses 'cons' instead of 
   'append'.  In this case, define 'reverse' with two arguments:
   an input and an output variable: 

  > (define (reverse input output) 
      ... )

  > (reverse '(1 2 3) '())
  (3 2 1)


7. The 'assoc' function allows us to use so-called association lists.
   An association list is a list of pairs each pair consisting of a
   key and an item associated with that key.  For example, the
   following association list maps the names of fields to their
   offsets in a list representing a database record.

  > (define employee_fields '((id 1) (first 2) (last 3) (address 4)))

   We can use 'assoc' to find the offset for a field by specifying the
   field name as the key.

  > (assoc 'last employee_fields)
  (last 3)
  > (car (cdr (assoc 'last employee_fields)))
  3

   If the key cannot be found, 'assoc' returns false #f.

  > (assoc 'middle employee_fields)
  #f

   Write a 'my_assoc' function that duplicates the functionality of
   'assoc'.  You'll need some way of comparing keys to write this
   function and the 'equal?' predicate does the trick:

  > (equal? 'first  (car (car (cdr employee_fields))))
  #t


Global and Local Variables

When you define a variable at the top level (i.e., not inside the body
of a function), that variable is said to have 'global scope'; you can
reference it anywhere including inside the body of other functions as
long as they don't have a formal parameter of the same name, e.g.,

> (define num 2.2)
> (define (scale base) (* num base))
> (define (square num) (* num num))

The function 'scale' has access to the global variable 'num' while
'square' has a formal parameter of the same name that masks the global
'num'.  Within the 'square' function 'num' refers to the local
variable introduced as its single formal parameter.

There are times however when you want to introduce a 'local' variable
(besides a formal parameter) to keep track of some information in the
midst of carrying out a computation.  This is easily accomplished by
introducing variables by way of 'let'.

> (define num 1)
> (printf "num = ~a~%" num))
num = 1
> (let ((num 0))
    (printf "num = ~a~%" num)
    (set! num 17)
    (printf "num = ~a~%" num))
num = 0
num = 17
> (printf "num = ~a~%" num))
num = 1

Inside the body of the 'let' the local variable 'num' masks the 'num'
introduced by the 'define'. Moreover, changes to the local 'num' do
not affect the global value of 'num'.  The scope of 'num' is said to
be 'lexical' or 'static' in the sense that the scope of the local
'num' is restricted to the code within body of the 'let', i.e., (let
((num 0)) ... body ... ); what happens outside of the body of the let
doesn't affect the local value of 'num'.  If we call a function within
the body of the 'let', the definition of that function is outside the
scope of the 'let' and, unless it has a formal parameter with the
same name or introduces a local value with the same name using 'let',
changes made to the variable in the body of the function will be
reflected in the global variable:

> (define num 1)
> (printf "num = ~a~%" num)
num = 1
> (define (test arg)
    (set! num arg)
> (define (test arg)
    (set! num arg))
> (let ((num 0))
    (test 43)
    (printf "num = ~a~%" num)
    (set! num 17)
    (printf "num = ~a~%" num))
num = 0
num = 17
> (printf "num = ~a~%" num)
num = 43

Here's an example of a procedure that is difficult to write 
without the use of a local variable introduced using 'let'.

> (define pair (list 1 2))
> (define (swap pair)
    (let ((tmp (car pair)))
      (set-car! pair (car (cdr pair)))
      (set-car! (cdr pair) tmp)))
> pair
(1 2)
> (swap pair)
> pair
(2 1)

2.16 Exercises for Wednesday, November 3, 2004

Functional Programming

Functional programming is a style of programming that often makes for
very simple (and elegant) solutions to complicated problems.  Suppose
that you have three columns of numbers

> (define uno  '(1 2 3))
> (define dos  '(4 5 6))
> (define tres '(7 8 9))

and you want to add the rows, i.e., the sum of the first number in
each column, the sum of the second number in each column, etc. 
Here's how you do this using a 'map' operator:

> (map + uno dos tres)
(12 15 18)

The 'map' operator takes a function (in this case '+') and one or more
lists that will serve as the arguments to the function.  'map' takes
each 'cross section' of the lists, e.g., the first cross section in
the example above would consist of the numbers 1, 4 and 7, and then
applies the function to each cross section; it then returns a list of
the results.  This next example turns rows into columns:

> (map list uno dos tres)
((1 4 7) (2 5 8) (3 6 9))

If we don't have a function that does exactly what we want we can
define an anonymous function using 'lambda'.  When you specify a
function definition using 'define' the next two expressions produce
identical results:

(define (function_name arg_1 arg_2 ... arg_n) 
  body )

(define function_name
  (lambda (arg_1 arg_2 ... arg_n) 
    body ))

However, in some cases, you want to define a special-purpose function
that will only be used for in one place in your program.  Defining
single-use functions is useful in functional programming.  Here's how
you would cube a list of numbers using 'map' and 'lambda':

> (map (lambda (x) (* x x x)) '(1 2 3 4 5 6))
(1 8 27 64 125 216)


1. As a somewhat more practical example, suppose that you have three
   columns of numbers where the first column lists the prices of
   products that you are contemplating purchasing, the second column
   corresponds to the associated shipping costs, and the third the
   relevant state sales taxes

  > (define prices '(61 17 31))
  > (define shipping '(7 3 5))
  > (define taxes '(0.5 0.4 0.6))

   and you want to compute the total cost as the sum of price, shipping
   costs and price times tax rate.  Use 'map' and 'lambda' to define a 
   function 'total_cost' that performs this calculation: 

  > (total_cost prices shipping taxes)
 (98.50 26.80 54.60)


The 'apply' function takes a function and a list of items that 
can serve as arguments to the function:

> (apply + '(1 2 3 4))
10

Suppose you have a list of numbers corresponding to the prices of the
parts for a product you're thinking of manufacturing and you want to
know the percentage that each part contributes to the total cost (this 
is also referred to as normalizing a set of numbers):

> (map (lambda (cost) (/ cost 1.0 (apply + costs))) costs)
(0.14 0.19 0.41 0.26)


2. In this exercise, you'll implement some of the basic functions used
   in linear algebra, a branch of mathematics important in computer
   graphics.  Use 'map' and 'apply' to compute the 'dot' or 'inner'
   product of two vectors represented as lists. The inner product is
   notated: u.v = ((u_1 * v_1) + ... + (u_n * v_n)) where u_i (v_i) is
   the ith component of the vector u (v):

  > (define u '(1 7 3)) 
  > (define v '(3 4 6))

  > (inner u v)
  49

   Now implement the 'magnitude' of a vector which is defined as the 
   square root of the sum of squares of the vector components.  The 
   magnitude is notated: |u| = sqrt ((u_1 * u_1) + ... + (u_n * u_n))

  > (magnitude u) 
  7.681145747868608

   Finally, implement a function that computes the cosine of the angle
   between two vectors calculated as the dot product of the vectors 
   divided by the product of their magnitudes:  u.v / ( |u|* |v| )

  > (cos-angle u v)
  0.8167801162284652


Iteration and Loops in Scheme

I certainly don't want to force you to implement iterative programs
using recursion if you find it difficult or somehow 'unnatural'.  Some
people find recursion more intuitive than 'foreach' or 'while' loops.
Perhaps it is just a matter of taste.  Scheme allows a variety of
iterative techniques besides recursion or the sort of iteration
implicit in the mapping functions common in functional programming.
One such very general iterative construct is the 'do' loop whose
general form is:

 (do (variable-initial-step-triples)
     (loop-exit-condition terminal-expressions)
   body )

where variable-initial-step-triples take the form

 ( variable initial-value step-value )

the loop-exit-condition is a test which is evaluated on each
iteration such that if the test returns true (#t) then the 
loop is terminated and the terminal-expressions are evaluated;
the loop returns the value of the last terminal expression. 

It's easier to understand if you just look at a simple example.
Here's an iterative version of factorial:

> (define (factorial n)
    (do ((i n (- i 1))
         (r 1 (* r i)))
        ((= i 1) r)))
> (factorial 4)
24

Note that 'factorial' has an empty 'body'.  The next example -
a test for primality - has one expression in its body: 

> (define (prime? n)
    (do ((factor (if (even? n) (/ n 2) (/ (- n 1) 2)) (- factor 1))
         (flag #t))
        ((or (= factor 1) (not flag)) flag)
      (if (= 0 (remainder n factor)) (set! flag #f))))
> (prime 18)
#f
> (prime 37)
#t


3. Write a function 'quality_check' using a 'do' loop that takes three
   arguments: a list of numbers corresponding to measurements taken of
   some critical parameter of a manufactured product, a second number
   corresponding to the optimal value for the measurement and a third
   number corresponding to the maximum deviation (+/-) from the
   optimal value that will be tolerated in a product to be shipped.
   Your 'quality_check' function should return the empty list if all
   of the measurements are within tolerance and otherwise return a
   list of all those measurements that were found to exceed the
   maximum allowed deviation.

  > (quality_check '(4.13 4.07 3.99 3.76) 4.00 0.10))
  (4.13 3.76)


Defining Your Own Syntax

If you're really pining for your old 'foreach' and 'while' loops we
could simply map the 'foreach' syntax onto the 'do' loop syntax.
Here's what the mapping would look like for the 'foreach' loop:

;; foreach var ( arg )     (do ((lst arg (cdr lst)) (var #f))
;;   command_1                 ((null? lst))
;;   ...               =>    (set! var (car lst))
;;   ...                     expression_1
;;   command_n               ...
;; end                       expression_n )


To get Scheme to do the mapping for you, would use the functions
'define-syntax' and 'syntax-rules' to define a 'macro' so that
expressions beginning with 'foreach' would expand into the 'do' 
format shown above. Here's how you'd define the 'foreach' macro:

(define-syntax foreach
  (syntax-rules ()
    ((foreach var ( arg ) expr ... )
     (do ((_lst_ arg (cdr _lst_)) (var #f))
         ((null? _lst_))
       (set! var (car _lst_))
       expr ... ))))

Now we can write programs with 'foreach' loops pretty much as we would
in C-shell scripts; there might be a few extra parens, but presumably
that's a small price to pay if you really miss your 'foreach' loops:

> (foreach num ( '(1 2 3 4) )
     (set! num (* num num num ))
     (printf "~a ~%" num))
1 
8 
27 
64 

We could do the same sort of thing for 'while' loops:

;; while ( test )       (do () 
;;   command_1              ((not test))
;;   command_2      =>    expression_2
;;   ...                  expression_1 
;;   command_n            ...
;; end                    expression_n )

Here's how we could get scheme to perform the mapping for us.

(define-syntax while
  (syntax-rules ()
    ((while ( test ) expr ... )
     (do () ((not test)) expr ... ))))
         
> (define num 4)
> (while ( (> num 0) )
    (printf "~a ~%" num)
    (set! num (- num 1)))
4 
3 
2 
1 

We could use 'define-syntax' and 'syntax-rules' to define a whole 
new language if we like.  Perhaps you'd like to design your own
idiosyncratic language; then you wouldn't have to put up with 
conforming to someone else's idea of how to write code. 


4. See if you can figure out from these examples how to use
   'define-syntax' and 'syntax-rules' to define a control structure
   'repeat' that mirrors the unix command of the same name.

  % repeat 64 printf "+"
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

  > (repeat 64 (printf "+"))
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2.17 Exercises for Monday, November 8, 2004

Today you'll get a taste for an area of computer science called the
analysis of algorithms.  Analysis of algorithms is considered a
branch of theoretical computer science and deals with algorithms and
data structures as mathematical objects using formal logic, graph
theory, combinatorics, statistics, probability and other areas of
discrete and continuous mathematics for purposes of analysis.


Association Lists and Hash Tables for Implementing Dictionaries

We want to associate words referred to as 'keys' with items in a
'dictionary'.  The items need not be definitions as in the case of a
conventional dictionary; in general, items can be any data objects
that we want to associate with the keys.  The dictionary itself is
implemented as a data structure and typically supports several
operations, e.g., insertion, deletion, and extraction or 'lookup'.
The implementation of these operations depends on the implementation
of the data structure.

In this exercise, we'll look at implementing dictionaries using
association lists and hash tables.  Recall that an association list is
a list of pairs whose first element is a key and whose second element
is the item that we wish to associate with the key.  For example, the
following association list associates the symbols corresponding to the
three-letter abbreviations for the months of the year with the strings
corresponding to their two-digit integer representation:

> (define months '((Jan "01") (Feb "02") (Mar "03") (Apr "04") 
                   (May "05") (Jun "06") (Jul "07") (Aug "08")
                   (Sep "09") (Oct "10") (Nov "11") (Dec "12")))

We can then use 'assoc' to lookup the two-digit representation when
given the three-letter symbol as a key:

> (printf "Today is 2004-~a-08.~%" (car (cdr (assoc 'Nov months))))
Today is 2004-11-08.

Here's a recursive definition for 'assoc' implementing a 'linear
search' algorithm: 

(define (assoc key alst)
  (cond ((null? alst) #f)
        ((equal? key (car (car alst)))  (car alst))
        (else (assoc key (cdr alst)))))

How long will 'assoc' take to associate a key with a given pair?  What
do we mean by 'how long'?  And, once we answer that, are we really
interested in the answer for a particular key?  How about a randomly
selected key?  Regarding 'how long' we probably don't mean elapsed
time on the clock; the elapsed time will vary depending on what sort
of computer we're using, what operating system were running, what
programming language and compiler, etc.  Typically in analyzing
algorithms we focus on the most expensive operations and how often
they are performed.  In the case of 'assoc' the most expensive
operation involves comparing the key with the first item of a pair in
the association list using the 'equal?' function.

Clearly if we choose a key not in the association list or choose a key
corresponding to the last pair in the list, we'll have to check every
pair in the list.  Let's say that there are n pairs in the list; in
the case in which the key is not present, a failed lookup will require
n comparisons, i.e., n applications of the 'equal?  predicate.  It
doesn't matter if we implement 'assoc' as a recursive procedure or a
'while' loop; in either case we'll have to perform n comparisons.  For
this reason, we say that dictionary lookup using linear search has a
worst-case performance of 'order n' notated O(n) in the 'big-oh'
notation common in theoretical computer science.

What about the case in which there is a pair (not necessarily the
last) in the association list with a first element matching the key?
In this case, it will depend on where the pair is located in the list.
For the purpose of analysis it will help if we take a statistical
perspective.  Let's say that we choose a key whose location in the
list is chosen uniformly at random from the numbers 1 through n.  This
implies that if we choose many keys selected in this random fashion,
on average, the number of comparisons will be n/2 which is still O(n)
since 'big-oh' notation ignores constant factors (the 1/2 in this
case).  Rather than prove this to you, I'll try to convince you
experimentally.  The following procedure draws m samples drawn
uniformly from the integers 1 through n.  The invocation (random n)
selects a random integer between 0 (inclusive) and n (exclusive).

> (define (sample-mean n m)
    (do ((i 0 (+ i 1)) 
         (s 0 (+ s (random n))))
        ((= i m) (/ s m 1.0))))

The law of large numbers tells us that if we draw a large enough
sample, the sample mean (the average of the m samples) should be 
close to n/2 with high probability.  Go ahead and try it.

> (sample-mean 100 100)
55.24
> (sample-mean 100 100)
49.42
> (sample-mean 100 1000)
49.556
> (sample-mean 100 1000)
49.207
> (sample-mean 100 10000)
49.8387
> (sample-mean 100 10000)
49.57

So, if we choose our keys uniformly at random and we choose enough of
them, lookup using linear search for large enough m should require 
n/2 comparisons on average with high probability.  Since the number 
of lookups on average is n/2, we say that the dictionary lookup based 
on linear search has an expected performance (measured in terms of
comparisons) linear in n.  An obvious question is 'Can we do any
better?'  And the answer is 'yes'.

Suppose that we could take each key and map it into a unique location
in memory.  How might we do this?  Suppose that we set aside 2^m
(that's 2 raised to m power) locations in memory for our dictionary.
Now let's take an arbitrary key such as 'aardvark'.  In memory, the
string of characters in 'aardvark' corresponds to a sequence of binary
bits, 1s and 0s.  For example, the characters in 'aardvark' might be
represented in ASCII (ASCII stands for ``American standard code for
information interchange'') code: one of several standards for encoding
characters.  ASCII maps the binary numbers 0-255 to 256 characters.
the letter 'a' in ASCII is represented by the decimal number 97 and the
binary number 1100001 (where the most significant bit is the right
most).  The rest of the letters in 'aardvark' have the following ASCII
binary representations: d = 1100100, v = 1110110, r = 1110010, and 
k = 1101011; hence, in computer memory, 'aardvark' might look like:

  aardvark = 1100001110000111001001110110110000111100101101011

We could take the first m bits, or the last 'm' bits, or every other
bit from the first to the 2m'th bit in the binary representation of
'aardvark' to construct a number between 1 and 2^m and this number
could correspond to the location in memory where we would store the
item associated with 'aardvark'.

Now suppose we want to lookup the item associated with 'aardvark'.
All we have to do is use our procedure for selecting m bits from the
binary representation of 'aardvark' and then jump to the corresponding
location in memory to retrieve the associated item stored there.  This
operation takes a constant amount of time, i.e., it is independent of
the number of items stored in the dictionary.  The method of
converting a key into a location in memory is called 'hashing' and the
data structure used to store the items associated with keys is called
a 'hash table'.

It appears that operations based on hash tables are much more
efficient than those based on association lists.  But there is a
potential problem that we have to address.  What if two keys are
mapped to the same location in memory?  This is called a 'collision'
and there are a number of ways to deal with collisions.  One approach
is simply to store all items that map to the same memory location in
an association list.  If we map a key to a location in memory and we
find a list, then all we have to do is compare the key against those
pairs that got mapped to this location - hopefully a much smaller
number of pairs than the total number of keys.  In the worst case, if
we choose a lousy method for hashing and every key gets mapped to the
same location in memory this approach take n comparisons, the same as
linear search.  But in the best case, there are no collisions and
lookups take a constant amount of time, i.e., the time required to
compute to map a key to a location in memory.  Hashing has O(n)
worst-case performance and 0(1) (denoting the smallest constant 1)
expected time for an optimal hashing strategy.

There are lots of clever ways of 'hashing' and implementing 'hash
tables' but to determine if they work for your application, you often
have to run experiments.  This is called 'profiling' or
'benchmarking' and we'll give you an example of how we might do this
for comparing the MzScheme implementations of 'assoc' and
'hash-table'.  The file ./bin/benchmark.scm contains code that builds
an association list or hash table of size m and then retrieves m
random keys from the corresponding data structure.  We'll try each
data structure for m = 100, 1,000, 10,000 and 100,000 starting with
the association list implementation, and use the 'time' function to
keep track of different measures performance in millisecond units:

> (time-assoc 100)
Inserting 100 items: cpu time: 0 real time: 1 gc time: 0
Performing 100 lookups: cpu time: 0 real time: 1 gc time: 0
> (time-assoc 1,000)
Inserting 1,000 items: cpu time: 0 real time: 3 gc time: 0
Performing 1,000 lookups: cpu time: 120 real time: 195 gc time: 0
> (time-assoc 10,000)
Inserting 10,000 items: cpu time: 30 real time: 66 gc time: 0
Performing 10,000 lookups: cpu time: 18,060 real time: 19,052 gc time: 0
> (time-assoc 100,000)
Inserting 100,000 items: cpu time: 520 real time: 661 gc time: 210
Performing 100,000 lookups: cpu time: 2,123,480 real time: 2,847,301 gc time: 0

It appears that the insertions don't depend on the size of the
association list; inserting 10 times the number of items results in 10
times the amount of time.  But the lookups are somewhat disturbing.
Inserting 10 times the number of items results in 10^2 the time; the
longer the list the long the time it takes to access items.  Now let's
take a look at the hash table implementation:

> (time-hash 100)
Inserting 100 items: cpu time: 0 real time: 0 gc time: 0
Performing 100 lookups: cpu time: 0 real time: 1 gc time: 0
> (time-hash 1000)
Inserting 1,000 items: cpu time: 10 real time: 4 gc time: 0
Performing 1,000 lookups: cpu time: 0 real time: 3 gc time: 0
> (time-hash 10000)
Inserting 10,000 items: cpu time: 60 real time: 90 gc time: 20
Performing 10,000 lookups: cpu time: 40 real time: 49 gc time: 0
> (time-hash 100000)
Inserting 100,000 items: cpu time: 470 real time: 534 gc time: 110
Performing 100,000 lookups: cpu time: 400 real time: 447 gc time: 0
> (time-hash 1000000)
Inserting 1,000,000 items: cpu time: 4,000 real time: 4276 gc time: 370
Performing 1,000,000 lookups: cpu time: 4,500 real time: 4969 gc time: 0

Note that in this case an increase of 10 in the number of lookups
results in a 10 fold increase in the amount of time.  Hash tables,
even with the potential for collisions, seem to scale much more
reasonably than association lists.  Of course we used very simple keys
and we might encounter a very different situation if we were using
keys corresponding to, say, words from a text document.  To get a
better idea of how well hash tables work, in the following experiments
we take random samples of the words in /usr/share/dict/words and try
the hash table implementation once again, duplicating some of the
larger experiments to get some idea of the variance due to sampling.

> (time-hash-words 100)
Inserting 100 items: cpu time: 0 real time: 0 gc time: 0
Performing 100 lookups: cpu time: 0 real time: 1 gc time: 0
> (time-hash-words 1000)
Inserting 1,000 items: cpu time: 0 real time: 1 gc time: 0
Performing 1,000 lookups: cpu time: 10 real time: 7 gc time: 0
> (time-hash-words 10000)
Inserting 10,000 items: cpu time: 10 real time: 12 gc time: 0
Performing 10,000 lookups: cpu time: 530 real time: 487 gc time: 0
> (time-hash-words 10000)
Inserting 10,000 items: cpu time: 10 real time: 12 gc time: 0
Performing 10,000 lookups: cpu time: 460 real time: 476 gc time: 0
> (time-hash-words 100000)
Inserting 100,000 items: cpu time: 150 real time: 534 gc time: 0
Performing 100,000 lookups: cpu time: 94,460 real time: 94,448 gc time: 0
> (time-hash-words 100000)
Inserting 100,000 items: cpu time: 130 real time: 204 gc time: 0
Performing 100,000 lookups: cpu time: 96,530 real time: 96561 gc time: 0

Well, the hash-table implementation slowed down considerably but 
it's still performing considerably better than the association-list
implementation.  It should also be noted that the benchmarking
function 'time-hash-words' has to randomly select words from a list
and this takes additional time which is reflected in the above 
statistics.  To give the hash-table implementation a fair shake
we'd have to design a better evaluation scheme. 

There are other algorithms for implementing dictionary lookup; one of
the more popular is based on a data structure called a balanced binary
tree.  This algorithm has a worst-case performance of O(log n) which
is better than O(n) but worse than O(1).  If you'd like to do a final
project with a mix of theory and practice, you can implement the
standard algorithms for linear search, hash tables and balanced binary
trees and compare them both analytically and experimentally.

2.18 Exercises for Wednesday, November 10, 2004

Semantic Networks - Graphs with Labeled Edges and Nodes 

In a semantic network the nodes correspond to words and the
arcs correspond to relationships between words.   Here are some
of the most obvious relationships found in semantic networks.

 synonyms - one word is-roughly-the-same-as another word, 

 antonyms - one word has-roughly-the-opposite-meaning of a another word,

 hypernyms - one word is-more-general/less-specific-than another word,
 hyponyms - one word is-less-general/more-specific-than another word, and

 meronyms - one word denotes an object that is part of an object denoted 
            by another word

Taxonomies can be represented as semantic networks.  Carolus Linnaeus,
an 18th-century Swedish botanist, devised a taxonomic categorization
of the biological world in which each species belongs to a genus, each
genus belongs to a family, each phylum to an order, each order to a
class, each class to a phylum and each phylum to a kingdom.  A similar
sort of analysis has been done for genes relating individual genes to
their known functions and functional classes.  In 1911 Peter Mark
Roget published his thesaurus of synonyms and antonyms much of which
can also be captured in a semantic network.  You can download the raw
text of Roget's Thesaurus from Project Gutenberg at

  http://www.gutenberg.org/dirs/etext91/roget15a.txt

WordNet is a project of the Cognitive Science Department at Princeton
University.  It is basically a big semantic network that supports a
rich set of relationships represented as links.  Visit their web site:

  http://www.cogsci.princeton.edu/cgi-bin/webwn/

and try a simple query like 'giraffe' and have WordNet search for
hypernyms and hyponyms.  Here is a longish chain of hyponyms:

  broco, mustang, pony, horse, equine, odd-toed ungulate, placental
  mammal, mammal, vertebrate, chordate, animal, organism, entity

A pairwise relationship ~ can be reflexive - for all x, x ~ x,
symmetric - for all x and y, if x ~ y then y ~ x, and transitive - for
all z, y and z, if x ~ y and y ~ z then x ~z.  An equivalence
relation, e.g., '=', is reflexive, symmetric, and transitive. Some of
the relationships typically captured in semantic networks are
symmetric (e.g., synonyms and antonyms) and some are transitive but
not symmetric (e.g., hypernyms and hyponyms).  In the case of
taxonomies, we often are interested if one word is related to another
through a sequence of taxonomic relationships.  For example, a giraffe
is an ungulate (animals with hoofs) and all ungulates are mammals from
which we can conclude that a giraffe is a mammal.  In searching the
web for information about animals with hoofs, you might want to narrow
your search to focus on the giraffe or rhinoceros or you might want
broaden your search to cover all mammals.  Google allows the use of
the tilde operator '~' to widen a search to include synonymous
keywords.

To implement semantic networks we'll use a basic Scheme utility for
creating specialized data structures.  For example, here's how you'd
create a data structure for representing points in the plane.

> (define-struct point (x-coord y-coord))

This one statement causes several additional functions to be defined:
'make-point' for creating points, 'point?' for testing whether or not
an object is a point, 'point-x-coord' and 'point-y-coord' to access
the x and y coordinates of a point respectively, and
'set-point-x-coord!' and 'set-point-y-coord!' to set the x and y
coordinates of a point respectively.

> (define pt (make-point 1 0))
> (point? pt)
#t
> (point-x-coord pt)
1
> (point-y-coord pt)
0
> (set-point-y-coord! pt 7)
> (point-y-coord pt)
7


1. Write a function 'distance' that takes two points created by
   'make-point' and computes the Euclidean distance between the two
   points.  Write a function 'tour' that takes a list of points
   created by 'make-point' and computes the total distance traveled if
   you start at the first point in the list, travel in a straight line
   to the second point, from there travel on to the third point and so
   on ending up at the last point in the list.


2. Learn about the graphics.ss library in DrScheme and implement a
   data structure and supporting functions for representing and
   drawing two-dimensional polygonal objects.  This is an advanced
   exercise and will take some time and effort to understand the
   syntax for drawing using the graphics.ss library.  Search for
   'Viewport Graphics' in the Help Desk and add the following to 
   'require' directive to your programs:

   (require (lib "graphics.ss" "graphics"))

   You should start with a set of primitive geometric objects, e.g.,
   rectangle, circle, isosceles triangle.  Your program should be able
   to compose a more complicated geometric object from a specification
   involving the simpler, primitive objects.


Now, we'll define the data structures for implementing semantic
networks.  A node in semantic network is identified by a 'word' 
and it has links that 'out' from the node and links that leading 
'in' to the node.

> (define-struct node (word out in))

A link has a specified 'type' and it has an origin node that it leads
'out' of and a destination node that it leads 'in' to.

> (define-struct link (type out in))


3. The in-degree of a node in a directed graph is defined as the
   number of links that lead into the node.  The out-degree is defined
   analogously.  Write the functions 'in-degree' and 'out-degree'
   implementing these definitions.  Write a function 'max-degree'
   which computes the maximum in- and out-degrees over all nodes in
   the graph corresponding to semantic network.


We use a hash table to implement a dictionary that will serve to index
the nodes in our semantics network. 

> (define dictionary (make-hash-table))

Insert a word into the dictionary associating it with a node in the 
network.  If a node is already associated with the word then return
the node. 

> (define (insert word) 
    (or (hash-table-get dictionary word (lambda () #f))
        (let ((node (make-node word () ())))
          (hash-table-put! dictionary word node))))

Return the node associated with a specified word if it exists and 
otherwise signal an error. 

> (define (lookup word)
    (or (hash-table-get dictionary word (lambda () #f))
        (error "lookup: unknown node" word)))
  
Add an 'isa' link starting from 'out' and leading to 'in'.

> (define (isa out in)
    (let* ((out (lookup out)) (in (lookup in))
           (link (make-link 'isa out in)))
      (set-node-out! out (cons link (node-out out)))
      (set-node-in! in (cons link (node-in in)))))

Now we can build a little semantic network as follows:

> (insert 'giraffe)
> (insert 'ungulate)
> (insert 'mammal)
> (isa 'giraffe 'ungulate)
> (isa 'ungulate 'mammal)


This next procedure just dumps the entire contents of a semantic
network by using the 'hash-table-for-each' operator which works on
hash-tables similarly to how 'map' and 'for-each' work on lists.

> (define (dump)
    (hash-table-for-each dictionary 
     (lambda (word node) 
       (printf "~a - out: ~a - in: ~a~%" word
         (map node-word (map link-in (node-out node)))
         (map node-word (map link-out (node-in node)))))))
> (dump) 

  ungulate - out: (mammal)   - in: (giraffe)
    mammal - out: ()         - in: (ungulate)
   giraffe - out: (ungulate) - in: ()

All of the above code can be found in ./bin/semantic.scm.


5. Write functions 'more-general' and 'more-specifc' which given a
   word return a list of words sorted by, respectively, the 'more
   general' and 'more-specific' relationships where we assume that
   both relationships are transitive.  


6. Write predicates 'is-more-general?' and 'is-more-specific? that
   take two words in the dictionary and return #t if the first is,
   respectively, more general or more specific than the second and
   that return #f otherwise.


7. Build a semantic network for the parts of a car.  Create link types
   for 'part-of' (a carburetor is part of the engine), 'is-a' (an
   internal combustion engine is a kind of engine which is a kind of
   motive force), and 'connected-to' (the drive train is connected to
   the rear axle).  Alternatively build a semantic network describing
   functional and structural dependencies among the parts/players of
   an athletic team, a company, a building, or a spring-wound clock.


8. Use the 'dot' graph language and graphing utility from the Graph
   Viz collection of graph drawing tools to draw and label a semantic
   network.  Your program should take as input a hash table
   implementing a semantic net with nodes and links built using
   'make-node' and 'make-link' and return a digraph with labeled nodes
   and links specified using the 'dot' graph language.  You can learn
   more about Graph Viz and 'dot' at the Graph Viz documentation web
   page http://www.graphviz.org/Documentation.php and the exercise on
   drawing graphs on the Talking With Computers web site
   http://www.cs.brown.edu/~tld/talk/home-Z-H-15.html#node_sec_13.2.1

2.19 Exercises for Friday, November 19, 2004

How does search engine represent and store documents to support
keyword queries?   Here are the basis procedures involved in
summarizing documents to support keyword queries:

document -> bag of words

For the most part, word order is ignored. Search engines like Google 
typically also store the location of each word in a document to support
proximity queries, e.g., search for documents including 'search' and 
'engine' such that these two terms are within one word of each other. 
Google is also starting to store short phrases, but storing phrases 
has to be managed carefully to avoid an explosion in terms.  

1. Find a utility for converting Microsoft Word documents to plain
   text and then use to convert a Word document to a bag of words.
   There are various utilities available including 'antiword' and
   'msview' that you might find on the system or by searching the
   web and downloading and installing an application. 

bag of words -> bag of terms 

Typically the words in a document are preprocessed by removing 'stop
words' such as 'a', 'the', 'it', 'and', 'that', etc., words that add
little meaning given that the document is represented as a bag of
words with little additional semantic analysis.  In addition to
removing stop words, usually the remaining words are 'stemmed' to map
morphological variants of a given word to the same root, e.g., 'run',
'runs' and 'running' all map to 'run'.  The set of all terms in the
corpus is called the 'dictionary'.  In most cases, stemming reduces
the size of the dictionary and the length of term vectors, but can in
some cases conflate terms that we would like to remain separate, e.g.,
'universe', 'university' and 'universal' all map to root 'univers'.

2. Find a Perl or Python script that implements the 'Porter stemmer'.
   Use it to stem a sample of words drawn from a familiar text and
   note what terms are unambiguously stemmed and what terms conflate
   meanings. 

bag of terms -> inverted index 

In order to retrieve documents that contain specific terms we build 
an inverted index which points from each term to the documents that 
contain the associated word. Usually the inverted index also records
the position of each term/word in the document to support proximity 
searches. 

3. How would you implement an inverted index in scheme?  Create 
   the data structures and functions implementing the operations 
   of 'index' and 'fetch'. 

bag of terms -> term-frequency vector

The bag of terms is then processed to determine the number of times
each term appears in the document - this number is called the
frequency of the term - and the resulting numbers are used to
construct a term frequency vector which maps each term to its
frequency in the document.  You can think of this as a vector of
numbers where the 'i'th number represents the frequency of the 'i'th
term in the dictionary, but in fact term frequency vectors are never
actually implemented in this fashion.  A more efficient way of
implementing this mapping would be as an association list or hash
table.

4. Implement term-frequency vectors as association lists and then
   write code for two important operations on vectors from linear
   algebra: the magnitude of a vector and the inner or dot product 
   of two vectors (see ../../../04/11/03/exercises.txt).

Now that we have a representation for each document and an index
mapping terms/words to documents keyword queries are handled as
follows:

keywords -> keyterms

Eliminate stop words and stem the remaining keywords.  

keyterms -> documents

To retrieve a set of candidate documents we use the inverted index to
map the keywords to documents containing them.  The search engine
compiles a list of all documents that contain some (or all) of the
keywords in preparation for ranking.  

keyterms -> keyterm-frequency vector

To rank the documents, we can construct a term-frequency vector from
the search keywords which we'll call the keyterm-frequency vector.  At
this point we can add additional terms corresponding to synonyms of
the keywords.  We can also specialize the query by including hyponyms
(x 'is a kind of' y) or generalize it by using hypernyms (y 'is a kind
of' x).

5. Implement the cosine angle of two term-frequency vectors using
   the 'magnitude' and 'inner' product operations you defined earlier.

documents + keyterm-frequency vector -> document ranking

The degree of similarity between the keyterm-frequency vector and each
of the document term-frequency vectors is then calculated and the
documents are sorted using the calculated similarities and presented
to the user in the resulting order.

2.20 Exercises for Monday, November 29, 2004

1. Suppose that you want to compute the minimum and the maximum of a
   list of items. Here's a simple algorithm for computing the minimum:

   simple-min of a list of items
     set min to be the first item in items
     foreach item in the rest of the items
       if item < min then 
         set min to be item
       endif
     end
     return min

   And here's an algorithm for computing the maximum of a list of items:

   simple-max of a list of items
     set max to be the first item in items
     foreach item in the rest of the items
       if item > max then 
         set max to be item
       endif
     end
     return max

   Here's the most obvious way of computing both the minimum and the
   maximum:

   simple-min-and-max of a list of items
     set min to be simple-min applied to items
     set max to be simple-max applied to items
     return ( min, max )

   We can be little more clever and only iterate over the list of
   items once:

   combined-min-and-max of a list of items
     set min to be the first item in items
     set max to be the first item in items
     for each item in the rest of the items
       if item > max then
         set max to be item
       endif 
       if item < min then 
         set min to be item
       endif 
     end
     return ( min, max )

   Or we can be downright devious and do a little preprocessing of the
   list:

   tricky-min-and-max of a list of items
     set mins to be the empty list
     set maxs to be the empty list
     for each consecutive pair of items in the list of items
       if first item in the pair > second item in the pair then 
         add the first item in the pair to maxs and 
         add the second item in the pair to mins
       else 
         add the first item in the pair to mins and 
         add the second item in the pair to maxs
       endif
     end
     set min to be simple-min applied to mins
     set max to be simple-max applied to maxs
     return ( min, max )

   Note that if the list of items is 

     (item_1 item_2 item_3 item_4)

   then the list of consecutive pairs is 

     ((item_1 item_2) (item_3 item_4)) 

   Don't worry about the case in which the length of the list of 
   items is odd.

   For each of the three algorithms, 'simple-min-and-max',
   'combined-min-and-max' and 'tricky-min-and-max' determine how many
   comparisons (applications of < or >) are required to compute the
   min and max of a list of items as a function of n, the length of
   the list. Note that 2 * n, n/4, n^2, and log n are all functions 
   of n as are any combinations of these such as 2 * log n/4.

   Show your work; by this I mean write out the steps that you took
   to arrive at your solutions, e.g., the for-each loop performs ...
   comparisons and the call to 'simple-min' applied to the list of 
   mins performs ... comparisons ... 


2. Suppose you want to insert a new item into a list of items that 
   is already sorted so that the new list with the inserted item is 
   also sorted; how many comparisons would this require for a list 
   of length n?  


3. Use the idea of inserting items in an already sorted list to
   define a sorting algorithm called 'insertion-sort' and suggest 
   an upper bound on the number of comparisons it performs.

2.21 Exercises for Wednesday, December 1, 2004

Here's what I'd like you to get out of the first part of this lecture:

   a. a basic understanding of how you measure the performance
      of algorithms 

   b. an understanding of how we measure the difficulty of 
      problems and assign problems to classes 


Computational Complexity

An algorithm is an abstract description of a computation: a recipe in
a reasonably precise mathematical language that describes the steps
required to carry out a given computation.  Algorithms differ from
particular implementations of an algorithm, e.g., in Scheme, C, Java
or C++ only in the level of detail and, indeed, these languages are 
perfectly suited to precisely specifying algorithms. 

There are many ways to measure the performance of algorithms.  For
example, you might want to count the number of required time-consuming
operations, e.g., the number of string comparisons or arithmetic
operations.  Alternatively, you might want to figure out how much
space (or memory) an algorithm requires.  And you might want to
understand how an algorithm performs in the worst case, the best case,
or the average case.

We typically describe the performance of algorithms as a function of n
where n is a measure of the size of the problem, such as the number of
items in a list of items to be searched or sorted, or the number of
nodes in a graph in which we want to find the shortest path or the
shortest tour (in the case of a traveling salesman problem).

For the reasons I gave on Monday, we group together (or classify)
problems that are 'asymptotically' (in the limit) equivalent, e.g.,
the linear-time algorithms - O(n) read 'order of n' using so-called
big-oh notation, logarithmic - O(log n), quadratic - O(n^2), cubic -
O(n^3), or, more generally, polynomial - 0(n^k) for some fixed k,
exponential - O(2^n), doubly exponential - O(2^(2^n)), etc.

By contrast, the difficulty of a 'problem' is typically associated
with the class of the 'most efficient' algorithm for solving the
problem.  Note that this is not the most efficient algorithm we know
of, but rather the most efficient algorithm possible.  For example,
computer scientists have shown that the problem of sorting a list of
items using comparisons (we have to be specific given that there are
other approaches to sorting - lookup 'radix sort' using Google if 
you're curious about this) is in the class of problems that can be
computed in O(n log n).  This class is referred to as the 'asymptotic
worst-case complexity' of the problem.

I also mentioned that there are some problems that we can't even find
solutions for.  These are the so-called 'undecidable problems' and one of
the best known such problems is the so-called 'halting problem'.  We'll
look at the halting problem in more detail in just a few minutes. 

For some problems, like sorting, we actually know the performance of
most efficient algorithm.  For others we don't know for sure.  In some
cases, computer scientists have identified abstract classes that
characterize computational problems.  The class of NP-complete
problems is one of the most famous such classes.

A 'nondeterministic polynomial time algorithm' is one that tries to
guess a solution to a problem and can determine whether or not a
proposed solution is indeed a solution to a specified problem in
polynomial time.  The class of NP-complete problems (the long version
is 'that class of problems complete for nondeterministic polynomial
time') is such that if you could solve any one of these problems in
polynomial time you could solve them all in polynomial time.

A 'decision' variant of the traveling salesman problem is in the class
of NP-complete.  In this particular variant, the task is to determine
for a given graph whether or not there exists a tour that is shorter
than a specified threshold distance.  Other problems involve solving
various kinds of puzzles, configuring computer networks, scheduling
assembly lines, etc.


1. Argue that you can check a proposed solution to an instance of the 
   traveling salesman problem in polynomial time.  


So far the best known algorithms for these problems all run in time
exponential in the size of their input for some inputs.  Computer
scientists conjecture, but don't know for sure, that P != NP, that is
to say that the class of NP Complete problems is distinct from the
class of polynomial-time problems.  A few computer scientists believe
that P = NP and that we just haven't been clever enough in devising
algorithms, and sooner or later we'll find a polynomial-time algorithm
for some problem in the class of NP-complete problems and this problem
will cause class of NP-complete problems to 'collapse' like a house of
cards into the class of polynomial-time problems.

The problem of whether or not P = NP produces all sorts of cranks:
people who think they have a proof that either P != NP or P = NP.  One
of the most common challenges involves the invention of alternative
forms of computation.  For example, DNA computing, quantum computing,
or even computing using ant colonies.  There are proposals for using
biological processes and in particular DNA and cellular processes to
compute the solutions to problems.  A European research group claims
that it can solve instances of TSP using a colony of ants.  Perhaps
the most interesting and credible challenge concerns the application
of quantum mechanics to building a so-called quantum computer.  It has
been conjectured that quantum computing will enable us to solve a
particular problem relating to security and code breaking faster than
on a conventional computer.

Half a century ago a mathematician by the name of Alan Turing thought
he has solved the problem of their being different models of computing
by proving that all sufficiently powerful computing machines are
equivalent.  He did this as part of proving that the halting problem
is undecidable. 


The Halting Problem

What might you expect from the following program:

while ( n != 1 ) {
   if n is even  
      then n = n / 2 ;
      else n = 3n + 1 ;
}

Or, if you prefer, in Scheme:

(define (steps n)
  (if (not (= n 1))
      (steps (if (even? n)
                 (/ n 2)
                 (+ (* 3 n) 1)))))

Does it halt or not?  No one actually knows.  For a power of two, it
clearly halts (for n = 1024 in halts in 10 steps).  But for some other
inputs it bounces around wildly.  Even for relative simple programs
involving integer inputs it's difficult to predict their behavior.
But that doesn't mean it's impossible.  How can we show that there is
no program that can predict the behavior of an arbitrary program on an
arbitrary input?  We're going to present a proof by contradiction:
we'll assume that such a program exists and then we'll show that this
assumption leads us to a contradiction.


2. Convince yourself of the follow proof by contradiction.  In the
   following, P, Q and S are programs and X represents a possible
   input to a program.  Since programs are just strings of symbols
   they can be used as inputs to programs.  To begin the proof by
   contradiction, assume that there exists a program Q defined as
   follows:

   Define Q(P,X) as follows:
     If P(X) halts then 
       return YES
     else return NO

   Now we define another program S that uses Q as a subroutine and
   that definitely does not halt on some inputs:

   Define S(X) as follows:
     If Q(X,X) = Yes then 
       loop forever
     else terminate

   Suppose that we apply S to itself as in S(S).  First, suppose that
   S(S) terminates; but this can't be since the only way for S(S) to
   terminate is for Q(S,S) to return NO which indicates that S(S)
   doesn't terminate.  Then suppose that S(S) doesn't terminate; but
   this can't be either, since if S(S) doesn't terminate that means
   that Q(S,S) must have returned YES which indicates that S(S) does
   terminate.  Since in either case we arrive at a contradiction, this
   leads us to reject our initial assumption, namely that there
   actually is a program that can determine for any program and any
   input whether the program will halt on that input.


In the process of filling in the details of this proof, Turing devised
an abstract computing devices which now bears his name. 


Turing Machines

Alan Turing proved that there is no mechanical method that can tell
for any program whether or not it will halt.  To complete his proof he
imagined a mathematical abstraction that reduces computation to a
simple, almost elemental form.  Here are the main features of a Turing
machine:
                                                    
     <-- ... | # | 0 | 1 | 0 | 0 | 0 | 0 | 0 | # | ... --> TAPE
                               ^ READ-WRITE HEAD
                               |
                         | CONTROLLER |
                         |   RULES    |
                         |  STATE: 1  |

  * a finite alphabet of symbols; in this case 0, 1 and #,

  * a finite set of states that the machine might be in; for example,
    a machine might have the set of three states: MOVING FORWARD
    LOOKING FOR A #, MOVING BACKWARD LOOKING FOR A 1, and DONE,

  * an infinite TAPE arranged as a sequence of locations each capable
    of recording a single symbol,

  * a READ-WRITE HEAD capable of reading the symbol written on the
    tape at the location under the head or writing a symbol at the
    location under the head, and

  * a CONTROLLER consisting of a finite set of rules each of which
    can, depending on the current state and the symbol written on the
    tape under the read-write head, change the current state to be a
    new state, write a new symbol at the location currently under the
    head, and move the tape forward or backward under the head.


We can represent the rules used to define a controller for a particular
machine as a table with the following columns. 

      CurrentState CurrentSymbol NewState NewSymbol Move

Here's the controller for the 'NO2ONES' machine which doesn't like
consecutive ones in a string; whenever NO2ONES finds two consecutive
ones, it replaces the second one with a zero.  When it gets to the end
of a string (encounters a blank space indicated by a #) then it stops.

The rules assume that the head starts at the beginning (the leftmost
symbol) of the string that we want to work on.  The plus symbol '+'
indicates move the head forward, the minus '-' move the head back, and
the asterisk '*' stay where you are, don't move the head in either
direction.

      CurrentState CurrentSymbol NewState NewSymbol Move
      1            1             2        1         +
      1            0             1        0         +
      1            #             HALT     #         *
      2            1             1        0         +
      2            0             1        0         +
      2            #             HALT     #         *


Another exercise in writing Turing machines is to define a machine that
can recognize palindromes (ignore the spaces and punctuation): 

     "A man, a plan, a canal: Panama" 
     "A nut for a jar of tuna"
     "A daffodil slid off Ada"

You can find many more palindromes at http://www.palindromes.org/

Here's the table of rules for recognizing even-length palindromes: 

      CurrentState CurrentSymbol NewState NewSymbol Move
      1            0             4        #         +
      1            1             2        #         +
      1            #             HALT     *         *
      2            0             *        *         +
      2            1             *        *         +
      2            #             3        *         -
      3            1             6        #         -
      3            #             HALT     *         *
      3            0             HALT     *         *
      4            0             *        *         +
      4            1             *        *         +
      4            #             5        *         -
      5            0             6        #         -
      5            #             HALT     *         *
      5            1             HALT     *         *
      6            0             *        *         -
      6            1             *        *         -
      6            #             1        *         +

An asterisk '*' in the NewState or NewSymbol column means 'no change'. 


3. Here's the the graphical representation of the controller shown as
   a so-called finite-state transition diagram:

             /-----------------> 6 <-----------------\
            /                    |                    \
           3 <------- 2 <------- 1 -------> 4 -------> 5

   Fill in the transitions by labeling each arc with the current
   symbol on the tape, the new symbol on the tape, and direction to
   move the tape.  Note that there are some arcs corresponding to self
   transitions that you'll have to add to the diagram. 


4. Devise a Turing machine to erase every third 1 in a sequence of
   1s and 0s:

   #01011011010# => #01010011000#


5. If you found the palindrome machine easy, try the copy machine:

   #0101###### => #0101#0101#


6. For a more ambitious challenge, implement a rule set for the 
   adding machine:

   #0101#0110###### => #0101#0110#1011#

   This is significantly more difficult than exercises (4) or (5).
   Here we are assuming numbers are represented 'big endian' with 
   the least significant bit on the right).


What can you do with Turing machines?  Arithmetic? Logic?  The answer
is that you can do anything that you can do with a traditional
computer; more in fact, if you assume that you have an infinite tape.
In particular, you can build a Turing machine called a Universal
Turing Machine that can simulate any other Turing machine.


Experimenting with Turing machines:

We've provided some Turing machine simulator for you to experiment
with (take a look at ./bin/turing.scm).  In the Scheme implementation
of this simulator, we used '_'s for '#'s since '#' has a special
significance in Scheme.  The 'execute' function takes four arguments:
a set of rules, a list of symbols, the starting state and the starting
position on the tape (positions are zero-based).

> (load "turing.scm")
> (define NO2ONES '((1 1 2    1 +)
                    (1 0 1    0 +)
                    (1 _ HALT _ *)
                    (2 1 1    0 +)
                    (2 0 1    0 +)
                    (2 _ HALT _ *)))
> 
> (execute NO2ONES '(_ 1 1 0 1 1 0 1 _) 1 1)
(_ 1 1 0 1 1 0 1 _)
(_ 1 0 0 1 0 0 1 _)


There's also a version of 'execute' called 'verbose' that spits out 
lots of diagnostic information that might prove useful in doing 
exercises (4), (5) and (6).

> (verbose NO2ONES '(_ 1 1 0 1 1 0 1 _) 1 1)
(_ 1 1 0 1 1 0 1 _)
State = 1, Symbol = 1, Next state = 2, New symbol = 1, Direction = +
State = 2, Symbol = 1, Next state = 1, New symbol = 0, Direction = +
State = 1, Symbol = 0, Next state = 1, New symbol = 0, Direction = +
State = 1, Symbol = 1, Next state = 2, New symbol = 1, Direction = +
State = 2, Symbol = 1, Next state = 1, New symbol = 0, Direction = +
State = 1, Symbol = 0, Next state = 1, New symbol = 0, Direction = +
State = 1, Symbol = 1, Next state = 2, New symbol = 1, Direction = +
State = 2, Symbol = _, Next state = halt, New symbol = _, Direction = *
(_ 1 0 0 1 0 0 1 _)

The 'execute' function also allows the use of wildcards which can
reduce the number of rules in some cases.  If you want to use
wildcards, read the comments in the code concerning their use.

Here's the palindrome machine without using wild cards: 

(define PALINDROMES '((1 0 4 _ +)
                      (1 1 2 _ +)
                      (1 _ HALT * *)
                      (2 0 2 0 +)
                      (2 1 2 1 +)
                      (2 _ 3 _ -)
                      (3 1 6 _ -)
                      (3 _ HALT _ *)
                      (3 0 HALT 0 *)
                      (4 0 4 0 +)
                      (4 1 4 1 +)
                      (4 _ 5 _ -)
                      (5 0 6 _ -)
                      (5 _ HALT _ *)
                      (5 1 HALT 1 *)
                      (6 0 6 0 -)
                      (6 1 6 1 -)
                      (6 _ 1 _ +)))


And here's the palindrome machine using five fewer rules with wild cards:

(define palindromes '((1 0 4 _ +)
                      (1 1 2 _ +)
                      (1 _ HALT * *)
                      (2 * * * +)
                      (2 _ 3 * -)
                      (3 1 6 _ -)
                      (3 * HALT * *)
                      (4 * * * +)
                      (4 _ 5 * -)
                      (5 0 6 _ -)
                      (5 * HALT * *)
                      (6 * * * -)
                      (6 _ 1 * +)))


7. Implement the Turing machines that you described in exercises (4), 
   (5) and (6). 


8. For Friday, December 3, 2004 you are to read Chapter 16: 'Ain't
   Nobody Here But Us Machines' and also read the on-line exercise
   http://www.cs.brown.edu/~tld/talk/home-Z-H-18.html#node_sec_16.1.1
   titled 'Pet Rocks and Things That Think.  You're to do the exercise
   that requires you to specify two measures of complexity and then
   rank the six systems listed: an ant, an elm tree, etc. Come to class
   prepared to discuss the questions raised in this piece.

2.22 Exercises for Friday, December 3, 2004

I started this course by talking about the writings of Vannevar Bush
who, right after World War II, set forth a vision that encompassed a
vast information store anticipating the World Wide Web and suggested
that computers would amplify our ability to communicate, calculate and
perceive in much the same way that the internal combustion engine and
other technologies amplified our physical abilities.

Computer scientists, some of them at least, think they can explain not
only decision making and perception in computational terms but also
the very stuff that makes us human (or so we believe), consciousness,
emotions, altruism, and intelligence in all its manifestations.  In
the 17th century, scientists tried to explain human behavior in terms
of clockwork mechanisms, and later in terms of steam engines,
electrical signals, and a host of other technological advancements.
So it's worth asking the question: Is the current optimism about
unraveling the mystery of the mind warranted or are the current set of
computational explanations just another fad?

Some computer scientists such as Hans Moravec and Ray Kurzweil believe
that one day relatively soon (at least by 2100) we'll be able to
manufacture the stuff of life, replace not only our organs but our
brains, codon-by-codon, cell-by-cell, neuron-by-neuron thereby
providing us with unlimited life spans and unlimited opportunities to
amplify our selves.

These are truly wondrous possibilities.  Technologists such as
Kurzweil are banking on the exponential improvements in computing
technology and biology and nanotechnology which are closely
intertwined.  This exponential trend is fueled by Moore's Law - which
basically predicts that the number of transistors (elementary
computing components) on a chip will double approximately every two
years and therefore that the computing power available on a microchip
will double every two years. O(2^n) gets very big very fast.  Morvavec
and Kurzweil imagine riding this exponential curve to immortality and
cognitive amplification without bounds.

But critics of this sort of technological exuberance challenge these
goals on the basis of their having little or nothing to do with being
human.  The very idea of us being machines and from Moravec's and
Kurzweil's perspective really just information and computational
processes that can be transferred from one machine such as the human
body to another, such as the silicon chips that comprise a computer.
If we are just machines, what does that say about being human?  And if
you believe we're 'just' machines and thereby subject to a set of
rules as are all machines, then what does that say about free will and
self determinism?  Are we different than any other animal except by
some measure of how fast our processors run or how much memory we have
or how many rules we can learn from experience?


1. You were asked to come up with two measures of 'complexity' so as to 
   situate six systems along a continuum.  What are your measures and 
   what do they they imply about our relationship to 'others'?  


2. What do think about Moravec's and Kurzweil's vision of the future?
   How might the availability of cognitive amplification affect jobs,
   admission to college, etc?  Assuming we can deal with any resulting
   population pressures, how do you feel about the prospect of living
   indefinitely long? 


Closing Remarks

This course is likely different than you've experienced in high school
and it may be different from many if not most of the courses you'll
take in college.  For one thing there were no exams, the quizzes were
meant as gentle persuasion to read the material for class, and the
exercises were really meant for your edification not to keep track of
what you were learning.

The projects were what you made of them.  They were designed to
reinforce the things you learned in class and in the exercises, but
they were offered as an opportunity for you to apply what you learned
and stretch yourself.  Some of you took advantage of this opportunity
and some of you did the minimum to get by.  You are rapidly
approaching a time of your life in which there won't be any formal
structure for guiding your performance: no teachers, no coaches, no
explicit instructions regarding what to do, when to do it, or how much
is enough.  Well, perhaps that's an exaggeration; there are jobs where
someone is always telling you what to do and when to do it, but I'd
argue that these jobs aren't worth having.

This class offered you ideas, resources, lots of help and the
opportunity in which to explore ideas and come up with some of your
own.  The final projects offer yet another such opportunity; those of
you who came to hours on Tuesday I hope I convinced you that all three
of the projects offer interesting opportunities to explore.  You can
make of them what you will.  I can't imagine that any of you going to
get less than a C in this class, but some of you will get less than an
A. But really that should be much less important than your learning to
take advantage of what life offers by applying yourself energetically
to getting the most out of every experience, and especially the next
four years.

If you learn this ability now so many other things in life will fall
into line.  Take advantage of these last two weeks to get something
substantial out of completing your final project - don't worry about
grades, because sooner than you think they won't make any difference.
What will make a difference is your attitude and your ability to learn
and exploit the opportunities that come your way.

2.23 Exercises for Monday, December 6, 2004

Help session for the 'Dueling Chatterbots' project

The first four exercises lead up to getting 'albert' 
and 'eliza' talking with one another:

1. Write two functions 'one' and 'two' such that 
   each takes a single argument corresponding to an
   integer, adds 1 to its argument, prints the 
   result on a line preceded by 'i -> ' where i = 1
   in the case of 'one' and i = 2 in the case of 
   'two', and then returns the result. 

   > (one 1)
   1 -> 2
   2
   > (two 2)
   2 -> 3
   3

   Note the behavior when the two functions are alternated:

   > (one (two (one (two (one (two 1))))))
   2 -> 2
   1 -> 3
   2 -> 4
   1 -> 5
   2 -> 6
   1 -> 7
   7


2. Now write a loop using 'do' that first calls 'one' 
   with 0, then calls 'two' with the result returned 
   by 'one', then calls 'one' with the result returned
   by 'two', and so on indefinitely.  There are lots of 
   ways to this, but you might consider using an integer
   counter 'i' and applying 'one' in the cases where 'i'
   is odd and 'two' in the cases where 'i' is even. The
   predicate 'even?' returns #t if its single integer
   argument is even and #f otherwise. 


3. Now produce the same behavior by redefining 'one'
   and 'two' to be mutually recursive, i.e., 'one'
   calls 'two' and 'two' calls 'one'. 


4. Rename 'one' and 'two' as 'eliza' and 'albert'
   and rewrite the body using 'aux-chatter' as model 
   so that each function calls 'apply-rules' with an 
   appropriate set of rules ('eliza-rules' or 
   'albert-rules'), prints out the output and then 
   sends it along as appropriate.


This next set of exercises involves writing rules.  Recall that 
rules are of the form (cons pattern list-of-responses) so that

   '((I went to the ?x)
     (Did you enjoy going to the ?x)
     (What did you find when you got to the ?x))

matches 

   I went to the museum

and would produce either the response

   Did you enjoy going to the museum

or the response

   What did you find when you got to the museum

with equal probability. 


5. As a warmup, create a rule that matches any of the following

   I think you are crazy
   I think you are schizophrenic
   I think you are psychopathic

   but not 

   I think you are going too far 
   I think you are an idiot

   and produces responses like

   I think you are [crazy | schizophrenic | psychopathic] too
   I certainly am not [crazy | schizophrenic | psychopathic] 
   Why do you think I am [crazy | schizophrenic | psychopathic] 


6. Create a rule that will match all of the following 

   When I get home I want to sleep for month
   After exams I want to watch stupid TV for a week
   During the break I want to go skiing in Taos

   but not either of the next two sentences

   I want another cup of coffee and a donut
   I'm fed up and I want some peace and quiet

   and produce responses like 

   That would be nice to be able to sleep for a month
   I too would like to watch stupid TV for a week


The next set of rules involves writing special-purpose matching
functions to augment the pattern-matcher.  The first exercise 
tests your understanding of how 'special's are handled in the 
pattern matcher and allows the application of special-purpose 
matching functions to be a little more compact; instead of 
writing (special my-match-fun ...) you use (my-match-fun ...)


7. Modify the mechanism for handling 'special' matching functions 
   so that it doesn't need the 'special' keyword; any list whose 
   'car' is a procedure (i.e., answers #t to the Scheme predicate 
   'procedure?') is defined as a special-purpose matching function.


You can write special-purpose matching functions that are like
'variable-match' and 'segment-match' in that they consume words 
in the input and bind variables if the match succeeds and return 
the empty set of they fail.  Writing a special-purpose matcher 
like 'synonym-match' involves just this sort of matching.  
Generalizing the special-purpose matching functions illustrated 
in  the project README file (./exercises/final/chatter/README), 
there are several basic types of matching function:

There's a bug in the pattern matcher.  It can't handle a segment
variable right before a 'special' match function.  See the README 
file for more detail.  A pattern like 

     (?x (special ... ) ... ) 

will work just fine but a pattern like 

     ((?* ?x) (special ... ) ... ) will not

work correctly.

 a. Functions that test for some condition and, on the basis
    of the result of the test either, fail or call the pattern 
    matcher recursively consuming input and pattern both.

    (define (some-match-function pattern input bindings)
      (if (not (some-input-predicate (car input)))
          fail
          (pattern-match (cdr pattern) 
                         (cdr input) bindings)))

    pattern = ((special some-match-function) ...)

 b. Functions like (a) except that they include a variable as an
    additional argument and if they succeed extend the bindings 
    to include an additional binding.

    (define (some-match-function pattern input bindings var)
      (if (not (some-input-predicate (car input)))
          fail
          (pattern-match (cdr pattern) 
                         (cdr input) 
                         (extend-bindings var (car input) bindings))))

    pattern = ((special some-match-function var) ...)

 c. Functions like (a) except that they consume no input and only the
    part of the pattern corresponding to themselves. These functions are
    present only for their side effects and generally appear at the end of
    the pattern only after all the input has been consumed. 

    (define (some-match-function pattern input bindings)
      (if (not (some-input-predicate (car input)))
          fail
          (pattern-match (cdr pattern) input bindings)))

8. Write a special-purpose matcher that keys off emotional words
   like 'sad' or 'depressed' and outputs either a positive remark like: 
   'Would you like a handkerchief?' or a negative remark like 'Oh, boo 
   hoo! I'm all broken up.'  This remark will be output in addition to 
   the response for the first rule whose pattern matches the input. 

   > (pattern-match '((?* ?x) I feel (special mean-match ?y) (?* ?z))
                    '(Ever since my I turned eighty I feel depressed))
   BAD ALBERT - Oh, boo hoo! Who cares if you're depressed 
   GOOD ALBERT - 
   ((?z) (?y . depressed) (?x Ever since my I turned eighty))

   > (let ((input '(Ever since my I turned eighty I feel depressed))
           (rules '((((?* ?x) I feel (special mean-match ?y) (?* ?z))
                     (Do you feel ?y often?)
                     (That's interesting that you feel ?y)))))
       (printf "~%ELIZA - ")
       (map (lambda (word) (printf "~a " word)) input)
       (printf "~%ALBERT - ")
       (map (lambda (word) (printf "~a " word)) 
            (apply-rules rules input))
       (printf "~%ELIZA - ...~%") (void))
   ELIZA - Ever since my I turned eighty I feel depressed 
   ALBERT - 
   BAD ALBERT - Oh, boo hoo! Who cares if you're depressed 
   GOOD ALBERT - Do you feel depressed often? 
   ELIZA - ...

   Note that not all of the processing has to be embedded in the 
   rules. For example, you could augment the 'albert' function to 
   keep track of how long 'eliza' takes to respond and if it takes 
   longer than, say, two or three seconds, then have 'albert' make 
   some impatient remark.  


Examples of more advanced natural language processing techniques:


9. Write a single rule whose pattern matches any of the
   following sentences

   I went to the grocery store.
   I walked to a library. 
   I rode to the convenience store.
   I ran to the nearest phone. 

   and generates a response of the form 

   Do you think it was wise to go to [the grocery store | a 
   library | the convenience store | the nearest phone]

   You can think of the verbs 'went', 'walked', 'rode' and 
   'ran' being all of the same sort: involving the physical 
   transfer of a person from one location to another.  You 
   might identify other sorts of generalized sentence forms 
   and create rules that work for all instances of a given 
   generalized sentence (see Inside Computer Understanding 
   [Schank and Riesbeck 1981] for more on this approach). 
   A simple version of this can be implemented by an extension
   of your special-pattern matcher used to handle synonyms.


10. How might you write a special-purpose matching function 
   that figures out the object of the second appearance of 
   the preposition 'to' in the following sentences:

   I want to go to a restaurant that serves sushi. 
   I really want to go to bed when I get back to the dorm.
   I want to go to the baseball game this afternoon.

   How would you deal with indefinite articles - 'a' and 'an', the
   definite article - 'the', adjectives like 'some', 'all', 'any' 
   and demonstrative pronouns such as 'this', 'that', 'these', and
   'those'?  In the fields of natural language processing,
   computational linguistics, and information retrieval, researchers
   use programs called 'part-of-speech taggers' to identify the parts
   of speech of words appearing in sentences.  

   You might think about incorporating a part-of-speech tagger into
   your project - it's a bit ambitious, but I'll provide you with 
   some extra help to get you started.  Here's a part-of-speech 
   tagger from Helmut Schmid, University of Stuttgart, Germany, that 
   runs under both Mac OS X and Linux.

   % cd ~tld/bin/tagger/
   % ls
   FILES        bin     install-hints.txt
   LICENSE      cmd     sentence.txt
   README       doc     english.par
   % cat sentence.txt
   The
   boy
   went
   to
   the
   store
   for
   a
   loaf
   of
   bread
   % bin/tree-tagger -quiet english.par sentence.txt
   DT
   NN
   VVD
   TO
   DT
   NN
   IN
   DT
   NN
   IN
   NN
   % bin/tree-tagger -quiet english.par sentence.txt | paste - sentence.txt
   DT   The
   NN   boy
   VVD  went
   TO   to
   DT   the
   NN   store
   IN   for
   DT   a
   NN   loaf
   NN   of 
   NN   bread

   It did a pretty good job! If you want to learn what the tags mean
   search the web for the Penn Treebank Tagset.  Here are the tags used
   in the example: DT - Determiner, NN - Noun, singular or mass, IN - 
   preposition or subordinating conjunction, and VVD - past tense verb.

   The file ~tld/tmp/exercises/04/12/06/bin/tagger.scm provides 
   convenient access to the tagger in Scheme:

   > (load "~tld/tmp/exercises/04/12/06/bin/tagger.scm")
   > (tagger '(The boy went to the store for a loaf of bread))
   ("DT" "NN" "VVD" "TO" "DT" "NN" "IN" "DT" "NN" "IN" "NN")