Exercises using shell variables, conditionals and foreach loops # you could store all your information in files % echo 1 > n % cat n 1 % rm n % echo 2 > n % cat n 2 # in many cases however shell variables are a simpler solution % set str = one # use echo to look at the value of the variable % echo $str one % set n = 1 % echo $n 1 # shell variables can be assigned lists (or vectors or arrays) % set str = "one two three four" # looks sort of like a list to me % echo $str one two three four # even acts a little like a list % echo $str[1] one two three four # unfortunately not enough like % echo $str[2] <<<< str: Subscript out of range. # creating real lists requires some special syntax % set lst = ( one two three four ) # at first blush the variable behaves as above % echo $lst one two three four # but now we can pick out the individual elements % echo $lst[1] one % echo $lst[2] two % echo $lst[4] four # however we can't pick out elements that don't exist % echo $lst[5] lst: Subscript out of range. # we can also change the elements to have new values % set lst[4] = five % echo lst[4] five # and we can even shift the items in a list to the left % shift lst % echo $lst[1] two % in shifting left we eliminated one item in the list % echo $lst[4] lst: Subscript out of range. % and the 4th item in the list became the 3rd item % echo $lst[3] five # creating a list of files using the backtick operator (`) % ls *.mp3 bo_diddley.mp3 james_brown.mp3 % set files = ( `ls` ) % echo $files bo_diddley.mp3 james_brown.mp3 % echo $files[1] james_brown.mp3 % echo $files[2] james_brown.mp3 # doing simple math calculations in a shell script # setting variables and doing simple integer math % set x = 1 % echo $x 1 % @ x = $x + 1 echo $x 2 % @ x++ % echo $x 3 # dealing with real numbers (floats) requires other tools # the shell can't handle real numbers % @ x = 7.5 * 3.14 @: Badly formed number # using 'awk' to do math with real numbers % echo 7.5 | awk '{ print ( 3.14 * $1 ) }' 23.55 # using 'bc' an arbitrary precision calculator % set pi = 3.14 % set diameter = 7.5 % echo "$pi * $diameter" | bc 23.55 # conditional statements allow shell scripts to make choices % ls *.mp3 bo_diddley.mp3 james_brown.mp3 % if ( -e bo_diddley.mp3 ) echo "Hey Bo Diddley!" Hey Bo Diddley! # here's a qualified conditional statement: ! = negation % if ( ! -e wayne_newton.mp3 ) echo "Yo! No Wayne." "Yo! No Wayne." # a multiple-line conditional statement with an else clause % if ( ! -e james_brown.mp3 ) then if? echo "Get up off of that thing and dance." if? else echo "Papa's got a brand new bag!" Papa's got a brand new bag! # the antecedent can be any command that you like % if ( -e bo_short.wav ) esdplay bo_short.wav # a compound conditional statement: && = conjunction % if ( -e james_brown.mp3 && -e bo_diddley.mp3 ) then % echo "Dance! You'll feel better." Dance! You'll feel better. # 'foreach' statements are for looping over lists % cd bo_tracks % ls bo_backward.wav bo_forward.wav bo_slowly.wav % foreach file ( `ls` ) foreach? echo $file foreach? esdplay $file foreach? end bo_backward.wav bo_forward.wav bo_slowly.wav % In case you're interested, here's some info on working with sounds. The command 'mpg123' plays audio files in MPEG 1.0/2.0 format (layers 1, 2 and 3). MPEG files have to be decoded before they can be played and 'mpg123' enables you to convert an audio file to a format that can be handled by other command line tools. The -s options pipes the output to the standard output as raw (headerless) linear PCM audio data, 16 bit, stereo, host byte order. The -w option converts the file to WAV format (used by Microsoft) and sends it to the standard output (use - instead of a file name) or writes it to a specified file. This command also permits you to speed up the sound (-d n) by only playing every n'th sample or slow it down (-h n) by playing each sample n times. You can learn more about 'mpg123' at its web site (http://www.mpg123.de/). If you really want to manipulate sound files using fancy filters and synths, learn about 'sox' (for Sound Exchange) (http://sox.sourceforge.net/), a command line tool that provides lots of filters, special effects, mixers and synthesizers to play with. You can use 'sox' to reverse the samples in a sound file which is particularly useful in looking for satanic verses hidden in music files. Another useful sound library, called Elightened Sound Daemon (http://www.tux.org/~ricdude/overview.html) provides tools for playing sounds. All of these commands are available on the department Linux machines, but note that while these machines have soundcards (which are necessary for generating sound) they don't have speakers and so you'll have to bring headphones if you want to experiment with sound. Here are some short tips. Use 'info' and 'man' to learn more: mpg123 - MPEG decoder mpg123 -w file (--wav) # write file in wav format (- for standard output) mpg123 -s file (--stdout) # write file in headerless PCM, 16bits, host byte order mpg123 -d n (--doublespeed) # play every nth sample mpg123 -h n (--halfspeed) # play every sample n times mpg123 -h 4 bo_diddley.mp3 # here's how I made some of the tracks mpg123 --wav bo_slowly.wav --halfspeed 8 bo_diddley.mp3 SoX - sound exchange - universal sound sample translator sox in.wav out.wav trim 0:00:00.5 0:00:05.5 # grab a 5 second sample sox in.wav out.wav reverse # reverse the file sox bo_diddley.wav bo_short.wav trim 0:00:01.5 0:00:08.5 Enlightened Sound Daemon - esd, esdcat, esdfilt, esdplay, esdsample esdplay bo_short.wav # plays WAV files and many other formats
Supplementary tips and questions for 'Ciphers and Secrets' http://www.cs.brown.edu/~tld/talk/home-Z-H-4.html#node_sec_2.2.2 Shell scripts with lists of commands provide one way of stringing together a sequence of computations. The different computations / commands are connected together using files and variables - the only memory that shell scripts have. Variables and files are the means whereby different computations communicate with one another. In this exercise, we'll explore another method for computations to communicate, Unix pipes, which you were introduced to in Chapter 1. As demonstrated in Chapter 2, pipes allow you to do things that you might accomplish using intermediate files, but without the clutter and overhead of saving, reading and deleting files. We'll start by looking at some commands that work particularly well with pipes. 1. Here are some commands that will come in handy in this exercise. # 'translate' all uppercase characters to lowercase characters % cat text.txt | tr "A-Z" "a-z" What does 'tr "0-9" "a-z"' do? % echo "3.14" | tr "0-9" "a-z" What about 'tr "a-z" "0-9"' do? % cat text.txt | tr "a-z" "0-9" # delete all occurrences of the numeric characters 0-9 % echo "Pi is approximately 3.14" | tr -d "0-9" What does 'tr -dc "a-z \n"' do? % cat text.txt | tr -dc "a-z \n" # 'squeeze' consecutive spaces into a single space using -s % echo "compresses long sequences of spaces" | tr -s " " What does 'tr -s " " "\n"' do? % cat text.txt | tr -s " " "\n" # eliminate repeated lines in a sorted list % echo "a b b c c c z" | tr " " "\n" | tr " " "\n" | uniq What does 'uniq -c' do? % echo "a b b c c c z" | tr " " "\n" | tr " " "\n" | uniq -c # 'explode' a file of lowercase words into characters % cat text.txt | sed 's/\([a-z ]\)/\1@/g' | tr -s "@" "\n" Explain how this script works - \(RE\) 'saves' match in \1 - Why not use sed 's/\([a-z ]\)/\1\n/g'? The -s squeezes the whitespace. # 'implode' a file - basically the inverse of 'explode' # cat text.txt | tr "\n" " " | sed 's/\([A-Za-z]\) /\1/g' | tr -s " " Explain how this script works. # 'sort' in reverse numerical order using the first field in a # file, then resolve any ties by sorting on the second field % cat text.txt | sort -k1,1nr -k2 This is the sort of special-purpose application of 'sort' that you have to figure out every time you need to use it. What does the following one-line shell script do? % paste abc.txt 123.txt where abc.txt and 123.txt are text files. Can you guess what the 'cut' command does? % paste abc.txt 123.txt | cut -f 2 OK. Now we're going to apply what we learned to the task of creating ciphers or coded messages. I want you to think in terms of sending text messages through pipes, encrypting a plain text message using a secret code at one end of the pipe, sending it through the pipe, and then decrypting it back into plain text at the other end. We're going to experiment with 'cracking' a particular class of coding schemes called 'substitution codes.' Samuel Pepys was a high-level government bureaucrat in the mid 17th Century (he was more or less equivalent to the US Secretary of the Navy). He kept a diary for ten years beginning in 1660. While often cited for its accounts of Pepys extramarital exploits, it provides a unique glimpse of 17th Century life in London. In his diary, Pepys wrote a detailed account of the great fire of London and many other important historical events that would be otherwise lost to us. He also gossiped about anyone and everyone and revealed his most intimate thoughts. It's no wonder that he took pains to render his diaries secret. 2. Here's what I did to create the mystery file. % set code = `cat alpha.txt | explode | shuffle | implode` % set file = `ls pepys/*.txt | shuffle | head -1` % cat $file | \ tr "A-Z" "a-z" | \ tr -dc "a-z \n" | \ tr "a-z" "$code" > secret.tmp This is equivalent to running the 'secret' command as follows: % secret pepys In the following steps, I create a bunch of temporary files of the form *.hst (histograms) and *.tmp. You can delete the lot of them using the 'cleanup' command. You might want to scan the 'README' file in this directory to learn a little more about the commands (shell scripts) used in this exercise. 3. Create one big file from all the diary entries in the ./pepys/ directory. % cat ./pepys/*.txt > pepys.tmp 4. Create a character histogram of the combined entries in ./pepys/. % charhist pepys.tmp > pepys.chars.hst 5. Create a character histogram of the mystery text. % charhist secret.tmp > secret.chars.hst 6. Put the histograms side-by-side for making comparisons. % paste pepys.chars.hst secret.chars.hst > compare.chars.tmp 7. Figure out which character was substituted for the space and then use 'tr' to put the spaces back in the right places. % cat secret.tmp | tr "#" "#" > secret.space.tmp This isn't quite right! Figure out what's wrong and fix it. Think about using uppercase letters to keep track of the substitutions you've made so far. 8. If you correctly identified the space character, you have a chance to exploit statistical regularities in word lengths. % wordhist pepys.tmp > pepys.words.hst % wordhist secret.space.tmp > secret.words.hst 9. As before, compare the two histograms side-by-side. % paste pepys.words.hst secret.words.hst > compare.words.tmp *. From here on out, you're on your own, but here are some suggestions. Use a combination of word and character histograms to try to guess character substitutions. Make selected substitutions using 'tr' to experiment with various decryption alternatives. There's a list of all the words in the 2nd Edition of the Webster Dictionary in /usr/share/dict/. You can use 'egrep' to find words that match a specified pattern (called a regular expression). For example, the invocation % egrep "\<thr[aeiou][a-z]{1,2}l\>" /usr/share/dict/words thrail thrall thrill thripel thronal matches all words in the dictionary that begin with 'thr' followed by a vowel, followed by at least one but not more than two letters, followed by an 'l'. It's tricky making individual substitutions; if you replace each occurrence of 's' with 'g' then you'll confuse an inserted 'g' with a 'g' that was already present in the encrypted text. Figure out a way of handling substitutions that won't confuse things. First person who decodes the file mystery.txt gets a perfect grade on their choice of take-home quizzes. Expanding on this exercise will be one of the projects you can select for your midterm project. There is a lot you can do to build better code-cracking tools. Simple substitution codes are particularly vulnerable to statistical techniques. Think of some of the many regularities in English words that you might exploit. Look for common sequences of words like 'qu', 'ing', 'ed', and 'ie', common words like 'the', 'a', 'I', and 'you'. Sometimes you want to easily test a hypothesis, e.g., do the letters 'jy' map to the letters 'ed'. You might have several such hypotheses. How would you choose between them? How would you write a script to automate simple hypothesis generation and evaluation? You could rank substitutions by the number of words that they reveal in the text. How could you turn this general rule of thumb into a procedure for evaluating substitutions?
Some in-class exercises in writing short shell scripts. Exercises 1 and 2 are pretty easy; you should be able to handle each of them with relative ease. Exercise 3 is more challenging. If you're having trouble with any of them, please ask for help. 1. Write a script that takes two or more numbers specified as arguments on the command line and uses a 'while' loop to print out each number that is evenly divisible by the first number in the list. % divides 3 5 7 9 11 12 3 9 12 2. Write a script that that takes the name of an existing directory (which we'll refer to as the 'target' directory) as its only argument, creates a new directory named 'images' in the target directory if a directory of that same name doesn't already exist, and then uses a 'foreach' loop to copy every file with extension ".gif" or ".jpg" in the target directory to the 'images' directory. To test your script, create a directory called 'test' and then use the 'touch' command to create a bunch of files with extension ".gif" and ".jpg". 3. Write a script that takes the names of two text files as its only arguments. The first file is just an ordinary text file that might contain, for example, the text of a mail message or a term paper. The second file is of a special form: each line consists of a commonly mispelled word followed by a space followed by the correct spelling of the word, for example % cat mspl.txt tey the teh the arguement argument lawer lawyer plaintif plaintiff Your script should correct the spelling in the ordinary text file using the corrections in the file named by the second argument. For example, % cat msg.txt teh lawer's arguement was pretty weak but tey plaintif still got a big settlement % spellfix msg.txt mspl.txt the lawyer's argument was pretty weak but the plaintiff still got a big settlement Here are some hints that will help in writing your script % set fix = ( `cat mspl.txt` ) % echo "$fix[1] => $fix[2]" tey => the Note that the following straightforward approach to making the substitutions doesn't work: % echo "tey door closed" | sed 's/$foo[1]/$foo[2]/g' tey door closed The problem is that 'sed' has it's own interpretation of $ (it is used in a regular expression to match the end of a line) which is different from the shell's interpretation. Somehow we want to force the shell to 'interpolate' shell variables by looking up their values and inserting the values into the 'sed' command. One solution involves the careful use of quotes. Note that % echo "sed 's/$fix[1]/$fix[2]/g'" sed 's/tey/the/g' The outer double quotes render the inner single quotes as literals - that is to say they don't behave like single quotes usually do in suppressing the interpolation of shell variables. This means that the shell variables do get interpolated. Now we can expand on this and use 'eval' to force evaluation of the quoted string as follows: % eval "echo 'tey door closed' | sed 's/$fix[1]/$fix[2]/g'" the door closed The shell command 'eval' is often used in exactly this fashion. There are many possible solutions to this exercise; I want you to solve this problem using a 'while' loop that repeatedly uses 'sed' to rewrite the input file by making a different substitution on each iteration. Note that you'll probably want to use one or more temporary files to manage this iterative solution.
Quiz for Monday September 27, 2004 and tips on debugging I can only assume that most of you are doing just fine in writing shell scripts; I held more than nine open office hours this past week and only a few of you came for help. You're going to need to know how to write shell scripts in order to complete you midterms projects. I want to remind you that this course does NOT assume any prior knowledge of programming or computer science. You will learn what you need to know about programming. Right now we're studying shell scripts because they provide the best means for you to leverage powerful libraries. Writing programs in a suitable scripting language is the most common way that professionals in many disciplines carry out routine computing tasks in their daily work whether their work involves bioinformatics, physics, mathematics, business, engineering and the social sciences. But you won't learn to write shell scripts if you don't practice. There is no reading for Monday, but THERE IS A QUIZ; you're to do the following two exercises. Print out your solutions and bring them to class on Monday morning. Here are the exercises: 1. Exercise 1 that we did in class on Monday required you to use a 'while' loop. I want you to redo the exercise but this time use a 'foreach' loop. Here's the statement of the original exercise: Write a script called 'divides' that takes two or more numbers specified as arguments on the command line and prints out each number that is evenly divisible by the first number in the list. % divides 3 5 7 9 11 12 3 9 12 If you want something a little more challenging, print out only those numbers that are equal to the first number raised to some (integer) power. % powers 3 5 7 9 11 12 3 9 And if that isn't challenging enough, write a sieve for primes: % primes 2 3 4 5 6 7 2 3 5 7 2. Write a script called 'sumrows' that takes two arguments corresponding to files containing columns of numbers, and then prints out the sum of rows in the two columns. Here's an example illustrating the script's behavior: % cat one 3 4 % cat two 5 7 % sumrows one two 8 11 Your solution should exit with an appropriate message if the lengths of the two columns are different. The keyword 'exit' will force a shell script to terminate. In debugging your scripts, here are some common errors to check. Make sure you DON'T put a dollar sign before a variable when you're setting it either using 'set' or '@'. Make sure you DO put a dollar sign before a variable anywhere else you want to refer to its value. Make sure that you use '@' and not 'set' when you want to do some arithmetic operations when setting the value of a variable. Don't confuse the different ways of quoting expressions. In particular don't confuse the backtick operator ` (which is used to force the evaluation of an expression as in "set x = `ls`" or "foreach file ( `ls` )") with the single quote which can be used when you want to prevent the evaluation of variables. Remember to end every script with a final carriage return and remember to end every 'while' or 'foreach' loop with the 'end' keyword. The 'end' keyword should appear on a line all by itself with no additional spaces. You can always use 'info csh' to learn more about the C shell. If you've carefully checked for all the above problems and your script is still buggy, the next thing to do is to strategically insert some diagnostic print statements so you can get some idea of what's going on in your program as it executes. For example, if you have a script that uses the variable 'foo' and the program crashes at some point where it refers to '$foo', insert 'echo $foo' right before the statement that you suspect is causing problems. You can put this sort of diagnostic print statement anywhere in you program, including the body of a loop.
More practice writing shell scripts Here are some additional hints for debugging shell scripts: In previous exercises, I suggested that you add print statements of the form 'echo $variable' to examine the value of key variables at strategic points in your code. I also gave you tips about things to check when you encounter errors, e.g., make sure that your DON'T use the dollar sign when setting a variable but that you DO use it when you want to know the value of a variable. The 'csh' command takes a number of options but two of them are particularly useful in debugging. The 'echo' option specified by '-x' displays each line of the script after variable substitution but before execution. The 'verbose' option specified by '-v' displays each line of the script just as you typed it. You can use these options individually or in combination. You can apply these options in a number of ways. You can add them to the directive that appears at the beginning of a shell script as in: #!/bin/csh -xv You can also call 'csh' with these options and the name of the file containing the script you're trying to debug: % csh -xv script You can use 'set' and 'unset' to turn the corresponding shell variables 'echo' and 'verbose' on and off selectively, for example, in a particular block of code where you think you have a problem: % cat script #!/bin/csh ... ... set verbose set echo ... ... <<< suspected problem area ... unset verbose unset echo ... ... Try these two options out so you'll be prepared when you run into a really nasty bug. Now let's take a quick review of the major flow-of-control constructs. 'foreach' - do something for each item in a list of items foreach variable ( list ) something something end % set r = 0 % foreach n ( 1 2 3 4 5 6 7 8 9 ) foreach? @ r += $n foreach? end % echo $r 45 What do each of the following do? Search forward to the section on 'command substitution' to review the use of the backtick (also called a backquote) notation. foreach file ( `ls` ) echo $file end foreach word ( `cat file` ) echo $word end foreach word ( `grep pattern file` ) echo $word end 'if' - do something on the condition that if ( condition ) something or if ( condition ) then something something endif or if ( condition ) then something something else if ( condition ) then something something endif % set x = 0 % if ( $x == 0 ) echo "Big fat zero!" Big fat zero! % if ( ! ( $x > 0 ) && ! ( $x < 0 ) ) echo "Big fat zero!" Big fat zero! % set x = 1 % if ( $x != 0 ) echo "Ain't zero!" Ain't zero! % if ( ! ( $x == 0 ) ) echo "Ain't zero!" Ain't zero! % if ( ! $x ) echo "Big fat zero!" % set x = 0 % if ( ! $x ) echo "Big fat zero!" Big fat zero! What's going on in the last two conditionals? 1. Write a script 'sorthat' that sorts the words in a specified file into three new files: a file named 'nums' containing all the words corresponding to numbers or words beginning with numbers, a file named 'alphs' containing all the words that don't begin with a number and either begin with the letter 'L' (whether uppercase or lowercase) or some letter that appears earlier in the alphabet than 'L', and finally a file named 'bets' containing all the words that begin with the letter 'M' or some letter that appears later in the alphabet than 'M'. Hopefully my instructions are easier to understand than they were to compose. You can use the 'tr' (translate) command to convert the characters in the file to lowercase and eliminate any characters other than numbers or lowercase letters. Comparing alphanumeric strings turns out to be a little awkward (the pun is definitely intended) in the C-shell. The operators '<', '>', '>=' and '<=' only work on numbers, and, while '==' and '=~' work with strings, they don't allow us to establish an ordering among strings. The 'awk' utility comes to the rescue: % set u = aardvark % set v = albatross % echo $u $v | awk '{ print ( $1 < $2 ? $1 : $2 ) }' aardvark It looks like we're just using '<' which we had available in the C-shell, but the 'awk' '<' is more powerful than the 'csh' '<'. The above invocation is equivalent to the following which employs slightly more conventional syntax: % echo $u $v | awk '{ if ( $1 < $2 ) print $1 ; else print $2 }' aardvark We can actually simplify the 'awk' script since all we really care about is whether or not the first argument 'u' is less than the second argument 'v'. For this all we need is the following: % echo $u $v | awk '{ print ( $1 < $2 ) }' 1 % echo $v $u | awk '{ print ( $1 < $2 ) }' 0 The 'awk' command is pretty complicated (it has its own specialized scripting language) and I don't recommend that you spend a lot of time learning about its intricacies apart from picking up the few idioms that I've shown you in various scripts. You can save this as a shell script and give it a mnemonic name to make it a little more convenient to use. Note that numbers are always lexicographically 'less' than words beginning with alphabetic characters. % cat stringlessthan #!/bin/csh echo $1 $2 | awk '{ print ( $1 < $2 ) }' % stringlessthan aardvark albatross 1 % stringlessthan aardvark 17 0 Now armed with the 'stringlessthan' command this exercise ought to be relatively straightforward. Go back and look at the ciphers and secrets exercise (09-20-04.txt) to recall how to use 'tr' to preprocess the input file. 'while' - do something as long as some condition is met while ( condition ) something something end % set n = 10 % set i = 1 % set r = 0 % while ( $i < $n ) while? @ r += $i while? @ i ++ while? end % echo $r 45 You can 'nest' loops; you can have 'foreach' loops inside of 'while' loops or visa versa. The 'break' command allows you to terminate the inner-most loop in which the 'break' appears. Here's a simple example illustrating the use of the 'break' statement in checking a list of numbers for primes: % cat primes #!/bin/csh foreach n ( $argv ) @ d = $n / 2 while ( $d > 1 ) @ r = $n % $d if ( $r == 0 ) break @ d -- end if ( $d == 1 ) echo $n end 2. Write a command 'diagonal' that takes a single integer argument and outputs a diagonal matrix with ones along the diagonal and zeros off the diagonal. You'll need one new command to write this script. The command 'printf' allows you print formatted text. While 'echo 1' prints a '1' it also prints a carriage return (newline). The command 'printf "1 "' prints a 1 followed by a space but no newline. If you want to print a new line you would include a newline character ('\n') in the string argument to 'printf' as in 'printf "0 \n"'. You can read more about 'printf' using 'info' but the above description will suffice for implementing the 'diagonal' script. % diagonal 3 1 0 0 0 1 0 0 0 1 % diagonal 5 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 Command substitution The backtick or backquote enables you to assign a string or variable the output of a command. % echo "`date +%h` is the `date +%m`th month." Sep is the 09th month. When assigned to a variable the output of a command results in the variable being assigned a list (or array) of words % ls 09-15-04.txt 09-20-04.txt 09-22-04.txt 09-24-04.txt 09-27-04.txt % set exercises = ( `ls` ) % echo $exercises 09-15-04.txt 09-20-04.txt 09-22-04.txt 09-24-04.txt 09-27-04.txt As with any list you can determine how many items it contains or refer to any particular item in the list by its index. % echo $#exercises 5 % echo $exercises[1] 09-15-04.txt % echo $exercises[2] 09-20-04.txt You can also refer to a subsequence of the words in the list by specifying a beginning and ending index. % echo $exercises[2-4] 09-20-04.txt 09-22-04.txt 09-24-04.txt Some advanced topics for the determined and intrepid student. Using subroutines in shell scripts - advanced topic #1 You've already been using subroutines in shells; every time you invoke a command inside of a script you're executing a subroutine. In some cases, we execute a command for its 'side effects'. For example, you might execute 'mkdir images' to create a directory. The creation of the new directory is called a 'side effect'. In other cases, e.g., command substitution (above), we execute a command for the output returned to the shell. Unless you take pains to avoid it, the output of commands is directed to the standard output and you see the output scrolling across your screen. It doesn't matter whether you execute a command directly in the shell or indirectly by executing a script which then executes other commands: % cat testout #!/bin/csh echo "Starting 'testout'" subtest echo "Finishing 'testout'" % cat subtest #!/bin/csh echo "Executing 'subtest'" % testout Starting 'testout' Executing 'subtest' Finishing 'testout' You can, of course, redirect output to files using '>' or '>>' and you can pipe the output from one command into another command as in 'ls | wc -l' (which counts the number of files and directories in the current working directory). You can also use the backtick operator as just mentioned to set a variable as in 'set var = `ls`' or specify the list in a 'foreach' loop as in 'foreach var ( `ls ) ... end'. In the following, we illustrate how use the output generated by one command can determine the flow of control of another command by using conditional statements. First we define a command that checks to see if its only argument is prime. % cat isprime #!/bin/csh set n = $argv set p = 1 @ d = $n / 2 while ( $d > 1 && $p ) @ r = $n % $d if ( $r == 0 ) set p = 0 @ d -- end if ( $p ) then echo 1 else echo 0 endif The echo commands are placed so that 'isprime' outputs a one (1) if its sole argument is prime and a zero (0) otherwise. The 'isprime' command is now a general utility that we can use in any number of contexts. We can use the 'isprime' command in a conditional statement: % if ( `isprime 3` ) echo "Prime!" Prime! % if ( ! `isprime 4` ) echo "Composite!" Composite! We can also use 'isprime' to write a more compact version of the 'primes' command: % cat primes.sub #!/bin/csh foreach n ( $argv ) if ( `isprime $n` ) echo $n end Exiting from a shell command using 'exit' - advanced topic #2 The 'exit' command allows you to exit from a script at any point. 3. Rewrite the 'isprime' command without the 'p' variable using 'exit' to terminate the loop when a divisor other than one is found. Getting user input in a shell script - advanced topic #3 The pseudo variable '$<' substitutes a line read from the standard input, with no further interpretation. It can be used to read from the keyboard in a shell script. See if you can figure out what the following script does: % cat guess #!/bin/csh set n = `date +"%M"` while ( 1 ) set m = $< if ( $m > $n ) then echo "Too high!" else if ( $m < $n ) then echo "Too low!" else echo "You got it!" exit endif end If you're ambitious you might write a shell script that plays a a game of tic-tac-toe with a user. Using the exit status of a command - advanced topic #4 Every Unix command returns an exit status. By convention, if the command is successful, it returns an exit status of zero. If the command fails, then it returns an exit status greater than zero with the exact value depending on how it failed; for example, grep returns one (1) if it can't find the pattern it's looking for and two (2) if it can't find the file supplied as its second argument. In the C-shell the 'status' variable, is set to the exit status of the last command executed: % grep foreach isprime while ( $d > 1 ) % echo $status 0 % grep foreach isprime % echo $status 1 In writing a C-shell script you can use the keyword 'exit' with an argument to make the script return a particular exit status based on whatever you define as success or failure. 4. Rewrite the 'isprime' command yet again, this time write it to have an exit status of zero (success) if its only argument is prime and one (failure) otherwise. You solution should contain the invocations 'exit 0' and 'exit 1' and no 'echo' statements. Here is what we might expect of your new version of 'isprime': % isprime 3 % echo $status 0 % isprime 4 % echo $status 1 There is a special syntax that allows you to use the exit status of a command in an 'if' statement; you use curly braces instead of the usual parentheses. The logic of using commands as conditional is a little twisted however. Zero is false in a conditional while one is true. % if ( 0 ) echo "No!" % if ( ! 0 ) echo "No Way!" No Way! % if ( 1 ) echo "Way!" Way! In evaluating a command, if the command is successful (that is to say it returns an exit status of zero), then the curly braces tell the shell to return a one. If the command fails (it returns an exit status greater than zero), then the curly braces tell the shell to return a zero. % if { isprime 3 } echo "Prime!" Prime! Now we can rewrite the 'primes' command as follows: % cat primes #!/bin/csh foreach n ( $argv ) if { isprime $n } echo $n end
1. Here's a simple web page written in HTML: <html> <head> <title>A Web page</title> </head> <body> <h1>Heading One</h1> <p> This is a paragraph. </p> </body> </html> Using this as a model create a home page with some basic contact and biographical information and then open it using your favorite browser. 2. Consult the on-line HTML help to figure out how to add italics, bold faced fonts and itemized lists. Learn about 'anchor' tags and add a link to the course home page from your home page. 3. Define the following concepts: client-server model, communications protocol. 4. Learn about the network that allows you to access the web from your dormroom. Explain the idea of network topologies and networks of networks. 5. Here are the protocols that make the world-wide web possible HTTP (hypertext transfer protocol) FTP (file transfer protocol) TCP (Transfer Control Protocol) IP (Internet Protocol) Explain how each of these protocols plays a role in the sequence of events that follows when you click on the anchor text of a link on a web page displayed in a browser window. 6. HTML (hypertext markup language) now being supplanted by XHTML. Use the web to learn about XML and XHTML. What's the difference?
Last class we discussed the client-server model, protocols, and the world-wide web. Your browser (Explorer, FireFox, Mozilla, Safari) is an HTTP client which enables it to interact with HTTP servers such as the Apache web server that serves up web pages for the computer science department. There are other simpler HTTP clients such as 'curl' (for copy URL) that supply more limited functionality: % curl "http://www.cs.brown.edu/~tld/talk/home.html" Most browsers do a good deal more than simply request and fetch web pages; a browser is said to 'parse' HTML documents and then 'render' them for display in a browser window. In this exercise, we'll write an HTML parser and simple text-based renderer. There's a very handy text-based browser called 'lynx' that we'll use as a model for our code. This first invocation causes 'lynx' to take over a terminal window running a shell in much the same was as 'info' does when invoked on a the name of a shell command. % lynx "http://www.cs.brown.edu/~tld/talk/home.html" The next invocation causes 'lynx' to fetch a specified page, parse the HTML file, remove the HTML tags and then 'dump' the result to the standard output. You can redirect the output to a file if you want to save the result. The -nolist option directs 'lynx' not to list the hypertext links at the end of the listing as it normally would do. % lynx -dump -nolist "http://www.cs.brown.edu/~tld/talk/home.html" Parsing involves revealing the structure of a text. When you parse a sentence, something that you may have done in grade school, you identify the subject, verb, direct object, prepositional phrases, etc. and how they are related to one another. In parsing an HTML document we'll focus on identifying the HTML tags. Technically this is referred to as 'lexical analysis' but it is generally considered as a part of the parsing process. It's primary purpose for us is to highlight the tags and thereby simplify subsequent processing. We're going to use a 'sed' program to implement our parser. A 'sed' program is basically a list of 'sed' commands in a file. Recall the syntax of the 'sed' substitution command. % echo "Damien has TA hours tonight." | sed 's/Damien/Vanessia/g' Vanessia has TA hours tonight. The usage 's/PATTERN/REPLACEMENT/g' is one of the most commonly used 'sed' commands; the 's' refers to substitution, the PATTERN is a regular expression with much the same syntax as 'grep', and the 'g' is the 'global' option which makes 'sed' make the substitution for every occurrence of the pattern in a line of text. We can run a batch of 'sed' commands at the same time by putting them in a file, instructing 'sed' to load the file using the -f option and then applying the commands in the order as they appear in the file to each line in the standard input. % cat change.sed s/Damien/Vanessia/g s/he/she/g s/his/her/g % echo "Damien left town; he will miss his class." | sed -f change.sed Vanessia left town; she will miss her class. 1. Modify the file 'change.sed' so that it correct handles both the above example and the following one: % echo "Damien left town. He will miss his class." | sed -f change.sed Vanessia left town. She will miss her class. 2. Create a 'sed' program called 'viewpoint.sed' that produces the following behavior. This one is a little tricky. % echo "you make me feel like I am tall." | sed -f viewpoint.sed I make you feel like you are tall. In parsing an HTML file, we want to convert each HTML tag to a canonical identifier called a 'token'. Our parser should ignore the case of tags (though strictly speaking tags should be in lower case to comply with HTML 4.0 syntax) and it will ignore most of the options and formatting directives that are embedded in tags. Here is a simple command that converts the tag that signals the beginning of an HTML document; the 'g' option isn't really needed here as we don't expect multiple occurrences of the 'html' tag, but we will need it most of the time and it doesn't hurt. The replacement string adds space before and after the token that we use to mark the occurrence of the 'html' tag. s/<html>/ @BEGIN_HTML@ /g We'll need another command to handle an uppercase tag: s/<HTML>/ @BEGIN_HTML@ /g 3. How would you handle both lowercase and uppercase tags with a single 'sed' command? 4. That was pretty easy, but how would you manage a 'table' HTML tag which takes options as in: '<table align="center">' In this case we need a little more complicated regular expression. Suppose we want our parser to convert this tag into two tokens: ' @BEGIN_TABLE@ @CENTER@ ' Similarly, '<table align="left">' would be converted into ' @BEGIN_TABLE@ @LEFT@ ' What would the 'sed' command look like? 5. Suppose we want to completely ignore the content of an anchor tag so that '<a href="http://www.cs.brown.edu/~tld/my_talk/home.html#name">text</a>' would be converted to ' @BEGIN_ANCHOR@ text @END_ANCHOR@ ' Write the sequence of 'sed' commands to implement this transformation. 6. What would happen if the '<' of a tag is on one line and the '>' is on another line? How might you avoid this problem? 7. With the exception of a 'verbatim' mode (implemented using the 'pre' tag), HTML pretty much ignores white space. It recognizes separate words but doesn't distinguish among space, tab and linefeed characters or strings of such characters. All formatting is specified by using appropriate HTML tags. For example, the 'p' tag is used to format paragraphs; <p> is used to signal the beginning of a paragraph and </p> is used to signal the end a paragraph. Given that spaces, tabs and linefeeds are irrelevant except to distinguish separate words, without losing information we can convert an HTML document into a list of words using the backtick operator and then iterate through the list of words. Write a script 'format' that takes as its only argument the name of a file consisting of words with HTML 'p' tags and formats the output with an indentation of two spaces and a maximum line width of 72 characters. The text alignment should be ragged right and there should be a blank line separating paragraphs. Here's the basic skeleton for your script: #!/bin/csh set max_width = 72 set line_width = 0 foreach word ( `cat $argv` ) if ( "$word" == "<p>" ) then ... >> print statements << ... else ... >> control logic << ... ... >> print statements << ... endif end You'll need additional 'if', 'then' and 'else' keywords to implement the control logic. Recall that 'printf' is more general than 'echo' in that it doesn't automatically add a linefeed, but you can always add line feeds by explicitly inserting them into the format string as in 'printf "\nText\n\n"'. You'll want to count the length of each word to stay within the maximum line width. The wordcount command 'wc' counts words (-w), lines (-l), characters (-m), and bytes (-c). What do the following incantations accomplish: 'ls *.jpg | wc -l', 'cat file.txt | wc -w' and 'grep "<p>" file.html | wc -l'? The following script will come in handy in implementing your 'format' command: @ width += `echo $word | wc -m` By the way, you can define an alias for this short script by using the 'alias' command in the C-shell. Many programmers use 'alias' to provide names to commonly used scripts or to redefine commands as in 'alias rm "/bin/rm -i"' which always uses the interactive (safe) option when running 'rm' to delete a file. Typically these aliases are defined in a file called ".alias" which appears in the user's home directory and which is loaded by the user's ".cshrc" file by including the line "source .alias". In defining an alias, '\!^' refers to the arguments to the alias. alias length "echo \!^ | wc -m" Now you can refer to the alias as you would any other command and the name of the alias, if chosen judiciously, provides the reader with a better idea what's going on. @ width += `length $word` Note that in implementing word length in this way we compute the length of a word as the number of characters plus one. (Why is this so and how might we avoid it?) For this application, however, this is exactly what we want. (Why?) Appendix A below contains an example of 'format' input and output. 8. A case statement can come in handy when your flow-of-control logic has to deal with several alternatives. Use 'info csh' to read about the syntax for using 'switch' together with 'case' and then reimplement 'format' using a case statement. The basic skeleton for your script should look something like the following: #!/bin/csh set width = 0 set max_width = 72 foreach word ( `cat $argv` ) switch ( $word ) case '<p>': ... breaksw case '...': ... breaksw default: ... endsw end 9. Now put it all together. Write a 'sed' program that handles 'html', 'head', 'title', 'body', 'h1' (heading), 'h2', 'h3', 'p' and 'a' (anchor) tags and ignores the rest. Then write a renderer that can format appropriate output for these tags. Your renderer should work like 'format' except that it uses the tokens generated by your parser instead of raw HTML tags. If you're ambitious, I'll show you how to get input from the user in a shell script, and then using 'curl' you can actually write a simple browser. A suitably expanded version of this exercise would serve as a midterm project. Appendix A: An example of 'format' input and output % cat paragraphs.html <p> The csh is a command language interpreter incorporating a history mechanism, job control facilities, interactive file name and user name completion, and a C-like syntax. It is used both as an interactive login shell and a shell script command processor. </p> <p> The break flag forces a halt to option processing, causing any further shell arguments to be treated as non-option arguments. The remaining arguments will not be interpreted as shell options. This may be used to pass options to a shell script without confusion or possible subterfuge. </p> <p> The shell repeatedly performs the following sequence of actions: a line of command input is read and broken into words. This sequence of words is placed on the command history list and parsed. Finally each command in the current line is executed. </p> % format paragraphs.html The csh is a command language interpreter incorporating a history mechanism, job control facilities, interactive file name and user name completion, and a C-like syntax. It is used both as an interactive login shell and a shell script command processor. The break flag forces a halt to option processing, causing any further shell arguments to be treated as non-option arguments. The remaining arguments will not be interpreted as shell options. This may be used to pass options to a shell script without confusion or possible subterfuge. The shell repeatedly performs the following sequence of actions: a line of command input is read and broken into words. This sequence of words is placed on the command history list and parsed. Finally each command in the current line is executed.
In class last Friday, I told you that most web pages on web servers are written in HTML. This statement was once true but is no longer. Today, most web pages are generated by running scripts on the web server; these scripts construct HTML (or XHTML) files (referred to as 'dynamic content') on the fly and then send them on to the client. The scripts are written in any number of languages including the shell scripting language that we've been working in, Perl - a language that is somewhere between our shell language and C, Python - a recently introduced language designed for much the same range of applications as Perl, PHP - a language designed from the ground up for dynamically generating web content, and even Scheme - a dialect of Lisp that was introduced in Chapter 1 and we'll see in later in the course. In this exercise, you'll write a script that takes as input a table consisting of database records generated by an SQL query and produces as output an HTML file featuring an HTML table in which the rows are the records and the columns are the fields in the database table. We'll assume that the database table is formatted as we did in our database exercise: lines are records and the fields are separated using semicolons at delimiters. You'll need to know a little more HTML to carry out this exercise. Save the following HTML code to a file named 'table.html' and then open it using a browser. <html> <head> <title>Title</title> </head> <body> <table align="center" border="1"> <caption>Table Caption</caption> <tr> <th></th> <th>COLUMN 1</th> <th>COLUMN 2</th> </tr> <tr> <th>ROW 1</th> <td>DATUM 1,1</td> <td>DATUM 1,2</td> </tr> <tr> <th>ROW 2</th> <td>DATUM 2,1</td> <td>DATUM 2,2</td> </tr> <tr> <th>ROW 3</th> <td>DATUM 3,1</td> <td>DATUM 3,2</td> </tr> </table> </body> </html> The 'table' tag takes many more options than shown but the two shown are pretty common: 'align' can be 'right', 'left' or 'center' and 'border' is an integer greater than or equal to zero. The 'tr' tag inside a 'table' environment is used to specify a row of data, the 'td' and 'th' are used to display the data for a cell (or box) in the table with the 'th' form typically reserved for row or column labels (in most browsers the 'th' form is displayed in a bold face type). Experiment with 'table.html' by adding rows and columns and altering the alignment. You can set the alignment in rows or cells using the same syntax as shown for the 'table' tag; alignment specifications in a 'table' tag refer to the alignment of the table on the page while alignment specifications in 'tr', 'th' and 'td' tags determine the arrangement of text in cells. 1. Create a table with columns 'First', 'Last', 'Age' and 'Gender'. Populate the table with 3 rows of made-up data. The column headers should be in bold faced type and centered in their cells, the first and last names should be left justified, ages should be right justified and gender identifiers ('M' or 'F') should be centered in the gender column. 2. Write a script that reads from the standard input the result of a SQL query and produces as output an HTML document that features an HTML table displaying display the query. For this exercise, the result of the SQL query is assumed to have one record per line with fields delimited by semicolons. For this exercise, you need not label the columns or rows. Here's an example of the sort of input your script should be able to handle: % cat employee Carla;Carbuncle;21;F Nathan;Normal;33;M Alice;Altavista;23;F Ernest;Sequitur;26;M In writing this script, it would be nice if you could go through the file line by line just like the 'awk', 'grep' and 'sed' utilities do. Fortunately, the C-shell scripting language provides just such an option: '$<' substitutes a line from the standard input, with no further interpretation. ('$<' is also useful to read from the keyboard in a shell script; check out 'Advanced Topics' (topic #3) in ./../../../doc/scripts.txt for an example of an interactive script that implements a simple guessing game.) % cat number #!/bin/csh set line = $< set n = 1 while ( "$line" != "" ) foreach word ( $line ) printf "$n - $word\n" end set line = $< @ n ++ end % cat employee | number cat employee | number 1 - Carla;Carbuncle;21;F 2 - Nathan;Normal;33;M 3 - Alice;Altavista;23;F 4 - Ernest;Sequitur;26;M That's a start, but how do you read the individual fields in each record? As a hint, try replacing 'foreach word ( $line )' with 'foreach word ( `echo $line | tr ";" " "` )'. This doesn't quite work because, as you might recall, we used the semicolon as a delimiter precisely because if we used the space character as a field separator we would not be able to distinguish spaces separating words within an individual field from those separating distinct fields. Judicious use of 'sed' will fix this however. Once you understand how to break up the input into records (lines) and fields (sequences of words separated by semicolons), the only tricky part left involves writing one loop nested inside a second loop to construct the table. The outer loop reads in a line from the standard input and produces the required 'tr' (row) tags, while the inner loop breaks up the line into words and produces the requisite 'th' (column) tags and associated text. 3. Rewrite your solution to Exercise 2 but this time assume that the first two lines of the input file include the names of the fields and alignment type for displaying the data in each column. The resulting HTML file should display the field names at the top of each column and format the data accordingly. % cat employee First;Last;Age;Gender left;left;right;center Carla;Carbuncle;21;F Nathan;Normal;33;M Alice;Altavista;23;F Ernest;Sequitur;26;M 4. Rewrite your solution to Exercise 3 to take two optional arguments. If a second argument appears on the command line it is assumed to be a string intended for use as the title of the HTML page. If a third argument appears on the command line, it is assumed to be a string intended for use as the caption of the HTML table. 5. Rewrite your solution to Exercise 4 to support two command-line options. Command-line options can be handled in shell scripts just as you would handle any argument; typically the script scans 'argv' to extract options from the rest of the command line arguments, assigning variables and performing additional computations as appropriate to carry out the options as specified. Use this alternative directive as the first line of your script: #!/bin/csh -b The '-b' option to the C-shell forces a 'break' from option processing, causing any further shell arguments to be treated as non-option arguments. The remaining arguments will not be interpreted as shell options. This is used to pass options to a shell script without employing convoluted text processing. % cat options #!/bin/csh -b foreach word ( $argv ) if ( "$word" == "-u" ) echo "dash ewe" if ( "$word" == "-l" ) echo "dash ell" end % options -l misc -u misc misc dash ell dash ewe Use the '-d' option to specify an alternative delimiter. Your script should use the space character as the default delimiter if no '-d' option is specified. Use the '-a' option to indicate that the line encoding the alignment directives is present. If the '-a' option is not present on the command line, then you can assume that the line specifying alignment directives is not present in the input. 6. Figure out a way to use the script that you wrote for Exercise 5 in conjunction with the scripts that we wrote in the database exercise to produce nicely formatted output. (Think about how you might invoke a browser in a script to display query results.)
1. In the following program, each line marked with a "<<<" has a bug. Briefly describe each bug and the problem it would cause when the program is run. #!/bin/csh set line = $< echo "<table align="center" border="1">" <<< while ( "$line" != " " ) <<< printf <tr>\n<<< foreach word ( `echo $line | sed 's/;/ ;/g'` ) <<< printf "<td>" <<< if ( "$word" == " ; " ) then <<< printf "</td>\n" else printf $word endif end printf "</tr>\n" <<< end <<< echo "</table>" 2. There are multiple ways of doing just about everything. For the exercise of building an HTML table from a file containing semicolon-delimited database records, we suggested that you use 'sed' to expand each record into a list of separate words and then use a 'foreach' loop to build the rows of the table field by field. An alternative approach would be to use 'sed' to iteratively dissect each record into fields. Describe what each line of the following program does: #!/bin/csh set line = "Sally Ann;Smith;26;F" printf "0 - field = [] ; line = [$line]\n" set line_number = 1 while ( "$line" != "" ) set field = `echo $line | sed 's/\([^;]*\);.*/\1/'` set line = `echo $line | sed 's/'"$field"'[;]*\(.*\)/\1/'` printf "$line_number - field = [$field] ; line = [$line] ; \n" @ line_number ++ end 3. You can also build records field by field. Here's an example illustrating how you might do this in a script that interactively adds a record to a databate table. This version assumes that each field name is a single word. Rewrite the script to support two additional features: (i) add code so that the script can handle multi-word field names and (ii) near the end of the script add code to display the completed record and prompt the user for a response (yes or no) that determines whether or not the record is actually added to the data file for the corresponding table. % ls account.* account.fields account.data % cat account.fields first;last;age;gender % cat account.data Sally;Smith;26;F % cat insert #!/bin/csh # the first and only argument specifies a table set table = $argv # grab the field names for the specified table set fields = ( `cat $table.fields | sed 's/;/ /g'` ) # create an empty record set record = "" # build the record field by field foreach field ( $fields ) # prompt the user for the value of a field printf "enter $field = " # append the entered value to the record set record = "$record;$<" end # strip off the leading semicolon and append # the completed record to the table data echo $record | sed 's/;//' >> $table.data % insert account enter first = Freddy enter last = Jones enter age = 21 enter gender = M % cat account.data Sally;Smith;26;F Freddy;Jones;21;M
The advantages and shortcomings of shell scripting languages. As a shell many programmers prefer the C-shell. The C-shell was written by Bill Joy and the interactive extension 'tcsh' provides an environment well suited to program development. As a scripting language some programmers prefer the scripting language provided in the Bourne again shell 'bash'. Bash has several advantages over the C-shell: it is distributed under the GNU General Public License, it is the default shell under most Linux distributions, it has a more expressive scripting language and avoids some of the quirks of the C-shell, and it combines many of the best features of the C-shell, the Korn Shell 'ksh' written by David Korn, and the original Bourne shell 'sh' written by Steve Bourne. Bash has a few disadvantages as well, the primary one for me being that 'bash' is big and complex. I like the spare elegance of the C-shell while admitting to its shortcomings as a scripting language. My philosophy, however, is that if the job is too complicated to program elegantly in the C-shell, then I'd prefer to write it in a scripting language like Perl or Python or in a general purposes language like C, Java or Scheme. I like to use shell scripts to customize my working environment. Anything that I have to do more than a few times is a candidate for a shell script. In addition to storing a few scripts in my '~/bin' directory, I also have 'bin' directories sprinkled throughout my home directory tree that are used for particular purposes. Shell scripts allow me to combine commands built into the shell, with commands that others write and my own personal commands. The directory tree in which I wrote Talking With Computers has over a thousand individual files and more than two hundred directories: % find ~/book -type f -print | wc -l 1369 % find ~/book -type d -print | wc -l 220 I have shell scripts that put the book together out of hundreds of pieces using the same technologies (version control systems and 'make' files) that programmers use to build large software systems. Other scripts recompile the book web site and then install the web pages on the department web server. These scripts save me effort by performing repetitive tasks and they also prevent errors - all I have to do is make sure that I'm very careful when writing the scripts. In the remainder of this course, the in-class exercises will turn to the use of Scheme a dialect of Lisp which was one of the very first general-purpose programming languages developed. John McCarthy is credited with the invention of Lisp though he never really intended it to be used for practical programming; for McCarthy it was an elegant mathematical language for proving theorems about computability. But just because we're adopting another programming language doesn't mean that we can't use that language to write shell scripts. Indeed it is as easy to write and use Scheme-based shell scripts as it is to write and use C-shell, bash, Perl or Python-based shell scripts; easier I would say given the elegance and power of Scheme. Here are some simple examples of Scheme-based shell scripts: % ls args hello pipes % cat hello #! /usr/local/plt/bin/mzscheme -qr ;; This script implements the classic 'hello world' program. (printf "Hello world!\n") % hello Hello world! % cat args #! /usr/local/plt/bin/mzscheme -qr ;; This scheme shell script prints out its arguments to the ;; standard output numbered as they appear on command line. (let ((n 1)) (map (lambda (arg) (printf "~a - ~a~%" n arg) (set! n (+ n 1))) (vector->list (current-command-line-arguments)))) % args 1 2 3 four 1 - 1 2 - 2 3 - 3 4 - four % cat pipes #! /usr/local/plt/bin/mzscheme -qr ;; Here is a scheme shell script that reads lines from the ;; standard input and then prints each line preceded with ;; a number indicating its order in the input sequence. (let loop ((n 1)) (let ((line (read-line))) (if (eof-object? line) null (begin (printf "~a - ~a~%" n line) (loop (+ n 1)))))) % ls | pipes 1 - args 2 - hello 3 - pipes Note that Scheme-based shell scripts take command-line arguments and can be used with Unix pipes to read from the standard input and write to the standard output. I expect the syntax looks pretty foreign to you, but I think you'll find that Scheme syntax is pretty simple (for one thing there isn't a lot of it) and you'll find Scheme programming environments more powerful and easier to use in debugging programs. Here are some things that we'll gain by moving to Scheme as a programming language: - a more expressive and more powerful programming language - a language designed for developing larger applications - a powerful set of libraries for building new applications - a graphical programming environment that ports easily to all major platforms including Windows, Linux and Mac OS X. - a rapid-prototyping language of proven power and value It is true that programming C, C++, C#, Java and even Perl are more marketable skills, but we're not trying to get you a job, we're trying to provide you with the skills to make computers work for you and Scheme has few peers when it comes to ease of learning and the ability to support rapid software development. As a biologist working in genomics, an analyst working in the financial markets or an engineer sorting through experimental data, Scheme is a fine tool for practical programming. And besides that it is an ideal first general-purpose programming language to learn. You can write, debug and use Scheme programs much as you write, debug and use C-shell programs by using Emacs and an interactive shell. Here's a short transcript illustrating how you might interact with the Scheme 'interpreter' a program that, much like the C-shell, issues a prompt, reads expressions typed by the user, evaluates them and prints any output: % which mzscheme /usr/local/plt/bin/mzscheme % mzscheme Welcome to MzScheme version 208, Copyright (c) 2004 PLT Scheme, Inc. > (printf "This is my first line of input.~%") This is my first line of input. > (+ 3.14 2.71 1/3 5) 11.183333333333334 > (* 7 (expt 3.14 2.71)) 155.51682682201624 > (map + '(1 2 3 4) '(1 2 3 4)) (2 4 6 8) > (map expt '(2 2 2 2) '(1 2 3 4)) (2 4 8 16) > (regexp-replace "Damien" "Damien has hours tonight." "Vanessia") "Vanessia has hours tonight." There are a number of things to notice about this short session. Some of what I typed must look familiar: 'printf' looks like it works pretty much the same as the shell variant. Of course, there are lots of parentheses sprinkled about and indeed parentheses are ubiquitous in Scheme; they constitute the most common method of defining syntax in Lisp and most of its dialects. You'll also notice that mathematical operators are written 'prefix' instead of 'infix', a convention referred to as Polish notation. But what is more interesting is that Scheme is pretty capable and accommodating when it comes to math; I can add integers, rationals and real numbers without invoking any special programs and operators like exponentiation 'expt' come standard. The two examples involving 'map' illustrate a method of iteration that appears different from a 'foreach' loop, but is actually related though much more powerful. Finally, there are built-in tools for pattern matching that are every bit as powerful as those available in the C-shell. The program that I invoked from the shell was called 'mzscheme' (as in "Ms. Scheme") and is one of a family of Scheme related open-source programs developed by a group of programmers and programming language experts affiliated with PLT Scheme Inc. PLT also provides a graphical programming environment called DrScheme that runs on all the major platforms (Windows, Linux, Apple, Unix) and that we recommend you use when working in the lab and that you download and install on your personal computer. DrScheme is available at http://www.drscheme.org/. The sophisticated procedures for doing math, pattern matching and the like provide the basis for writing powerful tools that can be loaded into Scheme in the form of libraries. These libraries extend the syntax and the power of Scheme to suit particular purposes. There are Scheme libraries for working with databases, building spiders that search the web, creating fancy graphical front ends for programs, and a host of other applications.
The Midterm Projects are due by class time this coming Monday, October 22. You don't actually have to hand in anything. We expect that all of your project materials - code, documentation, sample data, etc - is in a directory called 'midterm' in your top-level directory on the department machines. DON'T MODIFY ANY FILES OR SUBDIRECTORIES WITHIN THIS DIRECTORY AFTER 10:00AM ON MONDAY, OCTOBER 22, 2004. We will use the last modification dates of the files and directories in 'midterm' to validate that you 'turned in' your project on time. There have been more than 14 hours of help available to you this week and Damien will have hours Sunday evening. As always, you can send email to cs009-2tas with course-related questions. I've made sure that any help I give to one of you becomes available to all of you by my updating the project README files. The latest versions of the project README files are in the usual place: Account Management - ~/tmp/exercises/projects/accounts/README Cryptanalyst's Tool Kit - ~/tmp/exercises/projects/ciphers/README Dynamic Web Content - ~/tmp/exercises/projects/photos/README These files have tips on how to structure your projects, solve specific problems, relevant commands you might want to use, and suggestions on how to 'spiff' up your projects. For example, my implementation of the 'select' command for the "Account Management' project uses an 'awk' script to 'pretty print' the database tables. Feel free to use this for any of your projects: % select first address last from account sort by last ----------+-------------+-------------- Frodo | Baggins | Hobbit Town ----------+-------------+-------------- Bilbo | Baggins | Hobbit Town ----------+-------------+-------------- Hermione | Granger | Hogworts ----------+-------------+-------------- Harry | Potter | Hogworts ----------+-------------+-------------- Wally | Ringwraith | Mordor ----------+-------------+-------------- We've provided lots of help and allowed you to collaborate with one another, but we expect that your final project is your work alone. In particular, don't copy code from your classmates or take code from sources other than those we provide. Recall that you will have to demonstrate your code to a TA and they will ask you questions about how your code works. In addition to having working code, we expect your 'midterm' directory to be coherently organized and documented with a README file just like the files and directories in ~/tmp/exercises/ are structured. All of your scripts should be in a 'bin' directory inside of 'midterm' and this 'bin' directory should have a README file that provides a high-level description of each command, e.g., see './../../../04/09/20/bin/README' for an example of what we're looking for. Each of your shell scripts should be carefully documented, e.g., see './../../../projects/accounts/bin/select' for an example. Finally, you should have a 'midterm/doc' directory with documentation for the project; this will depend on the project and doesn't have to be lengthy, but it should describe the project in your own words and discuss any extensions, simplifications or modifications you might have made. Next week we will start working on Scheme. You should read Chapter 5 'Computational Muddles' for Monday; there won't be a quiz for Monday since you have the midterm to turn in but we will have a quiz on this Chapter for next Wednesday. In the remainder of this class we'll talk a bit about different programming languages, what shell scripts are good for and why you might want to learn to program in a general-purpose language like C, Java or Scheme. The advantages and shortcomings of shell scripting languages. As a shell many programmers prefer the C-shell. The C-shell was written by Bill Joy and the interactive extension 'tcsh' provides an environment well suited to program development. As a scripting language, some programmers prefer the scripting language provided in the Bourne again shell 'bash'. Bash has several advantages over the C-shell: it is distributed under the GNU General Public License, it is the default shell under most Linux distributions, it has a more expressive scripting language and avoids some of the quirks of the C-shell, and it combines many of the best features of the C-shell, the Korn Shell 'ksh' written by David Korn, and the original Bourne shell 'sh' written by Steve Bourne. Bash has a few disadvantages as well, the primary one for me being that 'bash' is big and complex. I like the spare elegance of the C-shell while admitting to its shortcomings as a scripting language. My philosophy, however, is that if the job is too complicated to program elegantly in the C-shell, then I'd prefer to write it in a scripting language like Perl or Python or in a general purposes language like C, Java or Scheme. I like to use shell scripts to customize my working environment. Anything that I have to do more than a few times is a candidate for a shell script. In addition to storing a few scripts in my '~/bin' directory, I also have 'bin' directories sprinkled throughout my home directory tree that are used for particular purposes. Shell scripts allow me to combine commands built into the shell, with commands that others write and my own personal commands. The directory tree in which I wrote Talking With Computers has over a thousand individual files and more than two hundred directories: % find ~/book -type f -print | wc -l 1369 % find ~/book -type d -print | wc -l 220 I have shell scripts that put the book together out of hundreds of pieces using the same technologies (version control systems and 'make' files) that programmers use to build large software systems. Other scripts recompile the book web site and then install the web pages on the department web server. These scripts save me effort by performing repetitive tasks and they also prevent errors - all I have to do is make sure that I'm very careful when writing the scripts. In the remainder of this course, the in-class exercises will turn to the use of Scheme a dialect of Lisp which was one of the very first general-purpose programming languages developed. John McCarthy is credited with the invention of Lisp though he never really intended it to be used for practical programming; for McCarthy it was an elegant mathematical language for proving theorems about computability. But just because we're adopting another programming language doesn't mean that we can't use that language to write shell scripts. Indeed it is as easy to write and use Scheme-based shell scripts as it is to write and use C-shell, bash, Perl or Python-based shell scripts; easier I would say given the elegance and power of Scheme. Here are some simple examples of Scheme-based shell scripts: % ls args hello pipes % cat hello #! /usr/local/plt/bin/mzscheme -qr ;; This script implements the classic 'hello world' program. (printf "Hello world!\n") % hello Hello world! % cat args #! /usr/local/plt/bin/mzscheme -qr ;; This scheme shell script prints out its arguments to the ;; standard output numbered as they appear on command line. (let ((n 1)) (map (lambda (arg) (printf "~a - ~a~%" n arg) (set! n (+ n 1))) (vector->list (current-command-line-arguments)))) % args 1 2 3 four 1 - 1 2 - 2 3 - 3 4 - four % cat pipes #! /usr/local/plt/bin/mzscheme -qr ;; Here is a scheme shell script that reads lines from the ;; standard input and then prints each line preceded with ;; a number indicating its order in the input sequence. (let loop ((n 1)) (let ((line (read-line))) (if (eof-object? line) null (begin (printf "~a - ~a~%" n line) (loop (+ n 1)))))) % ls | pipes 1 - args 2 - hello 3 - pipes Note that Scheme-based shell scripts take command-line arguments and can be used with Unix pipes to read from the standard input and write to the standard output. I expect the syntax looks pretty foreign to you, but I think you'll find that Scheme syntax is pretty simple (for one thing there isn't a lot of it) and you'll find Scheme programming environments more powerful and easier to use in debugging programs. Here are some things that we'll gain by moving to Scheme as a programming language: - a more expressive and more powerful programming language - a language designed for developing larger applications - a powerful set of libraries for building new applications - a graphical programming environment that ports easily to all major platforms including Windows, Linux and Mac OS X. - a rapid-prototyping language of proven power and value It is true that programming C, C++, C#, Java and even Perl are more marketable skills, but we're not trying to get you a job, we're trying to provide you with the skills to make computers work for you and Scheme has few peers when it comes to ease of learning and the ability to support rapid software development. As a biologist working in genomics, an analyst working in the financial markets or an engineer sorting through experimental data, Scheme is a fine tool for practical programming. And besides that it is an ideal first general-purpose programming language to learn. You can write, debug and use Scheme programs much as you write, debug and use C-shell programs by using Emacs and an interactive shell. Here's a short transcript illustrating how you might interact with the Scheme 'interpreter' a program that, much like the C-shell, issues a prompt, reads expressions typed by the user, evaluates them and prints any output: % which mzscheme /usr/local/plt/bin/mzscheme % mzscheme Welcome to MzScheme version 208, Copyright (c) 2004 PLT Scheme, Inc. > (printf "This is my first line of input.~%") This is my first line of input. > (+ 3.14 2.71 1/3 5) 11.183333333333334 > (* 7 (expt 3.14 2.71)) 155.51682682201624 > (map + '(1 2 3 4) '(1 2 3 4)) (2 4 6 8) > (map expt '(2 2 2 2) '(1 2 3 4)) (2 4 8 16) > (regexp-replace "Damien" "Damien has hours tonight." "Vanessia") "Vanessia has hours tonight." There are a number of things to notice about this short session. Some of what I typed must look familiar: 'printf' looks like it works pretty much the same as the shell variant. Of course, there are lots of parentheses sprinkled about and indeed parentheses are ubiquitous in Scheme; they constitute the most common method of defining syntax in Lisp and most of its dialects. You'll also notice that mathematical operators are written 'prefix' instead of 'infix', a convention referred to as Polish notation. But what is more interesting is that Scheme is pretty capable and accommodating when it comes to math; I can add integers, rationals and real numbers without invoking any special programs and operators like exponentiation 'expt' come standard. The two examples involving 'map' illustrate a method of iteration that appears different from a 'foreach' loop, but is actually related though much more powerful. Finally, there are built-in tools for pattern matching that are every bit as powerful as those available in the C-shell. The program that I invoked from the shell was called 'mzscheme' (as in "Ms. Scheme") and is one of a family of Scheme related open-source programs developed by a group of programmers and programming language experts affiliated with PLT Scheme Inc. PLT also provides a graphical programming environment called DrScheme that runs on all the major platforms (Windows, Linux, Apple, Unix) and that we recommend you use when working in the lab and that you download and install on your personal computer. DrScheme is available at http://www.drscheme.org/. The sophisticated procedures for doing math, pattern matching and the like provide the basis for writing powerful tools that can be loaded into Scheme in the form of libraries. These libraries extend the syntax and the power of Scheme to suit particular purposes. There are Scheme libraries for working with databases, building spiders that search the web, creating fancy graphical front ends for programs, and a host of other applications.
You can implement a powerful database system using just a handful of standard shell commands. Most of the work is involved in setting up the calls to 'sort', 'cut' and 'join' by mapping symbolic field names to integer offsets into the text records. It took around 400 lines of code and less than a day to write this and more than 350 lines were required to parse the input and map the field names. This attests to the power of the C-shell scripting language. Add the many libraries available for manipulating images, sounds, HTML, etc and you can see why scripting languages are so useful in everyday computing. Here's a simple demonstration of the 'select' and 'insert' commands: % cd /u/tld/tmp/exercises/projects/accounts/bin % ls insert pretty restore select simple % select account ----+--------+-------+---------- id | first | last | address ----+--------+-------+---------- % select all from account ---+-----------+-------------+------------------------------ 1 | Frodo | Baggins | Hobbiton Village, Shire ---+-----------+-------------+------------------------------ 2 | Bilbo | Baggins | Hobbiton Village, Shire ---+-----------+-------------+------------------------------ 3 | Harry | Potter | Hogworts School of Wizardry ---+-----------+-------------+------------------------------ 4 | Hermione | Granger | Hogworts School of Wizardry ---+-----------+-------------+------------------------------ 5 | Wally | Ringwraith | Minas Morgul, Morgul Vale ---+-----------+-------------+------------------------------ % insert account enter first = Freddy enter last = Jones enter address = Orient, North Carolina ---+---------+--------+------------------------- 7 | Freddy | Jones | Orient, North Carolina ---+---------+--------+------------------------- insert: add to 'account' (yes or no): yes The record was added to the database. % select first address from account sort by first ----------+------------------------------ Bilbo | Hobbiton Village, Shire ----------+------------------------------ Freddy | Orient, North Carolina ----------+------------------------------ Frodo | Hobbiton Village, Shire ----------+------------------------------ Harry | Hogworts School of Wizardry ----------+------------------------------ Hermione | Hogworts School of Wizardry ----------+------------------------------ Wally | Minas Morgul, Morgul Vale ----------+------------------------------ % select first last sale.product from account cross sale on id = account ----------+-------------+--------------------------- Frodo | Baggins | Big Book of Spiders ----------+-------------+--------------------------- Bilbo | Baggins | Magic Ring Tricks Bilbo | Baggins | Introductory Elvish ----------+-------------+--------------------------- Harry | Potter | Potions for Dummies ----------+-------------+--------------------------- Hermione | Granger | One Hundred Nifty Spells ----------+-------------+--------------------------- Wally | Ringwraith | Planning a Second Career ----------+-------------+--------------------------- % select -p first last sale.product from account cross sale on id = account Frodo;Baggins;Big Book of Spiders Bilbo;Baggins;Magic Ring Tricks Bilbo;Baggins;Introductory Elvish Harry;Potter;Potions for Dummies Hermione;Granger;One Hundred Nifty Spells Wally;Ringwraith;Planning a Second Career % select -p first last address from account | grep "Hob[a-z]*n" | pretty -------+----------+-------------------------- Frodo | Baggins | Hobbiton Village, Shire -------+----------+-------------------------- Bilbo | Baggins | Hobbiton Village, Shire -------+----------+-------------------------- Whether your project was the Account Management System, the Cryptanalyst's Tool Kit or Dynamic Web Content, you likely learned a lot about how to make a shell do your bidding. You probably encountered several of the limitations of basic scripting languages. The variety of syntax used in different tools like 'awk', 'sed', 'grep' etc., the contortions you have to go through in some cases to pass information back and forth between parent shell an the subshells in which commands are run. Did any of you notice how variables in a parent shell are isolated from those in a sub shell? Because the variable passing conventions in most shell are relatively primitive, files and pipes are often the best way of passing information from part of the computation to another. Even simple data structures like arrays don't always behave as one might expect. These are just some of the reasons why programmers turn to general-purpose languages like C, C++, C#, Java and Scheme for implementing large software projects. Here are some things that are easy in these general-purpose languages but difficult in shell programming. >>>> mapping field names to record offsets <<<< Let's say you have a list of field names: (define fields '( id first last address )) Here's how you would create a 'field map'. (define field_map (let ((i 0)) (map (lambda (field) (set! i (+ i 1)) (list field i)) fields))) Here's how you define a general-purpose index function: (define (index field map) (cadr (assoc field map))) And here's how you would use the index function with the field map: (index 'first field_map) In the second half of this course, we'll show you how to use Scheme to build abstractions similar to 'index' for all sorts of tasks from information retrieval to building web pages.
Make sure that your DrScheme language is set to 'Pretty Big (includes MrEd and Advanced)'. To do so click on the 'Language' menu in DrScheme and you'll find 'Pretty Big (includes MrEd and Advanced)' under 'Professional Languages' (you'll have to click on the 'PLT' tab to show the 'PLT' family of languages). The 'Student Languages' impose too many restrictions and cause more confusion than they are worth. Once you make this change you'll find that 'first' and 'rest' are not defined. So, either use 'car' and 'cdr' or define 'first' and 'rest'. To do the latter, open a file in DrScheme, type in the following two definitions and then 'Execute' the file. (define first car) (define rest cdr) Here are the notes that were handed out in class along with some additional help and hints to make the assignment easier. Please follow the instructions; there really is method in my seeming madness - I'm not just assigning you busywork. I want you to actually type the expressions to the DrScheme interpreter; don't just read the text. Typing the expressions to the interpreter is a very effective learning strategy; it is just like typing sentences in an unfamiliar foreign language - since you can't rely on filling in the details from your knowledge of your native language, you have to attend to the syntax carefully. Trust me it works. If you make a typing mistake, Scheme will likely complain and you'll have to retype the expression thereby learning something in the process. These exercises are designed with this learning strategy in mind. Look carefully at the examples. If you understand the examples, it should be easy to generalize enough to solve the exercise problems. If you really want to know more about the commands, then use the 'Help Desk' under the DrScheme 'Help' menu, but only do that as a last resort. These exercises do assume that you've read Chapter 5 in the text. Basic (Primitive) Data Types Symbols: any sequence of characters including at least one character other than a digit. There are some exceptions: you can't include the single quote, double quote, back quote or any parentheses {},[],(). In addition, the first character of a symbol can't be a sharp sign #. To define a symbol 'name' to have the value of some 'expression', type (define name expression). Try variations on the following: > (define foo 1) > foo 1 > (define foo_bar 2) > foo_bar 2 > (define 1_foo 3) > 1_foo 3 > (define n (+ 1 2 3)) > n 6 If you type a defined symbol to Scheme, it will respond by printing the 'value' of the symbol. If you type a symbol that hasn't been defined, Scheme will complain: > bar reference to undefined identifier: bar So far we've been defining symbols to have values corresponding to integers. There are other sorts of numbers available in Scheme including rationals, real numbers in decimal notation and various kinds of floating point numbers: > (define third 1/3) > third 1/3 Another primitive data type is the string; in Scheme, strings are always enclosed in double quotes: > (define my_name "Tom Dean") > my_name "Tom Dean" Unlike symbols that have values assigned to them by 'define'. Strings and numbers always evaluate to themselves: > "Vanessia Wu" "Vanessia Wu" > 2.712 2.712 The single quote is shorthand for the 'quote' function. The quote function suppresses evaluation. The following two invocations are equivalent: > 'foo foo > (quote foo) foo Note that by quoting a symbol you can refer to the symbol rather than any value that it might have; you can also refer to a symbol that doesn't have a defined value. Every symbol has a name but not all symbols have values. List Processing in Scheme Finally, one of the most important data types in Scheme is the list; lists are used to store information, create more complex data structures and even represent programs. The 'list' function takes any number of arguments and constructs a list whose elements correspond to the values of its arguments: > (list 1 2 3) (1 2 3) > (define ones (list "one" 1 'one )) > ones ("one" 1 one) > (define twos (list (+ 1 1) (/ 6 3) (/ (* 2 3) 2))) > twos (2 2 2) Reflect on the last two definitions; the 'list' function evaluates its arguments (arguments are like the words to the right of a shell command). The strings and the numbers evaluated to themselves and we used to the quote to suppress the evaluation of the symbol 'one'. You can 'nest' lists simply by calling the 'list' function as one of the arguments to 'list': > (define nest (list 1 2 (list 3 4) 5 6)) > nest (1 2 (3 4) 5 6) You can also create lists using quote; in this case, the resulting list is specified literally and no evaluation takes place: > (define nums '(1 2 3 4)) Notice what happens when we incorrectly try to create a nested list inside a list specified with quote: > (define nest '(1 2 (list 3 4) 5 6)) > nest (1 2 (list 3 4) 5 6) To achieve the intended effect we just specify what we want literally: > (define nest '(1 2 (3 4) 5 6)) > nest (1 2 (3 4) 5 6) It doesn't do much good to create lists if you can't subsequently take them apart and access their contents. The most basic functions for accessing lists are called 'car' and 'cdr'. These names refer to locations called 'registers' in the memory of a now defunct computer. You can use these name if you like but you can also give them more mnemonic names. The 'car' of a list is the first element of the list and the 'cdr' of a list is the list consisting of all the rest of the elements of the list but the first one. As mentioned in class, often programmers define 'first' and 'rest' as aliases for 'car' and 'cdr': > (define first car) > (define rest cdr) What do think the value of 'car' is? > car #<primitive:car> > cdr #<primitive:cdr> This indicates that 'car' and 'cdr' are defined as primitive functions. And now we've defined 'first' and 'rest' to have the same definitions as 'car' and 'cdr' respectively. > first #<primitive:car> > rest #<primitive:cdr> Now we can use these functions to access the contents of lists: > (define nums '(1 2 (3 4) 5 6)) > nums (1 2 (3 4) 5 6) > (first nums) 1 > (rest nums) (2 (3 4) 5 6) We can use various combinations of 'first' and 'rest' to access any element or sublist of given list. For example, suppose that you want to grab the 4 in '(0 (1 (3 4)) 5). First we'll do it in several steps: > (rest '(0 (1 (3 4)) 5)) <<< step 1 ((1 (3 4)) 5) > (first '((1 (3 4)) 5)) <<< step 2 (1 (3 4)) > (rest '(1 (3 4))) <<< step 3 ((3 4)) > (first '((3 4))) <<< step 4 (3 4) > (rest '(3 4)) <<< step 5 (4) > (first '(4)) <<< step 6 4 Now we put this all together and apply these steps in the order of evaluation which actually looks as if we're doing it backwards or at least inside out, but you should convince yourself that this is actually in the appropriate evaluation order. step 6 step 5 step 4 step 3 step 2 step 1 > (first (rest (first (rest (first (rest '(0 (1 (3 4)) 5))))))) 4 1. Given the following list structure: > (define nums '(1 2 (3 4) 5 6)) What do the following expressions return? > (first (rest (rest nums))) > (rest (first nums)) 2. Now use 'first' and 'rest' (or 'car' and 'cdr') to return each of the components 2, (3 4), 5 and 4, given 'nums' as defined above. Your solutions should look like (first (rest (rest ... '(1 2 (3 4) 5 6)))) with a different sequences of 'first' and 'rest' invocations. Predicates for Testing and Comparing Primitive Types Each of the primitive data types has a corresponding predicate for testing membership. A predicate is just a function that returns true or false. In scheme, '#f' means 'false' and '#t' means 'true'. The '?' is often tacked on the end of predicates to identify them as such, but the '?' is just a convention - it's not required when defining your own predicates. > (number? 1) #t > (string? "string") #t > (list? '(1 2 3)) #t > (list? "string") #f > (symbol? 'a) #t The 'string<?' predicate is a single symbol consisting of eight characters; we might have called it 'string-less-than?' instead of 'string<?' 'string<?' is the string analog of '<' for numbers. You can also check for special cases: > (zero? 0) #t > (null? '()) #t There are also variants of equality and inequality for every primitive type (substitute '>', '=', '<=', '>=' for '<' in the following to get the other variants). > (< 1 2) #t > (string<? "a" "b") #t > The '<' predcate is an example of a predicate that doesn't have the '?' tacked on the end. Note that 'string=?', 'string<?', 'string>?', 'string<=?' and 'string>=?' all use lexicographic order for making comparisons whereas '=', '<', '>', '<=', '>=' use numerical order. In Scheme '=' means numerical equality in contrast to the C-shell where '=' is used for assignment (in 'set' and '@'). There is also an equality predicate defined on more general types (note that for symbols, lowercase/uppercase doesn't matter - case does matter for strings however - Scheme can be configured so that case does matter but the default is for it to be insensitive to case): > (equal? 'a 'A) #t > (equal? 'a "a") #f > (equal? 1 (first (rest '(0 1 2 3)))) #t > (equal? "a" "A") #f > (equal? 1 (- 2 1)) #t Building Boolean Expressions Using Boolean Operators We can combine predicate invocations with the boolean operators: 'and', 'or', 'not'. The first, the so-called conjunctive operator 'and' takes any number of arguments including zero it returns true #t if all of its arguments evaluate to #t and false #f otherwise: > (and #t #t) #t > (and #t #t #f) #f > (and) #t We typically use boolean operators with predicates, e.g., > (and (list? arg) (not (null? (rest arg))) (number? (first arg))) returns #t if arg is a list, has at least one element, and its first element is a number. The disjunctive operator 'or' works similarly; it returns #t if at least one of its arguments evaluates to #t and false #f otherwise: > (or #f #t #t) #t > (or #f) #f > (or) #f Again, the arguments typically involve predicates or nested boolean expressions, e.g., > (or (and (number? arg) (> arg 0)) (symbol? arg)) returns #t if arg is a number greater than zero or a symbol. See if you can explain why '(or) => #f' and '(and) => #t'. The negation operator 'not' accepts exactly one argument and returns #t if its only argment evaluates to #f and returns #f if its only argument evaluates to #t. > (not #t) #f > (not #f) #t > (not (and (number? 1) (or (string? "") (null? '(1 2 3))))) #f Note that functions defined on strings, numbers, non-empty lists, etc., will complain if you call them with the wrong type of inputs: > (rest '()) cdr: expects argument of type <pair>; given () > (string=? 1 "one") string=?: expects type <string> as 1st argument, given: 1; other arguments were: "one" For this reason, it is often important to check the type of arguments before evaluating them. Note that in evaluating an 'and', the Scheme interpreter won't evaluate the second conjunct if the first conjunct returns #f; we can use this fact to check arguments and avoid errors of the sort illustrated above. > (and (not (null? lst)) (not (null? (rest lst))) (= 1 (first (rest lst)))) #f 3. Write a boolean expression for the body (the '...') of the following function: (define (special-list-type? arg) ... ) such that it returns #t for the following type of data structure: (list number number (list string) symbol) and #f otherwise; so, for example: > (special-list-type? '(1 2 ("foo") a)) #t > (special-list-type? (list 1 4 (list "foo") 'a)) #t > (special-list-type? (list 1 7 '("foo" "bar") 'a)) #f To help you out, here's a solution for a somewhat simpler example: (list symbol number (list)) (define (my-special-list-type? arg) (and (list? arg) (not (null? arg)) (symbol? (first arg)) (not (null? (rest arg))) (number? (first (rest arg))) (not (null? (rest (rest arg)))) (null? (first (rest (rest arg)))))) Note that the first two conjuncts (list? arg) and (not (null? arg)) have to be there to make sure that the third conjunct (symbol? (first arg)) won't cause an error on an input like '(). The fourth conjunct (not (null? (rest arg))) has to be there to make sure that the fifth conjunct (number? (first (rest arg))) won't cause an error on an input like '(1). Similarly, the sixth conjunct prevents the seventh conjunct from causing an error. Here are some examples showing my-special-list-type? in action: > (my-special-list-type? '(a 1 ())) #t > (my-special-list-type? (list 'foo 1.2 (list))) #t > (my-special-list-type? (list 'foo 1.2)) #f > (my-special-list-type? 3.14) #f > (my-special-list-type? (list 'foo 1.2)) #f Working With Conditional Statements Scheme conditionals are simpler and easier to use than conditionals in the C-shell. The 'else' clause is optional; however, if the condition of a conditional statement lacking an 'else' clause evaluates to '#f' then the result of the statement is undefined. > (if (symbol? 'a) 1 2) 1 > (if (null? '(1 2)) 1) > (if (and (number? 1) (string? "a")) 1 2) 1 There is also a generalized conditional statement 'cond' which works like nested if-then-else statements and is very convenient for all sorts of complicated tests. Here's the general format: (cond ( test_1 expression_1 ) ( test_2 expression_2 ) ( test_3 expression_3 ) ( test_4 expression_4 ) ( else expression_5 )) => (if test_1 expression_1 (if test_2 expression_2 (if test_3 expression_3 (if test_4 expression_4 (if #t expression_5 ))))) Each ( test_n expression_n ) is referred to as a 'clause'. Note that the first test to evaluate to #t will cause its corresponding expression to be evaluated. The keyword 'else' evaluates to '#t' in the context of a conditional statement. It should only be used as the test for the last clause - usually referred to as the 'default' clause. If no test evaluates to #t then the result of the 'cond' statement is undefined. > (define exp 1.2) > (cond ((list? exp) (printf "It's a list!~%")) ((symbol? exp) (printf "It's a symbol!~%")) ((number? exp) (printf "It's a number!~%")) (else (printf "I haven't the foggiest!~%"))) It's a number! Note that 'printf' behaves pretty much like the C-shell 'printf' except that we use '~%' for '\n' and to interpolate a variable into the format string, so instead of printf "the $last number is $num\n" you would write (printf "the ~a number is ~a~%" last num) For each appearance of '~a' in the format string there must be one additional argument to 'printf'. > (define one 1) > (define two 2) > (define three 3) > (printf "~a is less than ~a is less than ~a~%" one two three) 1 is less than 2 is less than 3 > (printf "~a is less than ~a~%" one) > (printf "~a is less than ~a~%" one) printf: format string requires 2 arguments, given 1; arguments were: "~a is less than ~a~%" 1 4. Now rewrite your solution to Exercise 3 using 'cond' so that it prints out useful informative statements about whether or not the input subscribes to the expected format: > (special-list-type-cond? '(1 2 ("foo") a)) Looks good to me. > (special-list-type-cond? (list 1 "" (list "foo") 'a)) I was hoping to see a number as the second element of your list. > (special-list-type-cond? (list 1 7 '("foo" "bar") 'a)) Expecting a list consisting of a single item but got (foo bar). A Word About Computer Literal-Mindedness I thought it might be useful to reflect on the differences between learning to program (where I am assuming that this involves actually writing programs and trying to make them run on a computer) and learning about some branch of history or philosophy or for that matter learning mathematics. The computer is stern task master; it allows no lapse in precision. Some of the requirements for precision may seem picayune - why should it matter if you leave out a semicolon! But the computer isn't smart enough to resolve ambiguities on its own; nor would we want it to in many cases. Most programmers learn to like the fact that the computer calls us to task when we are less that perfectly precise. In discussing philosophy with a professor, usually the professor gives you, the student, the benefit of the doubt, and you probably give the professor more credit that he or she deserves. It's easy to be lax in talking with another human. Even in learning, say, calculus, you can make mistakes on a test and often get partial credit; the human grader fills in what you meant. But late at night trying to get your programming assignment finished, the computer is obstinate and unforgiving. Somewhere in your code there is a missing semicolon but you can't find it and every time you try to run your code, it crashes spitting out cryptic comments. We're not used to such unflagging and obstinate adherence to rigid convention and we find it annoying - an affront to our intelligence. But it is worth keeping in mind that such precision is a boon to a skillful programmer. It assures that the machine will carry out his or her commands with perfect precision and repeatability. It's quite possible to write a program that doesn't do what the programmer intends, but at least we are assured that the computer will execute a piece of code the same today as tomorrow, the same whether you wrote it or someone else wrote it and the same whether it is run on one kind of a machine rather than another.
Calling and Creating Functions We're going to experiment with the DrScheme Stepper today. The Stepper only works in the Student Languages Beginner through Intermediate; so go to the Languages menu and set the Language to 'Intermediate Student'. Now type the following lines to the upper pane of a DrScheme window. (define (square num) (* num num)) (define (area radius) (* pi (square radius))) (area 7) The constant 'pi' is already defined in Intermediate Student. Now save the file as 'circle.scm' and click on the 'Step' button. If you repeatedly click on 'Step >' button in the stepper pane, you'll see the steps that Scheme takes in evaluating the expression '(area 7)'. In the stepper you'll see how Scheme replaces the invocation of a function, e.g., (area 7), with the definition of the function, e.g., (* pi (square radius)), using the calling pattern, (area radius), specified in the definition. The values of the function arguments are then substituted for the formal parameters in the function definition. 1. Write a function that computes the average of four test scores. (define (average score_1 score_2 score_3 score_4) ... ) > (average 90 86 97 75) 87 2. Write a function 'distance' that computes the Euclidean distance between two points in the plane. If you recall from high school geometry, the distance between two points in the plane is the square root of the sum of the squared differences of the x and y components, i.e., for points p1 = (x1 y1) and p2 = (x2 y2) the distance from p1 to p2 is sqrt of ( (x1 - x2)^2 + (y1 - y2)^2 ). Now translate this expression into Scheme: (define (distance x1 y1 x2 y2) ... ) > (distance 0 1 1 0) 1.4142135623730951 Use the 'square' function above and the built-in 'sqrt' function. Then use the stepper to step through the execution of an example invocation of the 'distance'. 3. In this exercise, you'll define and use a 'point' data structure: (define (make-point x y) (list x y)) (define (x-coord pt) (car pt)) (define (y-coord pt) (car (cdr pt))) Rewrite the 'distance' function to work with the 'point' data structure: (define (distance pt1 pt2) ... ) Create a couple of points: (define point_1 (make-point 1 0)) (define point_2 (make-point 0 1)) And try out your function (distance pt1 pt2) with the stepper. > (distance point_1 point_2) 1.4142135623730951 Now type in the factorial function and save it in 'factorial.scm': (define (factorial n) (if (= n 1) 1 (* n (factorial (- n 1))))) (factorial 3) Again watch what happens in the stepper. Now experiment with the last function. Last only works if it is given a list consisting of at least one element. So we first check to see if 'last' is given appropriate input. (define (last arg) (if (and (list? arg) (not (null? arg))) (aux-last arg) "expecting a non-empty list")) We then define an auxiliary function that does all the real work. (define (aux-last lst) (if (null? (rest lst)) (first lst) (aux-last (rest lst)))) > (last '(1 2 3)) 3 4. Write a function 'count' that counts the number of elements in a list. Use two functions 'count' and 'aux-count' as we did for 'last. Your 'aux-count' should take a second argument that keeps track of how many elements have been encountered so far. > (count '(1 2 3 4 5)) 5
Scheme Resources: Basic Introduction to Scheme - ~tld/tmp/exercises/doc/scheme.txt This instroduction is tailored to the material we'll cover in the exercises and supplements the descriptions in the book. Learn to use the DrScheme Help Desk You can search for information on any command or any concept. More on List Processing Recall that the 'list' function takes any number of arguments and constructs a list whose elements correspond to the values of its arguments: > (list 1 2 3) (1 2 3) There's also a more primitive list constructor called 'cons' that constructs lists in exactly the same way that 'car' ('first') and 'cdr' ('rest') take them apart. > (define lst (cons 0 '(1 2 3))) > lst (0 1 2 3) > (car lst) 1 > (cdr lst) (1 2 3) > (cons 0 (cons 1 (cons 2 (cons 3 ())))) (0 1 2 3) > (car (cdr (cdr (cdr '(0 1 2 3))))) 3 1. Create the list structure: (1 2 (3 (4) 5) 6) using just 'cons', the numbers 1, 2, 3, 4, 5 and 6 and the empty list '()'. The 'append' functions takes zero or more arguments that must evaluate to lists and combines them into one big list: > (append '(0 1 2) '(3 4 5) '(6)) (0 1 2 3 4 5 6) 2. What happens when you call 'append' with no arguments? Note that it is possible to define functions in Scheme (like 'append', 'list' and '+') that take a variable number of arguments in a manner similar to the way shell commands use 'argv' to grab all the words on the command line. However, in many cases where you might think you want a variable number of arguments what you really want is just a single argument which is a list. The 'cons' and 'append' functions will come in handy a little later when we want to build lists from various pieces under the control of a program. Destructive Modification of Lists The 'define' command is used to define the value of symbols. When you want to redefine a symbol you can use 'define' to do so, but often once the symbol is initialized to a particular value subsequent changes are typically made using another command called 'set!' The '!' is often added to commands that are said to 'destructively modify' their arguments. The 'set!' command is like 'set' in the C-shell except instead of writing 'set var = expression' you write '(set! var expression)'. For our purposes, the first argument to 'set!' is expected to be a symbol but, unlike symbols appearing as arguments to other functions, this first argument is not evaluated; 'set!' works like 'define' does in setting a symbol to have a particular value. > (define counter 0) > counter 0 > (set! counter (+ counter 1)) > counter 1 > (set! counter (+ counter 1)) > counter 2 There are also commands that allow you to destructively alter the structure of lists. > (define nums (list 1 2 3)) > (set-car! nums 7) > nums (7 2 3) > (set-cdr! (cdr (cdr nums)) (list 4 5 6)) > nums (7 2 3 4 5 6) 3. Explain the behavior of the following sequence of commands: > (define lst_1 (list 1 2 3)) > (define lst_2 (cons 0 lst_1)) > lst_1 (1 2 3) > lst_2 (0 1 2 3) > (set-car! (cdr lst_1) "two") > lst_1 (1 "two" 3) > lst_2 (0 1 "two" 3) Iteration and Recursion Scheme offers a variety of methods for implementing iterative algorithms. We're going to begin by emphasizing recursion, but later we'll provide iterative constructs that work like 'foreach' and 'while'. Last week we implemented a program extracted the last element of a list: (define (last lst) (if (null? (cdr lst)) (car lst) (last (cdr lst)))) Let's morph 'last' into a program that uses 'cons' to copy a list: (define (copy lst) (if (null lst) lst (cons (car lst) (copy (cdr lst))))) 4. Now write a recursive function 'numbers' based on 'last' that takes a single argument corresponding to an integer 'n' and returns a list of the numbers 1 through n. > (numbers 5) (1 2 3 4 5) 5. Write a recursive function 'reverse' that takes a list and uses 'append' to produce a new list in which the elements appear in reverse order. > (reverse '(1 2 3)) (3 2 1) 6. Write another version of 'reverse' that uses 'cons' instead of 'append'. In this case, define 'reverse' with two arguments: an input and an output variable: > (define (reverse input output) ... ) > (reverse '(1 2 3) '()) (3 2 1) 7. The 'assoc' function allows us to use so-called association lists. An association list is a list of pairs each pair consisting of a key and an item associated with that key. For example, the following association list maps the names of fields to their offsets in a list representing a database record. > (define employee_fields '((id 1) (first 2) (last 3) (address 4))) We can use 'assoc' to find the offset for a field by specifying the field name as the key. > (assoc 'last employee_fields) (last 3) > (car (cdr (assoc 'last employee_fields))) 3 If the key cannot be found, 'assoc' returns false #f. > (assoc 'middle employee_fields) #f Write a 'my_assoc' function that duplicates the functionality of 'assoc'. You'll need some way of comparing keys to write this function and the 'equal?' predicate does the trick: > (equal? 'first (car (car (cdr employee_fields)))) #t Global and Local Variables When you define a variable at the top level (i.e., not inside the body of a function), that variable is said to have 'global scope'; you can reference it anywhere including inside the body of other functions as long as they don't have a formal parameter of the same name, e.g., > (define num 2.2) > (define (scale base) (* num base)) > (define (square num) (* num num)) The function 'scale' has access to the global variable 'num' while 'square' has a formal parameter of the same name that masks the global 'num'. Within the 'square' function 'num' refers to the local variable introduced as its single formal parameter. There are times however when you want to introduce a 'local' variable (besides a formal parameter) to keep track of some information in the midst of carrying out a computation. This is easily accomplished by introducing variables by way of 'let'. > (define num 1) > (printf "num = ~a~%" num)) num = 1 > (let ((num 0)) (printf "num = ~a~%" num) (set! num 17) (printf "num = ~a~%" num)) num = 0 num = 17 > (printf "num = ~a~%" num)) num = 1 Inside the body of the 'let' the local variable 'num' masks the 'num' introduced by the 'define'. Moreover, changes to the local 'num' do not affect the global value of 'num'. The scope of 'num' is said to be 'lexical' or 'static' in the sense that the scope of the local 'num' is restricted to the code within body of the 'let', i.e., (let ((num 0)) ... body ... ); what happens outside of the body of the let doesn't affect the local value of 'num'. If we call a function within the body of the 'let', the definition of that function is outside the scope of the 'let' and, unless it has a formal parameter with the same name or introduces a local value with the same name using 'let', changes made to the variable in the body of the function will be reflected in the global variable: > (define num 1) > (printf "num = ~a~%" num) num = 1 > (define (test arg) (set! num arg) > (define (test arg) (set! num arg)) > (let ((num 0)) (test 43) (printf "num = ~a~%" num) (set! num 17) (printf "num = ~a~%" num)) num = 0 num = 17 > (printf "num = ~a~%" num) num = 43 Here's an example of a procedure that is difficult to write without the use of a local variable introduced using 'let'. > (define pair (list 1 2)) > (define (swap pair) (let ((tmp (car pair))) (set-car! pair (car (cdr pair))) (set-car! (cdr pair) tmp))) > pair (1 2) > (swap pair) > pair (2 1)
Functional Programming Functional programming is a style of programming that often makes for very simple (and elegant) solutions to complicated problems. Suppose that you have three columns of numbers > (define uno '(1 2 3)) > (define dos '(4 5 6)) > (define tres '(7 8 9)) and you want to add the rows, i.e., the sum of the first number in each column, the sum of the second number in each column, etc. Here's how you do this using a 'map' operator: > (map + uno dos tres) (12 15 18) The 'map' operator takes a function (in this case '+') and one or more lists that will serve as the arguments to the function. 'map' takes each 'cross section' of the lists, e.g., the first cross section in the example above would consist of the numbers 1, 4 and 7, and then applies the function to each cross section; it then returns a list of the results. This next example turns rows into columns: > (map list uno dos tres) ((1 4 7) (2 5 8) (3 6 9)) If we don't have a function that does exactly what we want we can define an anonymous function using 'lambda'. When you specify a function definition using 'define' the next two expressions produce identical results: (define (function_name arg_1 arg_2 ... arg_n) body ) (define function_name (lambda (arg_1 arg_2 ... arg_n) body )) However, in some cases, you want to define a special-purpose function that will only be used for in one place in your program. Defining single-use functions is useful in functional programming. Here's how you would cube a list of numbers using 'map' and 'lambda': > (map (lambda (x) (* x x x)) '(1 2 3 4 5 6)) (1 8 27 64 125 216) 1. As a somewhat more practical example, suppose that you have three columns of numbers where the first column lists the prices of products that you are contemplating purchasing, the second column corresponds to the associated shipping costs, and the third the relevant state sales taxes > (define prices '(61 17 31)) > (define shipping '(7 3 5)) > (define taxes '(0.5 0.4 0.6)) and you want to compute the total cost as the sum of price, shipping costs and price times tax rate. Use 'map' and 'lambda' to define a function 'total_cost' that performs this calculation: > (total_cost prices shipping taxes) (98.50 26.80 54.60) The 'apply' function takes a function and a list of items that can serve as arguments to the function: > (apply + '(1 2 3 4)) 10 Suppose you have a list of numbers corresponding to the prices of the parts for a product you're thinking of manufacturing and you want to know the percentage that each part contributes to the total cost (this is also referred to as normalizing a set of numbers): > (map (lambda (cost) (/ cost 1.0 (apply + costs))) costs) (0.14 0.19 0.41 0.26) 2. In this exercise, you'll implement some of the basic functions used in linear algebra, a branch of mathematics important in computer graphics. Use 'map' and 'apply' to compute the 'dot' or 'inner' product of two vectors represented as lists. The inner product is notated: u.v = ((u_1 * v_1) + ... + (u_n * v_n)) where u_i (v_i) is the ith component of the vector u (v): > (define u '(1 7 3)) > (define v '(3 4 6)) > (inner u v) 49 Now implement the 'magnitude' of a vector which is defined as the square root of the sum of squares of the vector components. The magnitude is notated: |u| = sqrt ((u_1 * u_1) + ... + (u_n * u_n)) > (magnitude u) 7.681145747868608 Finally, implement a function that computes the cosine of the angle between two vectors calculated as the dot product of the vectors divided by the product of their magnitudes: u.v / ( |u|* |v| ) > (cos-angle u v) 0.8167801162284652 Iteration and Loops in Scheme I certainly don't want to force you to implement iterative programs using recursion if you find it difficult or somehow 'unnatural'. Some people find recursion more intuitive than 'foreach' or 'while' loops. Perhaps it is just a matter of taste. Scheme allows a variety of iterative techniques besides recursion or the sort of iteration implicit in the mapping functions common in functional programming. One such very general iterative construct is the 'do' loop whose general form is: (do (variable-initial-step-triples) (loop-exit-condition terminal-expressions) body ) where variable-initial-step-triples take the form ( variable initial-value step-value ) the loop-exit-condition is a test which is evaluated on each iteration such that if the test returns true (#t) then the loop is terminated and the terminal-expressions are evaluated; the loop returns the value of the last terminal expression. It's easier to understand if you just look at a simple example. Here's an iterative version of factorial: > (define (factorial n) (do ((i n (- i 1)) (r 1 (* r i))) ((= i 1) r))) > (factorial 4) 24 Note that 'factorial' has an empty 'body'. The next example - a test for primality - has one expression in its body: > (define (prime? n) (do ((factor (if (even? n) (/ n 2) (/ (- n 1) 2)) (- factor 1)) (flag #t)) ((or (= factor 1) (not flag)) flag) (if (= 0 (remainder n factor)) (set! flag #f)))) > (prime 18) #f > (prime 37) #t 3. Write a function 'quality_check' using a 'do' loop that takes three arguments: a list of numbers corresponding to measurements taken of some critical parameter of a manufactured product, a second number corresponding to the optimal value for the measurement and a third number corresponding to the maximum deviation (+/-) from the optimal value that will be tolerated in a product to be shipped. Your 'quality_check' function should return the empty list if all of the measurements are within tolerance and otherwise return a list of all those measurements that were found to exceed the maximum allowed deviation. > (quality_check '(4.13 4.07 3.99 3.76) 4.00 0.10)) (4.13 3.76) Defining Your Own Syntax If you're really pining for your old 'foreach' and 'while' loops we could simply map the 'foreach' syntax onto the 'do' loop syntax. Here's what the mapping would look like for the 'foreach' loop: ;; foreach var ( arg ) (do ((lst arg (cdr lst)) (var #f)) ;; command_1 ((null? lst)) ;; ... => (set! var (car lst)) ;; ... expression_1 ;; command_n ... ;; end expression_n ) To get Scheme to do the mapping for you, would use the functions 'define-syntax' and 'syntax-rules' to define a 'macro' so that expressions beginning with 'foreach' would expand into the 'do' format shown above. Here's how you'd define the 'foreach' macro: (define-syntax foreach (syntax-rules () ((foreach var ( arg ) expr ... ) (do ((_lst_ arg (cdr _lst_)) (var #f)) ((null? _lst_)) (set! var (car _lst_)) expr ... )))) Now we can write programs with 'foreach' loops pretty much as we would in C-shell scripts; there might be a few extra parens, but presumably that's a small price to pay if you really miss your 'foreach' loops: > (foreach num ( '(1 2 3 4) ) (set! num (* num num num )) (printf "~a ~%" num)) 1 8 27 64 We could do the same sort of thing for 'while' loops: ;; while ( test ) (do () ;; command_1 ((not test)) ;; command_2 => expression_2 ;; ... expression_1 ;; command_n ... ;; end expression_n ) Here's how we could get scheme to perform the mapping for us. (define-syntax while (syntax-rules () ((while ( test ) expr ... ) (do () ((not test)) expr ... )))) > (define num 4) > (while ( (> num 0) ) (printf "~a ~%" num) (set! num (- num 1))) 4 3 2 1 We could use 'define-syntax' and 'syntax-rules' to define a whole new language if we like. Perhaps you'd like to design your own idiosyncratic language; then you wouldn't have to put up with conforming to someone else's idea of how to write code. 4. See if you can figure out from these examples how to use 'define-syntax' and 'syntax-rules' to define a control structure 'repeat' that mirrors the unix command of the same name. % repeat 64 printf "+" ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > (repeat 64 (printf "+")) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Today you'll get a taste for an area of computer science called the analysis of algorithms. Analysis of algorithms is considered a branch of theoretical computer science and deals with algorithms and data structures as mathematical objects using formal logic, graph theory, combinatorics, statistics, probability and other areas of discrete and continuous mathematics for purposes of analysis. Association Lists and Hash Tables for Implementing Dictionaries We want to associate words referred to as 'keys' with items in a 'dictionary'. The items need not be definitions as in the case of a conventional dictionary; in general, items can be any data objects that we want to associate with the keys. The dictionary itself is implemented as a data structure and typically supports several operations, e.g., insertion, deletion, and extraction or 'lookup'. The implementation of these operations depends on the implementation of the data structure. In this exercise, we'll look at implementing dictionaries using association lists and hash tables. Recall that an association list is a list of pairs whose first element is a key and whose second element is the item that we wish to associate with the key. For example, the following association list associates the symbols corresponding to the three-letter abbreviations for the months of the year with the strings corresponding to their two-digit integer representation: > (define months '((Jan "01") (Feb "02") (Mar "03") (Apr "04") (May "05") (Jun "06") (Jul "07") (Aug "08") (Sep "09") (Oct "10") (Nov "11") (Dec "12"))) We can then use 'assoc' to lookup the two-digit representation when given the three-letter symbol as a key: > (printf "Today is 2004-~a-08.~%" (car (cdr (assoc 'Nov months)))) Today is 2004-11-08. Here's a recursive definition for 'assoc' implementing a 'linear search' algorithm: (define (assoc key alst) (cond ((null? alst) #f) ((equal? key (car (car alst))) (car alst)) (else (assoc key (cdr alst))))) How long will 'assoc' take to associate a key with a given pair? What do we mean by 'how long'? And, once we answer that, are we really interested in the answer for a particular key? How about a randomly selected key? Regarding 'how long' we probably don't mean elapsed time on the clock; the elapsed time will vary depending on what sort of computer we're using, what operating system were running, what programming language and compiler, etc. Typically in analyzing algorithms we focus on the most expensive operations and how often they are performed. In the case of 'assoc' the most expensive operation involves comparing the key with the first item of a pair in the association list using the 'equal?' function. Clearly if we choose a key not in the association list or choose a key corresponding to the last pair in the list, we'll have to check every pair in the list. Let's say that there are n pairs in the list; in the case in which the key is not present, a failed lookup will require n comparisons, i.e., n applications of the 'equal? predicate. It doesn't matter if we implement 'assoc' as a recursive procedure or a 'while' loop; in either case we'll have to perform n comparisons. For this reason, we say that dictionary lookup using linear search has a worst-case performance of 'order n' notated O(n) in the 'big-oh' notation common in theoretical computer science. What about the case in which there is a pair (not necessarily the last) in the association list with a first element matching the key? In this case, it will depend on where the pair is located in the list. For the purpose of analysis it will help if we take a statistical perspective. Let's say that we choose a key whose location in the list is chosen uniformly at random from the numbers 1 through n. This implies that if we choose many keys selected in this random fashion, on average, the number of comparisons will be n/2 which is still O(n) since 'big-oh' notation ignores constant factors (the 1/2 in this case). Rather than prove this to you, I'll try to convince you experimentally. The following procedure draws m samples drawn uniformly from the integers 1 through n. The invocation (random n) selects a random integer between 0 (inclusive) and n (exclusive). > (define (sample-mean n m) (do ((i 0 (+ i 1)) (s 0 (+ s (random n)))) ((= i m) (/ s m 1.0)))) The law of large numbers tells us that if we draw a large enough sample, the sample mean (the average of the m samples) should be close to n/2 with high probability. Go ahead and try it. > (sample-mean 100 100) 55.24 > (sample-mean 100 100) 49.42 > (sample-mean 100 1000) 49.556 > (sample-mean 100 1000) 49.207 > (sample-mean 100 10000) 49.8387 > (sample-mean 100 10000) 49.57 So, if we choose our keys uniformly at random and we choose enough of them, lookup using linear search for large enough m should require n/2 comparisons on average with high probability. Since the number of lookups on average is n/2, we say that the dictionary lookup based on linear search has an expected performance (measured in terms of comparisons) linear in n. An obvious question is 'Can we do any better?' And the answer is 'yes'. Suppose that we could take each key and map it into a unique location in memory. How might we do this? Suppose that we set aside 2^m (that's 2 raised to m power) locations in memory for our dictionary. Now let's take an arbitrary key such as 'aardvark'. In memory, the string of characters in 'aardvark' corresponds to a sequence of binary bits, 1s and 0s. For example, the characters in 'aardvark' might be represented in ASCII (ASCII stands for ``American standard code for information interchange'') code: one of several standards for encoding characters. ASCII maps the binary numbers 0-255 to 256 characters. the letter 'a' in ASCII is represented by the decimal number 97 and the binary number 1100001 (where the most significant bit is the right most). The rest of the letters in 'aardvark' have the following ASCII binary representations: d = 1100100, v = 1110110, r = 1110010, and k = 1101011; hence, in computer memory, 'aardvark' might look like: aardvark = 1100001110000111001001110110110000111100101101011 We could take the first m bits, or the last 'm' bits, or every other bit from the first to the 2m'th bit in the binary representation of 'aardvark' to construct a number between 1 and 2^m and this number could correspond to the location in memory where we would store the item associated with 'aardvark'. Now suppose we want to lookup the item associated with 'aardvark'. All we have to do is use our procedure for selecting m bits from the binary representation of 'aardvark' and then jump to the corresponding location in memory to retrieve the associated item stored there. This operation takes a constant amount of time, i.e., it is independent of the number of items stored in the dictionary. The method of converting a key into a location in memory is called 'hashing' and the data structure used to store the items associated with keys is called a 'hash table'. It appears that operations based on hash tables are much more efficient than those based on association lists. But there is a potential problem that we have to address. What if two keys are mapped to the same location in memory? This is called a 'collision' and there are a number of ways to deal with collisions. One approach is simply to store all items that map to the same memory location in an association list. If we map a key to a location in memory and we find a list, then all we have to do is compare the key against those pairs that got mapped to this location - hopefully a much smaller number of pairs than the total number of keys. In the worst case, if we choose a lousy method for hashing and every key gets mapped to the same location in memory this approach take n comparisons, the same as linear search. But in the best case, there are no collisions and lookups take a constant amount of time, i.e., the time required to compute to map a key to a location in memory. Hashing has O(n) worst-case performance and 0(1) (denoting the smallest constant 1) expected time for an optimal hashing strategy. There are lots of clever ways of 'hashing' and implementing 'hash tables' but to determine if they work for your application, you often have to run experiments. This is called 'profiling' or 'benchmarking' and we'll give you an example of how we might do this for comparing the MzScheme implementations of 'assoc' and 'hash-table'. The file ./bin/benchmark.scm contains code that builds an association list or hash table of size m and then retrieves m random keys from the corresponding data structure. We'll try each data structure for m = 100, 1,000, 10,000 and 100,000 starting with the association list implementation, and use the 'time' function to keep track of different measures performance in millisecond units: > (time-assoc 100) Inserting 100 items: cpu time: 0 real time: 1 gc time: 0 Performing 100 lookups: cpu time: 0 real time: 1 gc time: 0 > (time-assoc 1,000) Inserting 1,000 items: cpu time: 0 real time: 3 gc time: 0 Performing 1,000 lookups: cpu time: 120 real time: 195 gc time: 0 > (time-assoc 10,000) Inserting 10,000 items: cpu time: 30 real time: 66 gc time: 0 Performing 10,000 lookups: cpu time: 18,060 real time: 19,052 gc time: 0 > (time-assoc 100,000) Inserting 100,000 items: cpu time: 520 real time: 661 gc time: 210 Performing 100,000 lookups: cpu time: 2,123,480 real time: 2,847,301 gc time: 0 It appears that the insertions don't depend on the size of the association list; inserting 10 times the number of items results in 10 times the amount of time. But the lookups are somewhat disturbing. Inserting 10 times the number of items results in 10^2 the time; the longer the list the long the time it takes to access items. Now let's take a look at the hash table implementation: > (time-hash 100) Inserting 100 items: cpu time: 0 real time: 0 gc time: 0 Performing 100 lookups: cpu time: 0 real time: 1 gc time: 0 > (time-hash 1000) Inserting 1,000 items: cpu time: 10 real time: 4 gc time: 0 Performing 1,000 lookups: cpu time: 0 real time: 3 gc time: 0 > (time-hash 10000) Inserting 10,000 items: cpu time: 60 real time: 90 gc time: 20 Performing 10,000 lookups: cpu time: 40 real time: 49 gc time: 0 > (time-hash 100000) Inserting 100,000 items: cpu time: 470 real time: 534 gc time: 110 Performing 100,000 lookups: cpu time: 400 real time: 447 gc time: 0 > (time-hash 1000000) Inserting 1,000,000 items: cpu time: 4,000 real time: 4276 gc time: 370 Performing 1,000,000 lookups: cpu time: 4,500 real time: 4969 gc time: 0 Note that in this case an increase of 10 in the number of lookups results in a 10 fold increase in the amount of time. Hash tables, even with the potential for collisions, seem to scale much more reasonably than association lists. Of course we used very simple keys and we might encounter a very different situation if we were using keys corresponding to, say, words from a text document. To get a better idea of how well hash tables work, in the following experiments we take random samples of the words in /usr/share/dict/words and try the hash table implementation once again, duplicating some of the larger experiments to get some idea of the variance due to sampling. > (time-hash-words 100) Inserting 100 items: cpu time: 0 real time: 0 gc time: 0 Performing 100 lookups: cpu time: 0 real time: 1 gc time: 0 > (time-hash-words 1000) Inserting 1,000 items: cpu time: 0 real time: 1 gc time: 0 Performing 1,000 lookups: cpu time: 10 real time: 7 gc time: 0 > (time-hash-words 10000) Inserting 10,000 items: cpu time: 10 real time: 12 gc time: 0 Performing 10,000 lookups: cpu time: 530 real time: 487 gc time: 0 > (time-hash-words 10000) Inserting 10,000 items: cpu time: 10 real time: 12 gc time: 0 Performing 10,000 lookups: cpu time: 460 real time: 476 gc time: 0 > (time-hash-words 100000) Inserting 100,000 items: cpu time: 150 real time: 534 gc time: 0 Performing 100,000 lookups: cpu time: 94,460 real time: 94,448 gc time: 0 > (time-hash-words 100000) Inserting 100,000 items: cpu time: 130 real time: 204 gc time: 0 Performing 100,000 lookups: cpu time: 96,530 real time: 96561 gc time: 0 Well, the hash-table implementation slowed down considerably but it's still performing considerably better than the association-list implementation. It should also be noted that the benchmarking function 'time-hash-words' has to randomly select words from a list and this takes additional time which is reflected in the above statistics. To give the hash-table implementation a fair shake we'd have to design a better evaluation scheme. There are other algorithms for implementing dictionary lookup; one of the more popular is based on a data structure called a balanced binary tree. This algorithm has a worst-case performance of O(log n) which is better than O(n) but worse than O(1). If you'd like to do a final project with a mix of theory and practice, you can implement the standard algorithms for linear search, hash tables and balanced binary trees and compare them both analytically and experimentally.
Semantic Networks - Graphs with Labeled Edges and Nodes In a semantic network the nodes correspond to words and the arcs correspond to relationships between words. Here are some of the most obvious relationships found in semantic networks. synonyms - one word is-roughly-the-same-as another word, antonyms - one word has-roughly-the-opposite-meaning of a another word, hypernyms - one word is-more-general/less-specific-than another word, hyponyms - one word is-less-general/more-specific-than another word, and meronyms - one word denotes an object that is part of an object denoted by another word Taxonomies can be represented as semantic networks. Carolus Linnaeus, an 18th-century Swedish botanist, devised a taxonomic categorization of the biological world in which each species belongs to a genus, each genus belongs to a family, each phylum to an order, each order to a class, each class to a phylum and each phylum to a kingdom. A similar sort of analysis has been done for genes relating individual genes to their known functions and functional classes. In 1911 Peter Mark Roget published his thesaurus of synonyms and antonyms much of which can also be captured in a semantic network. You can download the raw text of Roget's Thesaurus from Project Gutenberg at http://www.gutenberg.org/dirs/etext91/roget15a.txt WordNet is a project of the Cognitive Science Department at Princeton University. It is basically a big semantic network that supports a rich set of relationships represented as links. Visit their web site: http://www.cogsci.princeton.edu/cgi-bin/webwn/ and try a simple query like 'giraffe' and have WordNet search for hypernyms and hyponyms. Here is a longish chain of hyponyms: broco, mustang, pony, horse, equine, odd-toed ungulate, placental mammal, mammal, vertebrate, chordate, animal, organism, entity A pairwise relationship ~ can be reflexive - for all x, x ~ x, symmetric - for all x and y, if x ~ y then y ~ x, and transitive - for all z, y and z, if x ~ y and y ~ z then x ~z. An equivalence relation, e.g., '=', is reflexive, symmetric, and transitive. Some of the relationships typically captured in semantic networks are symmetric (e.g., synonyms and antonyms) and some are transitive but not symmetric (e.g., hypernyms and hyponyms). In the case of taxonomies, we often are interested if one word is related to another through a sequence of taxonomic relationships. For example, a giraffe is an ungulate (animals with hoofs) and all ungulates are mammals from which we can conclude that a giraffe is a mammal. In searching the web for information about animals with hoofs, you might want to narrow your search to focus on the giraffe or rhinoceros or you might want broaden your search to cover all mammals. Google allows the use of the tilde operator '~' to widen a search to include synonymous keywords. To implement semantic networks we'll use a basic Scheme utility for creating specialized data structures. For example, here's how you'd create a data structure for representing points in the plane. > (define-struct point (x-coord y-coord)) This one statement causes several additional functions to be defined: 'make-point' for creating points, 'point?' for testing whether or not an object is a point, 'point-x-coord' and 'point-y-coord' to access the x and y coordinates of a point respectively, and 'set-point-x-coord!' and 'set-point-y-coord!' to set the x and y coordinates of a point respectively. > (define pt (make-point 1 0)) > (point? pt) #t > (point-x-coord pt) 1 > (point-y-coord pt) 0 > (set-point-y-coord! pt 7) > (point-y-coord pt) 7 1. Write a function 'distance' that takes two points created by 'make-point' and computes the Euclidean distance between the two points. Write a function 'tour' that takes a list of points created by 'make-point' and computes the total distance traveled if you start at the first point in the list, travel in a straight line to the second point, from there travel on to the third point and so on ending up at the last point in the list. 2. Learn about the graphics.ss library in DrScheme and implement a data structure and supporting functions for representing and drawing two-dimensional polygonal objects. This is an advanced exercise and will take some time and effort to understand the syntax for drawing using the graphics.ss library. Search for 'Viewport Graphics' in the Help Desk and add the following to 'require' directive to your programs: (require (lib "graphics.ss" "graphics")) You should start with a set of primitive geometric objects, e.g., rectangle, circle, isosceles triangle. Your program should be able to compose a more complicated geometric object from a specification involving the simpler, primitive objects. Now, we'll define the data structures for implementing semantic networks. A node in semantic network is identified by a 'word' and it has links that 'out' from the node and links that leading 'in' to the node. > (define-struct node (word out in)) A link has a specified 'type' and it has an origin node that it leads 'out' of and a destination node that it leads 'in' to. > (define-struct link (type out in)) 3. The in-degree of a node in a directed graph is defined as the number of links that lead into the node. The out-degree is defined analogously. Write the functions 'in-degree' and 'out-degree' implementing these definitions. Write a function 'max-degree' which computes the maximum in- and out-degrees over all nodes in the graph corresponding to semantic network. We use a hash table to implement a dictionary that will serve to index the nodes in our semantics network. > (define dictionary (make-hash-table)) Insert a word into the dictionary associating it with a node in the network. If a node is already associated with the word then return the node. > (define (insert word) (or (hash-table-get dictionary word (lambda () #f)) (let ((node (make-node word () ()))) (hash-table-put! dictionary word node)))) Return the node associated with a specified word if it exists and otherwise signal an error. > (define (lookup word) (or (hash-table-get dictionary word (lambda () #f)) (error "lookup: unknown node" word))) Add an 'isa' link starting from 'out' and leading to 'in'. > (define (isa out in) (let* ((out (lookup out)) (in (lookup in)) (link (make-link 'isa out in))) (set-node-out! out (cons link (node-out out))) (set-node-in! in (cons link (node-in in))))) Now we can build a little semantic network as follows: > (insert 'giraffe) > (insert 'ungulate) > (insert 'mammal) > (isa 'giraffe 'ungulate) > (isa 'ungulate 'mammal) This next procedure just dumps the entire contents of a semantic network by using the 'hash-table-for-each' operator which works on hash-tables similarly to how 'map' and 'for-each' work on lists. > (define (dump) (hash-table-for-each dictionary (lambda (word node) (printf "~a - out: ~a - in: ~a~%" word (map node-word (map link-in (node-out node))) (map node-word (map link-out (node-in node))))))) > (dump) ungulate - out: (mammal) - in: (giraffe) mammal - out: () - in: (ungulate) giraffe - out: (ungulate) - in: () All of the above code can be found in ./bin/semantic.scm. 5. Write functions 'more-general' and 'more-specifc' which given a word return a list of words sorted by, respectively, the 'more general' and 'more-specific' relationships where we assume that both relationships are transitive. 6. Write predicates 'is-more-general?' and 'is-more-specific? that take two words in the dictionary and return #t if the first is, respectively, more general or more specific than the second and that return #f otherwise. 7. Build a semantic network for the parts of a car. Create link types for 'part-of' (a carburetor is part of the engine), 'is-a' (an internal combustion engine is a kind of engine which is a kind of motive force), and 'connected-to' (the drive train is connected to the rear axle). Alternatively build a semantic network describing functional and structural dependencies among the parts/players of an athletic team, a company, a building, or a spring-wound clock. 8. Use the 'dot' graph language and graphing utility from the Graph Viz collection of graph drawing tools to draw and label a semantic network. Your program should take as input a hash table implementing a semantic net with nodes and links built using 'make-node' and 'make-link' and return a digraph with labeled nodes and links specified using the 'dot' graph language. You can learn more about Graph Viz and 'dot' at the Graph Viz documentation web page http://www.graphviz.org/Documentation.php and the exercise on drawing graphs on the Talking With Computers web site http://www.cs.brown.edu/~tld/talk/home-Z-H-15.html#node_sec_13.2.1
How does search engine represent and store documents to support keyword queries? Here are the basis procedures involved in summarizing documents to support keyword queries: document -> bag of words For the most part, word order is ignored. Search engines like Google typically also store the location of each word in a document to support proximity queries, e.g., search for documents including 'search' and 'engine' such that these two terms are within one word of each other. Google is also starting to store short phrases, but storing phrases has to be managed carefully to avoid an explosion in terms. 1. Find a utility for converting Microsoft Word documents to plain text and then use to convert a Word document to a bag of words. There are various utilities available including 'antiword' and 'msview' that you might find on the system or by searching the web and downloading and installing an application. bag of words -> bag of terms Typically the words in a document are preprocessed by removing 'stop words' such as 'a', 'the', 'it', 'and', 'that', etc., words that add little meaning given that the document is represented as a bag of words with little additional semantic analysis. In addition to removing stop words, usually the remaining words are 'stemmed' to map morphological variants of a given word to the same root, e.g., 'run', 'runs' and 'running' all map to 'run'. The set of all terms in the corpus is called the 'dictionary'. In most cases, stemming reduces the size of the dictionary and the length of term vectors, but can in some cases conflate terms that we would like to remain separate, e.g., 'universe', 'university' and 'universal' all map to root 'univers'. 2. Find a Perl or Python script that implements the 'Porter stemmer'. Use it to stem a sample of words drawn from a familiar text and note what terms are unambiguously stemmed and what terms conflate meanings. bag of terms -> inverted index In order to retrieve documents that contain specific terms we build an inverted index which points from each term to the documents that contain the associated word. Usually the inverted index also records the position of each term/word in the document to support proximity searches. 3. How would you implement an inverted index in scheme? Create the data structures and functions implementing the operations of 'index' and 'fetch'. bag of terms -> term-frequency vector The bag of terms is then processed to determine the number of times each term appears in the document - this number is called the frequency of the term - and the resulting numbers are used to construct a term frequency vector which maps each term to its frequency in the document. You can think of this as a vector of numbers where the 'i'th number represents the frequency of the 'i'th term in the dictionary, but in fact term frequency vectors are never actually implemented in this fashion. A more efficient way of implementing this mapping would be as an association list or hash table. 4. Implement term-frequency vectors as association lists and then write code for two important operations on vectors from linear algebra: the magnitude of a vector and the inner or dot product of two vectors (see ../../../04/11/03/exercises.txt). Now that we have a representation for each document and an index mapping terms/words to documents keyword queries are handled as follows: keywords -> keyterms Eliminate stop words and stem the remaining keywords. keyterms -> documents To retrieve a set of candidate documents we use the inverted index to map the keywords to documents containing them. The search engine compiles a list of all documents that contain some (or all) of the keywords in preparation for ranking. keyterms -> keyterm-frequency vector To rank the documents, we can construct a term-frequency vector from the search keywords which we'll call the keyterm-frequency vector. At this point we can add additional terms corresponding to synonyms of the keywords. We can also specialize the query by including hyponyms (x 'is a kind of' y) or generalize it by using hypernyms (y 'is a kind of' x). 5. Implement the cosine angle of two term-frequency vectors using the 'magnitude' and 'inner' product operations you defined earlier. documents + keyterm-frequency vector -> document ranking The degree of similarity between the keyterm-frequency vector and each of the document term-frequency vectors is then calculated and the documents are sorted using the calculated similarities and presented to the user in the resulting order.
1. Suppose that you want to compute the minimum and the maximum of a list of items. Here's a simple algorithm for computing the minimum: simple-min of a list of items set min to be the first item in items foreach item in the rest of the items if item < min then set min to be item endif end return min And here's an algorithm for computing the maximum of a list of items: simple-max of a list of items set max to be the first item in items foreach item in the rest of the items if item > max then set max to be item endif end return max Here's the most obvious way of computing both the minimum and the maximum: simple-min-and-max of a list of items set min to be simple-min applied to items set max to be simple-max applied to items return ( min, max ) We can be little more clever and only iterate over the list of items once: combined-min-and-max of a list of items set min to be the first item in items set max to be the first item in items for each item in the rest of the items if item > max then set max to be item endif if item < min then set min to be item endif end return ( min, max ) Or we can be downright devious and do a little preprocessing of the list: tricky-min-and-max of a list of items set mins to be the empty list set maxs to be the empty list for each consecutive pair of items in the list of items if first item in the pair > second item in the pair then add the first item in the pair to maxs and add the second item in the pair to mins else add the first item in the pair to mins and add the second item in the pair to maxs endif end set min to be simple-min applied to mins set max to be simple-max applied to maxs return ( min, max ) Note that if the list of items is (item_1 item_2 item_3 item_4) then the list of consecutive pairs is ((item_1 item_2) (item_3 item_4)) Don't worry about the case in which the length of the list of items is odd. For each of the three algorithms, 'simple-min-and-max', 'combined-min-and-max' and 'tricky-min-and-max' determine how many comparisons (applications of < or >) are required to compute the min and max of a list of items as a function of n, the length of the list. Note that 2 * n, n/4, n^2, and log n are all functions of n as are any combinations of these such as 2 * log n/4. Show your work; by this I mean write out the steps that you took to arrive at your solutions, e.g., the for-each loop performs ... comparisons and the call to 'simple-min' applied to the list of mins performs ... comparisons ... 2. Suppose you want to insert a new item into a list of items that is already sorted so that the new list with the inserted item is also sorted; how many comparisons would this require for a list of length n? 3. Use the idea of inserting items in an already sorted list to define a sorting algorithm called 'insertion-sort' and suggest an upper bound on the number of comparisons it performs.
Here's what I'd like you to get out of the first part of this lecture: a. a basic understanding of how you measure the performance of algorithms b. an understanding of how we measure the difficulty of problems and assign problems to classes Computational Complexity An algorithm is an abstract description of a computation: a recipe in a reasonably precise mathematical language that describes the steps required to carry out a given computation. Algorithms differ from particular implementations of an algorithm, e.g., in Scheme, C, Java or C++ only in the level of detail and, indeed, these languages are perfectly suited to precisely specifying algorithms. There are many ways to measure the performance of algorithms. For example, you might want to count the number of required time-consuming operations, e.g., the number of string comparisons or arithmetic operations. Alternatively, you might want to figure out how much space (or memory) an algorithm requires. And you might want to understand how an algorithm performs in the worst case, the best case, or the average case. We typically describe the performance of algorithms as a function of n where n is a measure of the size of the problem, such as the number of items in a list of items to be searched or sorted, or the number of nodes in a graph in which we want to find the shortest path or the shortest tour (in the case of a traveling salesman problem). For the reasons I gave on Monday, we group together (or classify) problems that are 'asymptotically' (in the limit) equivalent, e.g., the linear-time algorithms - O(n) read 'order of n' using so-called big-oh notation, logarithmic - O(log n), quadratic - O(n^2), cubic - O(n^3), or, more generally, polynomial - 0(n^k) for some fixed k, exponential - O(2^n), doubly exponential - O(2^(2^n)), etc. By contrast, the difficulty of a 'problem' is typically associated with the class of the 'most efficient' algorithm for solving the problem. Note that this is not the most efficient algorithm we know of, but rather the most efficient algorithm possible. For example, computer scientists have shown that the problem of sorting a list of items using comparisons (we have to be specific given that there are other approaches to sorting - lookup 'radix sort' using Google if you're curious about this) is in the class of problems that can be computed in O(n log n). This class is referred to as the 'asymptotic worst-case complexity' of the problem. I also mentioned that there are some problems that we can't even find solutions for. These are the so-called 'undecidable problems' and one of the best known such problems is the so-called 'halting problem'. We'll look at the halting problem in more detail in just a few minutes. For some problems, like sorting, we actually know the performance of most efficient algorithm. For others we don't know for sure. In some cases, computer scientists have identified abstract classes that characterize computational problems. The class of NP-complete problems is one of the most famous such classes. A 'nondeterministic polynomial time algorithm' is one that tries to guess a solution to a problem and can determine whether or not a proposed solution is indeed a solution to a specified problem in polynomial time. The class of NP-complete problems (the long version is 'that class of problems complete for nondeterministic polynomial time') is such that if you could solve any one of these problems in polynomial time you could solve them all in polynomial time. A 'decision' variant of the traveling salesman problem is in the class of NP-complete. In this particular variant, the task is to determine for a given graph whether or not there exists a tour that is shorter than a specified threshold distance. Other problems involve solving various kinds of puzzles, configuring computer networks, scheduling assembly lines, etc. 1. Argue that you can check a proposed solution to an instance of the traveling salesman problem in polynomial time. So far the best known algorithms for these problems all run in time exponential in the size of their input for some inputs. Computer scientists conjecture, but don't know for sure, that P != NP, that is to say that the class of NP Complete problems is distinct from the class of polynomial-time problems. A few computer scientists believe that P = NP and that we just haven't been clever enough in devising algorithms, and sooner or later we'll find a polynomial-time algorithm for some problem in the class of NP-complete problems and this problem will cause class of NP-complete problems to 'collapse' like a house of cards into the class of polynomial-time problems. The problem of whether or not P = NP produces all sorts of cranks: people who think they have a proof that either P != NP or P = NP. One of the most common challenges involves the invention of alternative forms of computation. For example, DNA computing, quantum computing, or even computing using ant colonies. There are proposals for using biological processes and in particular DNA and cellular processes to compute the solutions to problems. A European research group claims that it can solve instances of TSP using a colony of ants. Perhaps the most interesting and credible challenge concerns the application of quantum mechanics to building a so-called quantum computer. It has been conjectured that quantum computing will enable us to solve a particular problem relating to security and code breaking faster than on a conventional computer. Half a century ago a mathematician by the name of Alan Turing thought he has solved the problem of their being different models of computing by proving that all sufficiently powerful computing machines are equivalent. He did this as part of proving that the halting problem is undecidable. The Halting Problem What might you expect from the following program: while ( n != 1 ) { if n is even then n = n / 2 ; else n = 3n + 1 ; } Or, if you prefer, in Scheme: (define (steps n) (if (not (= n 1)) (steps (if (even? n) (/ n 2) (+ (* 3 n) 1))))) Does it halt or not? No one actually knows. For a power of two, it clearly halts (for n = 1024 in halts in 10 steps). But for some other inputs it bounces around wildly. Even for relative simple programs involving integer inputs it's difficult to predict their behavior. But that doesn't mean it's impossible. How can we show that there is no program that can predict the behavior of an arbitrary program on an arbitrary input? We're going to present a proof by contradiction: we'll assume that such a program exists and then we'll show that this assumption leads us to a contradiction. 2. Convince yourself of the follow proof by contradiction. In the following, P, Q and S are programs and X represents a possible input to a program. Since programs are just strings of symbols they can be used as inputs to programs. To begin the proof by contradiction, assume that there exists a program Q defined as follows: Define Q(P,X) as follows: If P(X) halts then return YES else return NO Now we define another program S that uses Q as a subroutine and that definitely does not halt on some inputs: Define S(X) as follows: If Q(X,X) = Yes then loop forever else terminate Suppose that we apply S to itself as in S(S). First, suppose that S(S) terminates; but this can't be since the only way for S(S) to terminate is for Q(S,S) to return NO which indicates that S(S) doesn't terminate. Then suppose that S(S) doesn't terminate; but this can't be either, since if S(S) doesn't terminate that means that Q(S,S) must have returned YES which indicates that S(S) does terminate. Since in either case we arrive at a contradiction, this leads us to reject our initial assumption, namely that there actually is a program that can determine for any program and any input whether the program will halt on that input. In the process of filling in the details of this proof, Turing devised an abstract computing devices which now bears his name. Turing Machines Alan Turing proved that there is no mechanical method that can tell for any program whether or not it will halt. To complete his proof he imagined a mathematical abstraction that reduces computation to a simple, almost elemental form. Here are the main features of a Turing machine: <-- ... | # | 0 | 1 | 0 | 0 | 0 | 0 | 0 | # | ... --> TAPE ^ READ-WRITE HEAD | | CONTROLLER | | RULES | | STATE: 1 | * a finite alphabet of symbols; in this case 0, 1 and #, * a finite set of states that the machine might be in; for example, a machine might have the set of three states: MOVING FORWARD LOOKING FOR A #, MOVING BACKWARD LOOKING FOR A 1, and DONE, * an infinite TAPE arranged as a sequence of locations each capable of recording a single symbol, * a READ-WRITE HEAD capable of reading the symbol written on the tape at the location under the head or writing a symbol at the location under the head, and * a CONTROLLER consisting of a finite set of rules each of which can, depending on the current state and the symbol written on the tape under the read-write head, change the current state to be a new state, write a new symbol at the location currently under the head, and move the tape forward or backward under the head. We can represent the rules used to define a controller for a particular machine as a table with the following columns. CurrentState CurrentSymbol NewState NewSymbol Move Here's the controller for the 'NO2ONES' machine which doesn't like consecutive ones in a string; whenever NO2ONES finds two consecutive ones, it replaces the second one with a zero. When it gets to the end of a string (encounters a blank space indicated by a #) then it stops. The rules assume that the head starts at the beginning (the leftmost symbol) of the string that we want to work on. The plus symbol '+' indicates move the head forward, the minus '-' move the head back, and the asterisk '*' stay where you are, don't move the head in either direction. CurrentState CurrentSymbol NewState NewSymbol Move 1 1 2 1 + 1 0 1 0 + 1 # HALT # * 2 1 1 0 + 2 0 1 0 + 2 # HALT # * Another exercise in writing Turing machines is to define a machine that can recognize palindromes (ignore the spaces and punctuation): "A man, a plan, a canal: Panama" "A nut for a jar of tuna" "A daffodil slid off Ada" You can find many more palindromes at http://www.palindromes.org/ Here's the table of rules for recognizing even-length palindromes: CurrentState CurrentSymbol NewState NewSymbol Move 1 0 4 # + 1 1 2 # + 1 # HALT * * 2 0 * * + 2 1 * * + 2 # 3 * - 3 1 6 # - 3 # HALT * * 3 0 HALT * * 4 0 * * + 4 1 * * + 4 # 5 * - 5 0 6 # - 5 # HALT * * 5 1 HALT * * 6 0 * * - 6 1 * * - 6 # 1 * + An asterisk '*' in the NewState or NewSymbol column means 'no change'. 3. Here's the the graphical representation of the controller shown as a so-called finite-state transition diagram: /-----------------> 6 <-----------------\ / | \ 3 <------- 2 <------- 1 -------> 4 -------> 5 Fill in the transitions by labeling each arc with the current symbol on the tape, the new symbol on the tape, and direction to move the tape. Note that there are some arcs corresponding to self transitions that you'll have to add to the diagram. 4. Devise a Turing machine to erase every third 1 in a sequence of 1s and 0s: #01011011010# => #01010011000# 5. If you found the palindrome machine easy, try the copy machine: #0101###### => #0101#0101# 6. For a more ambitious challenge, implement a rule set for the adding machine: #0101#0110###### => #0101#0110#1011# This is significantly more difficult than exercises (4) or (5). Here we are assuming numbers are represented 'big endian' with the least significant bit on the right). What can you do with Turing machines? Arithmetic? Logic? The answer is that you can do anything that you can do with a traditional computer; more in fact, if you assume that you have an infinite tape. In particular, you can build a Turing machine called a Universal Turing Machine that can simulate any other Turing machine. Experimenting with Turing machines: We've provided some Turing machine simulator for you to experiment with (take a look at ./bin/turing.scm). In the Scheme implementation of this simulator, we used '_'s for '#'s since '#' has a special significance in Scheme. The 'execute' function takes four arguments: a set of rules, a list of symbols, the starting state and the starting position on the tape (positions are zero-based). > (load "turing.scm") > (define NO2ONES '((1 1 2 1 +) (1 0 1 0 +) (1 _ HALT _ *) (2 1 1 0 +) (2 0 1 0 +) (2 _ HALT _ *))) > > (execute NO2ONES '(_ 1 1 0 1 1 0 1 _) 1 1) (_ 1 1 0 1 1 0 1 _) (_ 1 0 0 1 0 0 1 _) There's also a version of 'execute' called 'verbose' that spits out lots of diagnostic information that might prove useful in doing exercises (4), (5) and (6). > (verbose NO2ONES '(_ 1 1 0 1 1 0 1 _) 1 1) (_ 1 1 0 1 1 0 1 _) State = 1, Symbol = 1, Next state = 2, New symbol = 1, Direction = + State = 2, Symbol = 1, Next state = 1, New symbol = 0, Direction = + State = 1, Symbol = 0, Next state = 1, New symbol = 0, Direction = + State = 1, Symbol = 1, Next state = 2, New symbol = 1, Direction = + State = 2, Symbol = 1, Next state = 1, New symbol = 0, Direction = + State = 1, Symbol = 0, Next state = 1, New symbol = 0, Direction = + State = 1, Symbol = 1, Next state = 2, New symbol = 1, Direction = + State = 2, Symbol = _, Next state = halt, New symbol = _, Direction = * (_ 1 0 0 1 0 0 1 _) The 'execute' function also allows the use of wildcards which can reduce the number of rules in some cases. If you want to use wildcards, read the comments in the code concerning their use. Here's the palindrome machine without using wild cards: (define PALINDROMES '((1 0 4 _ +) (1 1 2 _ +) (1 _ HALT * *) (2 0 2 0 +) (2 1 2 1 +) (2 _ 3 _ -) (3 1 6 _ -) (3 _ HALT _ *) (3 0 HALT 0 *) (4 0 4 0 +) (4 1 4 1 +) (4 _ 5 _ -) (5 0 6 _ -) (5 _ HALT _ *) (5 1 HALT 1 *) (6 0 6 0 -) (6 1 6 1 -) (6 _ 1 _ +))) And here's the palindrome machine using five fewer rules with wild cards: (define palindromes '((1 0 4 _ +) (1 1 2 _ +) (1 _ HALT * *) (2 * * * +) (2 _ 3 * -) (3 1 6 _ -) (3 * HALT * *) (4 * * * +) (4 _ 5 * -) (5 0 6 _ -) (5 * HALT * *) (6 * * * -) (6 _ 1 * +))) 7. Implement the Turing machines that you described in exercises (4), (5) and (6). 8. For Friday, December 3, 2004 you are to read Chapter 16: 'Ain't Nobody Here But Us Machines' and also read the on-line exercise http://www.cs.brown.edu/~tld/talk/home-Z-H-18.html#node_sec_16.1.1 titled 'Pet Rocks and Things That Think. You're to do the exercise that requires you to specify two measures of complexity and then rank the six systems listed: an ant, an elm tree, etc. Come to class prepared to discuss the questions raised in this piece.
I started this course by talking about the writings of Vannevar Bush who, right after World War II, set forth a vision that encompassed a vast information store anticipating the World Wide Web and suggested that computers would amplify our ability to communicate, calculate and perceive in much the same way that the internal combustion engine and other technologies amplified our physical abilities. Computer scientists, some of them at least, think they can explain not only decision making and perception in computational terms but also the very stuff that makes us human (or so we believe), consciousness, emotions, altruism, and intelligence in all its manifestations. In the 17th century, scientists tried to explain human behavior in terms of clockwork mechanisms, and later in terms of steam engines, electrical signals, and a host of other technological advancements. So it's worth asking the question: Is the current optimism about unraveling the mystery of the mind warranted or are the current set of computational explanations just another fad? Some computer scientists such as Hans Moravec and Ray Kurzweil believe that one day relatively soon (at least by 2100) we'll be able to manufacture the stuff of life, replace not only our organs but our brains, codon-by-codon, cell-by-cell, neuron-by-neuron thereby providing us with unlimited life spans and unlimited opportunities to amplify our selves. These are truly wondrous possibilities. Technologists such as Kurzweil are banking on the exponential improvements in computing technology and biology and nanotechnology which are closely intertwined. This exponential trend is fueled by Moore's Law - which basically predicts that the number of transistors (elementary computing components) on a chip will double approximately every two years and therefore that the computing power available on a microchip will double every two years. O(2^n) gets very big very fast. Morvavec and Kurzweil imagine riding this exponential curve to immortality and cognitive amplification without bounds. But critics of this sort of technological exuberance challenge these goals on the basis of their having little or nothing to do with being human. The very idea of us being machines and from Moravec's and Kurzweil's perspective really just information and computational processes that can be transferred from one machine such as the human body to another, such as the silicon chips that comprise a computer. If we are just machines, what does that say about being human? And if you believe we're 'just' machines and thereby subject to a set of rules as are all machines, then what does that say about free will and self determinism? Are we different than any other animal except by some measure of how fast our processors run or how much memory we have or how many rules we can learn from experience? 1. You were asked to come up with two measures of 'complexity' so as to situate six systems along a continuum. What are your measures and what do they they imply about our relationship to 'others'? 2. What do think about Moravec's and Kurzweil's vision of the future? How might the availability of cognitive amplification affect jobs, admission to college, etc? Assuming we can deal with any resulting population pressures, how do you feel about the prospect of living indefinitely long? Closing Remarks This course is likely different than you've experienced in high school and it may be different from many if not most of the courses you'll take in college. For one thing there were no exams, the quizzes were meant as gentle persuasion to read the material for class, and the exercises were really meant for your edification not to keep track of what you were learning. The projects were what you made of them. They were designed to reinforce the things you learned in class and in the exercises, but they were offered as an opportunity for you to apply what you learned and stretch yourself. Some of you took advantage of this opportunity and some of you did the minimum to get by. You are rapidly approaching a time of your life in which there won't be any formal structure for guiding your performance: no teachers, no coaches, no explicit instructions regarding what to do, when to do it, or how much is enough. Well, perhaps that's an exaggeration; there are jobs where someone is always telling you what to do and when to do it, but I'd argue that these jobs aren't worth having. This class offered you ideas, resources, lots of help and the opportunity in which to explore ideas and come up with some of your own. The final projects offer yet another such opportunity; those of you who came to hours on Tuesday I hope I convinced you that all three of the projects offer interesting opportunities to explore. You can make of them what you will. I can't imagine that any of you going to get less than a C in this class, but some of you will get less than an A. But really that should be much less important than your learning to take advantage of what life offers by applying yourself energetically to getting the most out of every experience, and especially the next four years. If you learn this ability now so many other things in life will fall into line. Take advantage of these last two weeks to get something substantial out of completing your final project - don't worry about grades, because sooner than you think they won't make any difference. What will make a difference is your attitude and your ability to learn and exploit the opportunities that come your way.
Help session for the 'Dueling Chatterbots' project The first four exercises lead up to getting 'albert' and 'eliza' talking with one another: 1. Write two functions 'one' and 'two' such that each takes a single argument corresponding to an integer, adds 1 to its argument, prints the result on a line preceded by 'i -> ' where i = 1 in the case of 'one' and i = 2 in the case of 'two', and then returns the result. > (one 1) 1 -> 2 2 > (two 2) 2 -> 3 3 Note the behavior when the two functions are alternated: > (one (two (one (two (one (two 1)))))) 2 -> 2 1 -> 3 2 -> 4 1 -> 5 2 -> 6 1 -> 7 7 2. Now write a loop using 'do' that first calls 'one' with 0, then calls 'two' with the result returned by 'one', then calls 'one' with the result returned by 'two', and so on indefinitely. There are lots of ways to this, but you might consider using an integer counter 'i' and applying 'one' in the cases where 'i' is odd and 'two' in the cases where 'i' is even. The predicate 'even?' returns #t if its single integer argument is even and #f otherwise. 3. Now produce the same behavior by redefining 'one' and 'two' to be mutually recursive, i.e., 'one' calls 'two' and 'two' calls 'one'. 4. Rename 'one' and 'two' as 'eliza' and 'albert' and rewrite the body using 'aux-chatter' as model so that each function calls 'apply-rules' with an appropriate set of rules ('eliza-rules' or 'albert-rules'), prints out the output and then sends it along as appropriate. This next set of exercises involves writing rules. Recall that rules are of the form (cons pattern list-of-responses) so that '((I went to the ?x) (Did you enjoy going to the ?x) (What did you find when you got to the ?x)) matches I went to the museum and would produce either the response Did you enjoy going to the museum or the response What did you find when you got to the museum with equal probability. 5. As a warmup, create a rule that matches any of the following I think you are crazy I think you are schizophrenic I think you are psychopathic but not I think you are going too far I think you are an idiot and produces responses like I think you are [crazy | schizophrenic | psychopathic] too I certainly am not [crazy | schizophrenic | psychopathic] Why do you think I am [crazy | schizophrenic | psychopathic] 6. Create a rule that will match all of the following When I get home I want to sleep for month After exams I want to watch stupid TV for a week During the break I want to go skiing in Taos but not either of the next two sentences I want another cup of coffee and a donut I'm fed up and I want some peace and quiet and produce responses like That would be nice to be able to sleep for a month I too would like to watch stupid TV for a week The next set of rules involves writing special-purpose matching functions to augment the pattern-matcher. The first exercise tests your understanding of how 'special's are handled in the pattern matcher and allows the application of special-purpose matching functions to be a little more compact; instead of writing (special my-match-fun ...) you use (my-match-fun ...) 7. Modify the mechanism for handling 'special' matching functions so that it doesn't need the 'special' keyword; any list whose 'car' is a procedure (i.e., answers #t to the Scheme predicate 'procedure?') is defined as a special-purpose matching function. You can write special-purpose matching functions that are like 'variable-match' and 'segment-match' in that they consume words in the input and bind variables if the match succeeds and return the empty set of they fail. Writing a special-purpose matcher like 'synonym-match' involves just this sort of matching. Generalizing the special-purpose matching functions illustrated in the project README file (./exercises/final/chatter/README), there are several basic types of matching function: There's a bug in the pattern matcher. It can't handle a segment variable right before a 'special' match function. See the README file for more detail. A pattern like (?x (special ... ) ... ) will work just fine but a pattern like ((?* ?x) (special ... ) ... ) will not work correctly. a. Functions that test for some condition and, on the basis of the result of the test either, fail or call the pattern matcher recursively consuming input and pattern both. (define (some-match-function pattern input bindings) (if (not (some-input-predicate (car input))) fail (pattern-match (cdr pattern) (cdr input) bindings))) pattern = ((special some-match-function) ...) b. Functions like (a) except that they include a variable as an additional argument and if they succeed extend the bindings to include an additional binding. (define (some-match-function pattern input bindings var) (if (not (some-input-predicate (car input))) fail (pattern-match (cdr pattern) (cdr input) (extend-bindings var (car input) bindings)))) pattern = ((special some-match-function var) ...) c. Functions like (a) except that they consume no input and only the part of the pattern corresponding to themselves. These functions are present only for their side effects and generally appear at the end of the pattern only after all the input has been consumed. (define (some-match-function pattern input bindings) (if (not (some-input-predicate (car input))) fail (pattern-match (cdr pattern) input bindings))) 8. Write a special-purpose matcher that keys off emotional words like 'sad' or 'depressed' and outputs either a positive remark like: 'Would you like a handkerchief?' or a negative remark like 'Oh, boo hoo! I'm all broken up.' This remark will be output in addition to the response for the first rule whose pattern matches the input. > (pattern-match '((?* ?x) I feel (special mean-match ?y) (?* ?z)) '(Ever since my I turned eighty I feel depressed)) BAD ALBERT - Oh, boo hoo! Who cares if you're depressed GOOD ALBERT - ((?z) (?y . depressed) (?x Ever since my I turned eighty)) > (let ((input '(Ever since my I turned eighty I feel depressed)) (rules '((((?* ?x) I feel (special mean-match ?y) (?* ?z)) (Do you feel ?y often?) (That's interesting that you feel ?y))))) (printf "~%ELIZA - ") (map (lambda (word) (printf "~a " word)) input) (printf "~%ALBERT - ") (map (lambda (word) (printf "~a " word)) (apply-rules rules input)) (printf "~%ELIZA - ...~%") (void)) ELIZA - Ever since my I turned eighty I feel depressed ALBERT - BAD ALBERT - Oh, boo hoo! Who cares if you're depressed GOOD ALBERT - Do you feel depressed often? ELIZA - ... Note that not all of the processing has to be embedded in the rules. For example, you could augment the 'albert' function to keep track of how long 'eliza' takes to respond and if it takes longer than, say, two or three seconds, then have 'albert' make some impatient remark. Examples of more advanced natural language processing techniques: 9. Write a single rule whose pattern matches any of the following sentences I went to the grocery store. I walked to a library. I rode to the convenience store. I ran to the nearest phone. and generates a response of the form Do you think it was wise to go to [the grocery store | a library | the convenience store | the nearest phone] You can think of the verbs 'went', 'walked', 'rode' and 'ran' being all of the same sort: involving the physical transfer of a person from one location to another. You might identify other sorts of generalized sentence forms and create rules that work for all instances of a given generalized sentence (see Inside Computer Understanding [Schank and Riesbeck 1981] for more on this approach). A simple version of this can be implemented by an extension of your special-pattern matcher used to handle synonyms. 10. How might you write a special-purpose matching function that figures out the object of the second appearance of the preposition 'to' in the following sentences: I want to go to a restaurant that serves sushi. I really want to go to bed when I get back to the dorm. I want to go to the baseball game this afternoon. How would you deal with indefinite articles - 'a' and 'an', the definite article - 'the', adjectives like 'some', 'all', 'any' and demonstrative pronouns such as 'this', 'that', 'these', and 'those'? In the fields of natural language processing, computational linguistics, and information retrieval, researchers use programs called 'part-of-speech taggers' to identify the parts of speech of words appearing in sentences. You might think about incorporating a part-of-speech tagger into your project - it's a bit ambitious, but I'll provide you with some extra help to get you started. Here's a part-of-speech tagger from Helmut Schmid, University of Stuttgart, Germany, that runs under both Mac OS X and Linux. % cd ~tld/bin/tagger/ % ls FILES bin install-hints.txt LICENSE cmd sentence.txt README doc english.par % cat sentence.txt The boy went to the store for a loaf of bread % bin/tree-tagger -quiet english.par sentence.txt DT NN VVD TO DT NN IN DT NN IN NN % bin/tree-tagger -quiet english.par sentence.txt | paste - sentence.txt DT The NN boy VVD went TO to DT the NN store IN for DT a NN loaf NN of NN bread It did a pretty good job! If you want to learn what the tags mean search the web for the Penn Treebank Tagset. Here are the tags used in the example: DT - Determiner, NN - Noun, singular or mass, IN - preposition or subordinating conjunction, and VVD - past tense verb. The file ~tld/tmp/exercises/04/12/06/bin/tagger.scm provides convenient access to the tagger in Scheme: > (load "~tld/tmp/exercises/04/12/06/bin/tagger.scm") > (tagger '(The boy went to the store for a loaf of bread)) ("DT" "NN" "VVD" "TO" "DT" "NN" "IN" "DT" "NN" "IN" "NN")