Chapter 5

Final Projects

5.1  Project #1: Algorithm Analysis

/Users/tld/email/book/intro/exercises/final/analysis
/Users/tld/email/book/intro/exercises/final/analysis/bin
/Users/tld/email/book/intro/exercises/final/analysis/bin/binary.scm
/Users/tld/email/book/intro/exercises/final/analysis/bin/hashing.scm
/Users/tld/email/book/intro/exercises/final/analysis/bin/instrument.scm
/Users/tld/email/book/intro/exercises/final/analysis/bin/load-once.scm
/Users/tld/email/book/intro/exercises/final/analysis/bin/red-black.scm
/Users/tld/email/book/intro/exercises/final/analysis/private
/Users/tld/email/book/intro/exercises/final/analysis/private/bin
/Users/tld/email/book/intro/exercises/final/analysis/private/bin/benchmark.scm
/Users/tld/email/book/intro/exercises/final/analysis/private/bin/hash-table.scm
/Users/tld/email/book/intro/exercises/final/analysis/private/bin/words.csh
/Users/tld/email/book/intro/exercises/final/analysis/private/bin/words.txt
/Users/tld/email/book/intro/exercises/final/analysis/private/README
/Users/tld/email/book/intro/exercises/final/analysis/README

In this project you are to evaluate three algorithms for performing
dictionary insertions and lookups.  The three algorithms consist of
the linear-time (worst-case) search algorithm modeled after 'assoc'
using association lists, the logarithmic-time (worst-case) algorithm
using balanced binary trees, and the constant-time (expected-case)
algorithm using hash tables.  You must implement the linear-time and
constant-time algorithms in Scheme, but we will provide you with a
Scheme implementation of balanced binary trees.  You must write your
program using hash tables from scratch, i.e., you can't use of the
built-in functions 'make-hash-table' or 'hash-table-get'.  You will
have to do some research to learn about the theoretical analyses of
these algorithms and summarize the main results as part of your
documentation.  You will also propose, implement and run a set of
experiments based on benchmarks of your choosing and compare the
empirical results with the theoretical predictions.  This project is
actually significantly easier than it sounds. Here are the pieces:

   a. Implement 'insert' and 'lookup' functions using hash tables.
   b. Read and summarize handounts on balanced trees and hashing.
   c. Instrument all three algorithms for running the experiments.
   d. Identify a set of keys to be used in benchmark experiments.
   e. Run your benchmark experiments and summarize the results. 


See ./bin/red-black.scm for the implementation of balanced binary
trees and ./bin/hashing for some useful functions for implementing
hash tables.  To implement hash tables you may want to make use of the
following vector operations:

(make-vector size fill)
(vector-ref vector k)
(vector-set! vector k obj)

> (define vec (make-vector 16 0))
> (vector-ref vec 8)
> (vector-set! vec 8 1)
> (vector-ref vec 8)
1

We suggest that you use strings for keys rather than symbols.  The
following functions and built-in string operations will come in handy.

(string->list string)
(string=? string1 string2)
(string<? string1 string2)
(string<=? string1 string2)
(string>=? string1 string2)

Here are some additional functions found in ./bin/hashing.scm

;; Convert a string to list of bits. 
(define (string->bits str)
  (apply append! (map char->bits (string->list str))))

;; Convert a character to the eight bit binary code 
;; representing its ASCII equivalent integer 0-255.
;; Note that A -> 65, Z -> 90, a -> 96 and z -> 122. 
(define (char->bits char)
  (let ((n (char->integer char)))
    (map (lambda (m)
           (if (< n m) 0
               (begin (set! n (- n m)) 1)))
         '(128 64 32 16 8 4 2 1))))

5.2  Project #2: Dueling Chatterbots

/Users/tld/email/book/intro/exercises/final/chatter
/Users/tld/email/book/intro/exercises/final/chatter/bin
/Users/tld/email/book/intro/exercises/final/chatter/bin/eliza.scm
/Users/tld/email/book/intro/exercises/final/chatter/bin/reader.scm
/Users/tld/email/book/intro/exercises/final/chatter/bin/rules.scm
/Users/tld/email/book/intro/exercises/final/chatter/bin/simple.scm
/Users/tld/email/book/intro/exercises/final/chatter/bin/verbose.scm
/Users/tld/email/book/intro/exercises/final/chatter/private
/Users/tld/email/book/intro/exercises/final/chatter/private/bin
/Users/tld/email/book/intro/exercises/final/chatter/private/bin/match.scm
/Users/tld/email/book/intro/exercises/final/chatter/private/bin/reader.scm
/Users/tld/email/book/intro/exercises/final/chatter/private/bin/rules.scm
/Users/tld/email/book/intro/exercises/final/chatter/private/README
/Users/tld/email/book/intro/exercises/final/chatter/README

Eliza was designed to interact with a human and more often than not,
any perceived intelligence on Eliza's part could be attributed to the
variety and subtlety of the human.  In this project, you are to modify
the Eliza code so that Eliza interacts with a second chatterbot named
Albert.  You can give Albert any sort of personality that you like and
alter Eliza to respond appropriately.  For example, assuming that
Eliza retains the persona of a Rogerian therapist as in the original
version, Albert might be a patient with delusions of grandeur,
incipient paranoia or some other interesting pathology.  Eliza and
Albert must be able to sustain a conversation for at least eight
exchanges (that's sixteen distinct sentences total) and must employ
randomization so as to produce different conversations starting from
the same initial sentence.  We will suggest several ways to extend the
pattern matcher and the basic conversational loop to make Eliza's and
Albert's responses more flexible and realistic; you will select two of
these extensions to implement as part of your project.


Additional Implementation Details and Support Code

See ~/tmp/exercises/04/12/06/exercises.txt for a set of exercises
designed to help with this project.

Eliza is animated by a list of rules bound to the global variable
'eliza-rules'.  To create Albert all you have to do is define another
set of rules called, say, 'albert-rules' to govern Albert's behavior.
Here's a simple example of one of Eliza's rules in ./bin/rules.scm.

   (((?* ?x) computer (?* ?y))
    (Are you afraid of computers?)
    (What do you think computers have to do with your problem?))

The 'car' of the rule is always a pattern and the 'cdr' of the rule
is a list of responses.  This rule is designed to trigger when anyone
mentions the word 'computer'.  There are two types of variables 
recognized by the pattern matcher: simple variables of the form ?var
where 'var' can be any symbol and segment variables of the form 
(?* ?var). A simple variable matches a single symbol whereas a 
segment variable matches zero or more symbols; the '*' in segment
variables is like the '*' in regular expressions.  More sophisticated
rules have variables in both the pattern and responses; this allows
you to incorporate pieces of the user input into your responses.

   (((?* ?x) worry about (?* ?y)) 
    (What is it about ?y that worries you?))

Eliza employs a simple 'switch-viewpoint' routine that changes
pronouns so that if the above rule fires on the input "I worry about
what I eat" it comes out as "What is it about what you eat that
worries you?" You can simulate rule application and response
substitution by invoking the functions 'pattern-match' and
'substitute' directly; 'pattern-match' takes a pattern, a list of
words corresponding to use input and an optional set of variable
bindings. If 'pattern-match' succeeds it it returns a set of bindings,
the same if there were no new variables in the pattern and an extended
set of bindings if there were new variables; it returns the empty list
if it fails. 'substitute' takes a response pattern and a set of
variable bindings and uses the variable bindings to make substitutions
in the response pattern.

> (load "eliza.scm")

> (define bindings (pattern-match '((?* ?x) worry about (?* ?y)) 
                                  '(I worry about what I eat)))
> bindings
((?y what I eat) (?x I))

> (set! bindings (switch-viewpoint bindings))
> bindings
((?y what you eat) (?x you))

> (substitute bindings '(What is it about ?y that worries you?))
(What is it about (what you eat) that worries you?)

The unwanted layer of parens around 'what you eat' is an artifact of
the way 'substitute' deals with segment variables and is eliminated
using the 'flatten' function.

> (flatten (substitute bindings '(What is it about ?y that worries you?)))
(What is it about what you eat that worries you?)

There are a few things to watch out for when writing rules.  Case does 
matter; since my version of Eliza accepted user input I had to deal
with 'I' versus 'i' and the various intricacies of the Scheme 'reader'. 
I strongly recommend that you create your rules all in lower case and 
don't use any additional punctuation.  If you must include punctuation, 
you should 'escape' the punctuation characters as in: (I'm tired of
hearing the word "nuclear" pronounced incorrectly) => (I\'m tired
of hearing the word \"nuclear\" pronounced incorrectly).

To get Albert and Eliza to talk with one another you have to create a
loop or employ recursive procedures in which the output generated by
one chatterbot is provided as input to the other.  For an elegant
recursive implementation think about writing 'albert' so it calls
'eliza' and 'eliza' so it calls 'albert'.  Look at the definitions for
'chatter' and 'apply-rules' in ./bin/eliza.scm.  Using 'apply-rules'
you can make Eliza talk with herself:

> (load "rules.scm")
> (apply-rules eliza-rules '(are you a computer))
(Do you worry about computers?)
> (apply-rules eliza-rules (apply-rules eliza-rules '(are you a computer)))
(You really should not worry about computers?)

In order to make Albert and Eliza's conversation a little more
realistic, you might want to add a short delay between responses.  The
Scheme 'sleep' procedure makes it easy to generate such a delay;
'sleep' takes a single positive integer argument interpreted as the
number of seconds to have Scheme to pause in its execution.

> (sleep 4)

A random delay is much more realistic.  The next invocation puts
Scheme to sleep for between 1 and 3 seconds with the interval chosen
randomly from the set {1, 2, 3}.

> (sleep (+ (random 3) 1))

I've augmented the pattern matcher so that it handles special-purpose
pattern matchers that you can write yourself to implement more
sophisticated forms of matching.  Here's an example of a special-purpose
matcher that matches numbers.  Every special-purpose matcher requires
at least three arguments; specifically, the first three arguments to a
special purpose-matcher must consist of a pattern, an input expression
that the pattern is to be matched against and a list of bindings.  If
the special-purpose matcher fails in matching the input, then it
returns the empty list signaling failure.  If it succeeds, then
typically it calls the top-level pattern matcher 'pattern-match' with
the 'cdr' of the pattern, the 'cdr' of the input and the bindings
possibly augmented with new bindings.  In this typical case, the
special-purpose matcher 'consumes' the 'car' of both the pattern and
the input.

> (define (number-match pattern input bindings)
    (if (not (and (pair? input) (number? (car input))))
        fail
        (pattern-match (cdr pattern) 
                       (cdr input) bindings)))

> (pattern-match '((special number-match)) '(3.14))
((t . t))

> (pattern-match '(Fred won (special number-match) games) '(Fred won 5 games))
((t . t))

Note that the following pattern and input do not match. 

> (pattern-match '(special number-match) 3.14)
()

We can also define special-purpose matchers that take additional
arguments.  These arguments will be tacked onto the end of the
'special' pattern after the symbol corresponding to the
special-purpose function.  Here we redefine 'match-number' to take an
additional argument corresponding to a variable which is bound to the
input if it corresponds to a number.  Note that while the symbol
corresponding to the special function is evaluated none of the
additional arguments are evaluated.

> (define (number-match pattern input bindings var)
    (if (not (and (pair? input) (number? (car input))))
        fail
        (pattern-match (cdr pattern) 
                       (cdr input) 
                       (extend-bindings var (car input) bindings))))

> (pattern-match '(Fred earned (special number-match ?x) dollars)
                 '(Fred earned 1000000 dollars))
((?x . 1000000))


There's a bug in the pattern matcher.  It can't handle a segment
variable right before a 'special' match function.  A pattern like 

  (?x (special ... ) ... ) 

will work just fine but a pattern like 

  ((?* ?x) (special ... ) ... ) 

will not work correctly.

Here's an extended example to illustrate the problem

(define (test-1)
  (let ((eliza-rules
         '((((special syn-match jocks ?x) are 
             (special syn-match smart ?y))
            (What do you know about ?x idiot?))
           (((?* ?x))
            (No match!)))
    (apply-rules eliza-rules '(swimmers are ))))

(define (test-2)
  (let ((eliza-rules
         '((((?* ?v) (special syn-match jocks ?x) ?z 
             (special syn-match smart ?y))
            (What do you know about ?x idiot?))
           (((?* ?x))
            (No match!)))))
    (apply-rules eliza-rules '(i think women are inept))))


> (test-1)
(What do you know about swimmers idiot?)

> (test-2)
(No match!)


Here are some ways that you might extend the code in 'eliza.scm' for 
your project (remember that you have to choose two of these):


a. Add the ability to match words with similar meaning (synonyms). 

   (define synonyms '((tall (high lofty))
                      (light (luminous bright))))

   > (pattern-match '(The fence was (special similar-match tall ?x))
                    '(The fence was high))
   ((?x . high))

   For five extra credit points rewrite the relevant parts of the
   pattern matcher so that it always matches similar words without
   having to provide a special-purpose match function. For the extra
   credit points you'll have to create a general data structure for
   keeping track of synonyms; this data structure will operate like
   a thesaurus allowing you to easily insert new words and extract 
   synonyms.  Each word in your thesaurus should point to a set of 
   synonyms and if two words are synonyms of one another then they 
   should point to the same set of synonyms. 


b. Add the ability for Albert and Eliza to keep state, i.e., remember
   things from one exchange to another.  To implement this ability 
   you might add 'mood' or 'memory' variables or define an association 
   list of state variables for each agent.

   > (define eliza-state '((happy-sad 2) 
                           (sleepy-alert -3) 
                           (anxious-calm 0)))

   > (define albert-state '((happy-sad -1) 
                            (sleepy-alert 4) 
                            (anxious-calm -3)))

   > (pattern-match 
        '((?* ?x) am sad (?* ?y) (special state-match happy-sad -1))
        '(I am sad that the days are so short this time of year))

   Note that 'state-match' should not consume any input since it
   doesn't actually match anything.  In most of the cases that I can
   think of, the state-match special pattern should appear at the end
   of a pattern and it should always succeed.  You could also have
   Eliza ask questions and look for particular responses; for example,
   following the question 'Are you male or female?' you might have a
   rule that looks for answers that match the pattern (I am (special
   gender-match ?x)) and subsequently favor responses that assume
   Eliza is conversing with a male or female as the case may be.
   Since this extension doesn't by itself have any consequence for the
   conversations between Albert and Eliza, if you choose this
   extension for one of your two, then you have to implement the next
   extension as well.


c. Add the ability for Albert and Eliza to respond depending on their
   current state.  There are several ways that you might go about
   doing this.  One strategy is to define a special-purpose pattern
   matcher that augments the current bindings conditionally based upon
   the state of the agent.  Another strategy is to require the
   procedure 'substitute' - which is used to instantiate a response
   pattern given a set of bindings - to take state into account when
   selecting a response.  Obviously, this extension would go well with
   the previous one, but you don't have to do (b) in order to do (c).
   You can demonstrate how your extension works by setting the state
   prior to having Albert and Eliza talk to one another and then
   altering the state to show how their conversation changes as a
   result of their altered state.


d. Alter your program to take time into account.  Albert and Eliza
   could change their behavior as the length of the exchange
   continues.  They might exhibit signs of impatience or boredom.  You
   can combine this with other special-purpose matching functions to
   have Eliza ask Albert the time or prompt Eliza at a particular
   time.  Even a gratuitous use of a date or oblique reference to the
   hour ('hurry up I have another appointment at 3:15 PM') makes the
   chatterbot seem more grounded in the moment and hence more
   realistic.  There are a number of ways to keep track of time in
   Scheme.  The following invocation returns an instance of the 'date'
   structure which has fields for second, minute, hour, day, month,
   year, week-day, etc.

   > (seconds->date (current-seconds)) 
   #<struct:date>

   Real-time dates in Scheme are as accurate as the corresponding dates
   supported by the operating system. 

   % date
   Sat Nov 20 15:35:39 EST 2004
   % scheme
   > (date-hour (seconds->date (current-seconds)))
   15
   > (date-minute (seconds->date (current-seconds)))
   35


e. Introduce 'typed' variables of the form: (?x number?) that consist
   of a variable and a procedure of one argument that tests for  
   type membership - basically anything you can write a predicate for.


d. Implement 'response' specials that will allow you to run programs  
   in the midst of printing out the select respone in a rule. 


e. Make use of part-of-speech information in matching and generating 
   responses; see ~/tmp/exercises/04/12/06/exercises.txt, Exercise 10
   for tips and some useful software/libraries.


f. Invent a concise notation for matching specials and typed variables,
   e.g., ?x:number?, ?x:synonym? word, ?x:special-match-function args, 
   ?x:(lambda (x) (and ... ...)) and then augment the matcher to handle
   your notational conventions. 

5.3  Project #3: Desktop Searching

/Users/tld/email/book/intro/exercises/final/desktop
/Users/tld/email/book/intro/exercises/final/desktop/bin
/Users/tld/email/book/intro/exercises/final/desktop/bin/phreqs
/Users/tld/email/book/intro/exercises/final/desktop/bin/porter
/Users/tld/email/book/intro/exercises/final/desktop/bin/README
/Users/tld/email/book/intro/exercises/final/desktop/bin/smrize
/Users/tld/email/book/intro/exercises/final/desktop/bin/txtify
/Users/tld/email/book/intro/exercises/final/desktop/private
/Users/tld/email/book/intro/exercises/final/desktop/private/algebra.scm
/Users/tld/email/book/intro/exercises/final/desktop/README

For this project, you will build a basic version of Google Desktop.
Your program will process documents by creating an inverted index for
the terms appearing in each document, summarizing each document as a
term-frequency vector, and retrieve documents using keyword search
ranking the retrieved documents by using a measure such as the cosine
of the angle between term-frequency vectors.  You will probably want
to preprocess the documents using a shell script which we will
provide.  This is arguably the most complicated of the three projects
and we will provide extra assistance for students interested in taking
it on.  Any student completing this project earns five extra points
automatically.

Here's a partial list of processes you'll have to implement:

  a. Process a document by inserting it into an inverted index.
  b. Summarize each document as a term-frequency vector.
  c. Return each document including all keys in a keyword query.
  d. Rank the documents using an appropriate measure of similarity.

Support files: You'll find some useful C-shell and Perl scripts 
for processing documents in ./bin - the text file ./bin/README 
provides a description of these scripts. You'll probably want
to modify these scripts to suit your project design. 

How does search engine represent and store documents to support
keyword queries?  Here are the basic procedures involved in
representing/summarizing documents to support keyword queries:

document -> bag of words

For the most part, word order is ignored. Search engines like Google 
typically also store the location of each word in a document to support
proximity queries, e.g., search for documents including 'search' and 
'engine' such that these two terms are within one word of each other. 
Google is also starting to store short phrases, but storing phrases 
has to be managed carefully to avoid an explosion in terms.

bag of words -> bag of terms 

Typically the words in a document are preprocessed by removing 'stop
words' such as 'a', 'the', 'it', 'and', 'that', etc., words that add
little meaning given that the document is represented as a bag of
words with little additional semantic analysis.  In addition to
removing stop words, usually the remaining words are 'stemmed' to map
morphological variants of a given word to the same root, e.g., 'run',
'runs' and 'running' all map to 'run'.  The set of all terms in the
corpus is called the 'dictionary'.  In most cases, stemming reduces
the size of the dictionary and the length of term vectors, but can in
some cases conflate terms that we would like to remain separate, e.g.,
'universe', 'university' and 'universal' all map to root 'univers'.

bag of terms -> inverted index 

In order to retrieve documents that contain specific terms we build 
an inverted index which points from each term to the documents that 
contain the associated word. Usually the inverted index also records
the position of each term/word in the document to support proximity 
searches. 

bag of terms -> term-frequency vector

The bag of terms is then processed to determine the number of times
each term appears in the document - this number is called the
frequency of the term - and the resulting numbers are used to
construct a term frequency vector which maps each term to its
frequency in the document.  You can think of this as a vector of
numbers where the 'i'th number represents the frequency of the 'i'th
term in the dictionary, but in fact term frequency vectors are never
actually implemented in this fashion.  A more efficient way of
implementing this mapping would be as an association list or hash
table.

Now that we have a representation for each document and an index
mapping terms/words to documents keyword queries are handled as
follows:

keywords -> keyterms

Eliminate stop words and stem the remaining keywords.  

keyterms -> documents

To retrieve a set of candidate documents we use the inverted index to
map the keywords to documents containing them.  The search engine
compiles a list of all documents that contain all (or some) of the
keywords in preparation for ranking.

keyterms -> keyterm-frequency vector

To rank the documents, we can construct a term-frequency vector from
the search keywords which we'll call the keyterm-frequency vector.  At
this point we can add additional terms corresponding to synonyms of
the keywords.  We can also specialize the query by including hyponyms
(x 'is a kind of' y) or generalize it by using hypernyms (y 'is a kind
of' x).

documents + keyterm-frequency vector -> document ranking

The degree of similarity between the keyterm-frequency vector and each
of the document term-frequency vectors is then calculated and the
documents are sorted using the calculated similarities and presented
to the user in the resulting order.