/Users/tld/email/book/intro/exercises/final/analysis
/Users/tld/email/book/intro/exercises/final/analysis/bin
/Users/tld/email/book/intro/exercises/final/analysis/bin/binary.scm
/Users/tld/email/book/intro/exercises/final/analysis/bin/hashing.scm
/Users/tld/email/book/intro/exercises/final/analysis/bin/instrument.scm
/Users/tld/email/book/intro/exercises/final/analysis/bin/load-once.scm
/Users/tld/email/book/intro/exercises/final/analysis/bin/red-black.scm
/Users/tld/email/book/intro/exercises/final/analysis/private
/Users/tld/email/book/intro/exercises/final/analysis/private/bin
/Users/tld/email/book/intro/exercises/final/analysis/private/bin/benchmark.scm
/Users/tld/email/book/intro/exercises/final/analysis/private/bin/hash-table.scm
/Users/tld/email/book/intro/exercises/final/analysis/private/bin/words.csh
/Users/tld/email/book/intro/exercises/final/analysis/private/bin/words.txt
/Users/tld/email/book/intro/exercises/final/analysis/private/README
/Users/tld/email/book/intro/exercises/final/analysis/README
In this project you are to evaluate three algorithms for performing
dictionary insertions and lookups. The three algorithms consist of
the linear-time (worst-case) search algorithm modeled after 'assoc'
using association lists, the logarithmic-time (worst-case) algorithm
using balanced binary trees, and the constant-time (expected-case)
algorithm using hash tables. You must implement the linear-time and
constant-time algorithms in Scheme, but we will provide you with a
Scheme implementation of balanced binary trees. You must write your
program using hash tables from scratch, i.e., you can't use of the
built-in functions 'make-hash-table' or 'hash-table-get'. You will
have to do some research to learn about the theoretical analyses of
these algorithms and summarize the main results as part of your
documentation. You will also propose, implement and run a set of
experiments based on benchmarks of your choosing and compare the
empirical results with the theoretical predictions. This project is
actually significantly easier than it sounds. Here are the pieces:
a. Implement 'insert' and 'lookup' functions using hash tables.
b. Read and summarize handounts on balanced trees and hashing.
c. Instrument all three algorithms for running the experiments.
d. Identify a set of keys to be used in benchmark experiments.
e. Run your benchmark experiments and summarize the results.
See ./bin/red-black.scm for the implementation of balanced binary
trees and ./bin/hashing for some useful functions for implementing
hash tables. To implement hash tables you may want to make use of the
following vector operations:
(make-vector size fill)
(vector-ref vector k)
(vector-set! vector k obj)
> (define vec (make-vector 16 0))
> (vector-ref vec 8)
> (vector-set! vec 8 1)
> (vector-ref vec 8)
1
We suggest that you use strings for keys rather than symbols. The
following functions and built-in string operations will come in handy.
(string->list string)
(string=? string1 string2)
(string<? string1 string2)
(string<=? string1 string2)
(string>=? string1 string2)
Here are some additional functions found in ./bin/hashing.scm
;; Convert a string to list of bits.
(define (string->bits str)
(apply append! (map char->bits (string->list str))))
;; Convert a character to the eight bit binary code
;; representing its ASCII equivalent integer 0-255.
;; Note that A -> 65, Z -> 90, a -> 96 and z -> 122.
(define (char->bits char)
(let ((n (char->integer char)))
(map (lambda (m)
(if (< n m) 0
(begin (set! n (- n m)) 1)))
'(128 64 32 16 8 4 2 1))))
/Users/tld/email/book/intro/exercises/final/chatter
/Users/tld/email/book/intro/exercises/final/chatter/bin
/Users/tld/email/book/intro/exercises/final/chatter/bin/eliza.scm
/Users/tld/email/book/intro/exercises/final/chatter/bin/reader.scm
/Users/tld/email/book/intro/exercises/final/chatter/bin/rules.scm
/Users/tld/email/book/intro/exercises/final/chatter/bin/simple.scm
/Users/tld/email/book/intro/exercises/final/chatter/bin/verbose.scm
/Users/tld/email/book/intro/exercises/final/chatter/private
/Users/tld/email/book/intro/exercises/final/chatter/private/bin
/Users/tld/email/book/intro/exercises/final/chatter/private/bin/match.scm
/Users/tld/email/book/intro/exercises/final/chatter/private/bin/reader.scm
/Users/tld/email/book/intro/exercises/final/chatter/private/bin/rules.scm
/Users/tld/email/book/intro/exercises/final/chatter/private/README
/Users/tld/email/book/intro/exercises/final/chatter/README
Eliza was designed to interact with a human and more often than not,
any perceived intelligence on Eliza's part could be attributed to the
variety and subtlety of the human. In this project, you are to modify
the Eliza code so that Eliza interacts with a second chatterbot named
Albert. You can give Albert any sort of personality that you like and
alter Eliza to respond appropriately. For example, assuming that
Eliza retains the persona of a Rogerian therapist as in the original
version, Albert might be a patient with delusions of grandeur,
incipient paranoia or some other interesting pathology. Eliza and
Albert must be able to sustain a conversation for at least eight
exchanges (that's sixteen distinct sentences total) and must employ
randomization so as to produce different conversations starting from
the same initial sentence. We will suggest several ways to extend the
pattern matcher and the basic conversational loop to make Eliza's and
Albert's responses more flexible and realistic; you will select two of
these extensions to implement as part of your project.
Additional Implementation Details and Support Code
See ~/tmp/exercises/04/12/06/exercises.txt for a set of exercises
designed to help with this project.
Eliza is animated by a list of rules bound to the global variable
'eliza-rules'. To create Albert all you have to do is define another
set of rules called, say, 'albert-rules' to govern Albert's behavior.
Here's a simple example of one of Eliza's rules in ./bin/rules.scm.
(((?* ?x) computer (?* ?y))
(Are you afraid of computers?)
(What do you think computers have to do with your problem?))
The 'car' of the rule is always a pattern and the 'cdr' of the rule
is a list of responses. This rule is designed to trigger when anyone
mentions the word 'computer'. There are two types of variables
recognized by the pattern matcher: simple variables of the form ?var
where 'var' can be any symbol and segment variables of the form
(?* ?var). A simple variable matches a single symbol whereas a
segment variable matches zero or more symbols; the '*' in segment
variables is like the '*' in regular expressions. More sophisticated
rules have variables in both the pattern and responses; this allows
you to incorporate pieces of the user input into your responses.
(((?* ?x) worry about (?* ?y))
(What is it about ?y that worries you?))
Eliza employs a simple 'switch-viewpoint' routine that changes
pronouns so that if the above rule fires on the input "I worry about
what I eat" it comes out as "What is it about what you eat that
worries you?" You can simulate rule application and response
substitution by invoking the functions 'pattern-match' and
'substitute' directly; 'pattern-match' takes a pattern, a list of
words corresponding to use input and an optional set of variable
bindings. If 'pattern-match' succeeds it it returns a set of bindings,
the same if there were no new variables in the pattern and an extended
set of bindings if there were new variables; it returns the empty list
if it fails. 'substitute' takes a response pattern and a set of
variable bindings and uses the variable bindings to make substitutions
in the response pattern.
> (load "eliza.scm")
> (define bindings (pattern-match '((?* ?x) worry about (?* ?y))
'(I worry about what I eat)))
> bindings
((?y what I eat) (?x I))
> (set! bindings (switch-viewpoint bindings))
> bindings
((?y what you eat) (?x you))
> (substitute bindings '(What is it about ?y that worries you?))
(What is it about (what you eat) that worries you?)
The unwanted layer of parens around 'what you eat' is an artifact of
the way 'substitute' deals with segment variables and is eliminated
using the 'flatten' function.
> (flatten (substitute bindings '(What is it about ?y that worries you?)))
(What is it about what you eat that worries you?)
There are a few things to watch out for when writing rules. Case does
matter; since my version of Eliza accepted user input I had to deal
with 'I' versus 'i' and the various intricacies of the Scheme 'reader'.
I strongly recommend that you create your rules all in lower case and
don't use any additional punctuation. If you must include punctuation,
you should 'escape' the punctuation characters as in: (I'm tired of
hearing the word "nuclear" pronounced incorrectly) => (I\'m tired
of hearing the word \"nuclear\" pronounced incorrectly).
To get Albert and Eliza to talk with one another you have to create a
loop or employ recursive procedures in which the output generated by
one chatterbot is provided as input to the other. For an elegant
recursive implementation think about writing 'albert' so it calls
'eliza' and 'eliza' so it calls 'albert'. Look at the definitions for
'chatter' and 'apply-rules' in ./bin/eliza.scm. Using 'apply-rules'
you can make Eliza talk with herself:
> (load "rules.scm")
> (apply-rules eliza-rules '(are you a computer))
(Do you worry about computers?)
> (apply-rules eliza-rules (apply-rules eliza-rules '(are you a computer)))
(You really should not worry about computers?)
In order to make Albert and Eliza's conversation a little more
realistic, you might want to add a short delay between responses. The
Scheme 'sleep' procedure makes it easy to generate such a delay;
'sleep' takes a single positive integer argument interpreted as the
number of seconds to have Scheme to pause in its execution.
> (sleep 4)
A random delay is much more realistic. The next invocation puts
Scheme to sleep for between 1 and 3 seconds with the interval chosen
randomly from the set {1, 2, 3}.
> (sleep (+ (random 3) 1))
I've augmented the pattern matcher so that it handles special-purpose
pattern matchers that you can write yourself to implement more
sophisticated forms of matching. Here's an example of a special-purpose
matcher that matches numbers. Every special-purpose matcher requires
at least three arguments; specifically, the first three arguments to a
special purpose-matcher must consist of a pattern, an input expression
that the pattern is to be matched against and a list of bindings. If
the special-purpose matcher fails in matching the input, then it
returns the empty list signaling failure. If it succeeds, then
typically it calls the top-level pattern matcher 'pattern-match' with
the 'cdr' of the pattern, the 'cdr' of the input and the bindings
possibly augmented with new bindings. In this typical case, the
special-purpose matcher 'consumes' the 'car' of both the pattern and
the input.
> (define (number-match pattern input bindings)
(if (not (and (pair? input) (number? (car input))))
fail
(pattern-match (cdr pattern)
(cdr input) bindings)))
> (pattern-match '((special number-match)) '(3.14))
((t . t))
> (pattern-match '(Fred won (special number-match) games) '(Fred won 5 games))
((t . t))
Note that the following pattern and input do not match.
> (pattern-match '(special number-match) 3.14)
()
We can also define special-purpose matchers that take additional
arguments. These arguments will be tacked onto the end of the
'special' pattern after the symbol corresponding to the
special-purpose function. Here we redefine 'match-number' to take an
additional argument corresponding to a variable which is bound to the
input if it corresponds to a number. Note that while the symbol
corresponding to the special function is evaluated none of the
additional arguments are evaluated.
> (define (number-match pattern input bindings var)
(if (not (and (pair? input) (number? (car input))))
fail
(pattern-match (cdr pattern)
(cdr input)
(extend-bindings var (car input) bindings))))
> (pattern-match '(Fred earned (special number-match ?x) dollars)
'(Fred earned 1000000 dollars))
((?x . 1000000))
There's a bug in the pattern matcher. It can't handle a segment
variable right before a 'special' match function. A pattern like
(?x (special ... ) ... )
will work just fine but a pattern like
((?* ?x) (special ... ) ... )
will not work correctly.
Here's an extended example to illustrate the problem
(define (test-1)
(let ((eliza-rules
'((((special syn-match jocks ?x) are
(special syn-match smart ?y))
(What do you know about ?x idiot?))
(((?* ?x))
(No match!)))
(apply-rules eliza-rules '(swimmers are ))))
(define (test-2)
(let ((eliza-rules
'((((?* ?v) (special syn-match jocks ?x) ?z
(special syn-match smart ?y))
(What do you know about ?x idiot?))
(((?* ?x))
(No match!)))))
(apply-rules eliza-rules '(i think women are inept))))
> (test-1)
(What do you know about swimmers idiot?)
> (test-2)
(No match!)
Here are some ways that you might extend the code in 'eliza.scm' for
your project (remember that you have to choose two of these):
a. Add the ability to match words with similar meaning (synonyms).
(define synonyms '((tall (high lofty))
(light (luminous bright))))
> (pattern-match '(The fence was (special similar-match tall ?x))
'(The fence was high))
((?x . high))
For five extra credit points rewrite the relevant parts of the
pattern matcher so that it always matches similar words without
having to provide a special-purpose match function. For the extra
credit points you'll have to create a general data structure for
keeping track of synonyms; this data structure will operate like
a thesaurus allowing you to easily insert new words and extract
synonyms. Each word in your thesaurus should point to a set of
synonyms and if two words are synonyms of one another then they
should point to the same set of synonyms.
b. Add the ability for Albert and Eliza to keep state, i.e., remember
things from one exchange to another. To implement this ability
you might add 'mood' or 'memory' variables or define an association
list of state variables for each agent.
> (define eliza-state '((happy-sad 2)
(sleepy-alert -3)
(anxious-calm 0)))
> (define albert-state '((happy-sad -1)
(sleepy-alert 4)
(anxious-calm -3)))
> (pattern-match
'((?* ?x) am sad (?* ?y) (special state-match happy-sad -1))
'(I am sad that the days are so short this time of year))
Note that 'state-match' should not consume any input since it
doesn't actually match anything. In most of the cases that I can
think of, the state-match special pattern should appear at the end
of a pattern and it should always succeed. You could also have
Eliza ask questions and look for particular responses; for example,
following the question 'Are you male or female?' you might have a
rule that looks for answers that match the pattern (I am (special
gender-match ?x)) and subsequently favor responses that assume
Eliza is conversing with a male or female as the case may be.
Since this extension doesn't by itself have any consequence for the
conversations between Albert and Eliza, if you choose this
extension for one of your two, then you have to implement the next
extension as well.
c. Add the ability for Albert and Eliza to respond depending on their
current state. There are several ways that you might go about
doing this. One strategy is to define a special-purpose pattern
matcher that augments the current bindings conditionally based upon
the state of the agent. Another strategy is to require the
procedure 'substitute' - which is used to instantiate a response
pattern given a set of bindings - to take state into account when
selecting a response. Obviously, this extension would go well with
the previous one, but you don't have to do (b) in order to do (c).
You can demonstrate how your extension works by setting the state
prior to having Albert and Eliza talk to one another and then
altering the state to show how their conversation changes as a
result of their altered state.
d. Alter your program to take time into account. Albert and Eliza
could change their behavior as the length of the exchange
continues. They might exhibit signs of impatience or boredom. You
can combine this with other special-purpose matching functions to
have Eliza ask Albert the time or prompt Eliza at a particular
time. Even a gratuitous use of a date or oblique reference to the
hour ('hurry up I have another appointment at 3:15 PM') makes the
chatterbot seem more grounded in the moment and hence more
realistic. There are a number of ways to keep track of time in
Scheme. The following invocation returns an instance of the 'date'
structure which has fields for second, minute, hour, day, month,
year, week-day, etc.
> (seconds->date (current-seconds))
#<struct:date>
Real-time dates in Scheme are as accurate as the corresponding dates
supported by the operating system.
% date
Sat Nov 20 15:35:39 EST 2004
% scheme
> (date-hour (seconds->date (current-seconds)))
15
> (date-minute (seconds->date (current-seconds)))
35
e. Introduce 'typed' variables of the form: (?x number?) that consist
of a variable and a procedure of one argument that tests for
type membership - basically anything you can write a predicate for.
d. Implement 'response' specials that will allow you to run programs
in the midst of printing out the select respone in a rule.
e. Make use of part-of-speech information in matching and generating
responses; see ~/tmp/exercises/04/12/06/exercises.txt, Exercise 10
for tips and some useful software/libraries.
f. Invent a concise notation for matching specials and typed variables,
e.g., ?x:number?, ?x:synonym? word, ?x:special-match-function args,
?x:(lambda (x) (and ... ...)) and then augment the matcher to handle
your notational conventions.
/Users/tld/email/book/intro/exercises/final/desktop
/Users/tld/email/book/intro/exercises/final/desktop/bin
/Users/tld/email/book/intro/exercises/final/desktop/bin/phreqs
/Users/tld/email/book/intro/exercises/final/desktop/bin/porter
/Users/tld/email/book/intro/exercises/final/desktop/bin/README
/Users/tld/email/book/intro/exercises/final/desktop/bin/smrize
/Users/tld/email/book/intro/exercises/final/desktop/bin/txtify
/Users/tld/email/book/intro/exercises/final/desktop/private
/Users/tld/email/book/intro/exercises/final/desktop/private/algebra.scm
/Users/tld/email/book/intro/exercises/final/desktop/README
For this project, you will build a basic version of Google Desktop.
Your program will process documents by creating an inverted index for
the terms appearing in each document, summarizing each document as a
term-frequency vector, and retrieve documents using keyword search
ranking the retrieved documents by using a measure such as the cosine
of the angle between term-frequency vectors. You will probably want
to preprocess the documents using a shell script which we will
provide. This is arguably the most complicated of the three projects
and we will provide extra assistance for students interested in taking
it on. Any student completing this project earns five extra points
automatically.
Here's a partial list of processes you'll have to implement:
a. Process a document by inserting it into an inverted index.
b. Summarize each document as a term-frequency vector.
c. Return each document including all keys in a keyword query.
d. Rank the documents using an appropriate measure of similarity.
Support files: You'll find some useful C-shell and Perl scripts
for processing documents in ./bin - the text file ./bin/README
provides a description of these scripts. You'll probably want
to modify these scripts to suit your project design.
How does search engine represent and store documents to support
keyword queries? Here are the basic procedures involved in
representing/summarizing documents to support keyword queries:
document -> bag of words
For the most part, word order is ignored. Search engines like Google
typically also store the location of each word in a document to support
proximity queries, e.g., search for documents including 'search' and
'engine' such that these two terms are within one word of each other.
Google is also starting to store short phrases, but storing phrases
has to be managed carefully to avoid an explosion in terms.
bag of words -> bag of terms
Typically the words in a document are preprocessed by removing 'stop
words' such as 'a', 'the', 'it', 'and', 'that', etc., words that add
little meaning given that the document is represented as a bag of
words with little additional semantic analysis. In addition to
removing stop words, usually the remaining words are 'stemmed' to map
morphological variants of a given word to the same root, e.g., 'run',
'runs' and 'running' all map to 'run'. The set of all terms in the
corpus is called the 'dictionary'. In most cases, stemming reduces
the size of the dictionary and the length of term vectors, but can in
some cases conflate terms that we would like to remain separate, e.g.,
'universe', 'university' and 'universal' all map to root 'univers'.
bag of terms -> inverted index
In order to retrieve documents that contain specific terms we build
an inverted index which points from each term to the documents that
contain the associated word. Usually the inverted index also records
the position of each term/word in the document to support proximity
searches.
bag of terms -> term-frequency vector
The bag of terms is then processed to determine the number of times
each term appears in the document - this number is called the
frequency of the term - and the resulting numbers are used to
construct a term frequency vector which maps each term to its
frequency in the document. You can think of this as a vector of
numbers where the 'i'th number represents the frequency of the 'i'th
term in the dictionary, but in fact term frequency vectors are never
actually implemented in this fashion. A more efficient way of
implementing this mapping would be as an association list or hash
table.
Now that we have a representation for each document and an index
mapping terms/words to documents keyword queries are handled as
follows:
keywords -> keyterms
Eliminate stop words and stem the remaining keywords.
keyterms -> documents
To retrieve a set of candidate documents we use the inverted index to
map the keywords to documents containing them. The search engine
compiles a list of all documents that contain all (or some) of the
keywords in preparation for ranking.
keyterms -> keyterm-frequency vector
To rank the documents, we can construct a term-frequency vector from
the search keywords which we'll call the keyterm-frequency vector. At
this point we can add additional terms corresponding to synonyms of
the keywords. We can also specialize the query by including hyponyms
(x 'is a kind of' y) or generalize it by using hypernyms (y 'is a kind
of' x).
documents + keyterm-frequency vector -> document ranking
The degree of similarity between the keyterm-frequency vector and each
of the document term-frequency vectors is then calculated and the
documents are sorted using the calculated similarities and presented
to the user in the resulting order.