/Users/tld/email/book/intro/exercises/final/analysis /Users/tld/email/book/intro/exercises/final/analysis/bin /Users/tld/email/book/intro/exercises/final/analysis/bin/binary.scm /Users/tld/email/book/intro/exercises/final/analysis/bin/hashing.scm /Users/tld/email/book/intro/exercises/final/analysis/bin/instrument.scm /Users/tld/email/book/intro/exercises/final/analysis/bin/load-once.scm /Users/tld/email/book/intro/exercises/final/analysis/bin/red-black.scm /Users/tld/email/book/intro/exercises/final/analysis/private /Users/tld/email/book/intro/exercises/final/analysis/private/bin /Users/tld/email/book/intro/exercises/final/analysis/private/bin/benchmark.scm /Users/tld/email/book/intro/exercises/final/analysis/private/bin/hash-table.scm /Users/tld/email/book/intro/exercises/final/analysis/private/bin/words.csh /Users/tld/email/book/intro/exercises/final/analysis/private/bin/words.txt /Users/tld/email/book/intro/exercises/final/analysis/private/README /Users/tld/email/book/intro/exercises/final/analysis/README In this project you are to evaluate three algorithms for performing dictionary insertions and lookups. The three algorithms consist of the linear-time (worst-case) search algorithm modeled after 'assoc' using association lists, the logarithmic-time (worst-case) algorithm using balanced binary trees, and the constant-time (expected-case) algorithm using hash tables. You must implement the linear-time and constant-time algorithms in Scheme, but we will provide you with a Scheme implementation of balanced binary trees. You must write your program using hash tables from scratch, i.e., you can't use of the built-in functions 'make-hash-table' or 'hash-table-get'. You will have to do some research to learn about the theoretical analyses of these algorithms and summarize the main results as part of your documentation. You will also propose, implement and run a set of experiments based on benchmarks of your choosing and compare the empirical results with the theoretical predictions. This project is actually significantly easier than it sounds. Here are the pieces: a. Implement 'insert' and 'lookup' functions using hash tables. b. Read and summarize handounts on balanced trees and hashing. c. Instrument all three algorithms for running the experiments. d. Identify a set of keys to be used in benchmark experiments. e. Run your benchmark experiments and summarize the results. See ./bin/red-black.scm for the implementation of balanced binary trees and ./bin/hashing for some useful functions for implementing hash tables. To implement hash tables you may want to make use of the following vector operations: (make-vector size fill) (vector-ref vector k) (vector-set! vector k obj) > (define vec (make-vector 16 0)) > (vector-ref vec 8) > (vector-set! vec 8 1) > (vector-ref vec 8) 1 We suggest that you use strings for keys rather than symbols. The following functions and built-in string operations will come in handy. (string->list string) (string=? string1 string2) (string<? string1 string2) (string<=? string1 string2) (string>=? string1 string2) Here are some additional functions found in ./bin/hashing.scm ;; Convert a string to list of bits. (define (string->bits str) (apply append! (map char->bits (string->list str)))) ;; Convert a character to the eight bit binary code ;; representing its ASCII equivalent integer 0-255. ;; Note that A -> 65, Z -> 90, a -> 96 and z -> 122. (define (char->bits char) (let ((n (char->integer char))) (map (lambda (m) (if (< n m) 0 (begin (set! n (- n m)) 1))) '(128 64 32 16 8 4 2 1))))
/Users/tld/email/book/intro/exercises/final/chatter /Users/tld/email/book/intro/exercises/final/chatter/bin /Users/tld/email/book/intro/exercises/final/chatter/bin/eliza.scm /Users/tld/email/book/intro/exercises/final/chatter/bin/reader.scm /Users/tld/email/book/intro/exercises/final/chatter/bin/rules.scm /Users/tld/email/book/intro/exercises/final/chatter/bin/simple.scm /Users/tld/email/book/intro/exercises/final/chatter/bin/verbose.scm /Users/tld/email/book/intro/exercises/final/chatter/private /Users/tld/email/book/intro/exercises/final/chatter/private/bin /Users/tld/email/book/intro/exercises/final/chatter/private/bin/match.scm /Users/tld/email/book/intro/exercises/final/chatter/private/bin/reader.scm /Users/tld/email/book/intro/exercises/final/chatter/private/bin/rules.scm /Users/tld/email/book/intro/exercises/final/chatter/private/README /Users/tld/email/book/intro/exercises/final/chatter/README Eliza was designed to interact with a human and more often than not, any perceived intelligence on Eliza's part could be attributed to the variety and subtlety of the human. In this project, you are to modify the Eliza code so that Eliza interacts with a second chatterbot named Albert. You can give Albert any sort of personality that you like and alter Eliza to respond appropriately. For example, assuming that Eliza retains the persona of a Rogerian therapist as in the original version, Albert might be a patient with delusions of grandeur, incipient paranoia or some other interesting pathology. Eliza and Albert must be able to sustain a conversation for at least eight exchanges (that's sixteen distinct sentences total) and must employ randomization so as to produce different conversations starting from the same initial sentence. We will suggest several ways to extend the pattern matcher and the basic conversational loop to make Eliza's and Albert's responses more flexible and realistic; you will select two of these extensions to implement as part of your project. Additional Implementation Details and Support Code See ~/tmp/exercises/04/12/06/exercises.txt for a set of exercises designed to help with this project. Eliza is animated by a list of rules bound to the global variable 'eliza-rules'. To create Albert all you have to do is define another set of rules called, say, 'albert-rules' to govern Albert's behavior. Here's a simple example of one of Eliza's rules in ./bin/rules.scm. (((?* ?x) computer (?* ?y)) (Are you afraid of computers?) (What do you think computers have to do with your problem?)) The 'car' of the rule is always a pattern and the 'cdr' of the rule is a list of responses. This rule is designed to trigger when anyone mentions the word 'computer'. There are two types of variables recognized by the pattern matcher: simple variables of the form ?var where 'var' can be any symbol and segment variables of the form (?* ?var). A simple variable matches a single symbol whereas a segment variable matches zero or more symbols; the '*' in segment variables is like the '*' in regular expressions. More sophisticated rules have variables in both the pattern and responses; this allows you to incorporate pieces of the user input into your responses. (((?* ?x) worry about (?* ?y)) (What is it about ?y that worries you?)) Eliza employs a simple 'switch-viewpoint' routine that changes pronouns so that if the above rule fires on the input "I worry about what I eat" it comes out as "What is it about what you eat that worries you?" You can simulate rule application and response substitution by invoking the functions 'pattern-match' and 'substitute' directly; 'pattern-match' takes a pattern, a list of words corresponding to use input and an optional set of variable bindings. If 'pattern-match' succeeds it it returns a set of bindings, the same if there were no new variables in the pattern and an extended set of bindings if there were new variables; it returns the empty list if it fails. 'substitute' takes a response pattern and a set of variable bindings and uses the variable bindings to make substitutions in the response pattern. > (load "eliza.scm") > (define bindings (pattern-match '((?* ?x) worry about (?* ?y)) '(I worry about what I eat))) > bindings ((?y what I eat) (?x I)) > (set! bindings (switch-viewpoint bindings)) > bindings ((?y what you eat) (?x you)) > (substitute bindings '(What is it about ?y that worries you?)) (What is it about (what you eat) that worries you?) The unwanted layer of parens around 'what you eat' is an artifact of the way 'substitute' deals with segment variables and is eliminated using the 'flatten' function. > (flatten (substitute bindings '(What is it about ?y that worries you?))) (What is it about what you eat that worries you?) There are a few things to watch out for when writing rules. Case does matter; since my version of Eliza accepted user input I had to deal with 'I' versus 'i' and the various intricacies of the Scheme 'reader'. I strongly recommend that you create your rules all in lower case and don't use any additional punctuation. If you must include punctuation, you should 'escape' the punctuation characters as in: (I'm tired of hearing the word "nuclear" pronounced incorrectly) => (I\'m tired of hearing the word \"nuclear\" pronounced incorrectly). To get Albert and Eliza to talk with one another you have to create a loop or employ recursive procedures in which the output generated by one chatterbot is provided as input to the other. For an elegant recursive implementation think about writing 'albert' so it calls 'eliza' and 'eliza' so it calls 'albert'. Look at the definitions for 'chatter' and 'apply-rules' in ./bin/eliza.scm. Using 'apply-rules' you can make Eliza talk with herself: > (load "rules.scm") > (apply-rules eliza-rules '(are you a computer)) (Do you worry about computers?) > (apply-rules eliza-rules (apply-rules eliza-rules '(are you a computer))) (You really should not worry about computers?) In order to make Albert and Eliza's conversation a little more realistic, you might want to add a short delay between responses. The Scheme 'sleep' procedure makes it easy to generate such a delay; 'sleep' takes a single positive integer argument interpreted as the number of seconds to have Scheme to pause in its execution. > (sleep 4) A random delay is much more realistic. The next invocation puts Scheme to sleep for between 1 and 3 seconds with the interval chosen randomly from the set {1, 2, 3}. > (sleep (+ (random 3) 1)) I've augmented the pattern matcher so that it handles special-purpose pattern matchers that you can write yourself to implement more sophisticated forms of matching. Here's an example of a special-purpose matcher that matches numbers. Every special-purpose matcher requires at least three arguments; specifically, the first three arguments to a special purpose-matcher must consist of a pattern, an input expression that the pattern is to be matched against and a list of bindings. If the special-purpose matcher fails in matching the input, then it returns the empty list signaling failure. If it succeeds, then typically it calls the top-level pattern matcher 'pattern-match' with the 'cdr' of the pattern, the 'cdr' of the input and the bindings possibly augmented with new bindings. In this typical case, the special-purpose matcher 'consumes' the 'car' of both the pattern and the input. > (define (number-match pattern input bindings) (if (not (and (pair? input) (number? (car input)))) fail (pattern-match (cdr pattern) (cdr input) bindings))) > (pattern-match '((special number-match)) '(3.14)) ((t . t)) > (pattern-match '(Fred won (special number-match) games) '(Fred won 5 games)) ((t . t)) Note that the following pattern and input do not match. > (pattern-match '(special number-match) 3.14) () We can also define special-purpose matchers that take additional arguments. These arguments will be tacked onto the end of the 'special' pattern after the symbol corresponding to the special-purpose function. Here we redefine 'match-number' to take an additional argument corresponding to a variable which is bound to the input if it corresponds to a number. Note that while the symbol corresponding to the special function is evaluated none of the additional arguments are evaluated. > (define (number-match pattern input bindings var) (if (not (and (pair? input) (number? (car input)))) fail (pattern-match (cdr pattern) (cdr input) (extend-bindings var (car input) bindings)))) > (pattern-match '(Fred earned (special number-match ?x) dollars) '(Fred earned 1000000 dollars)) ((?x . 1000000)) There's a bug in the pattern matcher. It can't handle a segment variable right before a 'special' match function. A pattern like (?x (special ... ) ... ) will work just fine but a pattern like ((?* ?x) (special ... ) ... ) will not work correctly. Here's an extended example to illustrate the problem (define (test-1) (let ((eliza-rules '((((special syn-match jocks ?x) are (special syn-match smart ?y)) (What do you know about ?x idiot?)) (((?* ?x)) (No match!))) (apply-rules eliza-rules '(swimmers are )))) (define (test-2) (let ((eliza-rules '((((?* ?v) (special syn-match jocks ?x) ?z (special syn-match smart ?y)) (What do you know about ?x idiot?)) (((?* ?x)) (No match!))))) (apply-rules eliza-rules '(i think women are inept)))) > (test-1) (What do you know about swimmers idiot?) > (test-2) (No match!) Here are some ways that you might extend the code in 'eliza.scm' for your project (remember that you have to choose two of these): a. Add the ability to match words with similar meaning (synonyms). (define synonyms '((tall (high lofty)) (light (luminous bright)))) > (pattern-match '(The fence was (special similar-match tall ?x)) '(The fence was high)) ((?x . high)) For five extra credit points rewrite the relevant parts of the pattern matcher so that it always matches similar words without having to provide a special-purpose match function. For the extra credit points you'll have to create a general data structure for keeping track of synonyms; this data structure will operate like a thesaurus allowing you to easily insert new words and extract synonyms. Each word in your thesaurus should point to a set of synonyms and if two words are synonyms of one another then they should point to the same set of synonyms. b. Add the ability for Albert and Eliza to keep state, i.e., remember things from one exchange to another. To implement this ability you might add 'mood' or 'memory' variables or define an association list of state variables for each agent. > (define eliza-state '((happy-sad 2) (sleepy-alert -3) (anxious-calm 0))) > (define albert-state '((happy-sad -1) (sleepy-alert 4) (anxious-calm -3))) > (pattern-match '((?* ?x) am sad (?* ?y) (special state-match happy-sad -1)) '(I am sad that the days are so short this time of year)) Note that 'state-match' should not consume any input since it doesn't actually match anything. In most of the cases that I can think of, the state-match special pattern should appear at the end of a pattern and it should always succeed. You could also have Eliza ask questions and look for particular responses; for example, following the question 'Are you male or female?' you might have a rule that looks for answers that match the pattern (I am (special gender-match ?x)) and subsequently favor responses that assume Eliza is conversing with a male or female as the case may be. Since this extension doesn't by itself have any consequence for the conversations between Albert and Eliza, if you choose this extension for one of your two, then you have to implement the next extension as well. c. Add the ability for Albert and Eliza to respond depending on their current state. There are several ways that you might go about doing this. One strategy is to define a special-purpose pattern matcher that augments the current bindings conditionally based upon the state of the agent. Another strategy is to require the procedure 'substitute' - which is used to instantiate a response pattern given a set of bindings - to take state into account when selecting a response. Obviously, this extension would go well with the previous one, but you don't have to do (b) in order to do (c). You can demonstrate how your extension works by setting the state prior to having Albert and Eliza talk to one another and then altering the state to show how their conversation changes as a result of their altered state. d. Alter your program to take time into account. Albert and Eliza could change their behavior as the length of the exchange continues. They might exhibit signs of impatience or boredom. You can combine this with other special-purpose matching functions to have Eliza ask Albert the time or prompt Eliza at a particular time. Even a gratuitous use of a date or oblique reference to the hour ('hurry up I have another appointment at 3:15 PM') makes the chatterbot seem more grounded in the moment and hence more realistic. There are a number of ways to keep track of time in Scheme. The following invocation returns an instance of the 'date' structure which has fields for second, minute, hour, day, month, year, week-day, etc. > (seconds->date (current-seconds)) #<struct:date> Real-time dates in Scheme are as accurate as the corresponding dates supported by the operating system. % date Sat Nov 20 15:35:39 EST 2004 % scheme > (date-hour (seconds->date (current-seconds))) 15 > (date-minute (seconds->date (current-seconds))) 35 e. Introduce 'typed' variables of the form: (?x number?) that consist of a variable and a procedure of one argument that tests for type membership - basically anything you can write a predicate for. d. Implement 'response' specials that will allow you to run programs in the midst of printing out the select respone in a rule. e. Make use of part-of-speech information in matching and generating responses; see ~/tmp/exercises/04/12/06/exercises.txt, Exercise 10 for tips and some useful software/libraries. f. Invent a concise notation for matching specials and typed variables, e.g., ?x:number?, ?x:synonym? word, ?x:special-match-function args, ?x:(lambda (x) (and ... ...)) and then augment the matcher to handle your notational conventions.
/Users/tld/email/book/intro/exercises/final/desktop /Users/tld/email/book/intro/exercises/final/desktop/bin /Users/tld/email/book/intro/exercises/final/desktop/bin/phreqs /Users/tld/email/book/intro/exercises/final/desktop/bin/porter /Users/tld/email/book/intro/exercises/final/desktop/bin/README /Users/tld/email/book/intro/exercises/final/desktop/bin/smrize /Users/tld/email/book/intro/exercises/final/desktop/bin/txtify /Users/tld/email/book/intro/exercises/final/desktop/private /Users/tld/email/book/intro/exercises/final/desktop/private/algebra.scm /Users/tld/email/book/intro/exercises/final/desktop/README For this project, you will build a basic version of Google Desktop. Your program will process documents by creating an inverted index for the terms appearing in each document, summarizing each document as a term-frequency vector, and retrieve documents using keyword search ranking the retrieved documents by using a measure such as the cosine of the angle between term-frequency vectors. You will probably want to preprocess the documents using a shell script which we will provide. This is arguably the most complicated of the three projects and we will provide extra assistance for students interested in taking it on. Any student completing this project earns five extra points automatically. Here's a partial list of processes you'll have to implement: a. Process a document by inserting it into an inverted index. b. Summarize each document as a term-frequency vector. c. Return each document including all keys in a keyword query. d. Rank the documents using an appropriate measure of similarity. Support files: You'll find some useful C-shell and Perl scripts for processing documents in ./bin - the text file ./bin/README provides a description of these scripts. You'll probably want to modify these scripts to suit your project design. How does search engine represent and store documents to support keyword queries? Here are the basic procedures involved in representing/summarizing documents to support keyword queries: document -> bag of words For the most part, word order is ignored. Search engines like Google typically also store the location of each word in a document to support proximity queries, e.g., search for documents including 'search' and 'engine' such that these two terms are within one word of each other. Google is also starting to store short phrases, but storing phrases has to be managed carefully to avoid an explosion in terms. bag of words -> bag of terms Typically the words in a document are preprocessed by removing 'stop words' such as 'a', 'the', 'it', 'and', 'that', etc., words that add little meaning given that the document is represented as a bag of words with little additional semantic analysis. In addition to removing stop words, usually the remaining words are 'stemmed' to map morphological variants of a given word to the same root, e.g., 'run', 'runs' and 'running' all map to 'run'. The set of all terms in the corpus is called the 'dictionary'. In most cases, stemming reduces the size of the dictionary and the length of term vectors, but can in some cases conflate terms that we would like to remain separate, e.g., 'universe', 'university' and 'universal' all map to root 'univers'. bag of terms -> inverted index In order to retrieve documents that contain specific terms we build an inverted index which points from each term to the documents that contain the associated word. Usually the inverted index also records the position of each term/word in the document to support proximity searches. bag of terms -> term-frequency vector The bag of terms is then processed to determine the number of times each term appears in the document - this number is called the frequency of the term - and the resulting numbers are used to construct a term frequency vector which maps each term to its frequency in the document. You can think of this as a vector of numbers where the 'i'th number represents the frequency of the 'i'th term in the dictionary, but in fact term frequency vectors are never actually implemented in this fashion. A more efficient way of implementing this mapping would be as an association list or hash table. Now that we have a representation for each document and an index mapping terms/words to documents keyword queries are handled as follows: keywords -> keyterms Eliminate stop words and stem the remaining keywords. keyterms -> documents To retrieve a set of candidate documents we use the inverted index to map the keywords to documents containing them. The search engine compiles a list of all documents that contain all (or some) of the keywords in preparation for ranking. keyterms -> keyterm-frequency vector To rank the documents, we can construct a term-frequency vector from the search keywords which we'll call the keyterm-frequency vector. At this point we can add additional terms corresponding to synonyms of the keywords. We can also specialize the query by including hyponyms (x 'is a kind of' y) or generalize it by using hypernyms (y 'is a kind of' x). documents + keyterm-frequency vector -> document ranking The degree of similarity between the keyterm-frequency vector and each of the document term-frequency vectors is then calculated and the documents are sorted using the calculated similarities and presented to the user in the resulting order.