You've Got (Junk) Email

Supplementary References

The title for this chapter is a variation on ``You've Got Mail'', the title of a film starring Tom Hanks and Meg Ryan and directed by Nora Ephron about two people who start a relationship by exchanging email. Nowadays the sincere solicitations of an aspiring cyber suitor are as likely as not to be mistaken for yet another unwanted advertisement.

Example Exercises

Scheme for Filtering Spam

In this exercise, we'll look at an implementation of the machine learning algorithm described in Chapter 8 for filtering junk email. Before you can use this code to actually filter spam, you'll have to write additional code to process email messages, turning them into lists of feature assignments that can then be used as training examples in the Scheme code below. We'll leave that step up to you and assume that we have the messages in the desired format. Skim through the following code (which you can download as a file in zip format) so as to get the gist of what it does. You'll need to read the chapter first to understand the basic ideas from machine learning and some of the implementation details (for example, those involving the use of the Scheme quasi-quote notation). Your assignment is described after the code.

;; Here's what a list of processed messages might look like:
(define examples '(((CLASS #t)
                    (SENDER-DOT-COM #f)
                    (SENDER-DOT-EDU #t)
                    (SUBJECT-IS-REPLY #t)
                    (SUBJECT-CONTAINS-FREE-WORD #f)
                    (BODY-CONTAINS-SALE-WORD #f)
                    (BODY-CONTAINS-BUY-WORD #f)
                    (BODY-FORMATTED-HTML #f)
                    (ATTACHED-PDF-DOC #f)
                    (ATTACHED-JPG-IMG #f))
                   ((CLASS #f)
                    (SENDER-DOT-COM #t)
                    (SENDER-DOT-EDU #f)
                    (SUBJECT-IS-REPLY #t)
                    (SUBJECT-CONTAINS-FREE-WORD #t)
                    (BODY-CONTAINS-SALE-WORD #f)
                    (BODY-CONTAINS-BUY-WORD #f)
                    (BODY-FORMATTED-HTML #f)
                    (ATTACHED-PDF-DOC #f)
                    (ATTACHED-JPG-IMG #f))
                   ((CLASS #t)
                    (SENDER-DOT-COM #t)
                    (SENDER-DOT-EDU #f)
                    (SUBJECT-IS-REPLY #f)
                    (SUBJECT-CONTAINS-FREE-WORD #t)
                    (BODY-CONTAINS-SALE-WORD #f)
                    (BODY-CONTAINS-BUY-WORD #f)
                    (BODY-FORMATTED-HTML #f)
                    (ATTACHED-PDF-DOC #f)
                    (ATTACHED-JPG-IMG #f))))

;; Here are the features used in specifying the above examples:
(define features '(SENDER-DOT-COM
                   SENDER-DOT-EDU
                   SUBJECT-IS-REPLY
                   SUBJECT-CONTAINS-FREE-WORD
                   BODY-CONTAINS-SALE-WORD
                   BODY-CONTAINS-BUY-WORD
                   BODY-FORMATTED-HTML
                   ATTACHED-PDF-DOC
                   ATTACHED-JPG-IMG))

;; Here are the functions implementing the learning algorithm:

;; Return the value assigned to a feature in a given example.
(define (value feature example)
  (cadr (assoc feature example)))

;; Return the class of a given example.
(define (class example)
  (cadr (assoc 'CLASS example)))

;; Return a classification procedure learned from the examples.
;; Unlike the Scheme code in the book, this implementation uses
;; an auxiliary procedure to build an executable classifier.
(define (build examples features)
  (eval 
   `(lambda (example) 
      ,(aux-build examples features))))

;; Recursively build a decision tree from the root down. The resulting
;; tree is expressed in the form of a nested if-then-else expression.
;; The invocation (andmap procedure list) returns #t if the procedure
;; returns #t when applied to each element of list and #f otherwise.
(define (aux-build examples features)
  (if (andmap (lambda (e) (class e)) examples)
      #t
    (if (andmap (lambda (e) (not (class e))) examples)
        #f
      (let ((feature (choose-feature examples features)))
         `(if (value ',feature example)
              ,(aux-build (findall examples
                                   (lambda (example) 
                                     (value feature example)))
                          (remove feature features))
              ,(aux-build (findall examples
                                   (lambda (example) 
                                     (not (value feature example))))
                          (remove feature features)))))))

;; Try each feature in turn and return the feature that best
;; discriminates among the examples.  Note that the evaluation
;; function returns values between 0 and 1 with smaller values
;; being more desirable (we want to minimize the disorder).
(define (choose-feature examples features) 
  (do ((features features (cdr features)) 
       (best-feature (car features))
       (best-valuation (evaluate examples (car features)))
       (valuation null))
      ((null? features) best-feature)
    (set! valuation (evaluate examples (car features)))
    (cond ((< valuation best-valuation)
           (set! best-valuation valuation)
           (set! best-feature (car features))))))

;; The following evaluation function uses the notion of disorder or
;; entropy from information theory to evaluate the discriminatory
;; power of a feature with respect to a set of examples.
(define (evaluate examples feature)
  (let ((total (length examples)))
    (apply +
           (map (lambda (subset)
                  (* (/ (length subset) total) 
                     (disorder (classify subset))))
                (partition examples feature)))))

;; Partition a set of examples according to a boolean class.
(define (classify examples)
  (list
   (findall examples (lambda (example) (class example)))
   (findall examples (lambda (example) (not (class example))))))

;; Partition a set of examples according to a boolean feature.
(define (partition examples feature)
  (list
   (findall examples (lambda (example) (value feature example)))
   (findall examples (lambda (example) (not (value feature example))))))

;; Try to write the following function in mathematical notation.  Note
;; that we take some care not to take the log of 0.  Why don't we have
;; to worry about this from a mathematical perspective?  The answer is
;; that 0 log 1/0 is taken to be zero by convention on the strength of
;; the fact that the limit of x log 1/x as x approaches 0 is 0.
(define (disorder partition)
  (let* ((sizes (map length partition)) (total (apply + sizes)))
    (if (= total 0)
        0
      (- (log_2 total)
         (apply +
                (map (lambda (size) 
                       (if (= size 0)
                           0 
                         (* (/ size total) 
                            (log_2 size))))
                     sizes))))))

;; Here are some additional support functions:

;; Remove all occurrences of an item from a list of items.
(define (remove item items)
  (cond ((null? items) (list))
        ((equal? item (car items))
         (remove item (cdr items)))
        (else (cons (car items) 
                    (remove item (cdr items))))))

;; Return all items that satisfy the given test procedure.
(define (findall items test)
  (do ((items items (cdr items)) (results (list)))
      ((null? items) results)
    (if (test (car items))
        (set! results (cons (car items) results)))))

;; In Scheme, the log function is the logarithm base e (also called
;; the natural logarithm) where e is Euler's constant.   The Scheme
;; invocation (/ (log n) (log m)) computes the logarithm of n base m,
;; and, hence, (/ (log n) (log 2)) computes the logarithm of n base 2.
(define (log_2 n) 
  (/ (log n) (log 2)))

Hopefully the bit about disorder and entropy didn't throw you for a loop. Chapter 8 talks about choosing features in very general terms, but when you're writing a program you have to get down to particulars. The method of evaluating the discriminatory ability of features in the above code is based on ideas from information theory.17

As an exercise, you might convince yourself that the invocation (evaluate examples feature) returns 0 when the feature splits the list of examples so that the then branch has all the examples with the class assigned #t and the else branch has all the examples with the class assigned #f (or the other way around, exchanging then and else). It returns 1 when the feature splits the list of examples so that there are equal numbers of #t and #f examples in each of the then and else branches (and so the feature makes no useful distinctions whatsoever). The value 0 indicates the minimum disorder and 1 is the maximum.

Your task for this exercise is to create a set of training examples which, if given to build (as in the invocation (build examples attributes)), produces an if-then-else expression that looks like the following skeleton (you get to determine the attributes indicated below by the ...):

(if ...
    (if ...
        #t
        #f)
    (if ...
        #t
        #f))

Realizing that it will be difficult to determine if your examples produce the desired effect, we've included the following invocations that will enable to you to view the resulting classifiers.

;; Load the DrScheme library for printing formatted output:
(require (lib "pretty.ss"))

;; An alternative build that displays the learned classifier:
(define (show-build examples features)
  (pretty-print 
   `(lambda (example) 
      ,(aux-build examples features))))

Assuming that you're using DrScheme having specified one of the PLT languages (for example, Textual) and that you've loaded the file containing all of the above code into Scheme (if you're using some other version of Scheme, you'll have to find or define your own pretty print procedures), you can display your learned classifiers and test them as the following exchange illustrates using the examples and features defined above:

> (define filter (build examples features))
> (define test_message (cdr (car examples)))
> (pretty-print test_message)
((sender-dot-com #f)
 (sender-dot-edu #t)
 (subject-is-reply #t)
 (subject-contains-free-word #f)
 (body-contains-sale-word #f)
 (body-contains-buy-word #f)
 (body-formatted-html #f)
 (attached-pdf-doc #f)
 (attached-jpg-img #f))
> (filter test_message)
#t
> (show-build examples features)
(lambda (example)
  (if (value 'sender-dot-com example)
    (if (value 'subject-is-reply example) #f #t)
    #t))

Code Fragments

Download all the code fragments in this chapter as a file in zip format.


17 If you'd like to learn more about information theory, you'll find a nice introduction on the Lucent Technologies web site with examples, history and even a link to the original paper by Claude Shanon entitled A Mathematical Theory of Communication which really started the field [Shannon1948].